Twitter

Tuesday, October 10, 2017

Statistical thoughts on learning from Google search trends

Today I was invited to gave comments at the book signing presentation by Seth Stephens-Davidowitz on his book: "Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are". Here are my comments, which were random statistical thoughts I had while reading this book.
    • "many economists to pay more attention to human behavior”
    • “people are not rational”
    • Reminded me of the most interesting aspect of this book: behind every google search, there is a human decision. 
  • Google search expanded our knowledge. 
    • There are two types of knowledge: the facts that you know and the facts that you know how to find them when needed. 
    • Google increased the amount of the second type of knowledge by multiple magnitude. 
    • In 2012, it is estimated that there are 3.3 billions Google searches per day, 47K a second.
    • Now: 63K from Live Internet Stats.
  • Google ranking algorithm: 
    • How does it change how we acquire knowledge and evaluate evidence? 
    • Has it been using personalization based on your location and prior search pattern? 
    • In what ways does a Google search differ from people asking friends for opinions or suggestions?
    • Google consumer confidence report 2017
      • Trust in Google remains high as 72.3% of respondents trust the accuracy of Google search results.
      • 63.7% of respondents don’t know how Google make money from search.
      • 65.3% said they would not want more relevant Google search results if it meant Google would use their search history to generate their results – something which Google is doing anyway.
  • Google Flu Trends story
    • In 2008, researchers from Google experimented predicting trends of seasonal flu based on people’s searches. 
    • They published a paper in Nature, explaining an intuitive idea that people who are coming down with the flu would search for flu-related information on Google, providing almost instant signals of overall flu prevalence. 
    • The research behind this paper was based on machine learning algorithms constructed using search data fitted against real-world flu tracking information from the Centers for Disease Control and Prevention. 
    • In 2013, Google flu trends missing at the peak of the 2013 flu season by 140 percent. 
    • In a 2014 Science paper, researchers found that 
    • Google’s GFT algorithm simply overfitted. 
    • It is sensitive to seasonal terms unrelated to the flu, like “high school basketball.” 
    • Google’s GFT algorithm also did not consider changes in search behavior over time, including influences from their own new search features.
    • It was suggested that GFT should be updated in collaboration with CDC. 
    • This is an example that there is information in “big data” but using such information to derive correct knowledge and insights require careful modeling and interdisciplinary collaboration. 
  • Natural language is hard
  • Actually science is hard.
  • Google search reveal secrets
    • Related to research conducted by my collaborator Sarah Cowan.
  • Pay attention to data collected and not collected: https://www.squawkpoint.com/2015/01/sample-bias/

Monday, September 25, 2017

Aggregated relation data (ARD), what is that?

Today, an academic friend of mine asked me about a series of papers I wrote on the topic of "how many X's do you know" questions. These papers are about a special kind of indirect "network" data that do not have detailed information on individual edges in a network. Rather, it contains counts of the edges that connects an ego towards a number of specific subpopulations.

Relational data, such as records of citations, online communication and social contacts, contain interesting information regarding the mechanisms that drive the dynamics of such interactions under different contexts. Current technology allows detailed observation and recording of these interaction, creating both opportunities and challenges. Aggregated relation data (ARD) are local summaries of these interactions, via aggregation. This has become a useful and common means of learning about hidden, hard-to-count and relatively small populations in the social network, also known as the network scale-up method.

Our papers since 2006 started with proposing better statistical models for such data. We further discussed data collection insights we came to realize during our research. We showed that via innovative statistical modeling, ARD can be used to estimate personal network sizes, the sizes and demographic profiles of hidden populations, and non-random mixing structures of the general social network. In particular, in our 2015 paper, we proposed  a model for ARD that is close to the latent space model by Hoff et al (2002) for full-network data. This would allow us to connect and possibly combine information from ARD with partially observed full network data.