Twitter

Tuesday, October 10, 2017

Statistical thoughts on learning from Google search trends

Today I was invited to gave comments at the book signing presentation by Seth Stephens-Davidowitz on his book: "Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are". Here are my comments, which were random statistical thoughts I had while reading this book.
    • "many economists to pay more attention to human behavior”
    • “people are not rational”
    • Reminded me of the most interesting aspect of this book: behind every google search, there is a human decision. 
  • Google search expanded our knowledge. 
    • There are two types of knowledge: the facts that you know and the facts that you know how to find them when needed. 
    • Google increased the amount of the second type of knowledge by multiple magnitude. 
    • In 2012, it is estimated that there are 3.3 billions Google searches per day, 47K a second.
    • Now: 63K from Live Internet Stats.
  • Google ranking algorithm: 
    • How does it change how we acquire knowledge and evaluate evidence? 
    • Has it been using personalization based on your location and prior search pattern? 
    • In what ways does a Google search differ from people asking friends for opinions or suggestions?
    • Google consumer confidence report 2017
      • Trust in Google remains high as 72.3% of respondents trust the accuracy of Google search results.
      • 63.7% of respondents don’t know how Google make money from search.
      • 65.3% said they would not want more relevant Google search results if it meant Google would use their search history to generate their results – something which Google is doing anyway.
  • Google Flu Trends story
    • In 2008, researchers from Google experimented predicting trends of seasonal flu based on people’s searches. 
    • They published a paper in Nature, explaining an intuitive idea that people who are coming down with the flu would search for flu-related information on Google, providing almost instant signals of overall flu prevalence. 
    • The research behind this paper was based on machine learning algorithms constructed using search data fitted against real-world flu tracking information from the Centers for Disease Control and Prevention. 
    • In 2013, Google flu trends missing at the peak of the 2013 flu season by 140 percent. 
    • In a 2014 Science paper, researchers found that 
    • Google’s GFT algorithm simply overfitted. 
    • It is sensitive to seasonal terms unrelated to the flu, like “high school basketball.” 
    • Google’s GFT algorithm also did not consider changes in search behavior over time, including influences from their own new search features.
    • It was suggested that GFT should be updated in collaboration with CDC. 
    • This is an example that there is information in “big data” but using such information to derive correct knowledge and insights require careful modeling and interdisciplinary collaboration. 
  • Natural language is hard
  • Actually science is hard.
  • Google search reveal secrets
    • Related to research conducted by my collaborator Sarah Cowan.
  • Pay attention to data collected and not collected: https://www.squawkpoint.com/2015/01/sample-bias/