t+z statistics: October 2017

Today I was invited to gave comments at the book signing presentation by Seth Stephens-Davidowitz on his book: "Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are". Here are my comments, which were random statistical thoughts I had while reading this book.

- "many economists to pay more attention to human behavior”
- “people are not rational”
- Reminded me of the most interesting aspect of this book: behind every google search, there is a human decision.

- There are two types of knowledge: the facts that you know and the facts that you know how to find them when needed.
- Google increased the amount of the second type of knowledge by multiple magnitude.
- In 2012, it is estimated that there are 3.3 billions Google searches per day, 47K a second.
- Now: 63K from Live Internet Stats.

How does it change how we acquire knowledge and evaluate evidence?
Has it been using personalization based on your location and prior search pattern?
In what ways does a Google search differ from people asking friends for opinions or suggestions?

In 2008, researchers from Google experimented predicting trends of seasonal flu based on people’s searches.
They published a paper in Nature, explaining an intuitive idea that people who are coming down with the flu would search for flu-related information on Google, providing almost instant signals of overall flu prevalence.
The research behind this paper was based on machine learning algorithms constructed using search data fitted against real-world flu tracking information from the Centers for Disease Control and Prevention.
In 2013, Google flu trends missing at the peak of the 2013 flu season by 140 percent.
In a 2014 Science paper, researchers found that
Google’s GFT algorithm simply overfitted.
It is sensitive to seasonal terms unrelated to the flu, like “high school basketball.”
Google’s GFT algorithm also did not consider changes in search behavior over time, including influences from their own new search features.
It was suggested that GFT should be updated in collaboration with CDC.
This is an example that there is information in “big data” but using such information to derive correct knowledge and insights require careful modeling and interdisciplinary collaboration.

Pay attention to data collected and not collected: https://www.squawkpoint.com/2015/01/sample-bias/

t+z statistics