Tuesday, October 10, 2017

Statistical thoughts on learning from Google search trends

Today I was invited to gave comments at the book signing presentation by Seth Stephens-Davidowitz on his book: "Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are". Here are my comments, which were random statistical thoughts I had while reading this book.
    • "many economists to pay more attention to human behavior”
    • “people are not rational”
    • Reminded me of the most interesting aspect of this book: behind every google search, there is a human decision. 
  • Google search expanded our knowledge. 
    • There are two types of knowledge: the facts that you know and the facts that you know how to find them when needed. 
    • Google increased the amount of the second type of knowledge by multiple magnitude. 
    • In 2012, it is estimated that there are 3.3 billions Google searches per day, 47K a second.
    • Now: 63K from Live Internet Stats.
  • Google ranking algorithm: 
    • How does it change how we acquire knowledge and evaluate evidence? 
    • Has it been using personalization based on your location and prior search pattern? 
    • In what ways does a Google search differ from people asking friends for opinions or suggestions?
    • Google consumer confidence report 2017
      • Trust in Google remains high as 72.3% of respondents trust the accuracy of Google search results.
      • 63.7% of respondents don’t know how Google make money from search.
      • 65.3% said they would not want more relevant Google search results if it meant Google would use their search history to generate their results – something which Google is doing anyway.
  • Google Flu Trends story
    • In 2008, researchers from Google experimented predicting trends of seasonal flu based on people’s searches. 
    • They published a paper in Nature, explaining an intuitive idea that people who are coming down with the flu would search for flu-related information on Google, providing almost instant signals of overall flu prevalence. 
    • The research behind this paper was based on machine learning algorithms constructed using search data fitted against real-world flu tracking information from the Centers for Disease Control and Prevention. 
    • In 2013, Google flu trends missing at the peak of the 2013 flu season by 140 percent. 
    • In a 2014 Science paper, researchers found that 
    • Google’s GFT algorithm simply overfitted. 
    • It is sensitive to seasonal terms unrelated to the flu, like “high school basketball.” 
    • Google’s GFT algorithm also did not consider changes in search behavior over time, including influences from their own new search features.
    • It was suggested that GFT should be updated in collaboration with CDC. 
    • This is an example that there is information in “big data” but using such information to derive correct knowledge and insights require careful modeling and interdisciplinary collaboration. 
  • Natural language is hard
  • Actually science is hard.
  • Google search reveal secrets
    • Related to research conducted by my collaborator Sarah Cowan.
  • Pay attention to data collected and not collected:

Monday, September 25, 2017

Aggregated relation data (ARD), what is that?

Today, an academic friend of mine asked me about a series of papers I wrote on the topic of "how many X's do you know" questions. These papers are about a special kind of indirect "network" data that do not have detailed information on individual edges in a network. Rather, it contains counts of the edges that connects an ego towards a number of specific subpopulations.

Relational data, such as records of citations, online communication and social contacts, contain interesting information regarding the mechanisms that drive the dynamics of such interactions under different contexts. Current technology allows detailed observation and recording of these interaction, creating both opportunities and challenges. Aggregated relation data (ARD) are local summaries of these interactions, via aggregation. This has become a useful and common means of learning about hidden, hard-to-count and relatively small populations in the social network, also known as the network scale-up method.

Our papers since 2006 started with proposing better statistical models for such data. We further discussed data collection insights we came to realize during our research. We showed that via innovative statistical modeling, ARD can be used to estimate personal network sizes, the sizes and demographic profiles of hidden populations, and non-random mixing structures of the general social network. In particular, in our 2015 paper, we proposed  a model for ARD that is close to the latent space model by Hoff et al (2002) for full-network data. This would allow us to connect and possibly combine information from ARD with partially observed full network data.

Monday, November 07, 2016

Postdoc position in Applied Statistics and Bayesian Modeling available immediately

A full-time postdoctoral position is available beginning immediately in the research group of Professor Tian Zheng working on reproducible modeling of social and behaviorial data using Bayesian computing, in close cooperation with our collaborators in social sciences. 

Requirements: The work is highly interdisciplinary, and applicants must have strong statistical and computational skills. Preferred educational background is a PhD in statistics, computer science, computational social science or a related field. Expertise in statistical computing is required. Experience with R/Stan, Python, and/or data visulization is preferred but not necessary. 

Environment: The research group is at Columbia University, based in the Statistics department, in the great city of New York. There will also be the opportunity to interact with our collaborators on other research projects involving psychology, genetics, and neural science.

Appointment: The initial appointment will be for one year, and is renewable. Salaries will be set based on experience and skills.

Applicants should send email to providing:
• a brief description of past research experience
• a brief description of future research interests and goals
• a resume of educational and research experience, including publications
• three letters of reference

Wednesday, September 28, 2016

ASA SLDS JSM 2017 student paper competition

Call for papers
Student Paper Competition - JSM 2017
(July 29th-Aug 3rd, 2017, Baltimore, MD)
ASA Section on Statistical Learning and Data Science
Sponsored by SLDS

Key dates:
• Abstracts due December 15th, 2016
• Full papers due January 4th, 2017

The Section on Statistical Learning and Data Science (SLDS) of the American Statistical Association (ASA) is sponsoring a student paper competition for the 2017 Joint Statistical Meetings in Baltimore, MD, on July 29th-August 3rd, 2017.

The paper might be an original methodological research or a real-world application (from various fields including but not limited to marketing, pharmaceutical, genomics, bioinformatics, imaging, defense, business, public health) that uses principles and methods in statistical learning and data science.

Papers that have been accepted for publication are NOT eligible for the competition. Selected winners will present their papers in a designated topic-contributed session at the 2017 JSM in Chicago, IL organized by the award committee. In this session, they will be presented a monetary prize and an award certificate. Winning papers will be recommended for submission to Statistical Analysis and Data Mining: The ASA Data Science Journal, which is the flagship journal of the SLDS Section.

Graduate or undergraduate students who are enrolled in Fall 2016 or Winter/Spring 2017 are eligible to participate. The applicant MUST be the first author of the paper.

Abstracts (up to 1000 characters) are due 12:00 PM (noon) EST on December 15th, 2016 and shall be submitted via this Abstract submission form ( ONLY students who submit their abstracts on time are eligible for submitting full papers after 12/15/2016.

Full papers and other application materials must be submitted electronically (in PDF, see instruction below) to Professor Tian Zheng ( by 12:00 PM (noon) EST on Wednesday, January 4th, 2017. ONLY students who submit their abstracts by 12/15/2016 are eligible for submitting full papers.

All full paper email entries must include the following:
  1. An email message contains:
    • List of authors and contact information;
    • Abstract with no more than 1000 characters.
  2. Unblinded manuscript - double-spaced with no more than 25 pages including figures, tables, references and appendix. Please use 11pt fonts (preferably Arial or Helvetical) and 1 inch margins all around.
  3. Blinded versions of the abstract and manuscript (with no authors nor references that could easily lead to author identification).
  4. A reference letter from a faculty member familiar with the student's work which MUST include a verification of the applicant's student status and, in the case of joint authorship, should indicate the fraction of the applicant's contribution to the manuscript.
All materials must be in English.

Entries will be reviewed by the Student Paper Competition Award committee. The selection criteria used by the committee will include statistical novelty, innovation and significance of the contribution to the field of application as well as the professional quality of the manuscript.

This year’s student competition is sponsored ASA SLDS and is chaired by Professor Tian Zheng (Columbia University). Award announcements will be made in mid-January 2017. For inquiries, please contact Professor Tian Zheng (

Saturday, September 10, 2016

Resources for data analytics at Columbia

I got an email asking for resources (other than courses) for data analytics at Columbia. This is what I wrote in reply: