Friday, August 21, 2015

Discussing "Statistical Methods for Observational Health Studies" (JSM 2015)

This post is based on my recollection of what I discussed in the session on "Statistical methods for observational health studies" that I organized. There has been a boom of such studies due to availability of large collection of patients records and medical claims.

The analysis of the "big data" from health studies is different from many machine learning tasks in two ways. First, association is not enough. Identification of causal relations is essential for any possible intervention. Second, detection of an effect may not be enough. Often precise and accurate estimates of the effect size are desired. Strongly!

In dealing with observational health studies, here are some of my advices, which are not intended to be comprehensive.

  • Understand the available data; especially understand what was not observed. When you do not have full control of the study design/data collection, there would always be some issues: sampling bias, informative bias, measurement bias. Always ask questions about what the measured data actually represent: how were the data cumulated? how were certain measures defined?
  • Know what your methods are estimating: association, causality or granger causality? effect size under a particular model may not reflect the actual true effect size. What are the questions that need answer? What are the questions your methods can actually address?
  • Always carry out some sensitivity analysis: are your results sensitive to model assumptions, or small changes in the data? Available tools include simulations, multiple data sets or resampling. This can be challenging for certain studies as defining "agreement" between different findings can be tricky.
  • Always report uncertainty: combining estimated sampling error from modeling and results from sensitivity analysis give the users of your results a better sense of uncertainty. This is especially important when modeling strategies were introduced to address a small area estimation problem.

Tuesday, August 18, 2015

The fifth V of Big Data: variables (JSM 2015 discussion)

I gave the following discussion (from recollection and my notes) during the session "the fifth V of Big Data: variables" organized by Cynthia Rudin.

The notion "Big Data" does not simply refer to a data set that is large in size. It includes all complex and nontraditional data that do not necessarily come in the form of a typical clean Excel sheet with rows corresponding to individuals and columns corresponding to variables. In other words, Big Data are often unstructured and do not have naturally defined variables.

Variables are central to nearly all statistical learning tasks. We study their distributions to build models and predictive tools. Therefore, in Big Data, how to define variables is one of the important first steps that is critical for the success of the statistical learning later on. This step is also known as feature generation. Even when some variables are observed on individuals in a data set, they often do not come in the form or scale most relevant with the learning task at hands. Domain knowledge, when used correctly as we have learnt from Kaiser's talk, is often the most helpful in identifying and generating features. At other times, we need some help from such as exploratory data analysis, sparse learning and metric learning to form nonlinear transformation.

For variables generated, they first need to predict well, i.e., achieve accuracy. In addition, for many application, they need to be interpretable. Here, sometime we need to strike a balance between these two criteria. One way to achieve such a balance is to encourage sparsity in the solution, which is often computational challenging.

Variables, the fifth V of Big Data, are essential for most statistical solutions and require a delicate three-way balance of accuracy, interpretability and computability.