Friday, August 21, 2015

Discussing "Statistical Methods for Observational Health Studies" (JSM 2015)

This post is based on my recollection of what I discussed in the session on "Statistical methods for observational health studies" that I organized. There has been a boom of such studies due to availability of large collection of patients records and medical claims.

The analysis of the "big data" from health studies is different from many machine learning tasks in two ways. First, association is not enough. Identification of causal relations is essential for any possible intervention. Second, detection of an effect may not be enough. Often precise and accurate estimates of the effect size are desired. Strongly!

In dealing with observational health studies, here are some of my advices, which are not intended to be comprehensive.

  • Understand the available data; especially understand what was not observed. When you do not have full control of the study design/data collection, there would always be some issues: sampling bias, informative bias, measurement bias. Always ask questions about what the measured data actually represent: how were the data cumulated? how were certain measures defined?
  • Know what your methods are estimating: association, causality or granger causality? effect size under a particular model may not reflect the actual true effect size. What are the questions that need answer? What are the questions your methods can actually address?
  • Always carry out some sensitivity analysis: are your results sensitive to model assumptions, or small changes in the data? Available tools include simulations, multiple data sets or resampling. This can be challenging for certain studies as defining "agreement" between different findings can be tricky.
  • Always report uncertainty: combining estimated sampling error from modeling and results from sensitivity analysis give the users of your results a better sense of uncertainty. This is especially important when modeling strategies were introduced to address a small area estimation problem.

Tuesday, August 18, 2015

The fifth V of Big Data: variables (JSM 2015 discussion)

I gave the following discussion (from recollection and my notes) during the session "the fifth V of Big Data: variables" organized by Cynthia Rudin.

The notion "Big Data" does not simply refer to a data set that is large in size. It includes all complex and nontraditional data that do not necessarily come in the form of a typical clean Excel sheet with rows corresponding to individuals and columns corresponding to variables. In other words, Big Data are often unstructured and do not have naturally defined variables.

Variables are central to nearly all statistical learning tasks. We study their distributions to build models and predictive tools. Therefore, in Big Data, how to define variables is one of the important first steps that is critical for the success of the statistical learning later on. This step is also known as feature generation. Even when some variables are observed on individuals in a data set, they often do not come in the form or scale most relevant with the learning task at hands. Domain knowledge, when used correctly as we have learnt from Kaiser's talk, is often the most helpful in identifying and generating features. At other times, we need some help from such as exploratory data analysis, sparse learning and metric learning to form nonlinear transformation.

For variables generated, they first need to predict well, i.e., achieve accuracy. In addition, for many application, they need to be interpretable. Here, sometime we need to strike a balance between these two criteria. One way to achieve such a balance is to encourage sparsity in the solution, which is often computational challenging.

Variables, the fifth V of Big Data, are essential for most statistical solutions and require a delicate three-way balance of accuracy, interpretability and computability.

Sunday, April 26, 2015

Forget P-values? or just let it be what it is

P-value has always been controversial. It is required for certain publications, banned from some journals, hated by many yet quoted widely. Not all p-values are loved equally. Because what someone popularized some 90 years ago, the small values below 0.05 have been the crowd's favorite.

When we teach hypothesis testing, we explain that the entire spectrum of p-value is to serve a single purpose: quantifying the "agreement" between an observed set of data and a statement (or claim) in the null hypothesis. Why do we single out the small values then? Why can't we talk about any specific p-value the same way we talk about today's temperature? i.e., as a measure of something.

First of all, the scale of p-value is hard to talk about, which is different from temperature. The difference between 0.21 and 0.20 is not the same as 0.02 and 0.01. It almost feels like we should use the reciprocal of the p-values to discuss the likeliness of the corresponding observed statistics assuming the null hypothesis is true. If the null hypothesis is true, it takes, on average, 100 independent tests to observe a p-value below 0.01. The occurrence of a p-value under 0.02 is twice as likely, taking only about 50 tests to observe. Therefore 0.01 is twice as unlikely as 0.02. Using similar calculation, 0.21 and 0.20 are almost identical in terms of likeliness under the null.

In introductory statistics, it is said a test of significance has four steps: stating the hypotheses and a desired level of significance, computing the test statistics, finding the p-value, concluding given the p-value. It is step 4 here requires us to draw a line somewhere on the spectrum of p-value between 0 and 1. That line is called the level of significance. I never enjoyed explaining how one should choose the level of significance. Many of my students felt confused. Technically, if a student derived a p-value of 0.28, she can claim it is significant at a significance level of 0.30. The reason why this is silly is because the significance level should convey a certain sense of rare occurrence, so rare that it is deemed contradictory with the null hypothesis. No one of common sense would argue a chance that is close to 1 out of 3 represents rarity.

What common sense fails to deliver is how rare is contradictory enough. Why 1/20 needs to be a universal choice? It doesn't. Statisticians are not quite bothered by "insignificant results" as we think 0.051 is just as interesting as 0.049. We, whenever possible, always just want to report the actual p-value instead of stating that we reject/accept the null hypothesis at a certain level. We use p-value to compare the strength of evidence between variables and studies. However, sometimes we don't have a choice so we got creative.

For any particular test between a null hypothesis and an alternative, a representative (i.e., not with selection bias) sample of p-values will offer a much better picture than the current published record of a handful of p-values under 0.05 out of who-knows-how-many trials. There have been suggestions on publishing insignificant results to avoid the so-called "cherry-picking" based on p-values. Despite the apparent appeal of such a reform, I cannot imagine it being practically possible. First of all, if we can assume that most people have been following the 0.05 "rule", publishing all the insignificant results will result in a 20-fold increase in the number of published studies. Yet it probably will create a very interesting data set for data mining. What would be useful is to have a public database of p-values on repeated studies of the same test (not just the null hypothesis as often the test depends on what is the alternative as well). In this database, p-value can finally be just what it is, a measure of agreement between a data set and a claim.

Friday, April 17, 2015

My new cloud computer, which will never need to upgrade?

What have been stopping me from upgrading my computer is the pain to migrate from one machine to the other. Over the years, I have intentionally made myself less reliant on one computer by sharing files across machines using dropbox or google drive. I still have a big and old (see note below) office desktop that holds all my career (or most of it). Every time I need to work on something, I put a folder in the dropbox and work on the laptops on the go, at home, in a coffee shop.

Using Columbia's Lion Mail google account, we receive 10 TB cloud storage on google drive. This is more than enough to hold all my files. A personal PC should have the following components: file storage, operating system, input/output devices, user softwares and user contents. I just decided to move the file storage/user content component of my computer onto the cloud. Next step would be moving the most essential user softwares onto the cloud and remotely connect to the cloud from any web browser to work. I can't wait to set up a remote cloud-based R studio server to try out.

This whole thing started when I was searching for a faster desktop replacement. I saw the price required to buy the best available desktop and compared it with the pricing of cloud computing engines. The $8000 price tag or higher of a most powerful PC/Mac will afford me non-stop computing on a 16-core cloud engine with 64MB or higher for two full years. Consider the idle time a PC is likely to have, it may be equivalent to 4-5 years. It seemed to me the best "personal" computer now is on the cloud.

Will this remove the need to go through the upgrading of our dearest work laptop? Not completely. We will still be buying new laptop to work on. But it will be more like upgrading an iPhone or iPad. This thought is enough to make my heart sing.

Note: my office pc was born 2006 and still going strong. Following an advice my mom, a computer science professor, gave me about 20 years ago, I bought the best specification possible at that time.

Thursday, March 12, 2015

The trinity of data science: the wall, the nail and the hammer

In Leo Breiman's legendary paper on "Statistics modeling: the two cultures" a trend in statistical research was criticized: people, holding on to their models (methods), look for data to apply their models (methods). "If all you have is a hammer, everything looks like a nail," Breiman quoted.

A similar research scheme was observed recently in data science. The availability of large unstructured data sets (i.e., Big Data) have sparkled imagination of quantitative researchers (and data scientists wanna-be) everywhere. Challenges such as the Linked-In economic graph challenge have invited people to think hard of creative ways to unlock the information hidden in the vast "data" that can potentially lead to novel data products. The exploratory nature of this trend, to some extent, resembles the quest of a hammer in search for nails. Only this time, it is a vast facet of under-utilized wall in search of deserved nails. Once the most deserved nails (or hooks and other wall installations) are identified, the most appropriate hammers (or tools) will be identified or crafted for the installation of them.

In every data science project, there is this trinity of three basic components: the problem to be addressed, the data to be used and the method/model for the hack.
The Trinity of Data Science

Traditional statistical modeling usually starts with the problem, assuming certain generative mechanism (model) for a potential data source, and device a suitable method. The hack sequence is then problem-data-methods (or first start with the nail, then choose which wall to use, and then decide which hammer to use, considering the nature of the nail and the wall).

Data mining explores data using suitable methods to reveal interesting patterns and eventually suggests certain discoveries that addresses important scientific problems. The hack sequence is then data-methods-problem (the wall, the available hammers, and the nails).

Data scientists enter this tri-fold path at different points, depending on their career path. The ones from an applied domain have most likely entered from a problem entry point, then to data and then to methods. It is often very tempting to use the same methods when one moves from one set of data to the next set of data. The training of these application domain data scientists often comes with a "manual" of popular methods for their data. Data evolve. So should the adopted methods, especially given the advancements in the methodological domain. The hammer used to be the best for the nail/wall might no long be the best given the current new collection of hammers. It is time to upgrade.

Methodological data scientists such as statisticians enter from the methodological perspective due to their training. When looking for ways to apply or extend their methods, they should consider problems where their methods might be applied and then find good data for the problem. In the process, one should never take for granted that the method can just be applied to the problem-data duo in its original form.

Computational data scientists and engineers often started from manipulating large data sets. These Algorithms were motivated by previous problems of interests or models that have been studied. When a similar large set of data become available, the most interesting problems can be answered by this set of data might be different from the ones have been addressed before in another data set. One should use creative methods to identify novel patterns in such data and discovery interesting problems to answer.

Looking at this trinity map of data science, it is then easy to understand that, for some, there will naturally be phases when one knows a few methods (from their training, or prior hacks of data sets) and looks for (other) data sets or problems to hack; and for some others, there will be phases when one has a big data set and looks for problems that can be answered by this data set. And there will also be those who start with a problem, find or collect data and apply existing or novel methods on the data.

These are all valid and natural "entry points" into data science. The most important thing here is that one remembers that there are many different hammers, many different nails and many different walls. A quest of a data scientist shall always be on finding the best match for the wall, the nail and the hammer and be willing to change, improve and create.