Wednesday, October 28, 2015

Prediction-oriented learning

Our paper on "Why significant variables aren’t automatically good predictors" just came out on PNAS. Here is the press release.

This paper was motivated by our long-time observation of variables selected using significance tests often (not always) perform poorly in prediction. This is not news to many applied scientists we spoke to but the reason for this phenomenon has not been explicitly discussed. Incidentally, also observed in our own research, variables selected using our own I score seem to do better.

Through the course of writing this paper, we came to realize how significance and predictivity are fundamentally different. For example, for a set of variables, its innate predictivity (or power to discern between two classes) does not depend on observed data while significance is a notion applied to sample statistics. As sample sizes grow, most sample statistics relevant to the population quantities that decide predictivity will eventually correlate well with predicitivity. However with finite samples, significant sets and predictive sets do not entirely overlap.

In our project, we tried various simulations to illustrate and understand this disparity, some of which are too complicated to be included in our intend-to-be-accessible paper. For example, using a designed scenario, we manage to recreate the Venn diagram above using simulations. 

Using observed data while evaluating variables for prediction, we inevitably have to rely on certain "sample statistics". We have found in our research presented in our paper, certain prediction-oriented statistics would correlate better with the level of predictivity and test statistics may not. This is because test statistics are often designed to have power to detect ANY departure from the null hypothesis but only certain departures will lead to substantial improvement in predictive performance. 

Friday, October 09, 2015

Re-vision Minard's plot

Minard's plot is a famous example in the history of visualization. Using thickness of lines, it clearly documented Napoleon's fateful defeat in 1812.
Many have attempted to recreate this graph using modern tools. Here is mine using my favorite and only data science programming tool, R. It can still be improved as when the line turns sharply, the thickness are off. I got lazy with algebra.

Thursday, October 08, 2015

Debugging the good results

You have a data set and an idea to model the data, in the hope that it will provide some information or solution to a problem. In the ideal world, you shall just cast the idea on the data like a never fail spell and, ta-da, the solution shall just pop out of thin air.

It does not happen in the real world. Even in the wizard world, when an angry Harry Potter tried to use the Cruciatus spell on an opponent, yet he failed. You-know-who commented "you've got to mean it, Harry." It takes both strong willpower and skill to execute a powerful spell. The execution is the vessel that carries an idea. If the vessel sinks, so goes the idea too.

Now let's talk about debugging. The reason we debug our analysis and codes is our codes and the analysis results they generate are prone to mistakes. We are all aware of that. However, our drive to diligently debug our codes is strong only when we are not getting the desired outcomes from our codes.

Regina Nuzzo (@ReginaNuzzo) wrote in her recent Nature news feature:
“When the data don't seem to match previous estimates, you think, 'oh, boy! Did I make a mistake?'”
However, not all coding errors give silly results. Some, on the contrary, give pretty "good" results. Results we have been hunting for. It takes strong willpower to check, proofread and debug to reduce the chance of false results. Over the years, I have had my fair share of false good results produced by programming errors. Therefore, to reduce the risk of cardiovascular arrest, members of my group tend to be more diligent when the results found are extremely exciting. Incidentally, both two of my graduate students who recently presented good results (results that agree with our intuition) in our weekly meeting said they will check their codes more carefully at the end of the meeting.

Wednesday, October 07, 2015

Animated plot using R package animation

Step 1: install imageMagick
Step 2: write a loop that create a sequence of plots.
Step 3: use saveGIF({ }, ...)