Thursday, November 12, 2015

What can "Cinderella", the classic children's story, teach us about model selection?

Today in my linear regression class, we accidentally realized a nice example for model selection from classic children's literature, "Cinderella."

First, we need to define our selection goal: find a suitable wife (model). Second, we need to define our selection criterion. The king wanted a "suitable mother" but the prince was looking for something more. Third, given all the possible models, we need to decide how to carry out the search, or the search mechanism. How did they search for the best fit in the story? "By royal command, every eligible maiden is to attend!" In other words, we are going to evaluate all candidate models. 

At the ball, the prince spotted his best fit. But unfortunately, he didn't save his selected model. He needs to search again. This time, he simplified his criterion to a single glass slipper (which is absurd, as if it fits perfectly it won't fall off in the first place) as he now knew the best fit have this characteristic. Due to this simplification, for the second search, the prince does not have to be the selection criterion any more. They found a good surrogate, the glass slipper. After this round of thorough search, they found the prince's match. 

The story "Cinderella" is a story on all possible regression method. Selection-criterion + all possible candidate models is the most reliable way of finding the best match. Greedy search or stochastic search do not have the same guarantee. An exhaustive search is tedious, expensive (a grand dance ball in a real palace!) and time consuming. It helps when the search criterion can be replaced by a simpler surrogate (a glass slipper as opposed to a face-to-face dance) and when the model space is small ("a tiny kingdom in a faraway land").

Friday, November 06, 2015

Identity crisis solved: I am a unicorn trainer

Last night, I watched the JSM encore on "the Statistics Identity Crisis". First, let me just say I was so relieved to learn that I am not the only one who felt the puzzlement "am I a data scientist or not?" I have never felt that tomatoes are so relatable.

Following the recommendation from the fourth talk, I also watched "Big Data, Data Science and Statistics. Can we all live together?" by Terry Speed. He compared the description of data science with his research profile and said:"I guess I have been doing data science after all." Precisely what I felt. 

Data Science, Big Data and Statistics – can we all live together? from Chalmers Internal on Vimeo.

These two videos touched on several important areas statisticians need to work on in order to be more involved in Data Science (if you want to be more involved, of course). In particular, we need to equip our students with problem solving skills, programming and "hacking" skills, collaboration and communication skills. These skills cannot be taught in the conventional pedagogy in Statistics. It would require more real-data projects with open-end problems and opportunity to collaborate with non-statisticians. It would require to encourage the students to be never satisfied with their models, their codes and their visualization/presentation, and strive for better models, faster  algorithms and clearer presentation. This has been a focus in every research project of mine with my PhD students. I love how Professor Lance Waller from Emory University hacked his business card to give himself a title of "unicorn trainer." My current PhD students are packing up domain knowledge and computing skills in parallel computing, data management and visualization in their individual projects, in additional to their research in statistics and machine learning. In other words, they are indeed becoming unicorns.

Next semester, I will be teaching a course on "Applied Data Science" that is data-centric and project-based. It will not be organized by methodology topics as the pre-requisites cover both statistical modeling and machine learning. Every 2-3 weeks, we will explore analytics on a type of non-traditional data. There will be a lot of discussion, brain-storming, real-time hacking, code reviews, etc. It is intended for graduate students in Statistics to gain more data science skills and overcome the fear towards the real-world messiness in real data (big or small). Hopefully, we can get some unicorns-to-be out of this course as well. 

Incidentally, I ran into ASA's statement of the role of Statistics in Data Science today. My favorite quote from this statement is:
For statisticians to help meet the considerable challenges faced by data scientists requires a sustained and substantial collaborative effort with researchers with expertise in data organization and in the flow and distribution of computation. Statisticians must engage them, learn from them, teach them, and work with them.

Sunday, November 01, 2015

Several pointers for graduate students in 140 characters

I gave a casual "speech" in a recent casual pizza hour for our MA students. It was intended to be a chat but the students didn't ask many questions and I just went on and on. So it felt like a speech. One of the students tweeted:

I couldn't have summarized better.

Wednesday, October 28, 2015

Prediction-oriented learning

Our paper on "Why significant variables aren’t automatically good predictors" just came out on PNAS. Here is the press release.

This paper was motivated by our long-time observation of variables selected using significance tests often (not always) perform poorly in prediction. This is not news to many applied scientists we spoke to but the reason for this phenomenon has not been explicitly discussed. Incidentally, also observed in our own research, variables selected using our own I score seem to do better.

Through the course of writing this paper, we came to realize how significance and predictivity are fundamentally different. For example, for a set of variables, its innate predictivity (or power to discern between two classes) does not depend on observed data while significance is a notion applied to sample statistics. As sample sizes grow, most sample statistics relevant to the population quantities that decide predictivity will eventually correlate well with predicitivity. However with finite samples, significant sets and predictive sets do not entirely overlap.

In our project, we tried various simulations to illustrate and understand this disparity, some of which are too complicated to be included in our intend-to-be-accessible paper. For example, using a designed scenario, we manage to recreate the Venn diagram above using simulations. 

Using observed data while evaluating variables for prediction, we inevitably have to rely on certain "sample statistics". We have found in our research presented in our paper, certain prediction-oriented statistics would correlate better with the level of predictivity and test statistics may not. This is because test statistics are often designed to have power to detect ANY departure from the null hypothesis but only certain departures will lead to substantial improvement in predictive performance. 

Friday, October 09, 2015

Re-vision Minard's plot

Minard's plot is a famous example in the history of visualization. Using thickness of lines, it clearly documented Napoleon's fateful defeat in 1812.
Many have attempted to recreate this graph using modern tools. Here is mine using my favorite and only data science programming tool, R. It can still be improved as when the line turns sharply, the thickness are off. I got lazy with algebra.