Thursday, November 12, 2015

What can "Cinderella", the classic children's story, teach us about model selection?

Today in my linear regression class, we accidentally realized a nice example for model selection from classic children's literature, "Cinderella."

First, we need to define our selection goal: find a suitable wife (model). Second, we need to define our selection criterion. The king wanted a "suitable mother" but the prince was looking for something more. Third, given all the possible models, we need to decide how to carry out the search, or the search mechanism. How did they search for the best fit in the story? "By royal command, every eligible maiden is to attend!" In other words, we are going to evaluate all candidate models. 

At the ball, the prince spotted his best fit. But unfortunately, he didn't save his selected model. He needs to search again. This time, he simplified his criterion to a single glass slipper (which is absurd, as if it fits perfectly it won't fall off in the first place) as he now knew the best fit have this characteristic. Due to this simplification, for the second search, the prince does not have to be the selection criterion any more. They found a good surrogate, the glass slipper. After this round of thorough search, they found the prince's match. 

The story "Cinderella" is a story on all possible regression method. Selection-criterion + all possible candidate models is the most reliable way of finding the best match. Greedy search or stochastic search do not have the same guarantee. An exhaustive search is tedious, expensive (a grand dance ball in a real palace!) and time consuming. It helps when the search criterion can be replaced by a simpler surrogate (a glass slipper as opposed to a face-to-face dance) and when the model space is small ("a tiny kingdom in a faraway land").

Friday, November 06, 2015

Identity crisis solved: I am a unicorn trainer

Last night, I watched the JSM encore on "the Statistics Identity Crisis". First, let me just say I was so relieved to learn that I am not the only one who felt the puzzlement "am I a data scientist or not?" I have never felt that tomatoes are so relatable.

Following the recommendation from the fourth talk, I also watched "Big Data, Data Science and Statistics. Can we all live together?" by Terry Speed. He compared the description of data science with his research profile and said:"I guess I have been doing data science after all." Precisely what I felt. 

Data Science, Big Data and Statistics – can we all live together? from Chalmers Internal on Vimeo.

These two videos touched on several important areas statisticians need to work on in order to be more involved in Data Science (if you want to be more involved, of course). In particular, we need to equip our students with problem solving skills, programming and "hacking" skills, collaboration and communication skills. These skills cannot be taught in the conventional pedagogy in Statistics. It would require more real-data projects with open-end problems and opportunity to collaborate with non-statisticians. It would require to encourage the students to be never satisfied with their models, their codes and their visualization/presentation, and strive for better models, faster  algorithms and clearer presentation. This has been a focus in every research project of mine with my PhD students. I love how Professor Lance Waller from Emory University hacked his business card to give himself a title of "unicorn trainer." My current PhD students are packing up domain knowledge and computing skills in parallel computing, data management and visualization in their individual projects, in additional to their research in statistics and machine learning. In other words, they are indeed becoming unicorns.

Next semester, I will be teaching a course on "Applied Data Science" that is data-centric and project-based. It will not be organized by methodology topics as the pre-requisites cover both statistical modeling and machine learning. Every 2-3 weeks, we will explore analytics on a type of non-traditional data. There will be a lot of discussion, brain-storming, real-time hacking, code reviews, etc. It is intended for graduate students in Statistics to gain more data science skills and overcome the fear towards the real-world messiness in real data (big or small). Hopefully, we can get some unicorns-to-be out of this course as well. 

Incidentally, I ran into ASA's statement of the role of Statistics in Data Science today. My favorite quote from this statement is:
For statisticians to help meet the considerable challenges faced by data scientists requires a sustained and substantial collaborative effort with researchers with expertise in data organization and in the flow and distribution of computation. Statisticians must engage them, learn from them, teach them, and work with them.

Sunday, November 01, 2015

Several pointers for graduate students in 140 characters

I gave a casual "speech" in a recent casual pizza hour for our MA students. It was intended to be a chat but the students didn't ask many questions and I just went on and on. So it felt like a speech. One of the students tweeted:

I couldn't have summarized better.