Wednesday, August 17, 2016

Postdoc position available immediately

Postdoctoral position in spatiotemporal data analysis

A full-time postdoctoral position is available beginning immediately in the research group of Professor Tian Zheng working on analysis of large spatiotemporal data sets, in close cooperation with our collaborators in neural imaging. 

Requirements: The work is highly interdisciplinary, and applicants must have strong statistical and computational skills. Preferred educational background is a PhD in statistics, computer science, computational neural science or a related field. Expertise in high performance computing is required. Experience with GPU computing is preferred but not necessary. 

Environment: The research group is at Columbia University, based in the Statistics department, in the great city of New York. There will also be the opportunity to work with our collaborators on other research projects involving statistical computing, genetics, and computational social science.

Appointment: The initial appointment will be for one year, and is renewable. Salaries will be set based on experience and skills.

Applicants should send email to providing:
• a brief description of past research experience
• a brief description of future research interests and goals
• a resume of educational and research experience, including publications
• three letters of reference

Wednesday, May 04, 2016

Call for papers--Statistical Learning for Data Science (SLDS)

Call For Papers 

(New deadline: June 12, 2016)

Special Session on

"Statistical Learning for Data Science (SLDS)"

In DSAA 2016: The 3rd IEEE International Conference on Data Science and Advanced Analytics

Montreal, Canada October 17-19, 2016

Organized by

  • Tian Zheng (Department of Statistics, Columbia University)
  • Wei Pan (Department of Biostatistics, University of Minnesota)
  • Hernando Ombao, (Department of Statistics, University of California at Irvine) 

Program Committee members

  • Genevera Allen (Department of Statistics, Rice University) 
  • Ke Deng (Center for Statistical Science, Tsinghua University)
  • Charles Doss (School of Statistics, University of Minnesota)
  • Bailey Fosdick (Department of Statistics, Colorado State University)
  • Mladen Kolar (University of Chicago, Booth School of Business)
  • Xi Luo (Department of Biostatistics and Center for Statistical Sciences, Brown University)
  • Ali Shojaie (Department of Biostatistics, University of Washington)
  • Gongjun Xu (School of Statistics, University of Minnesota)
  • Alexander Volfovsky (Department of Statistical Science, Duke University)
  • Sijian Wang (Department of Statistics, University of Wisconsin at Madison)
Statistics plays a central role in the data science approach. This special session is to engage discussion from statisticians who study methods and theory that are fundamental to data science. Paper submissions on recent advances in statistical learning and modeling for complex data are encouraged.
Topics of interests are, but not limited to,
  • Advances in theory or models associated with the analysis of massive, complex datasets;
  • Statistical modeling and data mining for data-driven solutions of real-world problems;
  • Innovative data mining algorithms or novel statistical approaches;
  • Comparison of techniques to solve a problem, along with an objective evaluation of the analyses and the solutions.
Conference content will be submitted for inclusion into IEEE Digital Library. The conference proceedings will be submitted for EI indexing through INSPEC by IEEE. Top quality papers accepted and presented at the conference will be selected for extension and publication in the special issues of some international journals, including IEEE TKDE, ACM TKDD, ACM TIIS and WWWJ.

Journal publication

Extended versions of accepted papers to this special session will be considered for a special issue of Statistical Analysis and Data Mining, the ASA data science journal.

Key dates

  • Paper Submission deadline: Sunday 12 June, 2016, 11:59 PM PDT
  • Notification of acceptance: 15 July, 2016
  • Final Camera-ready papers due: 19 August, 2016

Submission Instruction

Monday, January 18, 2016

Project-based learning (PBL)

This semester I will be teaching Applied Data Science (W4249), a project-based learning (PBL) course. I came up with the idea for this course without being award of this line of discussion in education innovation. As I have been preparing for this course, I started looking for research on instructional practices using projects. I have discovered some very nice discussion on PBL. PBL has been more widely discussed in K-12 education, especially in STEM curriculums. One of the main objective of PBL is to guide students in the process of turning into self-directed life learners. This is quite appealing as one can never acquire, within their time in school, all the knowledge required to solve all the problems. 

Thursday, November 12, 2015

What can "Cinderella", the classic children's story, teach us about model selection?

Today in my linear regression class, we accidentally realized a nice example for model selection from classic children's literature, "Cinderella."

First, we need to define our selection goal: find a suitable wife (model). Second, we need to define our selection criterion. The king wanted a "suitable mother" but the prince was looking for something more. Third, given all the possible models, we need to decide how to carry out the search, or the search mechanism. How did they search for the best fit in the story? "By royal command, every eligible maiden is to attend!" In other words, we are going to evaluate all candidate models. 

At the ball, the prince spotted his best fit. But unfortunately, he didn't save his selected model. He needs to search again. This time, he simplified his criterion to a single glass slipper (which is absurd, as if it fits perfectly it won't fall off in the first place) as he now knew the best fit have this characteristic. Due to this simplification, for the second search, the prince does not have to be the selection criterion any more. They found a good surrogate, the glass slipper. After this round of thorough search, they found the prince's match. 

The story "Cinderella" is a story on all possible regression method. Selection-criterion + all possible candidate models is the most reliable way of finding the best match. Greedy search or stochastic search do not have the same guarantee. An exhaustive search is tedious, expensive (a grand dance ball in a real palace!) and time consuming. It helps when the search criterion can be replaced by a simpler surrogate (a glass slipper as opposed to a face-to-face dance) and when the model space is small ("a tiny kingdom in a faraway land").

Friday, November 06, 2015

Identity crisis solved: I am a unicorn trainer

Last night, I watched the JSM encore on "the Statistics Identity Crisis". First, let me just say I was so relieved to learn that I am not the only one who felt the puzzlement "am I a data scientist or not?" I have never felt that tomatoes are so relatable.

Following the recommendation from the fourth talk, I also watched "Big Data, Data Science and Statistics. Can we all live together?" by Terry Speed. He compared the description of data science with his research profile and said:"I guess I have been doing data science after all." Precisely what I felt. 

Data Science, Big Data and Statistics – can we all live together? from Chalmers Internal on Vimeo.

These two videos touched on several important areas statisticians need to work on in order to be more involved in Data Science (if you want to be more involved, of course). In particular, we need to equip our students with problem solving skills, programming and "hacking" skills, collaboration and communication skills. These skills cannot be taught in the conventional pedagogy in Statistics. It would require more real-data projects with open-end problems and opportunity to collaborate with non-statisticians. It would require to encourage the students to be never satisfied with their models, their codes and their visualization/presentation, and strive for better models, faster  algorithms and clearer presentation. This has been a focus in every research project of mine with my PhD students. I love how Professor Lance Waller from Emory University hacked his business card to give himself a title of "unicorn trainer." My current PhD students are packing up domain knowledge and computing skills in parallel computing, data management and visualization in their individual projects, in additional to their research in statistics and machine learning. In other words, they are indeed becoming unicorns.

Next semester, I will be teaching a course on "Applied Data Science" that is data-centric and project-based. It will not be organized by methodology topics as the pre-requisites cover both statistical modeling and machine learning. Every 2-3 weeks, we will explore analytics on a type of non-traditional data. There will be a lot of discussion, brain-storming, real-time hacking, code reviews, etc. It is intended for graduate students in Statistics to gain more data science skills and overcome the fear towards the real-world messiness in real data (big or small). Hopefully, we can get some unicorns-to-be out of this course as well. 

Incidentally, I ran into ASA's statement of the role of Statistics in Data Science today. My favorite quote from this statement is:
For statisticians to help meet the considerable challenges faced by data scientists requires a sustained and substantial collaborative effort with researchers with expertise in data organization and in the flow and distribution of computation. Statisticians must engage them, learn from them, teach them, and work with them.