Friday, November 01, 2013

R commander: why is clicking better than typing?

I came across a GUI for R called R commander. It resembles a typical, more user friendly, interface where users can explore the drop down menus and select (basic) things they can apply to their data. I do not find it easier than writing my own R script. But I think this can actually be a blessing to people (i.e., students) who have not written a single script in their life before coming into an Intro Stat class.

I think the main reason why this makes life easier for certain user group is that you don't have to remember much to get started. The interface has the same structure as the other interface used by a personal computer (PC or Mac)'s operating system. Therefore, you understand what you are supposed to do, more or less. Therefore, most students should have had the required essential skill set to use R commander before taking the intro stat course, even not R itself.

The regular R console is another story. You can copy-paste the examples in a teacher's lecture notes without having a clue about what you are doing. This is understandably frustrating. If a student in an intro state class decides to go deeper into statistics, s/he eventually would need to learn how to program (R, C, Perl, Python, or whatever). This will naturally become more interesting (or less frustrating) once the student is into statistics already.

Monday, October 21, 2013

BBC Horizon - Homeopathy the test

This is one of my favorite teaching examples as it explains very well the placebo effects, the importance of blinded randomized experiment and the meaning of statistical significance.
If you prefer to read, you can go to the show's page.

Friday, October 18, 2013

NRG Research Highlight: A Mendelian code for complex disease


Key quotes:

  1. [The authors] analysed the phenotypic information present in large numbers of electronic medical records from the United States and Denmark to look for co-morbidities among Mendelian and complex diseases
  2. each complex disease was found to be associated with a unique set of Mendelian conditions.
  3. This finding led the authors to explore genetic models that could explain the risk of complex disease in patients with more than one Mendelian phenotype. The best explanation was provided by a model in which non-additive genetic interactions in specific 'communities' of loci have crucial roles.
What is "non-additive genetic interactions"?

Tuesday, October 01, 2013

The leap from an academic program to the corporate world (or any world)

At a recent meeting on educational issues, someone reported feedback from big companies on fresh graduates from academic graduate programs in general. It is widely felt that there is a gap between the training and the expectation in the corporate world. More specifically, most companies felt that fresh graduates from academic programs need to have more preparation in four areas: strategic thinking, project management, team work, presenting the big picture of a project (the elevator talk).

Hmm, these are all important for surviving in academia too! I thought. Maybe there should be more emphasis on having final projects (with final presentations) in our courses at all levels (undergraduate, MA and PhD). To make these handful of projects count, mentoring during the project and feedback after the project hold the key.

This semester, for G6101, there will be an assigned data project as part of the final exam, mimicking the format of our qualifying exam on applied statistics. G6101 doesn't have many presentation opportunities for the students (yet). For W4335 "sample surveys", I am experimenting assigning two small project ideas every week as "optional projects". Students are required to do two such projects during the semester and "present" their results in the discussion board. Hope these will bring them closer to their landing pad in the new world (whatever and wherever it may be) after they finish our program/course.

Tuesday, June 18, 2013

Postdoc position available. Come and join us!

Postdoctoral position in statistical modeling of social networks

A full-time postdoctoral position is available beginning Fall 2014 in the research group of Tian Zheng and Andrew Gelman working on statistical analysis and modeling of social network data, in close cooperation with our experimental collaborators. Four key papers of this project so far are:

Requirements: The work is highly interdisciplinary, and applicants must have strong statistical and computational skills. Social science research skills are preferred but not necessary. Preferred educational background is a PhD in statistics, computer science, political science, sociology, or a related field. Expertise in Bayesian modeling and computing is required. Previous experience with network data is preferred but not required.

Environment: The research group is at Columbia University, based in the Statistics department and closely integrated with the Applied Statistics Center at Columbia, in the great city of New York. There will also be the opportunity to work with Zheng and Gelman and their collaborators on other research projects involving statistical computing, genetics, and computational social science.

Appointment: The initial appointment will be for one year, and is renewable. Salaries will be set based on experience and skills.

Applicants should send email to providing:
• a one-page description of past research experience
• a one-page description of future research interests and goals
• a resume of educational and research experience, including publications
• three letters of reference

Thursday, May 02, 2013

Two useful R packages

Came across two useful R packages today.
1) partykit that makes prettier classification and regression tree plots.
2) arrayhelpers, for which I only used one function today, array2df that convert a multidimensional array to a "flat" matrix for easier plotting.

Tuesday, March 12, 2013

What is the BIG data?

Tyler told me that he is going to present on a panel about "Big Data." "What is big data?" I asked him and he didn't give me a satisfactory definition. For the past two years, every time I heard about the phrase "big data", it reminded me of the "Big Salad" episode of Seinfeld.
ELAINE: Um, hum, I don't know.. . . A big salad?
GEORGE: What big salad? I'm going to the coffee shop.
ELAINE: They have big salads.
GEORGE: I've never seen a big salad.
ELAINE: They have a big salad.
GEORGE: Is that what I ask for? The BIG salad?
ELAINE: It's okay, you don't…
GEORGE: No, no, Hey I'll get it. What's in the BIG salad?
JERRY: Big lettuce, big carrots, tomatoes like volleyballs.
GEORGE: (???), we'll see you in a little while.
I felt that I sort of know what they were referring to but I cannot really take in the idea of generalizing all the special needs of complex, messy, multi-disciplinary, multi-source, incomplete, biased-designed, ... data into one word "BIG." Yes, they are big. So big that getting them into a form that can be handled by traditional computational mechanisms becomes hard. Innovations are needed to either allow nearly loss-less reduction of the BIG data to a manageable size, or lead to new computational mechanisms for BIG data. This is the foremost step of any big data project. This part of the battle is more computer science than Statistics.

To a statistician, the whole new era of "BIG data" feels like a call for more dynamic models that can capture trends in space and time, better model-based tools for integrating multiple, individually incomplete, data sources, systematic data analysis tools that can mitigate design and sampling biases in the huge collection of existing data. It is like jigsaw puzzles. A small data is like a small puzzle and a BIG data is like a gigantic, bigger-than-a-football-field-kind of a puzzle. Exciting, amazing, fun (?), and intimidating. All our old puzzle-solving tricks won't work well but some fundamental principles still prevail, as long as it is indeed, as we understand, a jigsaw puzzle but not something else.