Twitter

Friday, January 30, 2015

Lego, sampling and bad-behaving confidence intervals

Yesterday, during the second lecture of our Introduction to Data Science course for students in non-quantitative program. We did a sampling demo adapted from Andrew Gelman and Deborah Nolan's teaching book (a bag of tricks).

Change from candies to legos. The original teaching recipe uses candies. A side effect of that is the instructor will always get so much left over of candies as the students are getting more and more health conscious. So this time, I decided to use lego pieces. One advantage of this change is that we can save the kitchen scale and just count the number of studs (or "points") on the lego pieces.

Preparation. The night before I counted two bags of 100 lego pieces: population A and population B. Population A consists of about 30 large pieces and 70 tiny pieces. Population B consists of 100 similar pieces (4 studs, 6 studs and 8 studs).

In-Class demo. At the beginning of the lecture, we explained to the students what they need to do and passed one bag to half of the class, and the other bag to the other half, along with  data recording sheets.

Results. Before class, I asked a MA student, Ke Shen, in our program who is very good at visualization and R to create a RShiny app for this demo, where I can quickly key in the numbers and display the confidence intervals.

Here are population A samples.
Here are population B samples. 

Conclusion. Several things we noticed from this demo:
  1. sampling lego pieces can be pretty noisy. 
  2. all samples of population A over-estimated the true population mean (the red line). samples of population B seemed to be doing better. 
  3. population variation affects the width of the confidence intervals. 
  4. but even wider confidence intervals were wrong due to large bias.