tag:blogger.com,1999:blog-100651222017-10-10T21:25:44.394-04:00t+z statisticsOn statistics, data science and applications. http://www.stat.columbia.edu/~tzheng.Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.comBlogger192125tag:blogger.com,1999:blog-10065122.post-32928027546030200252017-10-10T21:25:00.004-04:002017-10-10T21:25:44.408-04:00Statistical thoughts on learning from Google search trendsToday I was invited to gave comments at the book signing presentation by Seth Stephens-Davidowitz on his book: "Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are". Here are my comments, which were random statistical thoughts I had while reading this book.
Nobel Prize Economics 2017
"many economists to pay more attention to human behavior”
“people Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-87178462528583425022017-09-25T21:20:00.002-04:002017-09-25T21:21:56.321-04:00Aggregated relation data (ARD), what is that?Today, an academic friend of mine asked me about a series of papers I wrote on the topic of "how many X's do you know" questions. These papers are about a special kind of indirect "network" data that do not have detailed information on individual edges in a network. Rather, it contains counts of the edges that connects an ego towards a number of specific subpopulations.
Relational data, such Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-13223502146022525742016-11-07T16:59:00.000-05:002016-11-07T16:59:24.194-05:00Postdoc position in Applied Statistics and Bayesian Modeling available immediately
A full-time postdoctoral position is available beginning immediately in the research group of Professor Tian Zheng working on reproducible modeling of social and behaviorial data using Bayesian computing, in close cooperation with our collaborators in social sciences.
Requirements: The work is highly interdisciplinary, and applicants must have strong statistical and computational skills. Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-52968755585033045562016-09-28T10:46:00.002-04:002016-09-28T10:46:41.115-04:00ASA SLDS JSM 2017 student paper competition
Call for papers
Student Paper Competition - JSM 2017
(July 29th-Aug 3rd, 2017, Baltimore, MD)
ASA Section on Statistical Learning and Data Science
Sponsored by SLDS
Key dates:
• Abstracts due December 15th, 2016
• Full papers due January 4th, 2017
The Section on Statistical Learning and Data Science (SLDS) of the American Statistical Association (ASA) is sponsoring a student paper competition Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-37316039192506458472016-09-10T22:11:00.002-04:002016-09-13T22:43:34.446-04:00Resources for data analytics at Columbia
I got an email asking for resources (other than courses) for data analytics at Columbia. This is what I wrote in reply:
The library runs workshops on analytics: http://library.columbia.edu/research/workshops.html.
You can join student organizations such as Columbia Data Science Society (http://datascience.columbia.edu/columbia-data-science-society) and Columbia Statistics Club (https://Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-16907608712003537122016-09-05T20:52:00.003-04:002016-09-10T22:11:49.044-04:00ADS alum Yuhan Sun's summer intern at UNICEF
An Spring 2016 alum from our Applied Data Science course , Yuhan Sun (MA in Statistics, Columbia University), spent the past summer as a data scientist at UNICEF. She extended a Shiny app that provides a web-based application for generating child mortality estimates. These estimates are computed from empirical data using the United Nations Inter-agency Group for Child Mortality Estimation (UN Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-21470936117765748062016-09-03T22:51:00.000-04:002016-09-04T07:49:02.494-04:00A happy birthday in numbersOn my birthday, I received a total of 65 "happy birthday!" messages (in either Chinese or English) via social messaging. For the first year, I got birthday wishes from LinkedIn. I thought it'd be fun to visualize using Tableau some basic information about these social messages, as a snapshot of a subset of my social network.
More than 3/4 of my birthday wishing social network are Chinese.
Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-57599875823501871942016-09-02T22:35:00.004-04:002016-09-04T07:49:29.322-04:00A semester of data science fun with new teaching approachesThis semester I am teaching GU4243/GR5243 Applied Data Science (syllabus) and GR5705 Introduction to Data Science (syllabus). ADS will be a Project-Based Learning (PBL) class and IDS will be using the flipped classroom approach.Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-1040007466879146012016-09-01T20:01:00.001-04:002016-09-04T07:49:45.473-04:00Applied Data Science
Another semester of data science fun is just around the corner.
Follow us at TZStatsADS.
Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-82271122003669272972016-08-17T12:01:00.003-04:002016-08-17T12:06:38.533-04:00Postdoc position available immediately
Postdoctoral position in spatiotemporal data analysis
A full-time postdoctoral position is available beginning immediately in the research group of Professor Tian Zheng working on analysis of large spatiotemporal data sets, in close cooperation with our collaborators in neural imaging.
Requirements: The work is highly interdisciplinary, and applicants must have strong statistical and Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-68311841598286696052016-05-04T15:55:00.002-04:002016-06-12T10:03:19.742-04:00Call for papers--Statistical Learning for Data Science (SLDS)
Call For Papers
(New deadline: June 12, 2016)
Special Session on
"Statistical Learning for Data Science (SLDS)"
In DSAA 2016: The 3rd IEEE International Conference on Data Science and Advanced Analytics
Montreal, Canada October 17-19, 2016
Organized by
Tian Zheng (Department of Statistics, Columbia University)
Wei Pan (Department of Biostatistics, University of Minnesota)
Hernando Ombao,Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-48037290042654054462016-01-18T20:18:00.004-05:002016-01-18T20:18:54.817-05:00Project-based learning (PBL)
This semester I will be teaching Applied Data Science (W4249), a project-based learning (PBL) course. I came up with the idea for this course without being award of this line of discussion in education innovation. As I have been preparing for this course, I started looking for research on instructional practices using projects. I have discovered some very nice discussion on PBL. PBL has been Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com1tag:blogger.com,1999:blog-10065122.post-81647719822210047592015-11-12T16:27:00.003-05:002015-11-12T17:38:21.964-05:00What can "Cinderella", the classic children's story, teach us about model selection?Today in my linear regression class, we accidentally realized a nice example for model selection from classic children's literature, "Cinderella."
First, we need to define our selection goal: find a suitable wife (model). Second, we need to define our selection criterion. The king wanted a "suitable mother" but the prince was looking for something more. Third, given all the possible models, we Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com1tag:blogger.com,1999:blog-10065122.post-65396011403391075352015-11-06T22:14:00.001-05:002015-11-06T22:21:09.469-05:00Identity crisis solved: I am a unicorn trainerLast night, I watched the JSM encore on "the Statistics Identity Crisis". First, let me just say I was so relieved to learn that I am not the only one who felt the puzzlement "am I a data scientist or not?" I have never felt that tomatoes are so relatable.
Following the recommendation from the fourth talk, I also watched "Big Data, Data Science and Statistics. Can we all live together?" by Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-29066018406708790722015-11-01T23:01:00.000-05:002015-11-01T23:01:06.827-05:00Several pointers for graduate students in 140 charactersI gave a casual "speech" in a recent casual pizza hour for our MA students. It was intended to be a chat but the students didn't ask many questions and I just went on and on. So it felt like a speech. One of the students tweeted:
I couldn't have summarized better.Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-87543259670221809722015-10-28T03:50:00.000-04:002015-10-28T03:50:20.138-04:00Prediction-oriented learningOur paper on "Why significant variables aren’t automatically good predictors" just came out on PNAS. Here is the press release.
This paper was motivated by our long-time observation of variables selected using significance tests often (not always) perform poorly in prediction. This is not news to many applied scientists we spoke to but the reason for this phenomenon has not been explicitly Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-24264841708181325832015-10-09T21:54:00.000-04:002015-10-09T21:54:22.921-04:00Re-vision Minard's plotMinard's plot is a famous example in the history of visualization. Using thickness of lines, it clearly documented Napoleon's fateful defeat in 1812.
Many have attempted to recreate this graph using modern tools. Here is mine using my favorite and only data science programming tool, R. It can still be improved as when the line turns sharply, the thickness are off. I got lazy with algebra.
Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-50402322822541571342015-10-08T14:08:00.003-04:002015-10-08T14:08:43.638-04:00Debugging the good resultsYou have a data set and an idea to model the data, in the hope that it will provide some information or solution to a problem. In the ideal world, you shall just cast the idea on the data like a never fail spell and, ta-da, the solution shall just pop out of thin air.
It does not happen in the real world. Even in the wizard world, when an angry Harry Potter tried to use the Cruciatus spell on anTian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-89536836725401472232015-10-07T18:48:00.002-04:002015-10-07T18:48:52.509-04:00Animated plot using R package animationStep 1: install imageMagick
Step 2: write a loop that create a sequence of plots.
Step 3: use saveGIF({ }, ...)
Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-67085497973530207532015-09-22T23:31:00.000-04:002015-09-23T15:52:50.196-04:00Usage trends on R and Python (2014 to 2015) From a latest KDknuggets poll, both R and Python have furthered their dominance as programming language for analytics, data analysis, data mining and modeling. As a self-learning task, I made my very first chord diagram using the R package circlize. It was quite easy to use and took me about 2 minuets to make a first draft and another 30 minuets to refine the layout, color, etc (given I was Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-35276318923987808462015-09-07T13:00:00.002-04:002015-09-07T13:00:30.572-04:00ASA SLDM call for papers (JSM 2016)
Call for papers
Student Paper Competition-JSM 2016(July 30th-Aug 4th, 2016, Chicago, IL)ASA Section on Statistical Learning and Data MiningJointly sponsored by SLDM and PANDORAKey dates:• Abstracts due December 15th, 2015• Full papers due January 4th, 2016The Section on Statistical Learning and Data Mining (SLDM) of the American Statistical Association (ASA) is sponsoring a student paper Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-83938931510501295702015-09-07T09:53:00.001-04:002015-09-07T09:53:47.745-04:00A simple solution for R vioplot cut() errorI am using R's violin plots to visualize side-by-side comparison of empirical distributions. At some specifications, the simulated values can all equal to a constant. That will cause the following error.
Error in cut.default(x, breaks = breaks) : 'breaks' are not unique
To resolve this problem, one can try the following simple fix: Instead of vioplot( x, ...). Input vioplot( x+rnorm(length(x), 0Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-55149052875113944262015-08-21T15:41:00.003-04:002015-08-21T15:41:38.166-04:00Discussing "Statistical Methods for Observational Health Studies" (JSM 2015)This post is based on my recollection of what I discussed in the session on "Statistical methods for observational health studies" that I organized. There has been a boom of such studies due to availability of large collection of patients records and medical claims.
The analysis of the "big data" from health studies is different from many machine learning tasks in two ways. First, association isTian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-9779781561015885102015-08-18T10:25:00.000-04:002015-08-18T10:25:21.746-04:00The fifth V of Big Data: variables (JSM 2015 discussion)I gave the following discussion (from recollection and my notes) during the session "the fifth V of Big Data: variables" organized by Cynthia Rudin.
The notion "Big Data" does not simply refer to a data set that is large in size. It includes all complex and nontraditional data that do not necessarily come in the form of a typical clean Excel sheet with rows corresponding to individuals and Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-82055026581372769312015-04-26T23:14:00.003-04:002015-04-26T23:14:52.801-04:00Forget P-values? or just let it be what it isP-value has always been controversial. It is required for certain publications, banned from some journals, hated by many yet quoted widely. Not all p-values are loved equally. Because what someone popularized some 90 years ago, the small values below 0.05 have been the crowd's favorite.
When we teach hypothesis testing, we explain that the entire spectrum of p-value is to serve a single purpose:Tian Zhenghttps://plus.google.com/105418043920323381048noreply@blogger.com0