t+z statistics

tag:blogger.com,1999:blog-100651222026-03-02T07:36:40.368-05:00t+z statisticsOn statistics, data science and applications. http://www.stat.columbia.edu/~tzheng.Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.comBlogger194125tag:blogger.com,1999:blog-10065122.post-41613903080624628032021-11-01T18:41:00.005-04:002021-11-01T18:46:44.999-04:00The Collaboratory Program at Columbia

A summary of my @TheHDSR paper with @IsabelleZaugg, Trish Culligan and Richard Witten on the Collaboratory Program @Columbia, a collaboration of @ColumbiaEship and @DataSciColumbia. https://hdsr.mitpress.mit.edu/pub/j8v2h5hc/release/2 Columbia's Collaboratory program is to "Preparing Tomorrow’s Leaders for a Data-Rich World". It offers a meta-model for a data science education accelerator

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-82055026581372769312018-07-11T17:48:00.000-04:002018-07-11T17:52:24.537-04:00Forget P-values? or just let it be what it is

P-value has always been controversial. It is required for certain publications, banned from some journals, hated by many, yet quoted widely. Not all p-values are loved equally. Because of a "rule" popularized some 90 years ago, small values below 0.05 have been the crowd's favorite. When teaching hypothesis testing, we explain that the entire spectrum of p-value serves a single purpose: 

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com2tag:blogger.com,1999:blog-10065122.post-19488366368100485952018-06-21T14:39:00.000-04:002018-06-21T14:39:18.345-04:00DSI Scholars Initiation Boot Camp - Project-based learning of the Data Science life cycle

"How do you give a Data Science bootcamp to a mixed cohort of undergraduate and graduate students from a wide range of disciplines?" This is a question I was facing earlier this year when I needed to design the initiation boot camp for our inaugural class of DSI Scholars.  Some of them are rising sophomores, and some of them are PhDs in English. Some of these students had years of

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com1tag:blogger.com,1999:blog-10065122.post-32928027546030200252017-10-10T21:25:00.004-04:002017-10-10T21:25:44.408-04:00Statistical thoughts on learning from Google search trends

Today I was invited to gave comments at the book signing presentation by Seth Stephens-Davidowitz on his book: "Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are". Here are my comments, which were random statistical thoughts I had while reading this book. Nobel Prize Economics 2017 "many economists to pay more attention to human behavior” “people

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com1tag:blogger.com,1999:blog-10065122.post-87178462528583425022017-09-25T21:20:00.002-04:002017-09-25T21:21:56.321-04:00Aggregated relation data (ARD), what is that?

Today, an academic friend of mine asked me about a series of papers I wrote on the topic of "how many X's do you know" questions. These papers are about a special kind of indirect "network" data that do not have detailed information on individual edges in a network. Rather, it contains counts of the edges that connects an ego towards a number of specific subpopulations. Relational data, such

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com2tag:blogger.com,1999:blog-10065122.post-13223502146022525742016-11-07T16:59:00.000-05:002016-11-07T16:59:24.194-05:00Postdoc position in Applied Statistics and Bayesian Modeling available immediately

A full-time postdoctoral position is available beginning immediately in the research group of Professor Tian Zheng working on reproducible modeling of social and behaviorial data using Bayesian computing, in close cooperation with our collaborators in social sciences.  Requirements: The work is highly interdisciplinary, and applicants must have strong statistical and computational

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-52968755585033045562016-09-28T10:46:00.002-04:002016-09-28T10:46:41.115-04:00ASA SLDS JSM 2017 student paper competition

Call for papers Student Paper Competition - JSM 2017 (July 29th-Aug 3rd, 2017, Baltimore, MD) ASA Section on Statistical Learning and Data Science Sponsored by SLDS Key dates: • Abstracts due December 15th, 2016 • Full papers due January 4th, 2017 The Section on Statistical Learning and Data Science (SLDS) of the American Statistical Association (ASA) is sponsoring a student paper competition

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-37316039192506458472016-09-10T22:11:00.002-04:002016-09-13T22:43:34.446-04:00Resources for data analytics at Columbia

I got an email asking for resources (other than courses) for data analytics at Columbia. This is what I wrote in reply: The library runs workshops on analytics: http://library.columbia.edu/research/workshops.html.   You can join student organizations such as Columbia Data Science Society (http://datascience.columbia.edu/columbia-data-science-society) and Columbia Statistics Club (

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com10tag:blogger.com,1999:blog-10065122.post-16907608712003537122016-09-05T20:52:00.003-04:002018-07-11T16:54:31.070-04:00ADS alum Yuhan Sun's summer intern at UNICEF

An Spring 2016 alum from our Applied Data Science course , Yuhan Sun (MA in Statistics, Columbia University), spent the past summer as a data scientist at UNICEF. She extended a Shiny app that provides a web-based application for generating child mortality estimates. These estimates are computed from empirical data using the United Nations Inter-agency Group for Child Mortality Estimation (

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-21470936117765748062016-09-03T22:51:00.000-04:002018-07-11T16:55:44.774-04:00A happy birthday in numbers

On my birthday, I received a total of 65 "happy birthday!" messages (in either Chinese or English) via social messaging. For the first year, I got birthday wishes from LinkedIn. I thought it'd be fun to visualize using Tableau some basic information about these social messages, as a snapshot of a subset of my social network. More than 3/4 of my birthday wishing social network are Chinese.

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-57599875823501871942016-09-02T22:35:00.004-04:002018-07-11T16:56:36.205-04:00A semester of data science fun with new teaching approaches

This semester I am teaching GU4243/GR5243 Applied Data Science (syllabus) and GR5705 Introduction to Data Science (syllabus). ADS will be a Project-Based Learning (PBL) class and IDS will be using the flipped classroom approach.

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-1040007466879146012016-09-01T20:01:00.001-04:002018-07-11T16:56:53.863-04:00Applied Data Science

Another semester of data science fun is just around the corner.  Follow us at TZStatsADS.

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-82271122003669272972016-08-17T12:01:00.003-04:002016-08-17T12:06:38.533-04:00Postdoc position available immediately

Postdoctoral position in spatiotemporal data analysis A full-time postdoctoral position is available beginning immediately in the research group of Professor Tian Zheng working on analysis of large spatiotemporal data sets, in close cooperation with our collaborators in neural imaging.  Requirements: The work is highly interdisciplinary, and applicants must have strong

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-68311841598286696052016-05-04T15:55:00.002-04:002016-06-12T10:03:19.742-04:00Call for papers--Statistical Learning for Data Science (SLDS)

Call For Papers  (New deadline: June 12, 2016) Special Session on "Statistical Learning for Data Science (SLDS)" In DSAA 2016: The 3rd IEEE International Conference on Data Science and Advanced Analytics Montreal, Canada October 17-19, 2016 Organized by Tian Zheng (Department of Statistics, Columbia University) Wei Pan (Department of Biostatistics, University of

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-48037290042654054462016-01-18T20:18:00.004-05:002016-01-18T20:18:54.817-05:00Project-based learning (PBL)

This semester I will be teaching Applied Data Science (W4249), a project-based learning (PBL) course. I came up with the idea for this course without being award of this line of discussion in education innovation. As I have been preparing for this course, I started looking for research on instructional practices using projects. I have discovered some very nice discussion on PBL. PBL has been

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-81647719822210047592015-11-12T16:27:00.003-05:002015-11-12T17:38:21.964-05:00What can "Cinderella", the classic children's story, teach us about model selection?

Today in my linear regression class, we accidentally realized a nice example for model selection from classic children's literature, "Cinderella." First, we need to define our selection goal: find a suitable wife (model). Second, we need to define our selection criterion. The king wanted a "suitable mother" but the prince was looking for something more. Third, given all the possible models, we

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-65396011403391075352015-11-06T22:14:00.001-05:002015-11-06T22:21:09.469-05:00Identity crisis solved: I am a unicorn trainer

Last night, I watched the JSM encore on "the Statistics Identity Crisis". First, let me just say I was so relieved to learn that I am not the only one who felt the puzzlement "am I a data scientist or not?" I have never felt that tomatoes are so relatable. Following the recommendation from the fourth talk, I also watched "Big Data, Data Science and Statistics. Can we all live together?" by

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-29066018406708790722015-11-01T23:01:00.000-05:002018-07-11T16:57:56.942-04:00Several pointers for graduate students in 140 characters

I gave a casual "speech" in a recent casual pizza hour for our MA students. It was intended to be a chat but the students didn't ask many questions and I just went on and on. So it felt like a speech. One of the students tweeted: I couldn't have summarized better.

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-87543259670221809722015-10-28T03:50:00.000-04:002015-10-28T03:50:20.138-04:00Prediction-oriented learning

Our paper on "Why significant variables aren’t automatically good predictors" just came out on PNAS. Here is the press release. This paper was motivated by our long-time observation of variables selected using significance tests often (not always) perform poorly in prediction. This is not news to many applied scientists we spoke to but the reason for this phenomenon has not been explicitly

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-24264841708181325832015-10-09T21:54:00.000-04:002015-10-09T21:54:22.921-04:00Re-vision Minard's plot

Minard's plot is a famous example in the history of visualization. Using thickness of lines, it clearly documented Napoleon's fateful defeat in 1812. Many have attempted to recreate this graph using modern tools. Here is mine using my favorite and only data science programming tool, R. It can still be improved as when the line turns sharply, the thickness are off. I got lazy with algebra.

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-50402322822541571342015-10-08T14:08:00.003-04:002015-10-08T14:08:43.638-04:00Debugging the good results

You have a data set and an idea to model the data, in the hope that it will provide some information or solution to a problem. In the ideal world, you shall just cast the idea on the data like a never fail spell and, ta-da, the solution shall just pop out of thin air. It does not happen in the real world. Even in the wizard world, when an angry Harry Potter tried to use the Cruciatus 

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com1tag:blogger.com,1999:blog-10065122.post-89536836725401472232015-10-07T18:48:00.002-04:002015-10-07T18:48:52.509-04:00Animated plot using R package animation

Step 1: install imageMagick Step 2: write a loop that create a sequence of plots. Step 3: use saveGIF({ }, ...)

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-67085497973530207532015-09-22T23:31:00.000-04:002018-07-11T17:03:01.124-04:00Usage trends on R and Python (2014 to 2015)

From a latest KDknuggets poll, both R and Python have furthered their dominance as programming language for analytics, data analysis, data mining and modeling. As a self-learning task, I made my very first chord diagram using the R package circlize. It was quite easy to use and took me about 2 minuets to make a first draft and another 30 minuets to refine the layout, color, etc (given I was

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-35276318923987808462015-09-07T13:00:00.002-04:002015-09-07T13:00:30.572-04:00ASA SLDM call for papers (JSM 2016)

Call for papers Student Paper Competition-JSM 2016(July 30th-Aug 4th, 2016, Chicago, IL)ASA Section on Statistical Learning and Data MiningJointly sponsored by SLDM and PANDORAKey dates:• Abstracts due December 15th, 2015• Full papers due January 4th, 2016The Section on Statistical Learning and Data Mining (SLDM) of the American Statistical Association (ASA) is sponsoring a student paper

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0tag:blogger.com,1999:blog-10065122.post-83938931510501295702015-09-07T09:53:00.001-04:002015-09-07T09:53:47.745-04:00A simple solution for R vioplot cut() error

I am using R's violin plots to visualize side-by-side comparison of empirical distributions. At some specifications, the simulated values can all equal to a constant. That will cause the following error. Error in cut.default(x, breaks = breaks) : 'breaks' are not unique To resolve this problem, one can try the following simple fix: Instead of vioplot( x, ...). Input vioplot( x+rnorm(

Tian Zhenghttp://www.blogger.com/profile/13210536019151991331noreply@blogger.com0