Twitter

Tuesday, September 22, 2015

Usage trends on R and Python (2014 to 2015)

From a latest KDknuggets poll, both R and Python have furthered their dominance as programming language for analytics, data analysis, data mining and modeling. As a self-learning task, I made my very first chord diagram using the R package circlize. It was quite easy to use and took me about 2 minuets to make a first draft and another 30 minuets to refine the layout, color, etc (given I was remote desktoping from an iPad to my 10 years old windows PC in my office!).
It is shown here that both Python and R gained new users. The biggest movements are from previous R-only or Python-only users who decided to adopt the other programming language. This is natural as R and Python offer very different user experiences and in some areas complement each other. For new analytics researchers/data scientists with no or little prior experience, they almost exclusively chose R as a starting point. For users who decide to start using Python, they all had prior programming experience. 

This confirms what I have been speculating (which by no means can be claimed as novel or original).  The most attractive aspect of R is its relative ease of use. R has its own programming challenges and can sometimes hard to debug. But for new users, it does not take long for them to start hacking data. This is precisely the bottleneck for Python. Not everyone is willing to make the leap into scripting programming. The problem areas for R are computing speed, memory management and its interface with other programming tools, all of which are improving. For statisticians, we can leverage our years of experience with R and learn new computational tricks and new tools from other languages that have been interfaced with R, without the need to leave R. 

Here is the R code
> mat.v
             R2015 Python2015 other2015 none2015
R2014      0.40480     0.0506   0.00460    0.000
Python2014 0.01771     0.2093   0.00529    0.000
other2014  0.04370     0.0253   0.16100    0.000
none2014   0.04400     0.0000   0.00000    0.036
> circos.clear()
> circos.par(start.degree=-105)
> circos.par(gap.degree=c(rep(2, nrow(mat.v)-1), 30, 
                          rep(2, ncol(mat.v)-1), 30))
> chordDiagram(mat.v, order=c("R2014", "none2014", 
                             "other2014", "Python2014",
                             "Python2015", "other2015", 
                             "none2015", "R2015"), 
              grid.col=grid.col, directional=TRUE)





Monday, September 07, 2015

ASA SLDM call for papers (JSM 2016)

Call for papers
Student Paper Competition-JSM 2016
(July 30th-Aug 4th, 2016, Chicago, IL)
ASA Section on Statistical Learning and Data Mining
Jointly sponsored by SLDM and PANDORA

Key dates:
• Abstracts due December 15th, 2015
• Full papers due January 4th, 2016

The Section on Statistical Learning and Data Mining (SLDM) of the American Statistical Association (ASA) is sponsoring a student paper competition for the 2016 Joint Statistical Meetings in Chicago, IL, on July 30th-August 4th, 2016.

The paper might be an original methodological research or a real-world application (from various fields including but not limited to marketing, pharmaceutical, genomics, bioinformatics, imaging, defense, business, public health) that uses principles and methods in statistical learning and data mining.

Papers that have been accepted for publication are NOT eligible for the competition. Selected winners will present their papers in a designated session at the 2016 JSM in Chicago, IL organized by the award committee. In this session, they will be presented a monetary prize and an award certificate. Winning papers will be recommended for submission to Statistical Analysis and Data Mining: The ASA Data Science Journal, which is the flagship journal of the SLDM Section.

Graduate or undergraduate students who are enrolled in Fall 2015 or Winter/Spring 2016 are eligible to participate. The applicant MUST be the first author of the paper.

Abstracts (up to 1000 characters) are due 12:00 PM (noon) EST on December 15th, 2015 and shall be submitted via this Abstract submission form (http://goo.gl/forms/HHMZ1051Lt). ONLY students who submit their abstracts on time are eligible for submitting full papers after 12/15/2015.

Full papers and other application materials must be submitted electronically (in PDF, see instruction below) to Professor Tian Zheng (tian.zheng@columbia.edu) by 12:00 PM (noon) EST on Monday, January 4th, 2016. ONLY students who submit their abstracts by 12/15/2015 are eligible for submitting full papers.

All full paper email entries must include the following:
  1. An email message contains:
    • List of authors and contact information;
    • Abstract with no more than 1000 characters.
  2. Unblinded manuscript - double-spaced with no more than 25 pages including figures, tables, references and appendix. Please use 11pt fonts (preferably Arial or Helvetical) and 1 inch margins all around.
  3. Blinded versions of the abstract and manuscript (with no authors nor references that could easily lead to author identification).
  4. A reference letter from a faculty member familiar with the student's work which MUST include a verification of the applicant's student status and, in the case of joint authorship, should indicate the fraction of the applicant's contribution to the manuscript.
All materials must be in English.

Entries will be reviewed by the Student Paper Competition Award committee. The selection criteria used by the committee will include statistical novelty, innovation and significance of the contribution to the field of application as well as the professional quality of the manuscript.

This year’s student competition is sponsored jointly by SLDM and PANDORA and is chaired by Professor Tian Zheng (Columbia University). Award announcements will be made in mid-January 2016. For inquiries, please contact Professor Tian Zheng (tian.zheng@columbia.edu).

A simple solution for R vioplot cut() error

I am using R's violin plots to visualize side-by-side comparison of empirical distributions. At some specifications, the simulated values can all equal to a constant. That will cause the following error.
Error in cut.default(x, breaks = breaks) : 'breaks' are not unique
To resolve this problem, one can try the following simple fix: Instead of vioplot( x, ...). Input vioplot( x+rnorm(length(x), 0, 1e-6), ... Or simply use the function jitter().The variance of the random noise should be much smaller than the scale of x.