Tuesday, August 18, 2015

The fifth V of Big Data: variables (JSM 2015 discussion)

I gave the following discussion (from recollection and my notes) during the session "the fifth V of Big Data: variables" organized by Cynthia Rudin.

The notion "Big Data" does not simply refer to a data set that is large in size. It includes all complex and nontraditional data that do not necessarily come in the form of a typical clean Excel sheet with rows corresponding to individuals and columns corresponding to variables. In other words, Big Data are often unstructured and do not have naturally defined variables.

Variables are central to nearly all statistical learning tasks. We study their distributions to build models and predictive tools. Therefore, in Big Data, how to define variables is one of the important first steps that is critical for the success of the statistical learning later on. This step is also known as feature generation. Even when some variables are observed on individuals in a data set, they often do not come in the form or scale most relevant with the learning task at hands. Domain knowledge, when used correctly as we have learnt from Kaiser's talk, is often the most helpful in identifying and generating features. At other times, we need some help from such as exploratory data analysis, sparse learning and metric learning to form nonlinear transformation.

For variables generated, they first need to predict well, i.e., achieve accuracy. In addition, for many application, they need to be interpretable. Here, sometime we need to strike a balance between these two criteria. One way to achieve such a balance is to encourage sparsity in the solution, which is often computational challenging.

Variables, the fifth V of Big Data, are essential for most statistical solutions and require a delicate three-way balance of accuracy, interpretability and computability.

No comments: