Tuesday, March 12, 2013

What is the BIG data?

Tyler told me that he is going to present on a panel about "Big Data." "What is big data?" I asked him and he didn't give me a satisfactory definition. For the past two years, every time I heard about the phrase "big data", it reminded me of the "Big Salad" episode of Seinfeld.
ELAINE: Um, hum, I don't know.. . . A big salad?
GEORGE: What big salad? I'm going to the coffee shop.
ELAINE: They have big salads.
GEORGE: I've never seen a big salad.
ELAINE: They have a big salad.
GEORGE: Is that what I ask for? The BIG salad?
ELAINE: It's okay, you don't…
GEORGE: No, no, Hey I'll get it. What's in the BIG salad?
JERRY: Big lettuce, big carrots, tomatoes like volleyballs.
GEORGE: (???), we'll see you in a little while.
I felt that I sort of know what they were referring to but I cannot really take in the idea of generalizing all the special needs of complex, messy, multi-disciplinary, multi-source, incomplete, biased-designed, ... data into one word "BIG." Yes, they are big. So big that getting them into a form that can be handled by traditional computational mechanisms becomes hard. Innovations are needed to either allow nearly loss-less reduction of the BIG data to a manageable size, or lead to new computational mechanisms for BIG data. This is the foremost step of any big data project. This part of the battle is more computer science than Statistics.

To a statistician, the whole new era of "BIG data" feels like a call for more dynamic models that can capture trends in space and time, better model-based tools for integrating multiple, individually incomplete, data sources, systematic data analysis tools that can mitigate design and sampling biases in the huge collection of existing data. It is like jigsaw puzzles. A small data is like a small puzzle and a BIG data is like a gigantic, bigger-than-a-football-field-kind of a puzzle. Exciting, amazing, fun (?), and intimidating. All our old puzzle-solving tricks won't work well but some fundamental principles still prevail, as long as it is indeed, as we understand, a jigsaw puzzle but not something else.