Twitter

Friday, March 24, 2006

Data: the only and narrow window

Today, in a ISERP lunch group, Peter Hoff from University of Washington gave a talk on latent factors models for network data. It is nice to have a 3-hour discussion on a topic since you get to stop, think and discuss about a particular problem without worrying about running out of time.

One of the questions I had was that whether the latent factors fitted to the data correspond to demographic characteristics of the nodes in the network. Before I asked I knew the answer would be "not necessarily". The latent factors just provide a way or a model to decompose the variation structure of a network into a more interpretable factors that represent the initiator and the receiver of an edge in a network.

There were also other discussion along this direction. I didn't catch all of them since I was busy making some simple numerical examples to help myself understand better. Then I heard Andrew say:"we can not claim to infer the data generating mechanism behind the data. we can only infer a data generating mechanism that can generate the data observed."

This reminded me of my thoughts on data and models.

Data (limited observed values) always classify all possible models into equivalence classes. For example, in regression, n points (x_i, y_i) define classes of curves that go through the same values at the x_i's. The regression analysis is simply trying to find the class with the closest distance to the data. In a modeling effort, the targeted model space intercept with the data's equivalence classes. After the interception, if there is more than one model remained in each equivalence class, we get the identifiability issue.

We can only understand the world to the extent that the data allow. When we ask others about the size of their data sets, we may just sound like coworkers comparing offices: "how's the view in your new office?" "pretty good! the window's much bigger than what I used to have" "wow, nice! you can see so much more now!"

No comments: