It all started when I received an email from one of my current students:
"In recitation today we were talking a little about independent variables vs. variables that appear to be associated. Yuejing suggested that if we see an association between between variables this does not mean that there is a causal relation between the two. Are statisticians always limited to this "weakened" position? If not, what formal tools are used to decide at what point, from a statistical point of view, the distribution of one variable compared to that of another suggests causality instead of a mere association?"
Here is my reply:
"Usually, analysis of observational data can not establish causal relations between factors and results, unless the pattern is extremely strong. Experiments are the best way to establish causation. This is why anything about human is hard to verify since it is usually not ethical to do experiments on human unless there won't be any harm.
For example, to test whether a new medicine is effective, pharmaceutical companies need to do clinical trials (experiments on human). Before that they need to use lab animals to test whether the treatment is safe. Statistical analysis is the most important part of the report that FDA requires to pass a new medicine.
Also, people have been talking about genes that cause diseases, right? To see whether these specific genes truly cause the specific disease, researchers need to examine transgenic lab animals to confirm such a causal relation.
There is a branch of statistics called "causal analysis" for observational data. However, even there, no conclusion can be made with 100% confidence. This is not something unique to statistics. In biology, physics, chemistry, most scientific results have exceptions that we don't fully understand. It is especially true for astronomy, where most published results are only theories. The difference is that statistics maybe the only science admits explicitly that we don't know all, and the only science attempts to estimate the extent of our ignorance."
Here are comments from my colleague, Professor Andrew Gelman:
1. this is the kind of thing you can put on your blog!
2. the student asked a good question!
3. your answer is reasonable.
more generally, when we speak of causation in statistics, we usually speak of "treatments" or "interventions", not just of "variables". thus, we would not say that variable A causes variable B. rather, we would say that a specified treatment T which effects variable A, also affects variable B. thus, T affects both A and B. the idea is to design an experiment so that T is something that is realistic.
for example, surveys have found a positive correlation between "social capital" (roughly, the size and quality of your social network) and "happieness". does social capital cause happiness, or does happiness cause social capital, or neither, or both, or ...? the way a statistician would address this issue would be to design a particular treatment, designed to directly affect "social capital", and see how it affects "happiness". and to consider other treatments to directly affect "happiness" and see how they affect "social capital". or to look for observational data that has the form of a "natural experiment" affecting these.in this example, things get interesting because measurements and treatments can be made at individual or group levels. hence multilevel modeling becomes relevant."