The analysis of the "big data" from health studies is different from many machine learning tasks in two ways. First, association is not enough. Identification of causal relations is essential for any possible intervention. Second, detection of an effect may not be enough. Often precise and accurate estimates of the effect size are desired. Strongly!
In dealing with observational health studies, here are some of my advices, which are not intended to be comprehensive.
- Understand the available data; especially understand what was not observed. When you do not have full control of the study design/data collection, there would always be some issues: sampling bias, informative bias, measurement bias. Always ask questions about what the measured data actually represent: how were the data cumulated? how were certain measures defined?
- Know what your methods are estimating: association, causality or granger causality? effect size under a particular model may not reflect the actual true effect size. What are the questions that need answer? What are the questions your methods can actually address?
- Always carry out some sensitivity analysis: are your results sensitive to model assumptions, or small changes in the data? Available tools include simulations, multiple data sets or resampling. This can be challenging for certain studies as defining "agreement" between different findings can be tricky.
- Always report uncertainty: combining estimated sampling error from modeling and results from sensitivity analysis give the users of your results a better sense of uncertainty. This is especially important when modeling strategies were introduced to address a small area estimation problem.