In response to my post on "The appliedness of Statistical Research"
Nick Payne said...
This is somewhat only tangential to your comment. After years of being a client of statisticians and then becoming one, I am struck by something. Most scientistis and researchers who generate data understand that their data is only some approximation of the variable they are truly studying, while most consulting statisticans acti=as if the data WERE the variable of interest.How can I get applied statisticians to look past the data towards the actual variable? Issues in this are often that the measuring instrument introduces (or subtracts) attributes that are not releveant to thr true underlying variable.
I have thought about this before. I feel the term "variable" is a vaguely defined scientific term, which may have different meanings in different physical & social sciences, AND is of unambiguous meaning in mathematics. In statistics, when we study the properties of methods, we treat variables at the utmost mathematical level since we would use tools in mathematics domains. However, when interpreting the values of statistics, or outputs from a statistical procedure, one needs to distinguish between the variables as in data and the variables as in reality (or the true quantities we are trying to do inference on). The true quantitites that scientists are trying to study sometime can be called parameters in statistcs, but not always. I believe the whole philosophy behind formal statistical inference is trying to quantify and understand the "gap" between what we observe in the data collected and the truth.
Thursday, June 30, 2005
Tuesday, June 21, 2005
Friday, June 10, 2005
PowerBlog
Andrew once said if you want something to reach a lot of people, you put it into a blog. No, no, no! Not just A blog but the A-blog (Andrew's blog). I found an interesting picture and posted it in my normal curve post. Then Andrew posted it in his blog and then we got some discussions.
Thursday, June 09, 2005
Mea Culpa
When I was writing my blog entry on Murphy's law, I simply assumed that a coffee shop inside a business building would open on all five work days of a week. If it does not open for a work day, it is due to some reasons.
Today, I was very embarrassed to see that the coffee shop's schedule. It only opens from Monday to Thursday! Guess I used a wrong prior a couple of days ago.
Today, I was very embarrassed to see that the coffee shop's schedule. It only opens from Monday to Thursday! Guess I used a wrong prior a couple of days ago.
Tuesday, June 07, 2005
Normal Curves
I found the following picture when reading Chinese news online. It is a coverage on a recent hail storm in Beijing. What might be the explanation from physics for the curves to appear like that?
Teen sleep
Every morning I receive my dose of current news through morning news on TV. (Why not read? I can't do that when brushing my teeth, can I?) Anyway, I found the TV coverage of some scientific research quite interesting. The way they interpret and dramatize the results in current research papers can be very entertaining to watch.
Today, the one caught my attention is a discussion on how school schedule is the one to blame for sleep deprivation in teens, from a recent paper. It is argued that because human bodies undergo changes during teenage years, including biological clocks, it is then hard for most teens to go to bed early. As a result, the early school schedules inevitably deprive the students of their much needed sleep.
The one I watched even had an in-house re-enactment (they vidoetaped the waking-up of a teenage girl). One shocking statement ends the whole coverage, that states like "a 25-minuet difference in the sleep can make the difference between an A and C", which I fail to find in the original article.
Today, the one caught my attention is a discussion on how school schedule is the one to blame for sleep deprivation in teens, from a recent paper. It is argued that because human bodies undergo changes during teenage years, including biological clocks, it is then hard for most teens to go to bed early. As a result, the early school schedules inevitably deprive the students of their much needed sleep.
The one I watched even had an in-house re-enactment (they vidoetaped the waking-up of a teenage girl). One shocking statement ends the whole coverage, that states like "a 25-minuet difference in the sleep can make the difference between an A and C", which I fail to find in the original article.
Monday, June 06, 2005
Trapped by software that are too advanced?
Last summer, I happily updated my Adobe Acrobat from 5.0 to 6.0. The new acrobat displays file more clearly and smoothly. Both the text and image captures have become more powerful. And the software can READ the pdf file to you. It also has better security features. For example, my bank won't let me view my PDF banking statement unless I have updated my Acrobat 6.0 with some updates.
But when I was preparing for a grant proposal, I found I was trapped by this technology advancement. The Federal funding agency does not accept PDF file that is prepared by Acrobat 6.0. Huh?!
I first thought it won't be a problem since I probably could save my pdf file (again in Acrobat) as a file of a lower version, just as I can do in Microsoft Word, Excel, etc. But it didn't happen. Now I don't even remember what I did to circumvent this issue eventually. Maybe the struggle was too painful to remember. :)
Accidentally, when I was trying all the functions provided by Acrobat today, I found a command in the file menu called "reduce file size". It allows you to reduce the file to a new file that is of a smaller size and also compatible with a lower version of Acrobat. Bingo! It worked. Just in case I will forget this trick again, I am blogging it here.
Apparently, to the people at Adobe, the only reason one would want to lower the version of a PDF file is to reduce the size. :)
But when I was preparing for a grant proposal, I found I was trapped by this technology advancement. The Federal funding agency does not accept PDF file that is prepared by Acrobat 6.0. Huh?!
I first thought it won't be a problem since I probably could save my pdf file (again in Acrobat) as a file of a lower version, just as I can do in Microsoft Word, Excel, etc. But it didn't happen. Now I don't even remember what I did to circumvent this issue eventually. Maybe the struggle was too painful to remember. :)
Accidentally, when I was trying all the functions provided by Acrobat today, I found a command in the file menu called "reduce file size". It allows you to reduce the file to a new file that is of a smaller size and also compatible with a lower version of Acrobat. Bingo! It worked. Just in case I will forget this trick again, I am blogging it here.
Apparently, to the people at Adobe, the only reason one would want to lower the version of a PDF file is to reduce the size. :)
An Entropy-Based Statistic for Genomewide Association Studies
Am. J. Hum. Genet. 77:27–40, 2005
Jinying Zhao, Eric Boerwinkle, and Momiao Xiong
Abstract: Efficient genotyping methods and the availability of a large collection of single-nucleotide polymorphisms provide valuable tools for genetic studies of human disease. The standard chi-square statistic for case-control studies, which uses a linear function of allele frequencies, has limited power when the number of marker loci is large. We introduce a novel test statistic for genetic association studies that uses Shannon entropy and a nonlinear function of allele frequencies to amplify the differences in allele and haplotype frequencies to maintain statistical power with large numbers of marker loci. We investigate the relationship between the entropy-based test statistic and the standard chi-square statistic and show that, in most cases, the power of the entropy-based statistic is greater than that of the standard chi-square statistic. The distribution of the entropy-based statistic and the type I error rates are validated using simulation studies. Finally, we apply the new entropy-based test statistic to two real data sets, one for the COMT gene and schizophrenia and one for the MMP-2 gene and esophageal carcinoma, to evaluate the performance of the new method for genetic association studies. The results show that the entropy-based statistic obtained smaller P values than did the standard chi-square statistic.
Jinying Zhao, Eric Boerwinkle, and Momiao Xiong
Abstract: Efficient genotyping methods and the availability of a large collection of single-nucleotide polymorphisms provide valuable tools for genetic studies of human disease. The standard chi-square statistic for case-control studies, which uses a linear function of allele frequencies, has limited power when the number of marker loci is large. We introduce a novel test statistic for genetic association studies that uses Shannon entropy and a nonlinear function of allele frequencies to amplify the differences in allele and haplotype frequencies to maintain statistical power with large numbers of marker loci. We investigate the relationship between the entropy-based test statistic and the standard chi-square statistic and show that, in most cases, the power of the entropy-based statistic is greater than that of the standard chi-square statistic. The distribution of the entropy-based statistic and the type I error rates are validated using simulation studies. Finally, we apply the new entropy-based test statistic to two real data sets, one for the COMT gene and schizophrenia and one for the MMP-2 gene and esophageal carcinoma, to evaluate the performance of the new method for genetic association studies. The results show that the entropy-based statistic obtained smaller P values than did the standard chi-square statistic.
Friday, June 03, 2005
Phylogenetic Estimation of Context-Dependent Substitution Rates by Maximum Likelihood
Mol Biol Evol. 21:468-88.
Adam Sieple and David Haussler
Abstract: Nucleotide substitution in both coding and noncoding regions is context-dependent, in the sense that substitution rates depend on the identity of neighboring bases. Context-dependent substitution has been modeled in the case of two sequences and an unrooted phylogenetic tree, but it has only been accommodated in limited ways with more general phylogenies. In this article, extensions are presented to standard phylogenetic models that allow for better handling of context-dependent substitution, yet still permit exact inference at reasonable computational cost. The new models improve goodness of fit substantially for both coding and noncoding data. Considering context dependence leads to much larger improvements than does using a richer substitution model or allowing for rate variation across sites, under the assumption of site independence. The observed improvements appear to derive from three separate properties of the models: their explicit characterization of context-dependent substitution within N-tuples of adjacent sites, their ability to accommodate overlapping N-tuples, and their rich parameterization of the substitution process. Parameter estimation is accomplished using an expectation maximization algorithm, with a quasi-Newton algorithm for the maximization step; this approach is shown to be preferable to ordinary Newton methods for parameter-rich models. [Follow the link above to read more]
Adam Sieple and David Haussler
Abstract: Nucleotide substitution in both coding and noncoding regions is context-dependent, in the sense that substitution rates depend on the identity of neighboring bases. Context-dependent substitution has been modeled in the case of two sequences and an unrooted phylogenetic tree, but it has only been accommodated in limited ways with more general phylogenies. In this article, extensions are presented to standard phylogenetic models that allow for better handling of context-dependent substitution, yet still permit exact inference at reasonable computational cost. The new models improve goodness of fit substantially for both coding and noncoding data. Considering context dependence leads to much larger improvements than does using a richer substitution model or allowing for rate variation across sites, under the assumption of site independence. The observed improvements appear to derive from three separate properties of the models: their explicit characterization of context-dependent substitution within N-tuples of adjacent sites, their ability to accommodate overlapping N-tuples, and their rich parameterization of the substitution process. Parameter estimation is accomplished using an expectation maximization algorithm, with a quasi-Newton algorithm for the maximization step; this approach is shown to be preferable to ordinary Newton methods for parameter-rich models. [Follow the link above to read more]
Murphy's law
This morning I felt like to buy my breakfast at the coffee cart in the lobby of the SSW building. On my way, I thought what if the cart was closed today due to its light business recently. But I decided to go and try my luck any way. I thought the coffee cart just won't suspend their service to our building because of the low traffics in the summer months. Then I found out that the cart WAS indeed closed today. So I had to turn around, walked back 2 blocks and bought my coffee and bagel from a grocery store nearby.
Bitterly, I thought "every time!" Why on earth Murphy's law is always right?
Actually I know the answer this time. It is not an coincidence that I decided to buy coffee from the coffee cart on the day that it was closed. Yesterday day, when I went out to buy lunch I saw the cart manager (a very nice Indian man) was cleaning up the cart with several bags behind him on the lounge couch. I could tell from the shape of them that, in those bags, were unsold bagels, muffins, fruit juice, etc. I was telling myself that the business might have turned out to be much lighter than they expected. Then I thought to myself, maybe I should give them a little support since a coffee cart in the lobby would prove to be very convenient. Consciously or unconsciously I decided to buy coffee from them this morning.
My understanding of Murphy's law is that you fear something is going to happen and that something indeed happens. Mostly of the time, there is a common cause for the fear and the object of the fear. Of course, we also tend to remember better when bad things happen.
Bitterly, I thought "every time!" Why on earth Murphy's law is always right?
Actually I know the answer this time. It is not an coincidence that I decided to buy coffee from the coffee cart on the day that it was closed. Yesterday day, when I went out to buy lunch I saw the cart manager (a very nice Indian man) was cleaning up the cart with several bags behind him on the lounge couch. I could tell from the shape of them that, in those bags, were unsold bagels, muffins, fruit juice, etc. I was telling myself that the business might have turned out to be much lighter than they expected. Then I thought to myself, maybe I should give them a little support since a coffee cart in the lobby would prove to be very convenient. Consciously or unconsciously I decided to buy coffee from them this morning.
My understanding of Murphy's law is that you fear something is going to happen and that something indeed happens. Mostly of the time, there is a common cause for the fear and the object of the fear. Of course, we also tend to remember better when bad things happen.
Thursday, June 02, 2005
Aspirin scare?
A news topic on "pain relievers increase the risk of breast cancer" has been covered in multiple news agencies in the past several days. This comes as a shock since anti-inflammatory pain relievers such as Aspirin were believed to lower the risks of inflammatory disorders such as breast cancer, IBD, etc.
The findings were published in the May 31st 2005 issue of Journal of the National Cancer Institute. The study followed 114,460 women who participated in a previous California Teachers Study (so I suppose they are women, teachers and Californians) for 6 years. At the beginning of the study, they were all free of breast cancer (not necessary free of other cancers or inflammatory disorders). During 1995-2001, 2391 (2.1%) women were diagnosed with breast cancer. Data on the use, frequency and duration of nonsteroidal anti-inflammatory drugs (NSAIDS) was collected through a self-administered questionnaires (1995, 1997 and 2000).
Most news coverage of this study only cited the above description of the study. Then the news would go on saying that "daily long-term use of aspirin was associated with an 81% increased risk" (RR=1.81, 95% CI= 1.12 to 2.92).
First, what is an 81% increase in risk when the overall risk is only 2%?
In the paper, the authors honestly presented the following findings, which is no surprise at all:
"After accounting for age, regular NSAID users were more likely to be white, to be overweight or obese, to be current or former smokers, to have had a mammogram in the last 2 years, and to have used postmenopausal hormone therapy than regular NSAID users. "
In their multi-way tables that present the relative risk estimates of NSAID use, it is not "crystal clear" how exactly the statistical analysis was done. However, it does mention the estimation adjusted (by including the covariates on) race, BMI, first-degree family breast cancer history, menopausal and hormone therapy use, smoking, alcohol intake, physical activities, mammography history, breast biopsy history, parity status before age 30 and neighborhood socioeconomic status.
Is there something missing? I was looking for age. That would have been my first guess. Age should matter most. Who will have higher risk of cancer? And who will be more likely to be regular pain reliever users? But I can't seem to find a mention of it in my 10-minuet quick browse of their paper. I am sure the investigators have considered age as important. In another table, it is mentioned that the characteristics are age-standardized (well, meaning that they standardized the values of these covariates within each 10-year age category). This will make the other covariates not heavily correlated with age. But age is not adjusted for in estimating relative risks? Is this a typo or an unfortunate omission?
There are some other statistical question marks in this paper that I didn't figure out during my fast read. Not to mention multiple comparison issues.
The findings were published in the May 31st 2005 issue of Journal of the National Cancer Institute. The study followed 114,460 women who participated in a previous California Teachers Study (so I suppose they are women, teachers and Californians) for 6 years. At the beginning of the study, they were all free of breast cancer (not necessary free of other cancers or inflammatory disorders). During 1995-2001, 2391 (2.1%) women were diagnosed with breast cancer. Data on the use, frequency and duration of nonsteroidal anti-inflammatory drugs (NSAIDS) was collected through a self-administered questionnaires (1995, 1997 and 2000).
Most news coverage of this study only cited the above description of the study. Then the news would go on saying that "daily long-term use of aspirin was associated with an 81% increased risk" (RR=1.81, 95% CI= 1.12 to 2.92).
First, what is an 81% increase in risk when the overall risk is only 2%?
In the paper, the authors honestly presented the following findings, which is no surprise at all:
"After accounting for age, regular NSAID users were more likely to be white, to be overweight or obese, to be current or former smokers, to have had a mammogram in the last 2 years, and to have used postmenopausal hormone therapy than regular NSAID users. "
In their multi-way tables that present the relative risk estimates of NSAID use, it is not "crystal clear" how exactly the statistical analysis was done. However, it does mention the estimation adjusted (by including the covariates on) race, BMI, first-degree family breast cancer history, menopausal and hormone therapy use, smoking, alcohol intake, physical activities, mammography history, breast biopsy history, parity status before age 30 and neighborhood socioeconomic status.
Is there something missing? I was looking for age. That would have been my first guess. Age should matter most. Who will have higher risk of cancer? And who will be more likely to be regular pain reliever users? But I can't seem to find a mention of it in my 10-minuet quick browse of their paper. I am sure the investigators have considered age as important. In another table, it is mentioned that the characteristics are age-standardized (well, meaning that they standardized the values of these covariates within each 10-year age category). This will make the other covariates not heavily correlated with age. But age is not adjusted for in estimating relative risks? Is this a typo or an unfortunate omission?
There are some other statistical question marks in this paper that I didn't figure out during my fast read. Not to mention multiple comparison issues.
Wednesday, June 01, 2005
The appliedness of Statistical Research
Statistics is an interdisciplinary formal information science. As any science, there are something called "theoretical statistics" and something called "applied statistics". I think I am more on the applied side. But what makes a statistical research "applied" without turning it into a service to another scientist?
To me, a young statistician, there seems to be a thin (very very thin) line between "applied research" and "consulting/application". For an academic statistician, crossing the line by doing service-like application for other scientists can by harmful I suppose (or, I guess). But how far inside, from this line, of the applied research domain is safe enough? What decides the appropriate appliedness of a project?
Since there is not absolute standard for appropriate appliedness for the research projects (am I missing something here), one (who are dedicated to do some applied research) is faced with two choices: make an applied research more applied to address the needs of the other discipline or make an applied research less applied to maintain some statistical intellectual merits. Sometime, one does not have to make a choice. That is just the dreamland for an applied statistician.
To me, a young statistician, there seems to be a thin (very very thin) line between "applied research" and "consulting/application". For an academic statistician, crossing the line by doing service-like application for other scientists can by harmful I suppose (or, I guess). But how far inside, from this line, of the applied research domain is safe enough? What decides the appropriate appliedness of a project?
Since there is not absolute standard for appropriate appliedness for the research projects (am I missing something here), one (who are dedicated to do some applied research) is faced with two choices: make an applied research more applied to address the needs of the other discipline or make an applied research less applied to maintain some statistical intellectual merits. Sometime, one does not have to make a choice. That is just the dreamland for an applied statistician.
Subscribe to:
Posts (Atom)