Tuesday, December 20, 2005
What happens from time to time, is that good students I have known for a semester fail the final exam. This troubles me. On one hand, I know they studied hard, enjoyed the class and understood the material well. On the other hand, they didn't perform in the final.
Exams are not perfect but they are regarded as common means of evaluation in modern education. I know relying on fair exam is actually only partially fair. But, relying on my personal impression of the students would never be fair. So I choose to rely on the exam scores and my formula even though I am really really bothered by that.
Tuesday, December 13, 2005
Friday, December 02, 2005
The GSAS teaching center asked me to write a short paragraph on suggestions to new GSAS teaching fellows. I squibbed the following:
Basically speaking, teaching involves three components: the instructor, the students and the materials to teach. To better structure your teaching, you need to “know” these three components well.
- Know the instructor (yourself). Be clear about your teaching style. Don’t wait till when you start looking for job to write your teaching philosophy. Start a draft today and stick to the principles you truly believe when you teach. Don’t know how to start? Just google a sample online.
- Know your students. Students come to your class with different backgrounds, different interests and different needs. The better you know them, the better you can plan your teaching inside and outside the classroom.
- Know the material. Isn’t this a given? Well, your teaching should be more than the materials printed in the textbooks. Based on your own understanding, translate the materials in the textbooks to a form that is more accessible to your students, by using examples, demos, illustrations, etc.
I am not saying I know the best about teaching. But this is what I believe. I believe to do better is to understand better and think deeper. Go back to the roots. Not just for teaching but also for research. I think I can be called fundamentalist but not as usually defined:
A usually religious movement or point of view characterized by a return to fundamental principles, by rigid adherence to those principles, and often by intolerance of other views and opposition to secularism.No, this is not me. I believe in fundamental principles but I believe there are different principles for different things and I believe different people could choose to believe in different principles. I guess that is why I like to top-down organize things. :)
Monday, November 28, 2005
"Bosses beware: Workers today may not be as productive as usual. It's Cyber-Monday, one of the busiest online shopping days of the year. More than one-third of the consumers polled say they will go online at work where they have faster Internet connections. Total North American Internet traffic by visitors per minute already is up 35 percent from a typical Monday, according to the Akamai Net Usage Index."
Would that 35% increase be simply due to back-logged email communications after a holiday break? Uploading holiday pictures for sharing? Viewing holiday cards or pictures? Why would it have to be due to shopping?
Tuesday, November 22, 2005
Chinese students and scientists are playing an increasingly important role in US laboratories. According to the New York-based Institute of International Education, US academic institutions are now home to some 80,000 Chinese nationals, many of them in the sciences.
They are attracted to the United States, in the main, because of its excellent research universities, which are delighted to recruit well-trained and hard-working Chinese nationals. But as our News Feature on page 278 demonstrates, reality doesn't always quite meet the visitors' expectations.
Xuemei Han, a Chinese national, was admitted by Yale University to study ecology on the basis of her strong academic background. But the language barrier, funding problems and bureaucratic tussles ultimately led to a public falling-out, which quickly escalated to become a focal point for major protests by a large number of Chinese students at Yale.
Wednesday, November 16, 2005
Friday, November 11, 2005
Today, Beijing announced a set of five mascots for the upcoming 2008 Olympics games. I think they took a non-modern design and looked quite traditionally Chinese. I don't think people will think they are COOL. They gave me a warm feeling although I believe not all people around the world will think they are adorable. But I chose to like them, :), simply because I love the city where I grew up and chose to be proud of things it tries its best to offer. Subjective, isn't it?
Anyway, this brought back some memories.
When Beijing applied for hosting 2000 Olympic games in 1993, I was in high school. We particpated in all kinds of activities in support of Beijing's application. As far as I can tell, everybody, at that time, was deeply motivated. [I heard the enthusiasm was not that high in 2001 when Beijing actually won. ] At one point in 1993, some rumors said that if Beijing won, the college graduates from major Beijing University in 1999 will be recruited by the local organization committee to be volunteers during the games. According to schedule, I was going to college in 1994 and I was applying for Tsinghua, which at that time still had 5-year undergraduate programs. Believe it or not, I got all excited about it, imagining that I would be part of the Olympic games. I can still get excited about Beijing and Olympics even though I think it is a little bit silly. I don't really know where the excitement comes from. Somehow, the education I received since I was little really got through, so that I love my country, my city and my Tsinghua (where was my home "town", and is my home in Beijing), for no significant logical reasons.
In 2001, I was doing an internship somewhere in New Jersey, when Beijing was chosen to host the 2008 Olympics. I was sitting in my cubical and watching online real-time text coverage. 5 seconds after the anouncement was made, another intern from China ran to my desk and said "have you known?" and I cheered under my breath, "YES!" We were the happiest people, at that moment, on that industrial looking floor. It seems to be such a long time ago.
I don't usually get sentimental about things. But I did today when reading the news release.
BTW, the Five Mascots are named "Beibei", "JingJing", "Huanhuan", "Yingying", "Nini". When reading together, "Bei", "Jing" "Huan" "Ying" "Ni" means "Beijing Welcomes You". Well, we, Chinese people, sometimes try too hard on "one stone" killing "multiple birds".
"Rule 6: The ingredients of good science are obvious—novelty of research topic, comprehensive coverage of the relevant literature, good data, good analysis including strong statistical support, and a thought-provoking discussion. The ingredients of good science reporting are obvious—good organization, the appropriate use of tables and figures, the right length, writing to the intended audience—do not ignore the obvious."Read more on PLoS Computational Biology
Wednesday, November 09, 2005
Friday, November 04, 2005
Zhu G, Duffy DL, Eldridge A, Grace M, Mayne C, O'Gorman L, Aitken JF, Neale MC, Hayward NF, Green AC, Martin NG (1999) A Major Quantitative-Trait Locus for Mole Density Is Linked to the Familial Melanoma Gene CDKN2A: A Maximum-Likelihood Combined Linkage and Association Analysis in Twins and Their Sibs. Am. J. Hum. Genet. 65:483-492
Combined Linkage and Association Tests in Mx
D Posthuma, EJC de Geus, DI Boomsma, MC Neale - Behavior Genetics, 2004
Combined linkage and association analysis in pedigrees
KD Siegmund, H Vora, WJ Gauderman - Genet Epidemiol Suppl, 2001
Combined high resolution linkage and association mapping of quantitative trait loci
R Fan, M Xiong - European Journal of Human Genetics, 2003
Combined association and linkage analysis applied to the APOE locus
M Beekman, D Posthuma, BT Heijmans, N Lakenberg, … - Genetic Epidemiology, 2004
Combined Linkage and Association Mapping of Quantitative Trait Loci by Multiple Markers.
R Fan - Genetics, 2005
I used to think Harry Potter's world is cool. But I think our world is cool enough.
BTW: ABBA, what a great band!
Thursday, November 03, 2005
I started thinking about the use of blogs by people I know.
Blogging combined the personal web page with tools to make linking to other pages easier, specifically blogrolls and TrackBacks, as well as comments and afterthoughts. This way, instead of a few people being in control of threads on a forum, or anyone able to start threads on a list, there was a moderating effect that was the personality of the weblog's owner. Justin Hall, who began eleven years of personal blogging in 1994 while a student at Swarthmore College, is generally recognized as one of the earliest bloggers.
The term "weblog" may have been coined by Jorn Barger in December 1997. The horter version, "blog", was coined by Peter Merholz, who, in April or May of 1999, broke the word weblog into the phrase "we blog" in the sidebar of his weblog.  This was interpreted as a short form of the noun  and also as a verb to blog, meaning "to edit one's weblog or a post to one's weblog". The site Open Diary, while not using the term blog until recently, launched in 1998, had over 2000 diaries by 1999, and near 400 000 as of September 2005. Blog usage spread during 1999, with the word being further popularized by the near-simultaneous arrival of the first hosted weblog tools: Evan Williams and Meg Hourihan's company Pyra Labs launched Blogger (which was purchased by Google in February 2003) and Paul Kedrosky's GrokSoup. As of March 2003, the Oxford English Dictionary included the terms weblog, weblogging and weblogger in their dictionary. 
One of the pioneers of the tools that make blogging more than merely websites that scroll is Dave Winer. One of his most important contributions was the creation of servers which weblogs would ping to show that they had been updated. Blog reading utilities use the aggregated update data to show a user when their favorite blogs have new posts.
- Sharing ideas or opinions [for others]
- logging thoughts [for self]
- Initiate discussions [for self and others]
- logging online resources [for self]
- dumping personal experiences [nobody, maybe]
Lu, Xun (a Chinese writer) wrote an article titled "a memoir, so to forget" (unofficial title, I made it up. There must be a better translation). The main idea of the title is that the reason he wrote the piece is because he wanted to start the healing from the experiences he accounted in that article. I like this title a lot (sure, the chinese version). The main reason I wrote some of the blog posts is also I want them to be off my mind. (Sooner or later, I probably will write in my blog, "final exam, 3rd drawer") since I am an organizer. Not only I want the physical space surrounding me organized, I also want to have an organized mind. One rule in the principles of getting things organized is that you must have a place to dump. This dumping station should be very accessible (easy to dump), and very visible (hard for you to forget to attend). I also remember a short story I read years ago:
Two cities want to compete for the title of "cleanest city". One prohibits harshly its residents from dumping trash on the street but ends up with trash in every corner. So the mayor goes the other city and sees really clean streets. He wonders how the other mayor has done that. He later found out that, in the other city, the residents were told where to dump.
Naturally, we (should I say "I", just speaking for myself?) need a place to dump. For our mails, for our teaching notes, for our research notes. Now, we (or I) need place to dump our opinions, ideas, feelings, joys, etc. But when I dump, Ialso need to show them. This part of my blogging behavior always makes me think. A friend of mine said that blogging is a subtle way of "showing off" and that is why he does not blog. Well, that is too harsh, or is it? Is the whole culture of blogging driven by people's internal desire to show off? Simply because it is more subtle than most other ways of showing off, it gets such popularity. Or shall I think more positively? People blog because they want to participate in a collective thinking community?
Monday, October 17, 2005
Saturday, October 15, 2005
Today, the Hawks center Collier died. In the news, analysts are speculating the cause of death being heart arrest. I remember from my junior high biology class that there is a genetic condition for extrordnary tall people. With such condition, their heart blood vessels have a different structure than normal people, which makes them grow taller but on the same time make their hearts more stressed and more subject to heart arrest. I remember my biology teacher told us several famous atheletes were known to have died of this condition. I don't know whether this is something related to those speculations but this is what I can speculate.
I think the reason why I can remember so vividly what the teacher said is because I found it very sad. A condition can give one the physique to pursue in sports can also kill him/her young.
Now, the focus is not actually on the dead but rather the center of Knicks newly acquired from the Bulls. The story is here. Long story short. The center, Eddy Curry felt chest pain and irregular heart beats during training camp in 04 and the Bulls wanted him to have DNA test in order to decide whether he is susceptible to some fatal heart defects. Curry refused to do so, citing his privacy rights. The Bulls said that they will sign a big contract with Curry if the test returned as negative, otherwise,
Paxson, speaking during the team's media day, told reporters the Bulls had offered Curry $400,000 annually for the next 50 years if he failed the genetic test.
When I first saw it, I thought, "oh, it is not about money". Or, isn't it? If someone knows that he has such a defect and can't live long, would he want to have the big/short-term contract or the small/long-term contract? Of course, he would like to get as much money as possible in as short as possible time. However, what stirs the controversy is not the money this time. Reading along, another news article says:
"Think about what's at stake here," said Alan Milstein, Curry's attorney. "As far as DNA testing, we're just at the beginning of that universe. Pretty soon, though, we'll know whether someone is predisposed to cancer, alcoholism, obesity, baldness and who knows what else.
"Hand that information to an employer," he added, "and imagine the implications. If the NBA were to get away with it, what about everyone else in this country looking for a job."
This lead me to think about how the information we extraced from genetic codes should be used. I know this is a very heated topic nowadays. It never really occurs to me that it is so current.
If an employment can be based on whether someone is mentally sound, should high-risk genetic disorders be treated similarly? If an employment search can not discriminate against people with disabilities, should genetic defects someone was born into be treated similarly? Sure, it is time to decide the privacy rights on DNA information of inidividuals and the regulations on the use of such informations. But I am having a hard time thinking of a way this can be decided unbiasedly.
Tuesday, October 11, 2005
Then I told him about a student project done by one of my former students. In that project, people filled out surveys on the amount of time they spend on different activities every week (studying, commuting, shopping, etc). The most interesting finding of that project was that people had a wrong perception on how many hours in a week. People tend to overestimate the number of hours in a week by a large margin.
I said:"maybe this is why we are very much overcommitting ourselves."
Dave said that another reason we are overcommitting ourselves is because that when making commitments on something, we are committing time in the future. Since the future is unlimited, we tend to think or feel there will always be time.
Then I said:"maybe there should be something like the credit companies that prevent us from overdraft our future time too much. " Dave said:"That would be a good idea."
Saturday, October 08, 2005
Great scientists come in two varieties, which Isaiah Berlin, quoting the seventh-century-BC poet Archilochus, called foxes and hedgehogs. Foxes know many tricks, hedgehogs only one. Foxes are interested in everything, and move easily from one problem to another. Hedgehogs are interested only in a few problems which they consider fundamental, and stick with the same problems for years or decades. Most of the great discoveries are made by hedgehogs, most of the little discoveries by foxes. Science needs both hedgehogs and foxes for its healthy growth, hedgehogs to dig deep into the nature of things, foxes to explore the complicated details of our marvelous universe. Albert Einstein was a hedgehog; Richard Feynman was a fox.If the science of the 20th century was physics, and the science of the 21st century was said to be biology. Is now a time for foxes or hedgehogs?
Many readers of The New York Review of Books are more likely to have encountered Feynman as a story-teller, for example in his book Surely You're Joking, Mr. Feynman!, than as a scientist. Not many are likely to have read his great textbook The Feynman Lectures on Physics, which was a best seller among physicists but was not intended for the general public. Now we have a collection of his letters, selected and edited by his daughter, Michelle. The letters do not tell us much about his science. For readers who are not scientists, it is important to understand that foxes may be as creative as hedgehogs. Feynman happened to be young at a time when there were great opportunities for foxes. The hedgehogs, Einstein and his followers at the beginning of the twentieth century, had dug deep and found new foundations for physics. When Feynman came onto the scene in the middle of the century, the foundations were firm and the universe was wide open for foxes to explore.
Tuesday, October 04, 2005
- It uses the double helix structure of the DNA as the theme
- The words "statistics" and "genetics" are parts of the two strands of the DNA, while the bonds between are binary codes that represent digital data. It is designed to indicate that our research is trying to build the bonds between these two fields through new computational tools for analyzing data.
- To add more taste of statistics, in the center, a segment of the DNA strand is replaced by Greek letters (common parameters and notations in statistics) and also the upper (kinda) strand is shaped close to a normal curve.
- The binary digits in the middle represents data, it originates from "genetics" and lift up "statistics". Later, hopefully, some programmer can make an animated version of this logo where the binary codes will move like in the movie matrix.
Friday, September 23, 2005
I, for some uninteresting reason, just discovered this and installed a Real Simple Syndication reader on my PC. I really like the interface a lot. Very easy to use. Now the question remains that I am not sure whether I want to be alerted about newly published papers. We'll see.
Monday, September 19, 2005
This paper was a discussion paper. Brad Efron began his discussion with
"Before tearing into the paper, let me first applaud Professor Berkson's skeptical attitude toward asymptotics and fancy theory in general. Throughout his productive career he has always been primarily concerned with the practical, the computable and the verifiable---the right attitude for a good scientist doing good science. His mistake is not crediting Fisher (and Rao, Savage, Ghosh, Subramanyam, and me) with some of the same good sense. "
Efron mentioned three components of good science in his comments: the applicability, the computability, and the "verifibility". This reminded me of the research of Andrew's student Jouni on Fully Bayesian Computing. In their research, they were building around general models (applicability), facilitizing computing (computability) and promoting model checking (verifibility).
Monday, September 05, 2005
"Sometimes I think it's the opposite--people devalue what they know how to do because, to them, it's so "obvious." For example, Caroline has sometimes helped me with teaching issues (such as working with students on ideas for projects). When I thank C for the help, she typically says that she didn't do anything. She actually did a lot but she's such an expert in teaching and in drawing information out from people, that she doesn't realize how difficult it is for me."
I think just because "people devalue what they know" and tend to think it is easy, they will tend to think low of those who do not know how to do it. The "invisible IQ test" I was thinking was not formulated by values but rather by necessity and some kind of definition of basics.
For example, a professional figure skater will regard some of her moves as basics, nothing to be "proud" of, which will be very difficult for us to do. She will of course devalue such moves but will think anyone who can not do such moves is nothing like a figure skater. Of course, I don't think she will include things that are so advanced and specific into her "invisible IQ test".
Thursday, September 01, 2005
This reminds me of those kind of scoring tests for depression, anxiety etc. Since there is no scale for the "amount" of depression you feel, we need to score you using a limited number of very specific things you feel or don't feel.
I think everyone may unconciously have a list of things that he/she thinks everyone should know. Also unconciously they will use these things to judge others. It would be interesting to see the differences between these individual invisible IQ tests. Sure, there must also be invisible personality tests.
Tuesday, August 30, 2005
Some Columbia Students should do something similar on the Low plaza.
It would be nice if we use the steps for the audience and the lower level of the plaza as stage. Some stereo should be set up to play the music. It can either be done in the evening and every performer can hold a candle or something shinning, or it would be cool if it is done after a not-so-big snow.
Thursday, August 25, 2005
One of my students approached me after class, pointing to this line to his copy of the syllabus.
"This is a joke, right?" He said.
" 'Cause everybody here [on this campus] must have taken math back in high school."
"I am not saying we only require that all students must have taken high school math before they can take 1111. We assume that they still remember much of their high school math before taking 1111." I thought for a while and responded defensively. That I think is a lot to assume.
The other day, I walked pass the midtown Kmart and saw their back-2-school ad. They put an equation, x^2=x-1, in one of the pictures. I solved it in my head as I walked towards the subway and realized that Kmart probably didn't know that this equation did not have rational numerical solutions. Then I thought: "this is legitimately high school math, but I am not sure whether all my 1111 students can solve it." Not that I care since they will not need to do such thing in my class.
It made me think about all those things we have learned in high school and never used. Is it a bad thing that we have forgoten much of it? Why should we study things we are not going to use? Just simply because we don't know what we are going to need in the future, we have to study a comprehensive foundation? Is this the only reason? Or is it healthier to keep one's brain busy at different things when one is young? Or is it healthier to have something to dump when one is getting older? Maybe, all those things we have learned and have no use for are some kind of placeholders for our future acquisitions? This is a wild thought. :)
Thursday, August 04, 2005
Wednesday, August 03, 2005
Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems
S. L. Lauritzen; D. J. Spiegelhalter
Abstract: A causal network is used in a number of areas as a depiction of patterns of `influence' among sets of variables. In expert systems it is common to perform `inference' by means of local computations on such large but sparse networks. In general, non-probabilistic methods are used to handle uncertainty when propagating the effects of evidence, and it has appeared that exact probabilistic methods are not computationally feasible. Motivated by an application in electromyography, we counter this claim by exploiting a range of local representations for the joint probability distribution, combined with topological changes to the original network termed `marrying' and `filling-in'. The resulting structure allows efficient algorithms for transfer between representations, providing rapid absorption and propagation of evidence. The scheme is first illustrated on a small, fictitious but challenging example, and the underlying theory and computational aspects are then discussed.
Tuesday, August 02, 2005
A. P. Dempster, N. M. Laird, and D. B. Rubin
Abstract: A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.
Peter J. Green
Abstract: Markov chain Monte Carlo methods for Bayesian computation have until recently been restricted to problems where the joint distribution of all variables has a density with respect to some fixed standard underlying measure. They have therefore not been available for application to Bayesian model determination, where the dimensionality of the parameter vector is typically not fixed. This paper proposes a new framework for the construction of reversible Markov chain samplers that jump between parameter subspaces of differing dimensionality, which is flexible and entirely constructive. It should therefore have wide applicability in model determination problems. The methodology is illustrated with applications to multiple change-point analysis in one and two dimensions, and to a Bayesian comparison of binomial experiments.
Peter F. Arndt and Terence Hwa
Motivation: Neighbor-dependent substitution processes generated specific pattern of dinucleotide frequencies in the genomes of most organisms. The CpG-methylation–deamination process is, e.g. a prominent process in vertebrates (CpG effect). Such processes, often with unknown mechanistic origins, need to be incorporated into realistic models of nucleotide substitutions.
Results: Based on a general framework of nucleotide substitutions we developed a method that is able to identify the most relevant neighbor-dependent substitution processes, estimate their relative frequencies and judge their importance in order to be included into the modeling. Starting from a model for neighbor independent nucleotide substitution we successively added neighbor-dependent substitution processes in the order of their ability to increase the likelihood of the model describing given data. The analysis of neighbor-dependent nucleotide substitutions based on repetitive elements found in the genomes of human, zebrafish and fruit fly is presented.
Availability: A web server to perform the presented analysis is freely available at: http://evogen.molgen.mpg.de/server/substitution-analysis
Thursday, June 30, 2005
Nick Payne said...
This is somewhat only tangential to your comment. After years of being a client of statisticians and then becoming one, I am struck by something. Most scientistis and researchers who generate data understand that their data is only some approximation of the variable they are truly studying, while most consulting statisticans acti=as if the data WERE the variable of interest.How can I get applied statisticians to look past the data towards the actual variable? Issues in this are often that the measuring instrument introduces (or subtracts) attributes that are not releveant to thr true underlying variable.
I have thought about this before. I feel the term "variable" is a vaguely defined scientific term, which may have different meanings in different physical & social sciences, AND is of unambiguous meaning in mathematics. In statistics, when we study the properties of methods, we treat variables at the utmost mathematical level since we would use tools in mathematics domains. However, when interpreting the values of statistics, or outputs from a statistical procedure, one needs to distinguish between the variables as in data and the variables as in reality (or the true quantities we are trying to do inference on). The true quantitites that scientists are trying to study sometime can be called parameters in statistcs, but not always. I believe the whole philosophy behind formal statistical inference is trying to quantify and understand the "gap" between what we observe in the data collected and the truth.
Tuesday, June 21, 2005
Friday, June 10, 2005
Thursday, June 09, 2005
Today, I was very embarrassed to see that the coffee shop's schedule. It only opens from Monday to Thursday! Guess I used a wrong prior a couple of days ago.
Tuesday, June 07, 2005
Today, the one caught my attention is a discussion on how school schedule is the one to blame for sleep deprivation in teens, from a recent paper. It is argued that because human bodies undergo changes during teenage years, including biological clocks, it is then hard for most teens to go to bed early. As a result, the early school schedules inevitably deprive the students of their much needed sleep.
The one I watched even had an in-house re-enactment (they vidoetaped the waking-up of a teenage girl). One shocking statement ends the whole coverage, that states like "a 25-minuet difference in the sleep can make the difference between an A and C", which I fail to find in the original article.
Monday, June 06, 2005
But when I was preparing for a grant proposal, I found I was trapped by this technology advancement. The Federal funding agency does not accept PDF file that is prepared by Acrobat 6.0. Huh?!
I first thought it won't be a problem since I probably could save my pdf file (again in Acrobat) as a file of a lower version, just as I can do in Microsoft Word, Excel, etc. But it didn't happen. Now I don't even remember what I did to circumvent this issue eventually. Maybe the struggle was too painful to remember. :)
Accidentally, when I was trying all the functions provided by Acrobat today, I found a command in the file menu called "reduce file size". It allows you to reduce the file to a new file that is of a smaller size and also compatible with a lower version of Acrobat. Bingo! It worked. Just in case I will forget this trick again, I am blogging it here.
Apparently, to the people at Adobe, the only reason one would want to lower the version of a PDF file is to reduce the size. :)
Jinying Zhao, Eric Boerwinkle, and Momiao Xiong
Abstract: Efficient genotyping methods and the availability of a large collection of single-nucleotide polymorphisms provide valuable tools for genetic studies of human disease. The standard chi-square statistic for case-control studies, which uses a linear function of allele frequencies, has limited power when the number of marker loci is large. We introduce a novel test statistic for genetic association studies that uses Shannon entropy and a nonlinear function of allele frequencies to amplify the differences in allele and haplotype frequencies to maintain statistical power with large numbers of marker loci. We investigate the relationship between the entropy-based test statistic and the standard chi-square statistic and show that, in most cases, the power of the entropy-based statistic is greater than that of the standard chi-square statistic. The distribution of the entropy-based statistic and the type I error rates are validated using simulation studies. Finally, we apply the new entropy-based test statistic to two real data sets, one for the COMT gene and schizophrenia and one for the MMP-2 gene and esophageal carcinoma, to evaluate the performance of the new method for genetic association studies. The results show that the entropy-based statistic obtained smaller P values than did the standard chi-square statistic.
Friday, June 03, 2005
Adam Sieple and David Haussler
Abstract: Nucleotide substitution in both coding and noncoding regions is context-dependent, in the sense that substitution rates depend on the identity of neighboring bases. Context-dependent substitution has been modeled in the case of two sequences and an unrooted phylogenetic tree, but it has only been accommodated in limited ways with more general phylogenies. In this article, extensions are presented to standard phylogenetic models that allow for better handling of context-dependent substitution, yet still permit exact inference at reasonable computational cost. The new models improve goodness of fit substantially for both coding and noncoding data. Considering context dependence leads to much larger improvements than does using a richer substitution model or allowing for rate variation across sites, under the assumption of site independence. The observed improvements appear to derive from three separate properties of the models: their explicit characterization of context-dependent substitution within N-tuples of adjacent sites, their ability to accommodate overlapping N-tuples, and their rich parameterization of the substitution process. Parameter estimation is accomplished using an expectation maximization algorithm, with a quasi-Newton algorithm for the maximization step; this approach is shown to be preferable to ordinary Newton methods for parameter-rich models. [Follow the link above to read more]
Bitterly, I thought "every time!" Why on earth Murphy's law is always right?
Actually I know the answer this time. It is not an coincidence that I decided to buy coffee from the coffee cart on the day that it was closed. Yesterday day, when I went out to buy lunch I saw the cart manager (a very nice Indian man) was cleaning up the cart with several bags behind him on the lounge couch. I could tell from the shape of them that, in those bags, were unsold bagels, muffins, fruit juice, etc. I was telling myself that the business might have turned out to be much lighter than they expected. Then I thought to myself, maybe I should give them a little support since a coffee cart in the lobby would prove to be very convenient. Consciously or unconsciously I decided to buy coffee from them this morning.
My understanding of Murphy's law is that you fear something is going to happen and that something indeed happens. Mostly of the time, there is a common cause for the fear and the object of the fear. Of course, we also tend to remember better when bad things happen.
Thursday, June 02, 2005
The findings were published in the May 31st 2005 issue of Journal of the National Cancer Institute. The study followed 114,460 women who participated in a previous California Teachers Study (so I suppose they are women, teachers and Californians) for 6 years. At the beginning of the study, they were all free of breast cancer (not necessary free of other cancers or inflammatory disorders). During 1995-2001, 2391 (2.1%) women were diagnosed with breast cancer. Data on the use, frequency and duration of nonsteroidal anti-inflammatory drugs (NSAIDS) was collected through a self-administered questionnaires (1995, 1997 and 2000).
Most news coverage of this study only cited the above description of the study. Then the news would go on saying that "daily long-term use of aspirin was associated with an 81% increased risk" (RR=1.81, 95% CI= 1.12 to 2.92).
First, what is an 81% increase in risk when the overall risk is only 2%?
In the paper, the authors honestly presented the following findings, which is no surprise at all:
"After accounting for age, regular NSAID users were more likely to be white, to be overweight or obese, to be current or former smokers, to have had a mammogram in the last 2 years, and to have used postmenopausal hormone therapy than regular NSAID users. "
In their multi-way tables that present the relative risk estimates of NSAID use, it is not "crystal clear" how exactly the statistical analysis was done. However, it does mention the estimation adjusted (by including the covariates on) race, BMI, first-degree family breast cancer history, menopausal and hormone therapy use, smoking, alcohol intake, physical activities, mammography history, breast biopsy history, parity status before age 30 and neighborhood socioeconomic status.
Is there something missing? I was looking for age. That would have been my first guess. Age should matter most. Who will have higher risk of cancer? And who will be more likely to be regular pain reliever users? But I can't seem to find a mention of it in my 10-minuet quick browse of their paper. I am sure the investigators have considered age as important. In another table, it is mentioned that the characteristics are age-standardized (well, meaning that they standardized the values of these covariates within each 10-year age category). This will make the other covariates not heavily correlated with age. But age is not adjusted for in estimating relative risks? Is this a typo or an unfortunate omission?
There are some other statistical question marks in this paper that I didn't figure out during my fast read. Not to mention multiple comparison issues.
Wednesday, June 01, 2005
To me, a young statistician, there seems to be a thin (very very thin) line between "applied research" and "consulting/application". For an academic statistician, crossing the line by doing service-like application for other scientists can by harmful I suppose (or, I guess). But how far inside, from this line, of the applied research domain is safe enough? What decides the appropriate appliedness of a project?
Since there is not absolute standard for appropriate appliedness for the research projects (am I missing something here), one (who are dedicated to do some applied research) is faced with two choices: make an applied research more applied to address the needs of the other discipline or make an applied research less applied to maintain some statistical intellectual merits. Sometime, one does not have to make a choice. That is just the dreamland for an applied statistician.
Tuesday, May 31, 2005
Anyone who are currently having direct control over more than 2 computers know the truth that it gets harder and harder to shop for another machine. There are just so many things to consider.
Maybe I should talk more about my current situation. I have a fairly powerful office workstation for working in my office, and I also have an okay powerful home desktop for evenings and weekends. Then I have a super light SONY VAIO laptop (3 pounds) for travel and teaching. Until recently, I was fairly happy with my arrangements. I have recently started my own religious schedule of backing up my office machines. I never leave important single copies on my home machine and laptop. All three machines share a virtual hard drive on our department's server (which won't be possible for anyone but for me it is possible since I use Columbia as my ISP at home). And Regina introduced me to this brilliant software--total commander, which compares directories and synchronizes them. I should be happy and productive.
But this summer, I need to go back China for nearly a month. And, God forbid, I am thinking about doing some work during my stay there. My little SONY VAIO baby is just not powerful enough for me to do serious work. It is still okay for powerpoint or LaTex (I maximized its RAM to 384MB) but it gets hot really fast and make extended use of it not quite enjoyable.
So I start thinking about getting another laptop. This just freaks me out. I don't quite like any of the windows laptop PC on market right now (picky, picky, picky). I think the Apple Powerbook looks cool and has a lot of neat applications. But thinking about managing 4 computers and 2 OS (occasionally I also need to do some UNIX stuff) is overwhelming. Am I getting too old? I dread the thought of remembering all the different shortcuts on different machines. That is fair. Maybe I will just get a window laptop then. But I can not even imagine installing all the essential softwares (even freely available) on this new computer. I wonder whether there is a way or a service that can duplicate the working style of my current machines and put it into a new (more powerful) machine. I suppose it is not the age then. It is my progressive laziness.
Thursday, May 26, 2005
Sunday, May 22, 2005
1) The use of different terms. All the intro stat books we have examined for our teaching universally use "measure of center" and "measure of spread", not "central tendency" and "dispersion". Andrew once commented that only people for whom English is not a native language will find such usage of the language worthy of noting. Here, I am not fascinated by the differences. In my teaching, I have had a number of students who are confused by the different terms for a same thing in statistics. This is the first time I realize when such "variety" gets started.
2) MAD (mean absolute deviation) was taught. I just quickly checked several intro stat texts we use in this department. None of them cover MAD as a measure of spread. Only one covers a related topic, MAD regression. I can't understand what make such a topic essential to be in a high-school math text but unnecessary in a college intro stat text. Personally, I like MAD.
3) The extensive use of TI-83/84 calculator. It is very foreign to me for a math textbook to illustrate how to use a calculator. Using calculators for homework or exams weren't acceptable when I was in China. The use of calculator has the advantage of enabling the students to work on real-life scale of statistical problems. But following the diagram such as "stats->2->enter->..." In the book, the student, as my pupil said today, "had no idea what he was doing."
This is not an AP stat course. To my understanding, the text used for AP stat is usually very similar to college text but the emphasis is more on the computation not too much on the concepts.
Friday, May 20, 2005
Von Bing Yap and Terence P. Speed
Abstract: We studied the substitution patterns in 7661 well-conserved human–mouse alignments corresponding to the intergenic regions of human chromosome 22. Alignments with a high average GC content tend to have a higher human GC content than mouse GC content, indicating a lack of stationarity. Segmenting the alignments into four groups of GC content and fitting the general reversible substitution model (REV) separately gave significantly better fits than the overall fit and the levels of fit are close to that expected under an REV model. In addition, most of the fitted rate matrices are not of the HKY type but are remarkably strand-symmetric, and we constructed a number of substitution matrices that should be useful for genomic DNA sequence alignment. We did not find obvious signs of temporal inhomogeneity in the substitution rates and concluded that the conserved intergenic regions in human chromosome 22 and mouse appear to have evolved from their common ancestors via a process that is approximately reversible and strand-symmetric, assuming site homogeneity and independence.
John P. Huelsenbeck, Bret Larget, and Michael E. Alfaro
Abstract: A common problem in molecular phylogenetics is choosing a model of DNA substitution that does a good job of explaining the DNA sequence alignment without introducing superfluous parameters. A number of methods have been used to choose among a small set of candidate substitution models, such as the likelihood ratio test, the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and Bayes factors. Current implementations of any of these criteria suffer from the limitation that only a small set of models are examined, or that the test does not allow easy comparison of non-nested models. In this article, we expand the pool of candidate substitution models to include all possible time-reversible models. This set includes seven models that have already been described. We show how Bayes factors can be calculated for these models using reversible jump Markov chain Monte Carlo, and apply the method to 16 DNA sequence alignments. [Follow the link above to read more.]
Tuesday, May 17, 2005
As a graduate from Columbia myself, I naturally took what I went through as the "right" way of doing it and only was surprised to find out the ceremony may go quite differently in some other institutions. Of course, different universities have different colors for their gowns. I am not referring to this colorful side of the academia. What surprised me was how casual Columbia's Ph.D. convocation appeared to be, comparing to some other universities. My colleagues from other university seem to have a much better idea of the traditions. According to them, the hood is a big deal, symbolically. For example, one said that only those who had an academic position after graduation can wear the hood. Another said that the hood can only be wore after the degree was conferred (thus after the convocation). There are even some hooding ceremonies in some places, where the phd advisors will put the hood on to his/her proteges. I was nearly disappointed by the fact that I had such a unrecognizing graduation to end my long graduate study. It could even be that I had had a wrong doctoral convocation!
Now I recall how clueless I was when trying on my doctoral gown, hood and cap in the lady's room of Low Library. In a small space full of recently earned female phds, there are so many different ways in which the academic regalia were put on. I remember I asked a person next to me on how to put on the cap, and she simply shrugged saying she had no idea. Anyhow, I am at peace since I quite enjoyed my convocation quite well.
As a researcher, one should do research on things they are curious about. So I did some research on academic regalia and found (at least) that not all in a doctoral gown need to be a doctoral degree holder but only those who have a real doctoral degree can wear a hood. It is, however, not necessary for the hood wearer to have any academic appointment. Sure, this random website I found could also be wrong.
"The traditional rule is that a candidate for a degree should not wear the hood of that degree until it is actually conferred. This rule still applies to those who are to be individually hooded during the commencement ceremony; they should not wear the hoods in the preliminary academic procession. However, when degrees are to be conferred en masse, without individual hooding, the groups involved, e.g., master's degree candidates at large universities, may wear their hoods in the preliminary procession and throughout the ceremony."
Sunday, May 15, 2005
Anyway, Andrew dropped by the other day and I told him about my new toy blog. He said I should have just used this one. Well, his blog is more about research or related. But mine is more about my own "data". (That is why there are so few posts. Ha!)
Friday, April 01, 2005
Just a poem received by email:
"Do it Anyway
People are unreasonable, illogical and self-centered.
Love them anyway!
If you do good, people will accuse you of selfish, ulterior motives.
Do good anyway!
If you are successful, you will win false friends and enemies.
The good you do will be forgotten tomorrow.
Do good anyway!
Honesty and frankness make you vulnerable.
Be honest and frank anyway!
What you spend years building may be destroyed overnight.
People really need help but may attack you if you help them.
Help them anyway!
Give the world the best you have and you'll get kicked in the teeth.
Give the world the best you've got anyway!
Written by: Mother Teresa"
Wednesday, March 30, 2005
The first article is about a professor who used an all-A policy in his teaching since he felt that the grading system was a distraction to real teaching and learning. I agree that the grading system IS a distraction. But reducing it to an all-A police will create another distraction, especially to those students who are working hard and care about learning. Sure, we always say the grades are not important, but by the end of the day, they really matter. My teaching philosophy, when comes to grades and exams, is to use exams as reinforcing tools. In other words, I will put the most important things in the exams, things I want the students to remember longer. I found that students remembers better where they have lost a couple points on during an exam.
The second article discussed the culture of complaining, which made me think about a positive attitude towards students' complaints. I can't say I enjoy hearing about complaints. Actually, I think no matter what I do, theoretically, there can always be complaint about everything in my teaching. Then, should we start pay no attention to complaints, or students' comments? Well, maybe if I, the teacher, listen really carefully, the complaints are not just about everything. I found when I spend a little bit more time with every student who asks me a question, I can find out so much more how they think about the lectures, the homework, the project, the exams, the grades, etc. Isn't that what the teacher really needs to know? My colleague, Andrew Gelman once discussed the concept of "aggressive teaching" in his cool blog. I think if we just listen and seek, more aggressively, we can derive much more positive information from students' complaints, even though they are about everything.
Tuesday, March 22, 2005
Tuesday, March 01, 2005
Sunday, February 27, 2005
Guess what! My run of bad luck doesn't stop for weekends. My computer just started performing spontaneous death game on me. It will quietly turn black and stop without any warnings. I searched online and found I am not the only one suffering from such technical errors. It might be due to a motherboard failure and it turns to be quite common among Optiplex GX270 machines. What did I know when I bought it?
Okay! I guess I will just take a break as waiting for the Dell Tech to fix my machine.
Monday, February 07, 2005
"In recitation today we were talking a little about independent variables vs. variables that appear to be associated. Yuejing suggested that if we see an association between between variables this does not mean that there is a causal relation between the two. Are statisticians always limited to this "weakened" position? If not, what formal tools are used to decide at what point, from a statistical point of view, the distribution of one variable compared to that of another suggests causality instead of a mere association?"
Here is my reply:
"Usually, analysis of observational data can not establish causal relations between factors and results, unless the pattern is extremely strong. Experiments are the best way to establish causation. This is why anything about human is hard to verify since it is usually not ethical to do experiments on human unless there won't be any harm.
For example, to test whether a new medicine is effective, pharmaceutical companies need to do clinical trials (experiments on human). Before that they need to use lab animals to test whether the treatment is safe. Statistical analysis is the most important part of the report that FDA requires to pass a new medicine.
Also, people have been talking about genes that cause diseases, right? To see whether these specific genes truly cause the specific disease, researchers need to examine transgenic lab animals to confirm such a causal relation.
There is a branch of statistics called "causal analysis" for observational data. However, even there, no conclusion can be made with 100% confidence. This is not something unique to statistics. In biology, physics, chemistry, most scientific results have exceptions that we don't fully understand. It is especially true for astronomy, where most published results are only theories. The difference is that statistics maybe the only science admits explicitly that we don't know all, and the only science attempts to estimate the extent of our ignorance."
Here are comments from my colleague, Professor Andrew Gelman:
1. this is the kind of thing you can put on your blog!
2. the student asked a good question!
3. your answer is reasonable.
more generally, when we speak of causation in statistics, we usually speak of "treatments" or "interventions", not just of "variables". thus, we would not say that variable A causes variable B. rather, we would say that a specified treatment T which effects variable A, also affects variable B. thus, T affects both A and B. the idea is to design an experiment so that T is something that is realistic.
for example, surveys have found a positive correlation between "social capital" (roughly, the size and quality of your social network) and "happieness". does social capital cause happiness, or does happiness cause social capital, or neither, or both, or ...? the way a statistician would address this issue would be to design a particular treatment, designed to directly affect "social capital", and see how it affects "happiness". and to consider other treatments to directly affect "happiness" and see how they affect "social capital". or to look for observational data that has the form of a "natural experiment" affecting these.in this example, things get interesting because measurements and treatments can be made at individual or group levels. hence multilevel modeling becomes relevant."
Thursday, January 20, 2005
Monday, January 10, 2005
It will be some shock to the students.
It is natural for someone to seek shared compassion towards the subject out of his/her class of students, especially when the class is supposed to be young, curious and full of energy. In some cases, it is not that difficult. Students who would sign up for a semi-advanced mathematics class must at least have enjoyed most math classes they have taken. The fact that w1111 can be used to fulfill science requirement for undergraduates always make me wonder how many of my students actually care about the subjects, and how many just come to get an easy A. Such doubts sometime get in the way of my teaching. Maybe in the future, they should make a CVN version of intro statistics course and those who don't care too much about the course can just take the course without getting out of their rooms (are the videos online now?). This way, they get what they want in a much enjoyable way and our department can save the instructors and TAs for a more-motivated audience and no one's feelings will be hurt.
For now, I am getting myself motivated.