Saturday, November 25, 2006

Spams that mimic real emails

I always felt the spam emails I receive mimic the real emails in my inbox. I spoke to our web admin about this several months ago since I was worried that our mail server was infected by some viruses that "learn" the patterns of our real emails. Today, it just occured to me: it is NOT that the spams mimic our real emails. Rather, ONLY spams that look close to our real emails get pass our spam filter. I don't know why I didn't think of this earlier.

Sunday, November 19, 2006

BST files for journals

Here is a collection of style and bst files for journals. Some of them are pretty helpful.

Saturday, November 18, 2006

Ser-Venn-ity prayer redone.

Saw a nice venn diagram in Andrew's blog and loved it. But I sort feel the proportions do not agree with my perspective about the world and I love prettier colors. So I made one for myself.

Friday, November 17, 2006

Friday, November 10, 2006

Wednesday, November 01, 2006

Too tired to proof-read your draft?

How about proof-hear it then?

In a referee report we received days ago, the referee kindly pointed out an overlooked typo in our draft. We mis-spelt "child" as "chilled". After many times of proofreading, this silly mistake just skipped the big blind spot of our eyes (the three of us). We were more anxious about the logic, the flow, the math of the draft.

I was telling a friend of mine this instance since I think it is pretty funny. And then I remember I had a nice experience using Adobe Acrobat Reader to "read" a thesis draft to me. I remember, hearing the draft read aloud allowed me to catch more typos. So I should do that more often.

To have your Adobe Acrobat Reader "read aloud" your paper:
view>read aloud>read ... [a page or the whole thing]
To stop or pause
view>read aloud>pause (stop)

You can change the voice and speed in "control panel>speech".

Sure, you would still have to proof-read really HARD your math equations.

Wednesday, September 06, 2006

Have you seen so many colors in R before?

A colleague and good friend of mine, Dr. Ying Wei sent me a document that contains a list of R colors. She compiled this list when she was working on her thesis. It is such a beautiful file by itself, not to mention that it can be so helpful at times. Upon approval from Ying, I am sharing this list in this post.

Thursday, August 24, 2006

From PDF to Excel Table

I received the following email:

PDF2XL, our core software product can aid academic researchers in sociology, economics, political science and other disciplines to extract data from government and general publications. Extracting data from these sources is a prerequisite for numerous quantitative research projects. The data is usually provided in PDF format and PDF2XL can drastically reduce the amount of work involved in getting that data into a statistics-based software such as SPSS or Excel.

PDF2XL is regularly priced at $95, but our initial academic install base (UC Irvine, ASU, WSU and others) and their success, led us to believe strongly in its applicability to any research or academic setting. That is why we are introducing PDF2XL to the academic community by giving a free one-year license of PDF2XL to all academic users for the academic year 2006-2007.

We prepared a section on our website that contains information devoted to the academic community, for example, academic outreach main page and a case study about the use of PDF2XL in quantitative research.I have also attached the case study in PDF format. []

Out of curiosity and my love for new geeky softwares, I tried this software on a short PDF file. Below is the screenshot. It allows you to select table area and also lets you correct the table recognition done by the program. I can imagine that this will be useful if I need to deal with PDF files with lots of tables.

Saturday, August 19, 2006

The correlation plot in our JASA "how many X" paper

Andrew emailed me saying that he received a number of requests for the codes that we used to make that figure. So here I have prepared a more annotated version of the codes. R codes.

I thought of providing the data from our paper as example but decided not to since we are not the owners of these data.

Thursday, July 20, 2006

Fitting constrained least square regression in R

Prepared the following for an email inquiry.

Yes. There is something in R for such tasks. This requires a special package called mgcv, which should be installed in standard R configuration. See Especially, check out the option Ain and bin.

Here is an example I wrote:
(copy-paste these lines into R console)

## load the special package first.

## generate some fake data
x.1<-rnorm(100, 0, 1)
x.2<-rnorm(100, 0, 1)
x.3<-rnorm(100, 0, 1)
x.4<-rnorm(100, 0, 1)
y<-1+0.5*x.1-0.2*x.2+0.3*x.3+0.1*x.4+rnorm(100, 0, 0.01)
## make your own design matrix with one column corresponding to the intercept
x.mat<-cbind(rep(1, length(y)), x.1, x.2, x.3, x.4)

## this is the regular least-square regression
ls.print(lsfit(x.mat, y, intercept=FALSE))
## since you already have an X column for intercept, so no need for lsfit to assume another intercept term.

## the penalized constrained least square regression

w=rep(1, length(y)),
p=rep(1, ncol(x.mat)),
bin=rep(0, ncol(x.mat)) )


Monday, July 03, 2006

News Headline starts with "Statisticians"

This piece of news caught my attention simply because it has one of my favorite words, "statistician," in its headline.

This news article outlines a research conducted by a statistician in Oxford University on "how interconnected the human life is".

See the following argument:

It's simple math. Every person has two parents, four grandparents and eight great-grandparents. Keep doubling back through the generations — 16, 32, 64, 128 — and within a few hundred years you have thousands of ancestors.

It's nothing more than exponential growth combined with the facts of life. By the 15th century you've got a million ancestors. By the 13th you've got a billion. Sometime around the 9th century — just 40 generations ago — the number tops a trillion.

But wait. How could anybody — much less everybody — alive today have had a trillion ancestors living during the 9th century?

The answer is, they didn't. Imagine there was a man living 1,200 years ago whose daughter was your mother's 36th great-grandmother, and whose son was your father's 36th great-grandfather. That would put him on two branches on your family tree, one on your mother's side and one on your father's. In fact, most of the people who lived 1,200 years ago appear not twice, but thousands of times on our family trees, because there were only 200 million people on Earth back then.

Simple division — a trillion divided by 200 million — shows that on average each person back then would appear 5,000 times on the family tree of every single individual living today.

But things are never average. Many of the people who were alive in the year 800 never had children; they don't appear on anybody's family tree. Meanwhile, more prolific members of society would show up many more than 5,000 times on a lot of people's trees.

Keep going back in time, and there are fewer and fewer people available to put on more and more branches of the 6.5 billion family trees of people living today. It is mathematically inevitable that at some point, there will be a person who appears at least once on everybody's tree.

Monday, June 19, 2006

Sounds for the young ears

I am sure it is not new scientific discovery that 1) human ears cannot hear all sounds 2) some can hear more than others 3) aging affects hearing. But it is not until recently that people started to use such scientific facts in everyday life.

Someone in UK has deviced up some ringtones (follow the link to find a sample sound file) using a high frequence so that only people younger than a given age (say 17) can hear it. Sure, it is a statistical threshold, in the sense that, some teens may not hear it whereas some 30-years-olds can still hear it.

The scientific reasoning is that human ears become less sensitive over the years. Thus the range of freqeuncies a person can hear becomes narrower as he/she ages. When I told Xiaoli Meng from Harvard about this during the ICSA meeting, he was wondering whether there is a frequency that the elder can hear while the youngsters can't. I am afraid that this is not a fair game.

I am sure blogs around the world have covered the same topic but I still feel I should blog it.

Of course I tried to hear the sound but I can't hear it. It gives me strange feeling. I am not upset because I am old in some definition now. Rather, I am upset because that there is something I had the chance to hear but I didn't and now it is too late for me to hear it. I am sure there are tons of things like this that I have missed. Still, having something so materialized thrown right into my face like this makes me feel bad.

Besides the sentimentals, I also thought of the implications of this. I wonder whether babies cry sometimes because they are annoyed by some high-pitch sounds that the parents cannot hear. If this is true, I wonder whether we should have a meter at home that detects sounds of all frequencies and their magnitudes. Just a thought.

Friday, May 26, 2006

"Willing to Do the Math: An Interview with David Botstein"

From PLoS Biology

The first half is about science education for undergraduates and the second half is about the birth of the human genome project. A very interesting article.

Saturday, May 20, 2006

Columbia is entering e-prints era

I came across a new website on the libarynet of Columbia. It says that Columbia is pilot testing an electronic interface for academic papers of her departments. This is truly a good news. In the future, it will be very easy and semi-official to log every paper one finishes through this repository.

[Reference] Phi-divergence for Goodness-of-Fit tests

Goodness-of-fit tests via phi-divergencesAuthors: Leah Jager (University of Washington), Jon A. Wellner (University of Washington)Comments: 43 pages. Submitted to Annals of Statistics. See also this http URL this http URL

Tuesday, May 16, 2006

Lon Cardon and Tom Linder on Whole-Genome and Linkage Microarray Studies

I particularly like the following discussion on validation.

Conard: How do you go about validating the output information?

Cardon: I think there are two levels to validation. The one that probably gets overlooked most often is quality control. The first thing you want to look at is quality of the genotypes. Before you expend all that effort and spend the money, some mechanism for validating the genotype is important. That may mean re-genotyping, but it would be re-genotyping a relatively small number of markers and potentially using a different assay to really validate those findings before going further.

The second level concerns replication—the gold standard in association studies. When someone reports a finding, the best thing that can happen is someone in a different lab says, “I genotyped that same marker on my samples and I saw the same results.” That happened with macular degeneration recently, and that’s very hard to argue against.

Wednesday, May 10, 2006

Parents of baby boys used to say NO to social security? Of course not!

I was getting some first name distribution data for project MICHAEL. I found out, by googling and consulting with wikipedia, that the social secuity website maintains a nice list of first names of BIRTHS reported on application for social security number. When I was excitedly "grabbing" data off this website, I noticed something unusual. The number of boy births did not match up with the number of girl births for a couple of decades. The trend is pretty interesting. See above.

I sent this graph to Andrew and he posted in this blog. I was looking forward to some freaknomics-type answers and I got some. Here are the explainations I can accept.

Yeah, I didn't think it was WWI. When SS was first enacted, enrollment was optional. Although SS benefits were small at the time, there were survivor benefits; plus, although regular benefits were roughly pegged to contributions, in general women had fewer market wage opportunities than men so SS was the only game in town. More women signed up sooner.
Posted by: Robert at May 10, 2006 10:47 PM

Jason Ruspini writes,
There is an error preventing me fromposting on the blog, but survivors benefits (lifeinsurance) would be my guess.
Cheers, Jason Ruspini

Saturday, May 06, 2006

Statistical compromises

For some project I am working on, I opened the book entitled "Permutation Methods -- A Distance Function Approach" by Mielke and Berry. At this moment, I am only working on the first page. I just want to blog their first sentence in the introduction; "Many of the statistical methods routinely used in contemporary research are based on a compromise with the ideal." I found this opening very arresting.

Wednesday, April 19, 2006

[New paper in the literature]: A systematic comparison and evaluation of biclustering methods for gene expression data

Bioinformatics Advance Access originally published online on February 24, 2006
Bioinformatics 2006 22(9):1122-1129; doi:10.1093/bioinformatics/btl060

Amela Preli 1, Stefan Bleuler 1,*, Philip Zimmermann 2, Anja Wille 3,4, Peter Bühlmann 4, Wilhelm Gruissem 2, Lars Hennig 2, Lothar Thiele 1 and Eckart Zitzler 1

Motivation: In recent years, there have been various efforts to overcome the limitations of standard clustering approaches for the analysis of gene expression data by grouping genes and samples simultaneously. The underlying concept, which is often referred to as biclustering, allows to identify sets of genes sharing compatible expression patterns across subsets of samples, and its usefulness has been demonstrated for different organisms and datasets. Several biclustering methods have been proposed in the literature; however, it is not clear how the different techniques compare with each other with respect to the biological relevance of the clusters as well as with other characteristics such as robustness and sensitivity to noise. Accordingly, no guidelines concerning the choice of the biclustering method are currently available.

Results: First, this paper provides a methodology for comparing and validating biclustering methods that includes a simple binary reference model. Although this model captures the essential features of most biclustering approaches, it is still simple enough to exactly determine all optimal groupings; to this end, we propose a fast divide-and-conquer algorithm (Bimax). Second, we evaluate the performance of five salient biclustering algorithms together with the reference model and a hierarchical clustering method on various synthetic and real datasets for Saccharomyces cerevisiae and Arabidopsis thaliana. The comparison reveals that (1) biclustering in general has advantages over a conventional hierarchical clustering approach, (2) there are considerable performance differences between the tested methods and (3) already the simple reference model delivers relevant patterns within all considered settings.

Availability: The datasets used, the outcomes of the biclustering algorithms and the Bimax implementation for the reference model are available at


Supplementary information: Supplementary data are available at

Tuesday, April 18, 2006

[New paper in the literature]: Family-based designs in the age of large-scale gene-association studies

Nature Reviews Genetics 7, 385-394
Nan M. Laird and Christoph Lange

Abstract: Both population-based and family-based designs are commonly used in genetic association studies to locate genes that underlie complex diseases. The simplest version of the family-based design — the transmission disequilibrium test — is well known, but the numerous extensions that broaden its scope and power are less widely appreciated. Family-based designs have unique advantages over population-based designs, as they are robust against population admixture and stratification, allow both linkage and association to be tested for and offer a solution to the problem of model building. Furthermore, the fact that family-based designs contain both within- and between-family information has substantial benefits in terms of multiple-hypothesis testing, especially in the context of whole-genome association studies.

Monday, April 17, 2006

Papers on Cross-Validation and Bootstrap

Estimating the Error Rate of a Prediction Rule: Improvement of Cross Validation
B Efron. (1983) JASA

Improvements on Cross-Validation: The .632+ Bootstrap Method
B Efron and R Tibshirani (1997) JASA

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality
JH Friedman. (1997) Data mining and Knowledge Discovery.

Friday, April 14, 2006

Will books be endangered species soon?

Science: who needs books is part of a series of articles on scientific publishing. The one sentence I liked most in this article is: "Would Darwin need a publisher now? Would he even write a book?" I can't help musing, would Darwin just put up a blog?

Yesterday, a brilliant young man, Atif E Gulab, entered my office. He is currently a freshman in SEAS at Columbia. He said he would like to do a summer independent research project that launches an on-line magazine on statistics/statistics education for high school students and college non-majors. He has started a blog on that.

During our discussion, one topic came up: why online? The answer is really not that hard to think. There are so many things that an online magazine can carry whereas the regular ones can't, such as tons of pictures, stream-video interviews, music pieces, comments, etc.

I showed Atif the projects my W1111 students last semester accomplished using Wiki. Several projects employed multimedia forms in their data collections. The project reports are only interesting when you can really see and hear the pictures they used and the music they played to their survey respondents and experiment subjects. This is just like the illustrated version of the Da Vinci Code using pictures of all the discussed symbols and artworks makes the reading of the book so much more interesting. I bet Dan Brown would not want to write a controversy story on a musician since he does not get to show the readers the actual music in a printed book.

Atif said maybe in 10 years, no one will read a real book any more. Well. I don't think so. I love books. Love holding them in my hands. Atif does not seem to be attached to actual printed books at all. Maybe after the passing of my generation, the species of books might really get endangered.

Thursday, April 13, 2006

Joint Modeling of Linkage and Association: Identifying SNPs Responsible for a Linkage Signal

Am. J. Hum. Genet., 76:934-949, 2005
Mingyao Li, Michael Boehnke, and Gonçalo R. Abecasis

Abstract: Once genetic linkage has been identified for a complex disease, the next step is often association analysis, in which single-nucleotide polymorphisms (SNPs) within the linkage region are genotyped and tested for association with the disease. If a SNP shows evidence of association, it is useful to know whether the linkage result can be explained, in part or in full, by the candidate SNP. We propose a novel approach that quantifies the degree of linkage disequilibrium (LD) between the candidate SNP and the putative disease locus through joint modeling of linkage and association. [Read more by following the link above]

Genome-wide strategies for detecting multiple loci that influence complex diseases

Nature Genetics 37, 413 - 417 (2005)
Published online: 27 March 2005; doi:10.1038/ng1537

Jonathan Marchini, Peter Donnelly, Lon R Cardon

Abstract: After nearly 10 years of intense academic and commercial research effort, large genome-wide association studies for common complex diseases are now imminent. Although these conditions involve a complex relationship between genotype and phenotype, including interactions between unlinked loci1, the prevailing strategies for analysis of such studies focus on the locus-by-locus paradigm. Here we consider analytical methods that explicitly look for statistical interactions between loci. We show first that they are computationally feasible, even for studies of hundreds of thousands of loci, and second that even with a conservative correction for multiple testing, they can be more powerful than traditional analyses under a range of models for interlocus interactions. We also show that plausible variations across populations in allele frequencies among interacting loci can markedly affect the power to detect their marginal effects, which may account in part for the well-known difficulties in replicating association results. These results suggest that searching for interactions among genetic loci can be fruitfully incorporated into analysis strategies for genome-wide association studies.

Maximum-likelihood estimation of haplotype frequencies in nuclear families

Genetic Epidemiology 27:21 - 32
Tim Becker, Michael Knapp

Abstract:The importance of haplotype analysis in the context of association fine mapping of disease genes has grown steadily over the last years. Since experimental methods to determine haplotypes on a large scale are not available, phase has to be inferred statistically. For individual genotype data, several reconstruction techniques and many implementations of the expectation-maximization (EM) algorithm for haplotype frequency estimation exist. Recent research work has shown that incorporating available genotype information of related individuals largely increases the precision of haplotype frequency estimates. We, therefore, implemented a highly flexible program written in C, called FAMHAP, which calculates maximum likelihood estimates (MLEs) of haplotype frequencies from general nuclear families with an arbitrary number of children via the EM-algorithm for up to 20 SNPs. For more loci, we have implemented a locus-iterative mode of the EM-algorithm, which gives reliable approximations of the MLEs for up to 63 SNP loci, or less when multi-allelic markers are incorporated into the analysis. Missing genotypes can be handled as well. The program is able to distinguish cases (haplotypes transmitted to the first affected child of a family) from pseudo-controls (non-transmitted haplotypes with respect to the child). ... [Read more by following the link above] © 2004 Wiley-Liss, Inc.

Several multiclass gene expression papers

A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments.
Bioinformatics. 2004 Nov 1;20(16):2562-71
Broet P, Lewin A, Richardson S, Dalmasso C, Magdelenat H.

A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression.
Bioinformatics. 2004 Oct 12;20(15):2429-37.
Li T, Zhang C, Ogihara M.

BagBoosting for tumor classification with gene expression data.
Bioinformatics. 2004 Dec 12;20(18):3583-93
Dettling M.

For more papers on this topic:
Pubmed keywords: multiclass (or multi-class) AND gene expression

Wednesday, April 12, 2006

Bayesians, Frequentists, and Scientists

JASA 2005, vol. 100, no. 469, pp. 1 - 5
Brad Efron

Abstract: Broadly speaking, nineteenth century statistics was Bayesian, while the twentieth century was frequentist, at least from the point of view of most scientific practitioners. Here in the twenty-first century scientists are bringing statisticians much bigger problems to solve, often comprising millions of data points and thousands of parameters. Which statistical philosophy will dominate practice? My guess, backed up with some recent examples, is that a combination of Bayesian and frequentist ideas will be needed to deal with our increasingly intense scientific environment. This will be a challenging period for statisticians, both applied and theoretical, but it also opens the opportunity for a new golden age, rivaling that of Fisher, Neyman, and the other giants of the early 1900s. What follows is the text of the 164th ASA presidential address, delivered at the awards ceremony in Toronto on August 10, 2004.

Wednesday, April 05, 2006

NYC: Snow in April --- is it really abnormal?

Only summary data are available. I plotted the boxplots using five-number summaries provided. Some of the maximums are regarded as outliers by R.

R code: snow.R

In Spring 2005, I used a new example for w1111: snow falls in NYC. And that spring, we had record amount of snow. My students said that I jinxed it. :)

Well, today is absolutely not my fault.

Tuesday, April 04, 2006

99 bottles of beer

I got this link from a friend's blog. In its' own description:

"This Website holds a collection of the Song 99 Bottles of Beer programmed in different programming languages. Actually the song is represented in 933 different programming languages and variations. For more detailed information refer to historic information."

I saw this blog days ago and didn't have too much to say about it then. A couple of independent events over the past two days make me feel that it is worth noting. It finally hit me that when someone has proposed a "solution" to computerized generation of this song, 932 other parties still propose different solutions just to achieve the SAME thing. In publishing scientific papers (well, I only know about statistics and genetics), you probably don't want to submit a paper on a new method that achieve the same thing as some existing methods. If there is no new better performance, there is no new contribution, as it seems. I used to agree with this statement and then this number, "933", sort of "shocked" me into thinking. We definitely know more from 933 different programs that can "instruct" a computer to print out the lyrics than from just one such program. Then why there is no credit to runner-ups that solve important problems in the scientific world?

I was reading an editorial on the South Korean stem cell scandal. The author analyzed that since there is absolutely no credit to the person who discovers one day later than the first person, someone is pushed (by scientific greed) to fabricate something up just to take the first place for the moment and then go back to work out the details. Sure, most scientists have the integrity not to do something like this. The author also pointed out that the fame the first discoverer receives (accelerated by internet these days) makes the scientific research world become more and more like celebrity competition. I think the author has a point even thought I don't think things are this dramatic in statistics.

Wednesday, March 29, 2006

Mean, median, mutual funds and DoND

This morning I saw a commercial where a man cheered "more than half of our mutual funds outperforms the market's average!" I can't help thinking about this statement statistically.

It will be an absolutely dumb statement if it was not "average" but "median". For "average", if the distribution of the returns is highly right-skewed, with a more than 50% probability to perform higher than the mean is a good indicator. But can the copy-writer of that commercial be so statistically sophiscated? This could be an amusing w1111 example in the future. Even though mutual funds may not be the most appropriate subject. I used to have students who complained about the car examples I used since they had little experience with automobiles.

This reminded me of my conversation with Ying Wei the past Monday. It was on the loss of efficience when estimating mean of a Gaussian distribution using median compared with using mean. That reminded me of the recent popular show on NBC: Deal or No Deal (DoND).

Here is my version of what that show is about: the show starts with 26 closed cases contain 26 fixed money values range from $0.01 to $1,000,000. The contestant will open several cases randomly in batches. The cases opened are eliminated from the board (i.e., can not be won by the contestant). After each batch of cases opened, the show will pause. Looking at the remained undiscovered money amounts, a banker will offer the contestant an amount of money to make him/her stop, (from winning the biggest remaining value, of course). If the contestant refuses the offer, he/she will have to eliminate one or more amounts by random guessing, which will actually make the next offer drop.

From the contestant stand point, he/she should accept offer that is higher than the MEDIAN since he/she only play once. If he/she keeps on playing, there is a 50-50 chance that he/she leaves with value lower than the offer. On the banker side, he needs to make offer that is much lower than the mean since he needs to play the game many times. Thus, it is not a surprise to me that every time the bank makes an offer, it is always much lower than the mean of the remaining values. I still yet to figure out the magical amount (offer-median) and the reasoning behind it.

Sunday, March 26, 2006

Blog as an alternative to group website

On March 23rd, in Tom's office, we discussed about building a website for our "connection" research group (or how many X's do you know group). For something like this, in order for it to work, some one must take up some tedious work. I suggested a blog. Andrew said he was happy with blog except for only one concern--he could not directly attach a file to his blog post. I was not sure about blogs, so I didn't insist on the idea. I used the "upload file" function of editor and put document1.pdf in the same folder as the blog. However, it did not automatically generate a link to that file. The link needs to be added manually. It is a WORKABLE setup but not the most convenient we may wish to have.

Regarding the security, we can always implement .htaccess level of limited access. Try click on my teaching site for w1111. It is pretty easy to set up but I haven't mastered how to let users change their password occassionally. I suppose it is not our biggest concern now.

Friday, March 24, 2006

Data: the only and narrow window

Today, in a ISERP lunch group, Peter Hoff from University of Washington gave a talk on latent factors models for network data. It is nice to have a 3-hour discussion on a topic since you get to stop, think and discuss about a particular problem without worrying about running out of time.

One of the questions I had was that whether the latent factors fitted to the data correspond to demographic characteristics of the nodes in the network. Before I asked I knew the answer would be "not necessarily". The latent factors just provide a way or a model to decompose the variation structure of a network into a more interpretable factors that represent the initiator and the receiver of an edge in a network.

There were also other discussion along this direction. I didn't catch all of them since I was busy making some simple numerical examples to help myself understand better. Then I heard Andrew say:"we can not claim to infer the data generating mechanism behind the data. we can only infer a data generating mechanism that can generate the data observed."

This reminded me of my thoughts on data and models.

Data (limited observed values) always classify all possible models into equivalence classes. For example, in regression, n points (x_i, y_i) define classes of curves that go through the same values at the x_i's. The regression analysis is simply trying to find the class with the closest distance to the data. In a modeling effort, the targeted model space intercept with the data's equivalence classes. After the interception, if there is more than one model remained in each equivalence class, we get the identifiability issue.

We can only understand the world to the extent that the data allow. When we ask others about the size of their data sets, we may just sound like coworkers comparing offices: "how's the view in your new office?" "pretty good! the window's much bigger than what I used to have" "wow, nice! you can see so much more now!"

Thursday, March 23, 2006

Finally, found it!

I used to run DOS commands in Splus to automate data generation (naming folders, etc) using dos( ). But R does not have dos( ). Before I only help.searched 'dos' and didn't get what I wanted. Today, something just clicked and I help.searched 'system' and I found function shell( ) was exactly what I needed.

Remembering stress

I found I wake up to a different mood every day. Sometime I can come up with a probable explanation but most time I don't know why. On some lucky days, I wake up very excited about the day of work ahead of me. And that is what I want. It is not that I am a workaholic (even though it is not a bad idea I become one). It is just that I will go to work no matter what my mood is and excitement towards work will make the day so much more enjoyable AND productive. I remember I was in such a mood one January morning last year and then I had a bad fall on the icy street. Then my "work high" disappeared for a couple of months.

Today I woke up feeling exhausted. Every time this happens, I just want to take something that will boost my mood into a unreasonable "work high". I vaguely recall reading about the medical cause of depression, where I learned that our moods are affected by some enzyme in our brain. So I thought maybe the level of that-whatever-it-is thing in my head is highly variable. Maybe there is a way to stablize it (doing Yoga maybe?).

So I went online and googled.

There is absolute proof that people suffering from depression have changes in their brains compared to people who do not suffer from depression. The hippocampus, a small part of the brain that is vital to the storage of memories, is smaller in people with a history of depression than in those who've never been depressed. A smaller hippocampus has fewer serotonin receptors. Serotonin is a neurotransmitter -- a chemical messenger that allows communication between nerves in the brain and the body. What scientists don't yet know is why the hippocampus is smaller.

Investigators have found that cortisol (a stress hormone that is important to the normal function of the hippocampus) is produced in excess in depressed people. They believe that cortisol has a toxic or poisonous effect on the hippocampus. It's also possible that depressed people are simply born with a smaller hippocampus and are therefore inclined to suffer from depression.

Okay. It is not really an enzyme. Hippocampus (hippo-campus?) that in charge of memory storage becomes smaller in depressed patients. Hmm ... interesting, I thought. Most people had depression went through things they DO want to forget. Sometime, we hear ourselves saying "I am trying to forget about this" especially during stressful events. This can be interpretted as signals to our brain (of course, we think using our brain, don't we?) and our brain takes the hint and signaled the hippocampus to become smaller.

So we probably should keep remembering everything no matter how frustrating it is. :)

Tuesday, March 21, 2006

Direct 3D Robot Control

Robert Kass came to our department yesterday and gave a talk on Bayesian curve fitting and neuron firing rates modeling. During his presentation, he showed a segment of video from one of his collaborators' lab in which a monkey was shown to control a robot arm through its thoughts. The monkey's desire to eat were sensed by neuron sensors on its head and translated by some computational algorithm. Dr. Kass mentioned that similar experiements have been planned where the method developped by his colleagues and him were to be used in the translation. It is moments like this that make doing applied statistics rewarding.

Tuesday, March 14, 2006

A helpful command in R

setwd(base) -- set current working directory
getwd(base) -- get current working directory

I used to use my own path string in each program to control for data input/output path. I think this is even easier.

Tuesday, February 28, 2006

Is it hard to manipulate an MRI?

This morning I heard in the news that people were using MRI to detect lies. Experiments were conducted and MRI was compared with regular lie detectors. Under the experimental design, brain images showed relative (and statistically significantly) more activities when a person was trying to formulate a lie. I can't help thinking about how hard it would be to "pump" activities into my head when I try to lie. To me, it is definitely harder to manipulate my pulse in order to manipulate the regular lie detectors.

Monday, February 27, 2006


It is said in the news that China is "set to spend billions on wireless upgrade". China already has the biggest wireless market in the world. It is also reported that there were 59 million new subscription last year alone. So, it is only natural that China spent more money upgrading the network, which has led to heated competition among global telecommunications companies.

I have several reactions to this piece of news:

1) Does the number, 59 millions of new subscriptions, took into account the special cultural event, "super girl" (equivalent to American Idol in US)? In US, viewers vote online. In China, viewers vote by text messaging using cell phones. There was also a limit of 6 votes per phone numbers. Thus, loyal fans will buy prepaid cellular phone numbers and plans to boost votes for their idols. The TV event attacted tens of millions votes per show. One can imagine how many new wireless service subscriptions can result from this nationwide event.

2) For a one-per-user sort of services like wireless services, where each regular (not fans of some TV reality competitions) user only needs one count, marketing models probably also consider the saturation of the market in a fixed population. I wonder whether there is something called "beware of fixed population extrapolation" in this field of research.

3) This also reminded me of the differences between the wireless markets in China and in US. In China, people upgrade their cellphone far more frequent than people in US. This could leads to different marketing modelling.

Thursday, February 23, 2006

Better, felt psychologically

Instance #1: Recently I put a small humidifier in my office. Andrew commented the other day that this sort of things only make one feel good psychologically.

Instance #2: Due to my teaching schedule, I have more time to visit the gym this semester. After a couple of visits, I feel good even though I don't think the effects should be so immediate.

After these instances, I kept thinking about how naive our minds are. So easy to fall for such "traps" of fake goodness. Then I changed my mind. Simply because that for both instances, I, or my brain, was the one that fell for such mirage of improvement, there is a little bit resistance inside of me to accept that feeling good psychologically is a bad thing. Is there any "feeling good" not psychological? Feeling better is an improvement mentally, right? Most people are judgmental about themselves. They can, somehow, stand outside themselves, and criticize what is wrong with them. You know, on all kinds things, clothes, diet, working styles, etc. Whenever we initiate changes in our lives, we tend to feel better before any "real" physical effects kick in. But, isn't happiness the most precious thing in modern life? Maybe, we should say that these changes first take effects on one's mental health, if those are not their only effects.

So in some way, for both my instances, I (an entity in both physical and mental sense) did get better.

Wednesday, February 22, 2006

Harvard's Summers Resigns

(From Harvard Website)
It was in the news yesterday and it is definitely in the news today. Reading about his presidency in today's WSJ article, I learned so much more about him than his unfortunate comment on women in science. Even for that comment, I was not upset about him. He is not the reason for the realitiy. On the contrary, he has become part of the reason that may lead to a change. (Hopefully, a better one). In the WSJ article, it says "in the end, he [Summers] failed to appreciate the cultural differences between the hierarchical structure of the federal government and the more collaborative atmosphere of academia."

Tuesday, February 21, 2006

Think less and be happy!

From this week in Science:
Dijksterhuis et al. (p. 1005; see the news story by Miller) show that deliberate thinking about simple decisions (such as buy-ing a shampoo) does yield choices that are judged to be more satisfying than those made with little thought, as expected. However, as the decisions become complex (more expensive items with many characteristics, such as cars), better decisions and happier ones come from not attending to the choices but allowing one's unconscious to sift through the many permutations for the optimal combination.
This reminded me of my student years. After each exam, I would think back of how I did. The more mistakes I could recall, the worse I felt about that exam, and the better the grade turned out to be.

Thursday, February 16, 2006

Diet or no diet: a dilemma for the food industry

Today I read a very interesting paragraph in WSJ:

"Big food companies often are late-comers to diet fads, which tend bubble up through popular books and personal recommendations. Food makers, are, by their nature, predisposed to want people to eat more, as embodied by the classic Lay's potato-chip slogan, 'Bet you can't just eat one!'

But given Americans' obsessions with their waistlines, diet foods are one of the faster growing areas of the otherwise slack food business. As a result, companies from Nestle SA to Unilever and Kraft Foods Inc. are trying to get ahead of the game by creating their food fad. They have experimented with special starches, new types of fiber and a process that occurs in the small intestine called the 'ideal brake mechanism.' The goal: Create products that dieters will buy more of in order to eat less."

Monday, February 13, 2006

The battle between histograms and tables

In the guideline for contributors of a biomedical journal:
"Histograms should not be used to present data that can be captured easily in text or small tables, as they take up much more space."

Thursday, February 09, 2006

An example on my efforts in research

Today, in order to change a manuscript of mine to the format required by the journal, I spent more than two hours with LaTeX trying to figure it out.

Tuesday, February 07, 2006

How many of your former students did you meet in the GYM today?

Today I went to the Dodge gym of Columbia, during the day. This is one of the benefits I have earned by finishing all my teaching last semester. I believe it has been years since the last time I went to the gym, in the middle of a day, on a weekday, during an academic year.

I think I definitely saw at least four of my former students. That reminded me of one time when I said one can estimate how many students one has taught by the number of former students one meets on a brief walk on campus during a weekday or in a local restaurant during the weekend. Of course, such an estimation asks for extra information such as how many students actually walk on campus and how many students go to local restaurants during weekend.

For me, those numbers have been greater than zero almost with probability one. :) But, four on a brief visit to GYM?! That surprised me. Then I remembered the fact that a lot of former w1111 project groups wanted to do their data collection in the GYM until at one point I "banned" that topic.

Oh, I thought, so they did go to gym themselves.

In the unforeseeable future, when I have time, I will spend fixed amount of time at different locations on campus, at a same grid of times of the day, on a same selection of days of the week, to see how the number of former students met varies across campus. Variables to control are estimated traffic flow and outdoor weather.

Monday, February 06, 2006

Recent Papers on HapMap project

  1. HapMap project website
  2. (10/28/2005) McVean G, CCA Spencer, R Chaix (2005) Perspectives on human genetic variation from the HapMap project. PLoS Genetics 1:e54
  3. (10/24/2005) Perlegen Scientists Genotype 4.6 million SNPs in Phase 2 of the HapMap Project Using Array-based TechnologiesPerlegen’s David Cox discusses [an easy-to-read reality check of the HapMap project]
  4. (05/10/2001) Reich et al. (2001) Linkage disequilibrium in the human genome. Nature 411:199-204.[Empirical results on the inter-marker LD patterns w.r.t physical distances]
  5. (09/20/1995) Risch, N and B Devlin (1995) A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics. 29:311-22.

It IS harder to get research support from federal agencies

From Columbia’s FY 2005 Externally Sponsored Research Report
"Competing sponsored project awards decreased approximately 17% from Fiscal Year 2004 ... Awards from federal sources decreased by 20%; non-federal awards decreased by more than 1%."

Friday, February 03, 2006

Restaurant inspection results from NYC Dept. Health

It is not like that we will stop going to our favorite dinning locales because of these results. But it is good to know where exactly they have problems with.

Accepted genetically

During a meeting on Wednesday, one collaborator of mine happily reported that she was expecting a baby. After we all congratulated her and discussed that it was a good time to have a baby (year of dog, good luck, etc). She also mentioned she had been going through all kinds of genetic testing and was waiting for a whole-genome microarray results, for deletions and insertions stuff. We were joking about how this baby has been on the frontiers of research in her (it's a girl) mom's field, even before her birth. My collaborator joked that she would tell her baby years later that "you went through all kinds of tests before you were finally accepted!".