Tuesday, November 13, 2007

Temp Directory and Working Director of R

For a while, I have been using the setwd() command at the beginning of a project's code. This allows one to save data, graph, output easily to a project's result folder without typing long path names.

Yesterday, I was using a contributed package in R that requires the access to R's temp directory. The exact location of this folder can be found out using the command tempdir(). It just happened that my current location of this folder was not ideal for the task at hand and I needed to change it. There was no command in R to change it. I figured out a way to do it: add an entry to the environmental variable list of the system as TMPDIR and specify the desired folder name.

Monday, November 12, 2007

Useful websites for graphics in R

R graphical manuals (a nice collection of examples and codes)

Codes from the book "R graphics"

Friday, November 02, 2007

Fast permuting r by c two-way tables

Today, I need to permute many two-way tables of the same dimension (say 2 by 3) to carry out a test of independence. My data looks like a big matrix with each row corresponding to a two-way table (arranged by row). The permutation is just to draw a random 2 by 3 table given independence between the two dimensions, conditioning on the observed marginal distributions. I didn't find (or didn't have time to find) a function to do that in R. Therefore I wrote the following codes:<-function(x){

x.r<-rep(1:nrow(x), times=rowSums(x))

x.c<-rep(1:ncol(x), times=colSums(x))

temp.mat<-expand.grid(1:nrow(x), 1:ncol(x))

return(table(c(x.r, temp.mat[,1]), c(sample(x.c),temp.mat[,2])) -matrix(1, nrow(x), ncol(x)))


I also made a function to do the permutation for each line of my huge
matrix and used the apply() function to speed up things. It works pretty well
for me.

Wednesday, October 24, 2007

Tropical infectious diseases moving north

Today, on CNN, it was said that some experts argued several epidemic consequences of global warming. One of them is tropical diseases are moving north. For example, malaria cases were reported in New Jersey.

I found this claim is a little far-fetch. According to scientists, the temperature change due to global warming is about (1.33 ± 0.32 °F) during the last 100 years. The difference between the average temperature of New Jersey and that of a tropical region is probably much greater than that. To me, one possible reason why a single disease is moving north can be genetic mutation in the virus strand that increases cold-resistance. And more frequent air travel can probably explain the trend that multiple diseases have been moving north.

Before having more data, one cannot conclude global warming as the cause and ignore other possible threatening causes.

Monday, September 17, 2007

Probing genetic overlap among complex human phenotypes

Andrey Rzhetsky (a collaborator of mine) called me today. Among many things we discussed, he told me that our recent paper on estimating genetic overlap from phenotype data (time at diagnosis) had attracted quite a bit of popularity. It was the 24th most read paper of PNAS for the month of July 2007. And it was covered in MIT technology review and the Wired Science blog, among many other more biomedical related sites. Besides nice graphics, the paper used time-to-event models to estimate potential genetic overlap between human disorders.

Sunday, September 16, 2007

Correction: Chair Mao didn't say everything after all.

As Li pointed out, I had the wrong memory. I never aced any of these courses in my school days. BUT, I should have GOOGLED!!! :) That saying was published on May 11th, 1978. However, it was not proposed by Deng Xiaoping, despite the common perception. Deng only publicized it a lot. This is again a good example of Stigler's law of eponymy.

Friday, September 07, 2007

Chairman Mao was right!

I did not grow as a generation fully influenced by Mao Zedong. Actually, several days after I was born, he passed away. However, we still got to study quite a number of his quotes. One of them is "practice is the only way to validate the truth" (pardon my unpolished interpretation). 实践是检验真理的唯一标准。I have never given this quote much thought before yesterday.

Yesterday, I received an email regarding a computational biology challenge. The contestants are supposed to analyze the data and discover the gold standard genes or gene-gene interactions hidden in the data. This is a blinded challenge and the results will be announced during the conference. A winner will be named.

I was intrigued. For a couple of the challenges, I thought of some ideas. I thought, my methods might actually work well. "Shall I enter this contest then?" I asked myself. Then, it suddenly worried me that what if the data do not agree with my assumptions for my method. "That would really be a problem!" I thought, "and I might be at the bottom of the contest." But then it hit me: it would actually be nice that the method doesn't work and we understand which assumptions are wrong. How else can we learn about the biological systems if we don't fail?

This reminded me of my struggle with a simulation study recently. Part of the phenomenon I observed does not agree with my intuition. At first, I kept debugging my codes. But the codes were so short that I was finally convinced that what I observed was a true phenomenon of the statistics I was evaluating when the dimension is too large.

Often, I judge the validity of a method first by intuition. These recent experiences show that knowledge can actually occur when the intuition fails. And intuition is just knowledge based on past failed judgements. Therefore, Chairman Mao was right about one thing, if we don't just do it and fail, we can't learn the truth. Sometimes, it is necessary to make a fool out of ourselves for the greater good---better knowledge for the mankind. :) Or we can just have one more reason to feel less grumpy whenever we fail.

Friday, May 25, 2007

I want a PlayStation 3 ... for research

Next time I go to visit some of my friends, I will pay more attention to their little black gaming devices, called PS3. It turns out the best computer processor of our times is used in it. It is the Cell Chip developed jointly by IBM, SONY and Toshiba. According to some very scientific study comparing the Cell Chip with conventionally designed chipset, the performance for chips with comparable clock speed are improved several folds. Compared to tens of thousands dollars for the blade system by IBM that uses the Cell Chip, the $600 dollar price tag of PS3 seems to be a better option for academic research.

Of course, it is not as simple as it sounds. The reason why the Cell Chip can be faster for high performance computing is its well-designed structure. I am not equipped with the right knowledge to explain this. It is said to be an IBM PowerPC processor with eight vector processor. I can only assume this means better paralell capacity and better efficiency in dealing with high volume computation. Therefore, it takes *VERY* professional programming to take advantage of its power. Until R was programmed to take advantage of such chips, I will keep my eyes on those Core Duo Quad chips.

Tuesday, May 22, 2007


I came across this website that allows you to organize citations online. I think it is pretty neat. Especially it has Bibtex format right on the screen.

Monday, May 14, 2007

Mother's day spending

Men are more likely to spend money on Mother's day than women (in both percentage and amount). This is a fact but this is also a nice example of "confounding" for intro stats. I heard about this in the news (people were joking about this as a "mom's boy" phenomenon) but can't find that exact news article today. I found the data on Mother's day 2005 (a pretty neat collection of numbers). Here is the interesting part.

Celebrating Mother's day (i.e., buy gifts ... )

All 83%
Men 85%
Women 81.1%

Who do you plan to buy a Mother's Day gift for this
year? (Check all that apply)
(in the order of all, men, women)
Mother or Stepmother
65.2% 64.9% 65.3%
19.4% 39.8% ---

8.4% 4.9% 11.6%

7.4% 6.6% 8.0%

5.3% 3.9% 6.6%

5.7% 4.0% 7.2%

1.1% 0.8% 1.3%
Other relative

10.7% 5.5% 15.4%

Wednesday, May 09, 2007

Divided opinion?

Anderson Copper 360 is a show on CNN that gives in-depth coverage of current events. In one of its recent airing, Anderson Copper discussed a question occurred during the recent Republican presidential debate: "do you believe in evolution?" One of the candidates added after the poll that he believed in evolution and also believed in God when watching things such as the sunset at Grand Canyon. Part of the results from a survey was then shown on the screen:
    • 48% believed in God
    • 13% believed in Evolution (God is not involved)

It was then concluded that we are living in a divided country.

I was confused ("confused" is the mostly used words during my office hours) by how the options are organized. Should there be

    1. believed in God (no evolution)
    2. believed in God (evolution exists as part of God's plan)
    3. believed in Evolution (God is involved)
    4. believed in Evolution (There is God but God is not involved)
    5. believed in Evolution (There is no God)

It may turns out not to be very divided at all with most of the percentage splitted among the middle options and only small percentages in the two extreme views.

Monday, April 30, 2007

Ten Simple Rules for Making Good Oral Presentations

PLoS Computational Biology's "Ten Simple Rules" series are pretty nice guidelines, including:
  1. Bourne PE (2005) Ten simple rules for getting published. PLoS Comp Biol 1: e57.
  2. Bourne PE, Chalupa LM (2006) Ten simple rules for getting grants. PLoS Comp Biol 2: e12.
  3. Bourne PE, Korngreen A (2006) Ten simple rules for reviewers. PLoS Comp Biol 2: e110.
  4. Bourne PE, Friedberg I (2006) Ten simple rules for selecting a postdoctoral fellowship. PLoS Comp Biol 2: e121.
  5. Vicens Q, Bourne PE (2007) Ten simple rules for a successful collaboration. PLoS Comp Biol 3: e44.

It just published a new set of rules for oral presentations.

Sunday, April 29, 2007

10 most common passwords

I was frustrated with my little Lexmark printer for a while today. First, I can't print to it from my desktop (which I finally fixed it by re-installing the driver). Secondly, I totally forgot the PIN I set for its security. I tried everything I can think of for a 4-digit numeric PIN and realized how creative I must have been when setting it up and how lack of imagination I am now. I don't know whether anyone has done some survey on how many passwords (for different purposes) we were asked in a day, a week or a month. I bet the number won't be small.

Below is the 10 most common passwords published in PC magazine on April 18, 2007. I guess I should just be happy that I am not using any of them.
  1. password
  2. 123456
  3. qwerty (look at your keyboard. and I wonder why there is not a right-hand version)
  4. abc123
  5. letmein
  6. monkey
  7. myspace1
  8. password1
  9. link182
  10. (your first name)

Saturday, April 28, 2007

"hoaxing, forging, trimming and cooking"

These are several ways of misusing data in science according to Charles Babbage, as mentioned in the following article.

"Deception and dishonesty with data:fraud in science" by David Hand
Significance, March 2007, pages 22-25

The part of this article I like most is the part where the author discussed that, despite the idealized image of scientists in the minds of the public, scientists are just human who are under pressure to be productive and to compete in order to survive the academic world. The author also observed that scientific fraud differed from other frauds such as banking frauds in that the perpetrators didn't set out to be dishonest at the beginning. It is also true that some accusation of fraud in science is arguable, especially in the gray area of "trimming" unreliable data.