This paper was motivated by our long-time observation of variables selected using significance tests often (not always) perform poorly in prediction. This is not news to many applied scientists we spoke to but the reason for this phenomenon has not been explicitly discussed. Incidentally, also observed in our own research, variables selected using our own I score seem to do better.
Through the course of writing this paper, we came to realize how significance and predictivity are fundamentally different. For example, for a set of variables, its innate predictivity (or power to discern between two classes) does not depend on observed data while significance is a notion applied to sample statistics. As sample sizes grow, most sample statistics relevant to the population quantities that decide predictivity will eventually correlate well with predicitivity. However with finite samples, significant sets and predictive sets do not entirely overlap.
In our project, we tried various simulations to illustrate and understand this disparity, some of which are too complicated to be included in our intend-to-be-accessible paper. For example, using a designed scenario, we manage to recreate the Venn diagram above using simulations.
Using observed data while evaluating variables for prediction, we inevitably have to rely on certain "sample statistics". We have found in our research presented in our paper, certain prediction-oriented statistics would correlate better with the level of predictivity and test statistics may not. This is because test statistics are often designed to have power to detect ANY departure from the null hypothesis but only certain departures will lead to substantial improvement in predictive performance.