Congratulations to our PhD student, Xingzhi Wang, for successfully presenting his dissertation!
The thesis title is; Variable screening for high-dimensional data via Pearson's chi-square statistics
The current era of data abundance presents a new challenge: the number of
predictors can now reach thousands or even millions. This is particularly prevalent
in medical studies involving gene expression patterns or MRI-based morphology.
Variable/feature screening is a powerful tool for reducing the massive number
of predictors to a manageable scale, often smaller than the sample size. We derive
conditions for guaranteeing the validity of using Pearson chi-square statistic for variable screening, first for the case when
both the response and the predictor variables are categorical, which is subsequently
extended to encompass other data types, including continuous variables.
Our theoretical contributions in categorical feature screening are twofold:
First, we derive conditions for controlling the false discovery rate through Bayesian
model averaging. Second, we establish a certain rate of uniform consistency of the
screening statistic, which holds even with increasing number of categories some of
which may be of nearly zero probabilities.
Leveraging on these theoretical results and data-based binning, we extend the
categorical-based screening method to non-categorical variables including continuous
variables commonly found in practice.
We present findings from extensive numerical studies contrasting the proposed
methods with existing methods. We complement these findings with real applications
to illustrate the practical implications of our work.
For solving the problem of continuous domain regression or discrete domain classification, variable/feature screening is an effective approach and to a certain extent even computationally indispensable when the number of predictor variables, p, is extremely large as compared to n, the number of data cases. We are firstly concerned with the situations where both the response and the predictor variables are categorical variables and then extend the usage of the marginal utility statistic proposed beyond the categorical setting, which would further expand its applications in diverse fields including economics and medical studies, etc. For categorical feature screening, we adopt the framework introduced by Guo et al. (2022) to screening categorical variables via the Pearson's chi-square statistic. We derive a set of sufficient conditions for controlling the false discovery rate of the proposed method, in the ultrahigh dimensional setting. Our theoretical innovations are twofold: (i) we apply Bayesian model averaging to obtain the desired false discovery rate under the exchangeability assumption and (ii) do so by establishing the uniform consistency of the Pearson's chi-square statistics at a certain rate that allows for diverging number of categories and some cells with nearly zero cell probabilities.
Furthermore, equipped with the idea of data binning, we widen the applicability of the Pearson's chi-square statistic beyond random variables with finite support, through the creation of a novel population-level dependency measure which we call the Pearson's chi-square divergence, and whose potency may be carried towards multivariate variables or even more exotic random processes. Our main theoretical developments are focused upon: (i) elaborating the conditions that guarantee the uniform consistency of the Pearson's chi-square statistics under data-based binning to the Pearson's chi-square divergence and (ii) offering some closed-form formulae, in terms of the sample size, for the number of bins that could be empirically used for certain classes of marginal distributions for the response and the covariates.
Committee Chair: Kung-Sik Chan
Committee Members: Montserrat Fuentes, Joseph B Lang, Boxiang Wang