Congratulations to our PhD student, Xingzhi Wang, for successfully presenting his dissertation!

Tuesday, August 29, 2023 - 8:30am
Xingzhi Wang

The thesis title is; Variable screening for high-dimensional data via Pearson's chi-square statistics

The current era of data abundance presents a new challenge: the number of

predictors can now reach thousands or even millions. This is particularly prevalent

in medical studies involving gene expression patterns or MRI-based morphology.


Variable/feature screening is a powerful tool for reducing the massive number

of predictors to a manageable scale, often smaller than the sample size. We derive

conditions for guaranteeing the validity of using Pearson chi-square statistic for variable screening, first for the case when

both the response and the predictor variables are categorical, which is subsequently

extended to encompass other data types, including continuous variables.


Our theoretical contributions in categorical feature screening are twofold:

First, we derive conditions for controlling the false discovery rate through Bayesian

model averaging. Second, we establish a certain rate of uniform consistency of the

screening statistic, which holds even with increasing number of categories some of

which may be of nearly zero probabilities.


Leveraging on these theoretical results and data-based binning, we extend the

categorical-based screening method to non-categorical variables including continuous

variables commonly found in practice.


We present findings from extensive numerical studies contrasting the proposed

methods with existing methods. We complement these findings with real applications

to illustrate the practical implications of our work.


For solving the problem of continuous domain regression or discrete domain classification, variable/feature screening is an effective approach and to a certain extent even computationally indispensable when the number of predictor variables, p, is extremely large as compared to n, the number of data cases. We are firstly concerned with the situations where both the response and the predictor variables are categorical variables and then extend the usage of the marginal utility statistic proposed beyond the categorical setting, which would further expand its applications in diverse fields including economics and medical studies, etc. For categorical feature screening, we adopt the framework introduced by Guo et al. (2022) to screening categorical variables via the Pearson's chi-square statistic. We derive a set of sufficient conditions for controlling the false discovery rate of the proposed method, in the ultrahigh dimensional setting. Our theoretical innovations are twofold: (i) we apply Bayesian model averaging to obtain the desired false discovery rate under the exchangeability assumption and (ii) do so by establishing the  uniform consistency of the Pearson's chi-square statistics at a certain rate that allows for diverging number of categories and some cells with nearly zero cell probabilities.

Furthermore, equipped with the idea of data binning, we widen the applicability of the Pearson's chi-square statistic beyond random variables with finite support, through the creation of a novel population-level dependency measure which we call the Pearson's chi-square divergence, and whose potency may be carried towards multivariate variables or even more exotic random processes. Our main theoretical developments are focused upon: (i) elaborating the conditions that guarantee the uniform consistency of the Pearson's chi-square statistics under data-based binning to the Pearson's chi-square divergence and (ii) offering some closed-form formulae, in terms of the sample size, for the number of bins that could be empirically used for certain classes of marginal distributions for the response and the covariates.

Committee Chair:  Kung-Sik Chan

Committee Members:  Montserrat Fuentes, Joseph B Lang, Boxiang Wang