The thesis title is; Variable Screening for high-dimensional data via Pearson's chi-square statistics The current era of data abundance presents a new challenge: the number of predictors can now reach thousands or even millions.
 
This is particularly prevalent in medical studies involving gene expression patterns or MRI-based morphology. Variable/feature screening is a powerful tool for reducing the massive number of predictors to a manageable scale, often smaller than the sample size. We derive conditions for guaranteeing the validity of using Pearson chi-square statistic for variable screening, first for the case when both the response and the predictor variables are categorical, which is subsequently extended to encompass other data types, including continuous variables. Our theoretical contributions in categorical feature screening are twofold: First, we derive conditions for controlling the false discovery rate through Bayesian model averaging. Second, we establish a certain rate of uniform consistency of the screening statistic, which holds even with an increasing number of categories some of which may be of nearly zero probabilities. Leveraging on these theoretical results and data-based binning, we extend the categorical-based screening method to non-categorical variables including continuous variables commonly found in practice. We present findings from extensive numerical studies contrasting the proposed methods with existing methods. We complement these findings with real applications to illustrate the practical implications of our work.
Abstract: For solving the problem of continuous domain regression or discrete domain classification, variable/feature screening is an effective approach and to a certain extent even computationally indispensable when the number of predictor variables, p, is extremely large as compared to n, the number of data cases. We are firstly concerned with the situations where both the response and the predictor variables are categorical variables and then extend the usage of the marginal utility statistic proposed beyond the categorical setting, which would further expand its applications in diverse fields including economics and medical studies, etc. For categorical feature screening, we adopt the framework introduced by Guo et al. (2022) to screening categorical variables via the Pearson's chi-square statistic. We derive a set of sufficient conditions for controlling the false discovery rate of the proposed method, in the ultrahigh dimensional setting. Our theoretical innovations are twofold: (i) we apply Bayesian model averaging to obtain the desired false discovery rate under the exchangeability assumption and (ii) do so by establishing the uniform consistency of the Pearson's chi-square statistics at a certain rate that allows for diverging number of categories and some cells with nearly zero cell probabilities. Furthermore, equipped with the idea of data binning, we widen the applicability of the Pearson's chi-square statistic beyond random variables with finite support, through the creation of a novel population-level dependency measure which we call the Pearson's chi-square divergence, and whose potency may be carried towards multivariate variables or even more exotic random processes. Our main theoretical developments are focused upon: (i) elaborating the conditions that guarantee the uniform consistency of the Pearson's chi-square statistics under data-based binning to the Pearson's chi-square divergence and (ii) offering some closed-form formulae, in terms of the sample size, for the number of bins that could be empirically used for certain classes of marginal distributions for the response and the covariates.
Committee Chair: Kung-Sik Chan
Committee Members: Montserrat Fuentes, Joseph B Lang, Boxiang Wang