Min Yang - Colloquium Speaker

Professor of Statistics, Department of Mathematics, Statistics, and Computer Science, University of Illinois-Chicago
Thursday, September 15, 2022 - 3:15pm
Colloquium Title: 
Information-Based Optimal Subdata Selection for Massive and Heterogeneous Data


Extraordinary amounts of data offer us unprecedented opportunities for scientific discovery and advancement. At the same time, analyzing big data presents unprecedented challenges due to not only volume of data, but also data variety and complexity, as well as the speed with which it must be analyzed. A critical question for the statistics community is how to detect statistical relationship within high volumes of data with complicated structure and turn it into actionable knowledge. Classical statistical models such as linear models or generalized linear models are powerful tools when relationships between the input and output variables are homogeneous, but they can be inadequate when a dataset contains heterogenous patterns, as is often the case with big data. One strategy to address the heterogeneity issue is through the Mixtures-of-Experts (ME) modeling approach. The advantage is that ME is flexible enough to be combined with a variety of different models. Unfortunately, the computing resource requirement is extremely demanding, rendering the computation task formidable for large datasets. In this talk, I will discuss an optimal strategy of subdata selection that preserves maximum information while demanding low computing resource. Under clusterwise linear regression model, it can be shown that the statistical efficiency of the proposed subdata selection algorithm is asymptotically optimal, i.e., there exists no other method with better statistical efficiency than the proposed one when the size of full data is large. Second, as the full data increases in size, the selected subdata preserve the rich information even when the subdata size is fixed.


