2021 Hogg and Craig Lecturer is Bin Yu

The Class of 1936 Second Chair in the College of Letters and Science, and Chancellor's Distinguished Professor, Departments of Statistics and of Electrical Engineering & Computer Sciences, University of California at Berkeley
Thursday, April 15, 2021 to Friday, April 16, 2021

Dr. Bin Yu from University of California at Berkeley will be our 48th Hogg and Craig Lecturer.

Early in the 1969-70 academic year, Professor Allen T. Craig announced his retirement. He gave a retirement talk in January 1970. Under the leadership of Craig’s student and co-author, Professor Robert V. Hogg, the department decided to establish a lecture series to honor Professor Craig. His January 1970 talk was the first in this series. When Professor Hogg passed away at the age of 90 in 2014, the department decided to incorporate his name into the lecture series.


Bin Yu

Bin Yu is Chancellor’s Distinguished Professor and Class of 1936 Second Chair in the departments of Statistics and EECS at UC Berkeley. She leads the Yu Group which consists of 15-20 students and postdocs from Statistics and EECS. She was formally trained as a statistician, but her research extends beyond the realm of statistics. Together with her group, her work has leveraged new computational developments to solve important scientific problems by combining novel statistical machine learning approaches with the domain expertise of her many collaborators in neuroscience, genomics and precision medicine. She and her team develop relevant theory to understand random forests and deep learning for insight into and guidance for practice. Dr. Yu is a member of the U.S. National Academy of Sciences and of the American Academy of Arts and Sciences. She is Past President of the Institute of Mathematical Statistics (IMS), Guggenheim Fellow, Tukey Memorial Lecturer of the Bernoulli Society, Rietz Lecturer of IMS, and a COPSS E. L. Scott prize winner. She is serving on the editorial board of Proceedings of National Academy of Sciences (PNAS) and the scientific advisory committee of the UK Turing Institute for Data Science and AI.

All meetings will be conducted via Zoom, on Central Time.

Day 1: Thursday, April 15, 2021

Please use this Zoom link for the reception, presentation and lecture, which are open to everyone:


3:30 PM – 4:30 PM         Reception and Presentation of Annual Student Awards (the presentation begins at 4:00 PM)

4:30 PM – 5:30 PM         Lecture #1:

Veridical Data Science: the practice of responsible data analysis and decision-making

"A.I. is like nuclear energy — both promising and dangerous" — Bill Gates, 2019.

Data Science is a pillar of A.I. and has driven most of recent cutting-edge discoveries in biomedical research. In practice, Data Science has a life cycle (DSLC) that includes problem formulation, data collection, data cleaning, modeling, result interpretation and the drawing of conclusions. Human judgment calls are ubiquitous at every step of this process, e.g., in choosing data cleaning methods, predictive algorithms and data perturbations. Such judgment calls are often responsible for the "dangers" of A.I. To maximally mitigate these dangers, we developed a framework based on three core principles: Predictability, Computability and Stability (PCS). Through a workflow and documentation (in R Markdown or Jupyter Notebook) that allows one to manage the whole DSLC, the PCS framework unifies, streamlines and expands on the best practices of machine learning and statistics — bringing us a step forward towards veridical Data Science. We will illustrate the PCS framework in the modeling stage through the development of DeepTune images for characterization of neurons in the difficult V4 area of primary visual cortex.

Day 2: Friday, April 16, 2021

Please use this Zoom link for the reception and lecture, which are open to everyone:


4:00 PM – 4:30 PM         Reception

4:30 PM – 5:30 PM         Lecture #2:

Iterative Random Forests (iRF) with applications to biomedical problems through epiTree for epistasis discovery

Genomics has revolutionized biology, enabling the interrogation of whole transcriptomes, genome-wide binding sites for proteins, and many other molecular processes. However, individual genomic assays measure elements that interact in vivo as components of larger molecular machines. Understanding how these high-order interactions drive gene expression presents a substantial statistical challenge. Building on random forests (RFs) and random intersection trees (RITs) and through extensive, biologically inspired simulations, we developed the iterative random forest algorithm (iRF) to seek predictable and stable high-order Boolean interactions. We demonstrate the utility of iRF for high-order Boolean interaction discovery in two prediction problems: enhancer activity in the early Drosophila embryo and red hair phenotype using UK BioBank data. The latter is a proof-of-concept step towards suggesting gene variants behind cardiovascular phenotypes for single cell experiments as part of a Chan-Zuckerberg Biohub Intercampus Award to UC Berkeley, UCSF and Stanford. It also motivates the development of the epiTree pipeline for epistasis discovery. Finally, a connection is made between iRF and our PCS framework for veridical data science (PCS stands for predictability, computability and stability).