Research Highlights

Coarsened-Data Statistical Methods for Spatial Epidemiology

by Dale Zimmerman

The estimation of intensity and spatial variation in relative risk are important inference problems in spatial epidemiologic studies. A standard data assimilation component of these studies is the assignment of a geocode, i.e. point-level spatial coordinates, to the address of each subject in the study population.

Unfortunately, when geocoding is performed by the standard automated method of street-segment matching to a georeferenced road file and subsequent interpolation, it is rarely completely successful.

Typically, 10% to 30% of the addresses in the study population, and even higher percentages in particular subgroups, fail to geocode, potentially leading to a selection bias, called geographic bias, and an inefficient analysis. Missing-data methods could be considered for analyzing such data; however, since there is almost always some geographic information coarser than a point (e.g. a zip code) observed for the addresses that fail to geocode, a coarsened-data analysis is more appropriate.

data model

Recently I have been developing methodology for estimating spatial intensity and relative risk functions from coarsened geocoded data. Using this new methodology, substantial improvements in the estimation quality of coarsened-data analyses relative to analyses of only the observations that geocode have been demonstrated.

For example, using data from a rural health study in Iowa in which only 64% of rural addresses and 85% of non-rural addresses geocoded, but imprecise locational information was available for all addresses, I obtained a kernel-based intensity estimate using only the data that geocoded and a coarsened-data intensity estimate using all the data. Pointwise ratios of each of these two estimates to the complete-data kernel intensity estimate are displayed in the figure.

The coarsened-data estimate more closely approximates the complete-data estimate; in fact its integrated absolute error is less than half that of the incomplete-data estimate.

High-Dimensional Models and Microarray Data Analysis

by Jian Huang

Over the past decade DNA microarray technology has attracted tremendous interest in basic science labs, clinical labs and in industry. Microarrays are capable of monitoring the expression of thousands of genes simultaneously and have many important applications in biological and biomedical research. They are used, for example, to characterize disease states, determine the effects of certain treatments, and to examine the process of development. Microarrays are also increasingly used for identifying genes and genomic regions that increase the risk of common and complex diseases such as diabetes, heart diseases, and autism.


While microarrays have become a routine tool in research, analysis of microarray data is challenging. A hallmark of microarray data is high-dimensionality, since a typical microarray study surveys at least thousands of genes, but the sample size is often at most in the hundreds. This is called a "large p, small n" problem in statistics, where p refers to the number of variables (genes), and n refers to the sample size (the number of subjects participating in the study). Standard methods are not applicable to such problem since they require that p is smaller than n. Two other important features of microarray data are sparsity and the presence of cluster structure. Sparsity is due to the fact that the number of genes important to a trait or disease is usually small. The task of finding such genes for a given trait can be formulated as a variable selection problem in statistical modeling. Cluster structure is present since genes in the same biological pathways or functional groups tend to be correlated. Incorporation of such information in statistical modeling facilitates the identification of statistically and biologically significant patterns from data.

I have been working on approaches for correlating microarray data with a clinical outcome. These methods take into account the features described above. The focus is on developing variable selection methods for the identification of genes and pathways that are associated with disease such as age related macular degeneration or a disease related quantitative trait such as the survival time of lymphoma patients.

The image above (from "A Primer of Genome Science" by Greg Gibson and Spencer Muse, Sinauer Associates, 2002) is part of a cDNA microarray. Each pixel in the image represents part of the DNA sequence of a gene. Red pixels indicates genes with relatively higher expression in the treatment sample than in the mutant sample. The dendrogram on left side indicates that genes tend to be clustered according to their expression across the samples, the one on the top suggests that samples can also be clustered using gene expression.  

MRI Tissue Classification of the Human Brain

by Dai Feng and Luke Tierney

Magnetic Resonance Imaging (MRI) is an important non-invasive tool for understanding the structure and function of the human brain. One important task is to use MR images to identify the major tissue, white matter (WM), gray matter (GM), and cerebro-spinal fluid (CSF), within a particular subject's brain. This is valuable, for example in detecting diseases, in preparation for surgery, and to aid in subsequent functional studies of the brain.

An MR image is based on a discretization of the viewing area into a 3-dimensional array of volume elements, or voxels. Typical images consist of a 256 x 256 x 256 array of one cubic millimeter voxels. Segmentation is usually based on a T1-weighted image providing one measurement for each voxel. The measurements contain some noise that is usually modeled as normally distributed and independent from voxel to voxel. A simple model views each voxel as homogeneous, belonging entirely to one of the three major tissue types; the measurements are thus normally distributed with means depending on the tissue types of their voxels. The tissue types are not known and need to be identified from the image. Since nearby volumes tend to be of the same tissue type, a Markov random field model can be used to capture the spatial similarity of voxels. A Markov chain Monte Carlo approach can be used to fit this model.

A more realistic model than the one just described would take into account the fact that the volume elements are not homogeneous; while some may contain only one tissue type, others on the interface will contain two or possibly three different tissue types. Our approach to this problem is to construct a higher resolution image in which each voxel is divided into 8 subvoxels. For each voxel the measured value is the sum of the unobserved measurements for the subvoxels. The subvoxels are in turn assumed to be homogeneous and follow the simpler model described above. This approach provides more accurate tissue classification and also allows more effective estimation of the proportion of each voxel that belongs to each of the major tissue types.

a coronal slice of a T1-weighted MR image of a brainthe corresponding tissue classifications with CSF shown in dark gray, GM in medium gray, and WM in light gray.

The image on the left shows a coronal slice of a T1-weighted MR image of a brain, and the image on the right shows the corresponding tissue classifications with CSF shown in dark gray, GM in medium gray, and WM in light gray.