Geophysical Statistics Project

The mission of the Geophysical Statistics Project (GSP) is to encourage the application of statistical analysis and the development of new statistical methods in the geophysical sciences. To fulfill this goal, GSP must also be engaged in basic statistical research including mathematical statistics and probability theory. Although GSP is administered through the Climate and Global Dynamics Division, this project is an NCAR-wide effort and the research reported in the following sections reflects this breadth of scientific collaboration across the Center's divisions. A large part of GSP’s impact is through multi-year projects with substantial scientific involvement. For this reason many of the research results below are based on efforts continuing from last year's report. Project leadership is provided by Douglas Nychka (GSP). Co-Principal investigators for the project are Richard Katz (Environmental and Societal Impacts Group, ESIG) and Joseph Tribbia (Climate Dynamics and Predictability, CDP).

Time Series

Observation Driven Models for Rainfall Occurrence

This project is a collaborative effort among Sarah Streett (GSP), Linda Mearns (ESIG), and Stephan Sain (University of Colorado, Denver). The goal is to provide a realistic model for precipitation occurrence as one component of a statistical description of daily meteorology. The observation model consists of a Bernoulli {0,1} process where the probability is the logistic transformation of an autoregressive stochastic process. The key feature of this stochastic process is its dependence on past occurrences and seasonality.  A special case of this model is a Markov process for rainfall where transition probabilities depend on past history. Observation driven models have been fit to a representative set of reporting stations in the Southeastern U.S. and, in some cases, the resulting models provide a better reproduction of the distribution of extreme runs of wet and dry days than a simpler Markov model.

This figure indicates the ability of different binary times series models to fit daily precipitation occurrence. Plotted are boxplots indicating the distribution of the top 5% wet and dry runs for a station from North Carolina, an observation driven model, ODM(8) and a simple Markov model proposed by Stern and Coe using the previous days occurrence, SC(6) and the past two days occurrence SC(12). For this station the ODM(8) is better in reproducing the extremes for runs of wet days than the Markov models.

Spatial and Spatio-Temporal Processes

Pedotransfer Functions Based on Nonparametric Regression

Sain, in collaboration with Shrikant Jagtap (Florida State University), has developed a functional relationship between bulk soil properties and the soil’s water holding capacity. This function, known as the pedotransfer function, is important in the application of crop models to a wide variety of soil types. Although the composition (percentages of sand, silt, clay) and the amount of organic matter are mapped for soils, the drained upper limit (DUL) and drained lower limits (DLL) are difficult to measure and are not usually known. The statistical problem is to use an extensive field dataset where DUL and DLL are known and fit a flexible regression. The formulation of this model is a hybrid of spatial statistics ideas, additive models, and thin plate splines. It also includes an error covariance model for the correlation of errors as a function of the depth when soil layers are stacked in a common soil profile. The net result is an estimate of the pedotransfer for agricultural soils along with an estimate of the uncertainty. The progation of uncertainty in soil characteristics implied by the uncertainty in the pedotransfer function was assessed for practical significance using a crop model. Simulated yields for corn forced by observed daily meteorology from a site in North Carolina indicate that the year-to-year variability in weather introduces substantially more variance in the yields than uncertainty in the soil drained limits. The analysis of the soil is posted as functions in the R statistical package and makes it possible for other groups to use this estimated pedotransfer function and modify the estimator.

This figure depicts ensembles of soil profiles for a silty-loam soil (SIL) and sandy soil (S). The variation represents the uncertainty in the water holding capacity estimated from the field database and the correlation with depth is based on what is observed empirically in the data and reproduced by the statistical pedotransfer function model.

This figure illustrates the range of yields from a crop model (CERES maize) for a site in North Carolina using the ensemble of soil profile in the previous figure. The crop yields also depend on the observed meteorological variables and for this example this constitutes most of the variability from year to year.

Errors in Scaling Climate Models

Claudia Tebaldi (GSP, ESIG, and Research Applications Program) in collaboration with Tom Wigley (Climate Analysis Section, CAS) and Gerald Meehl (joint with CAS and Climate Change Research, CCR) have studied the errors in extrapolating the results from one climate model experiment to another. The long-term goal of such work is to provide a statistical tool where climate projections for a variety of different future emission scenarios can be inferred from a limited set of experiments using a specified set of emission scenarios. A test case was to scale NCAR Parallel Climate Model results for the A2 emissions scenario into the B2 scenario. The basic idea of scaling is quite simple. The pattern of climate change for one scenario is scaled by the difference in the global mean temperature between the two scenarios. Although this method has been given recent attention, one new aspect of this research is to model the error in the scaling exercise. It was found that the errors in scaling from A2 to B2 climate followed a Gaussian process, but with a nonstationary covariance function that modified the correlation as a function of latitude. This covariance reproduces the errors as the estimate is spatially averaged from the finest resolution of a model grid cell to continental scales.

Determining Extremes in Daily Ozone

Eric Gilleland (GSP and Colorado State University, CSU) and Nychka, in collaboration with William Cox (U.S. Environmental Protection Agency (USEPA) and Shelly Eberly (USEPA), have studied the problem of estimating the spatial field comprising the ozone standard proposed by the USEPA. The standard for an ozone monitoring station is an extreme order statistic (fourth highest) of the daily ozone averages observed over the summer season. This is a nonlinear statistic that has different spatial structure from daily ozone measurements and is possibly not normally distributed. Despite these difficulties, the goal is to estimate the spatial field of the standard at un-monitored locations. The approach is to create a space-time model for daily ozone and sample the conditional distribution of daily ozone for the entire summer season at arbitrary locations given the record of observed data. The conditional simulation of the standard is then the order statistic applied to the conditional sample of daily values. New results during this year are modeling the daily ozone field using several nonstationary covariance models. These include using elevation as an addition coordinate, a mixture of independent fields (proposed by Montserrat Fuentes, North Carolina State University) and varying the range and nugget parameters spatially using a new formulation by Christopher Paciorek (Carnegie Mellon University). The models are evaluated by cross-validation and are compared to uncertainty measures found by a Monte Carlo method.

Regression and Classification

Meta Analysis of Climate Simulations

Tebaldi, in collaboration with Linda Mearns (ESIG) and Richard Smith (University of North Carolina), have developed a Bayesian, random effects model to combine the results of different climate model simulations and quantify the uncertainty. This research is in the context of synthesizing the numerical experiments from different climate modeling centers in a way that produces credible probabilistic forecasts of future climate change.

Different atmosphere-ocean general circulation models (AOGCM) produce different climate change projections.  These differences are especially striking when trying to evaluate climate change on a regional (subcontinental) scale. Studies of multi-model ensemble predictions try to reconcile the AOGCM's responses by combining them into (weighted) means and to use the ensemble spread as an envelope of uncertainty around them. It is natural to embed this descriptive analysis by climate scientists in a statistical context in order to formalize these, otherwise, heuristic procedures. A statistical model also delineates the hypotheses and optimization criteria leading to a final estimate of climate change and uncertainty surrounding it. The basic concept is to consider the individual climate model results as independent samples from a super population of models that are unbiased with respect to the true climate.

A basic approach is a simple mixed-effects, linear statistical model that accounts for climate model bias as a function of the region and particular AOGCM. Another useful component is a correlation in bias between a particular AOGCM's ability to simulate present and future climate. The net results are posterior distribution for projected climate change across regions where the uncertainty combines information across AOGCMS and regions. The posteriors are found by Markov Chain Monte Carlo. In this specific project, we look at a dataset studied by Filippo Giorgi (Abdus Salam International Centre for Theoretical Physics) and Mearns (Journal of Climate, May 2002) consisting of nine AOGCM experiments run under two emission scenarios with simulations of climate for present and future. The results are aggregated over 22 regions. The purpose is to estimate changes in 30-year average temperatures (1961-1990 and 2071-2100) for each region and to attach a measure of uncertainty to them. A more controversial aspect of this analysis is that the parameters for the precision of model bias allow one to objectively rank the AOGCMs in the context of this statistical model.

This figure summarizes the posterior distributions by box-plots of the change in average temperature across the 22 regions for the 9 AOGCM experiments. Top panel, winter season. Bottom panel, summer season. The distributions for winter climate change have lower variance within region, but higher variance among regions than the distributions for summer climate change. Some regions show distinctly larger climate change signals than others, especially high latitude regions of the Northern experiments such as Alaska (ALA), Greenland (GRL), and Northern Asia (NAS).

Stochastic Multiresolution Models for Turbulence

This project is a collaboration among Brandon Whitcher (GSP), Jeff Weiss (University of Colorado, Boulder), Thomas Lee (CSU), and Curtis Storlie (CSU). With the current state of computing technology, it is relatively straightforward to generate high-resolution numerical integrations of the Naiver-Stokes (N-S) equations, the fundamental physical equations that describe (nonreactive) fluid motion. One feature in the simulation of geophysical type fluids is the presence of eddies and vortices. It has become of interest in the past five years to describe these coherent structures (i.e., the number, size, shape, amplitude of vortices).  Our goal is to formulate a statistical model for these structures. The initial statistical problem is to segment the time slices from a numerical simulation of turbulence to reliably identify vortices. The goal is verify some of the assumptions made by Weiss and James McWilliams (University of California at Los Angeles) in the derivation of scaling arguments. The vortex field (vorticity is the curl of velocity) is decomposed as a sum of individual models for the vortices where the individual modes use a multi-resolution basis. This model for the field is formulated as a multivariate multiple linear regression in the wavelet domain, and individual vortex models are isolated through a model selection strategy. The identification procedure depends on the usual kind of truncation ideas in wavelet regression and has been implemented on a parallel computer architecture to make analysis of a reasonable numerical experiment feasible.

A second approach is to use sequences of images to track vortices. Preliminary work suggests that tracking algorithms in the engineering literature may be effective with some adjustment to account for vortex coalescence and dissipation. Several scientific goals from this extension are to quantify the interactions among vortices in simulations of turbulence and to track storms and eddies from more realistic numerical experiments for the atmosphere and ocean.

This figure is a snapshot of the vorticity field from a simulation of 2-d quasi-geostrophic flow. The statistical problem is to readily identify the vortices in this image as distinct coherent elements.

Statistical Computing

Statistical Supercomputing with the Gibbs Sampler

This project is a collaboration among Tim Hoar (GSP), Chris Wikle (University of Missouri), and Ralph Milliff (Colorado Research Associates). Gibbs sampling for a Bayesian hierarchical spatial model was implemented using portable FORTRAN 90 code and ported to a massively parallel architecture, such as that of the NCAR IBM SP RS/6000 machines. The overall goal of this work is to understand what kind of software infrastructure is needed to complete a nontrivial analysis of a large geophysical data stream.

The experience gained from this project can be categorized in three ways: adaptation to hardware constraints, managing input and output streams, and leveraging the advantages of several programming environments. The first involves balancing software design with available computing resources and storage. For a massively parallel but shared resource, this turns out to be a nontrivial effort and is dictated by how large a time window can be accommodated for any individual run. The second category involves the coordination of data flow from the input data stream to the computational core and the subsequent handling of the output data stream. Finally, the development of the computational core was simplified by the flexible combination of several software environments. The development of an efficient FORTRAN 90 (F90) computational engine was facilitated by incremental comparisons to the pilot version coded in Matlab, an interactive, high-level computational language.

The resulting framework was used to successfully blend scatterometer wind observations (QuikSCAT) and analysis fields from the National Centers for Environmental Prediction (NCEP) to produce high-resolution surface wind fields for the tropical Pacific. The current work has been able to efficiently blend QuikSCAT surface winds with analysis winds over a 48° by 128° (at 0.5 degree resolution) domain in six-hour intervals for a three-year period. Posterior means from this analysis along with 50 posterior samples are being made available on DVD in network Common Data Format (netCDF). To our knowledge, this spatio-temporal data product based on Markov Chain Monte Carlo is substantially larger than any other analysis attempted by a statistical research group.

This figure gives an example of the U component (A) and the divergence (B) of a wind realization based on blending NCEP and satellite wind measurements. It one of the 50 ensemble members in the GibbsWinds data product for this time point. The bottom plots indicate the energy spectrum in the realized fields as a function of wave number (C) and the spectrum for just the NCEP wind data product (D). The goal is to reproduce an approximate polynomial decay (red line) for the wind energy, matching the scaling law expected for turbulent flow. Due to issues of data the assimilation and resolution, the NCEP winds alone (D) lack the right amount of power for small scales and depart from this relationship.

Research Software for Model-Data Fusion

Hoar, in collaboration with Jeff Anderson (CDP and Mesoscale and Microscale Meteorology Division), with support from the NCAR Data Assimilation Strategic Initiative, is developing a software framework for research on data assimilation.  The Data Assimilation Research Testbed (DART) will facilitate the combination of assimilation algorithms, models, and observation sets to allow increased understanding of all three. DART gives the user access to a scale of models ranging from toy dynamical systems involving a few degrees of freedom to the full atmosphere general circulation model used in the current NCAR Climate Atmosphere Model. The assimilation methods are also graduated from simple optimal interpolation to ensemble filters. The first test of DART was passed in July, 2003, when it was used successfully as the lab resource for the Advanced Study Program summer colloquium on data assimilation.

Sparse Approximations to Covariance Matrices

Reinhard Furrer (GSP), in collaboration with Anderson, has developed approximate statistical methods that scale to handle large spatial prediction problems where observations occur at irregular locations. The approach is to introduce sparsity in the covariance matrix by tapering with a compactly supported (positive definite) kernel (e.g., one finds the direct product of the covariance matrix and the tapering matrix.).   This technique is common in atmospheric data assimilation, but is not well used in general. Given the correct choice of taper, this approach can be justified based on matching the tail behavior of the spectral density of the covariance function. This analysis suggests that, under certain assumptions, spatial process estimators and classical kernel estimators will be asymptotically equivalent. Thus, we see the potential to unify these two approaches for function estimation.

The potential speedup is striking, suggesting that a spatial prediction (Kriging) can be computed for several thousand points interactively in the R environment. Also, as suggested by the theory, the tapering approximation is quite accurate sacrificing little statistical efficiency compared to the exact calculation.

This figure is an example of the computational time for the key step in a spatial prediction problem. Different lines indicate the time for solving a linear system for a 1-d Kriging problem based on two different high-level languages (Matlab and R) and different levels of tapering. For the R package the time difference between the exact solution verses using a tapered, sparse approximation is more that a factor of 100 for 1000 spatial locations.

Dynamical Systems

Forecasting and Data Assimilation: Adapting Particle Filter Methodology

This project is led by Thomas Bengtsson (GSP) in collaboration with Nychka and Chris Snyder (Mesoscale and Microscale Meteorology Division). Several aspects of numerical weather prediction (NWP) make forecasting and data assimilation particularly challenging: very high-dimensional systems, strongly non-linear (possibly chaotic) dynamics, and real-time requirements for assimilating data and physical models. Despite the practical complexity of data assimilation, the fundamental statistical operation is an application of Bayes Theorem (the update or analysis step) and a change of variables (the forecast step). In this project we implement this Bayes paradigm through an approximation based on a discrete sample from a mixture of Gaussian distributions, essentially a Monte Carlo approximation to the prior and posterior distributions. To handle non-linear systems and still have a stable filtering method, we have found that it is important to use nearest neighborhoods of states to derive updates. Building on the success of this non-Gaussian filter in Lorenz 1963 three-dimensional system, an extension was formulated for higher dimensional systems. The basic idea is to consider mixture approximations that are local both in the state space and also in the physical space. As a starting point, it is demonstrated that this approach can be effective in handling distinctly non-Gaussian distributions produced by the 40-dimensional Lorenz 1993 system. A practical filter for this problem uses a hybrid strategy that blends the nonGaussian filter for a low dimensional subset of the state vector with the ensemble Kalman filter for the other components.