# Helmholtz-Zentrum Dresden-Rossendorf

## Modelling and Evaluation

Fellow Researcher / Geostatistics, Compositional Data Analysis, Process Modelling
Phone:+49 351 260 - 4415
Email: r.tolosanahzdr.de
09599 Freiberg

Research topics:

• Compositional data analysis (CoDa). If the variables of a data set inform of the relative importance of a set of parts in a whole, then the data set should be considered a composition. These are typically described in relative units, %, mg/l, ppm, molarity, etc; they are always positive, and their total sum is equal or smaller than a constant (100%, 106, etc, so called the closure). Because of these general limitations, CoDa should not be treated with classical correlation based statistical techniques. Instead, one should work with a one-to-one set of log-ratios (capturing the relative character of compositions mentioned before). The main research in this topic is adapting and applying statistical methods to compositions. Most classical multivariate statistical methods can be modified to work meaningfully with these logratio transformed data: new and old descriptive statistics and diagrams, linear models, geostatistics, latent variable models (factor analysis, endmember unmixing problems), time series, etc, they can all be quickly adapted. The only general requirements are: to use multivariate methods (as a composition has always many variables) and to avoid interpreting results in terms of "absolute increment" or "absolute decrement". Conveying relative information, CoDa can only provide assessment on relative increments/decrements, i.e. enrichment/depletion of one component with respect to another. Lessons learnt from CoDa analysis can also be useful for many other kinds of data with restrictions (with positive variables, grainsize distributions, orientations and other spherical information as the most relevant).
• Geostatistics. Data in geosciences are often georeferenced, i.e. we know and use the positions in space (and/or time) where the samples were taken. In these cases, it is often natural to assume that data taken at neighbouring locations are probably more similar than data taken at locations far apart. This idea of increasing variability with increasing spatial distance is behind the concept of variogram, a function that describes how the variance of the difference of pairs of data increases with the distance between the sampling locations. Knowing the variogram of a data set allows us to map the variable in space using optimal interpolations, in the sense that they minimize the interpolation error variance. The main interest in this field is compositional geostatistics, i.e. the obtention of consistent spatial models and maps of compositional data. The key idea here is to work in the set of all possible pairwise log-ratios. This gives a flexible set of tools and solutions, consistent with both conventional geostatistics and logratio CoDa methodologies. CoDa-geostatistics finds applications from intra-crystal variability analysis to national-scale geochemical surveys. Current research in this field is focused on: block kriging for CoDa, and geostatistical simulation of mineralogically-consistent compositional data.
• Bayesian statistics. In the Bayesian paradigm, we want to know the probable value of some physically meaningful model parameters which condition the available data. In this framework, data are considered known random functions of the parameters, and the goal is to estimate the distribution of the uncertainty about the parameters conditional on the data. Really interesting applications cannot analytically resolve this parameter posterior distribution, and must resort to computationally intensive Markov Chain Monte Carlo methods.  This is a very general methodology with many varied applications in all fields of science. Within the scope of HIF activities, endmember problems stand out. In an endmember problem, one assumes that a given sampled signal (chemical composition, XRD or Raman spectrum, etc) is an additive mixture of some pure endmember signals (known or unknown), and the goal is to unmix the signal and estimate the proportions of each endmember in the samples. In most cases, some or all information available and desired (sampled signals, endmember signals and endmember proportions) are CoDa, and probabilistic models must therefore be adapted to this fact.
• Model-data merging techniques. Sometimes, the physically meaningful parameters do not control the available data directly, but through their influence on some state variables of a differential equation system. These typically model reactive-diffusion-advection processes or Lotka-Volterra-like dynamic systems. Bayesian analysis offers a framework to understand the relations between the several parameters, state variables and data. Parameters can be estimated from available data (calibration), concurrent alternative models can be ranked in their goodness of fit (validation) and predicted state variables can be perturbed to fit the data (assimilation). Most often, these systems are regionalized, thus requiring geostatistical tools in several intermediate steps.
• R programming. R is a multi-platform, free and open source statistical environment that has become a sort of de facto standard in Statistics. We have been working since 2003 in a package for compositional analysis (called compositions), and are dealing now with geostatistical applications, latent variable models, grainsize distributions applications and textural analysis. Userfriendlyness of "compositions" is also a line of work.

Teaching

• R for Geoscientists, as a 1-week short course of 30h, providing a practical introduction to the fundamental statistical techniques in Geosciences: exploratory data analysis, regression, discriminant analysis and geostatistics, from the point of view of the statistical software R.