Statistics in Metabolomics - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

Statistics in Metabolomics

Description:

Statistics in Metabolomics David Banks ISDS Duke University 1. Background Metabolomics is the next step after genomics and proteomics. There are about 25,000 genes ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 42
Provided by: hsphHarva1
Category:

less

Transcript and Presenter's Notes

Title: Statistics in Metabolomics


1
Statistics in Metabolomics
  • David Banks
  • ISDS
  • Duke University

2
1. Background
  • Metabolomics is the next step after genomics and
    proteomics.
  • There are about 25,000 genes, most of which have
    unknown functions.
  • There are about 1,000,000 proteins, most of which
    are unstudied.

3
(No Transcript)
4
  • In contrast to the omics areas
  • There are only about 900 main metabolites, and we
    know their chemical structures
  • Also, we know (pretty well) the biochemical
    pathways that determine their production rates
  • Metabolites are low-weight molecular compounds
    produced in the course of processing raw
    materials.

5
  • Some common metabolites include
  • cholesterol
  • glucose, sucrose, fructose
  • amino acids
  • lactic acid, uric acid
  • ATP, ADP
  • drug metabolites, legal and illegal
  • These are produced in metabolic pathways, such as
    the Krebs (citrate) cycle for oxidation of
    glucose.

6
(No Transcript)
7
  • These pathways contain important information
    about the amount of each metabolite
  • Stoichiometric equations show how much material
    is produced in a given reaction i.e., mass
    balance.
  • Rate equations govern the speed at which
    reactions take place, and the location of the
    Gibbs equilibrium
  • This gives metabolomics an edge.

8
Biochemical Profile Map to Metabolic Pathways
9
  • The purposes of metabolomics are
  • Early detection of disease, such as necrosis,
    ALS, Alzheimers, and infection or inflammation.
  • Assessment of toxicity (especially liver
    toxicity) in new drugs.
  • Diet strategies, drug testing.
  • Elucidating biochemical pathways.
  • There is less raw information than for other
    omics, but more context.

10
2. Measurement Issues
  • To obtain data, a tissue sample is taken from a
    patient. Then
  • The sample is prepped and put onto wells on a
    silicon plate.
  • Each wells aliquot is subjected to gas and/or
    liquid chromatography.
  • After separation, the sample goes to a mass
    spectrometer.

11
  • The sample prep involves stabilizing the sample,
    adding spiked-in calibrants, and creating
    multiple aliquots (some are frozen) for QC
    purposes. This is roboticized.
  • Sources of error in this step include
  • within-subject variation
  • within-tissue variation
  • contamination by cleaning solvents
  • calibrant uncertainty
  • evaporation of volatiles.

12
  • Gas chromatography creates an ionized aerosol,
    and each droplet evaporates to a single ion.
    This is separated by mass in the column, then
    ejected to the spectrometer.
  • Sources of error in this step include
  • imperfect evaporation
  • adhesion in the column
  • ion fragmentation or adductance

13
  • The fourier mass spectrometer determines the mass
    to charge ratio of the ion from the field
    strength required to keep the ion spinning in a
    circle. This avoids the entry-time uncertainty
    in TOF machines, so the only main error is
    uncertainty about the field strength
  • Some laboratories use MALDI-TOF equipment, and
    the error sources are slightly different.

14
(No Transcript)
15
  • The result of this is a set of m/z ratios and
    timestamps for each ion, which can be viewed as a
    2-D histogram in the m/z x time plane.
  • One now estimates the amount of each metabolite.
    This entails normalization, which also introduces
    error.
  • The caveats pointed out in Baggerley et al.
    (Proteomics, 2003) apply.

16
(No Transcript)
17
3. Statistical Problems
  • Understanding the uncertainty budget in
    metabolomic data, which entails both quality
    control and cross-platform comparisons.
  • Identifying the peaks in the m/z x t plane, and
    estimating quantity of specific metabolites.
  • Finding markers for disease or toxicity, or
    measuring change.

18
3.1 Uncertainty
  • The classical NIST approach to this is to
  • build a model for the error terms
  • do a designed experiment with replicated
    measurements
  • fit a measurement equation to the data
  • See Cameron, Error Analysis, ESS Vol. 9, 1982.

19
  • Let z be the vector of raw data, and let x be the
    estimates. Then the measurement equation is
  • G(z) x µ e
  • where µ is the vector of unknown true values
    and e is decomposable into separate components.
  • For metabolite i, the estimate Xi is
  • gi(z) lnS wij ??sm(z) c(m,t)dm dt.

20
  • The law of propagation of error (this is
    essentially the delta method) says that the
    variance in X is about
  • Sni1 (?g /? zi)2 Varzi
  • Si?k 2 (?g/?zi)(?g/?zk) Covzi, zk
  • The weights depend upon the values of the spiked
    in calibrants, so this gets complicated.

21
  • Cross-platform experiments are also crucial for
    medical use. This leads to key comparison
    designs. Here the same sample (or aliquots of a
    standard solution or sample) are sent to multiple
    labs. Each lab produces its spectrogram.
  • It is impossible to decide which lab is best, but
    one can estimate how to adjust for interlab
    differences.

22
  • The Mandel bundle-of-lines model is what we
    suggest for interlaboratory comparisons. This
    assumes
  • Xik ai ßi ?k eik
  • where Xik is the estimate at lab i for
    metabolite k, ?k is the unknown true quantity of
    metabolite k, and
  • eik N(0,sik2).

23
  • To solve the equations given values from the
    labs, one must impose constraints. A Bayesian
    can put priors on the laboratory coefficients and
    the error variance.
  • Metabolomics needs a multivariate version, with
    models for the rates at which compounds
    volatilize.
  • We plan to use this model to compare the
    Metabolon lab in RTP to Chris Newgards lab at
    Duke.

24
3.2 Peak Identification
  • A classic problem in proteomics is to locate
    peaks and estimate their area or volume.
  • Unlike proteomics, metabolite peak location is
    mostly known. So Bayesian methods seem good (cf.
    Clyde and House). Metabolon uses proprietary
    software.

25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
3.3 Data Mining
  • Different tools are appropriate for different
    kinds of metabolomic studies. The work we have
    done focuses on
  • Random Forests
  • Support Vector Machines
  • Robust Singular Value Decomposition

29
  • We had abundance data on 317 metabolites from 63
    subjects. Of these, 32 were healthy, 22 had ALS
    but were not on medication, and 9 had ALS and
    were taking medication.
  • The goal was to classify the two ALS groups and
    the healthy group.
  • Here pgtn. Also, some abundances were below
    detectability.

30
  • Using the Breiman-Cutler code for Random Forests,
    the out-of-bag error rate was 7.94 29 of the
    ALS patients and 29 of the healthy patients were
    correctly classified.
  • 20 of the 317 metabolites were important in the
    classification, and three were dominant.
  • RF can detect outliers via proximity scores.
    There were four such.

31
  • Several support vector machine approaches were
    tried on this data
  • Linear SVM
  • Polynomial SVM
  • Gaussian SVM
  • L1 SVM (Bradley and Mangasarian, 1998)
  • SCAD SVM (Fan and Li, 2000)
  • The SCAD SVM had the best loo error rate, 14.3.

32
  • The L1 SVM attempts to mimic the automatic
    variable selection in the LASSO (Tibshirani,
    1996) by solving the programming problem
  • Minb,w S1 yi(bwTxi) ?S wk
  • where the first sum is over n and the second
    is over p.
  • SCAD replaces the L1 penalty with a nonconvex
    penalty.

33
  • The SCAD SVM selected 18 of the metabolites as
    being important the L1 selected 32. This
    suggests that the automatic variable selection in
    L1 SVM is not very effective.
  • A further multiple tree analysis with FIRMPlusTM
    software from the GoldenHelix Co. did not achieve
    good classification.
  • So Random Forests wins. And the selected
    metabolites make sense.

34
  • Robust SVD (Liu et al., 2003) is used to
    simultaneously cluster patients (rows) and
    metabolites (columns). Given the patient by
    metabolite matrix X, one writes
  • Xik ri ck eik
  • where ri and ck are row and column effects.
    Then one can sort the array by the effect
    magnitudes.

35
  • To do a rSVD use alternating L1 regression,
    without an intercept, to estimate the row and
    column effects. First fit the row effect as a
    function of the column effect, and then reverse.
    Robustness stems from not using OLS.
  • Doing similar work on the residuals gives the
    second singular value solution.

36
(No Transcript)
37
3.3.1 Preterm Labor
  • The NIH wanted to decide whether amniotic fluid
    samples from women in preterm labor could support
    classification
  • Term delivery
  • Preterm delivery with inflammation
  • Preterm delivery without inflammation.

38
  • The analysis had samples from 113 women in
    preterm labor. We tried all of the usual
    classification methods.
  • As before, Random Forests gave the best results.
    The various SVMs were about 5-10 less
    predictive.
  • The main information was contained in amino acids
    and carbohydrates.

39
  • Predicted
  • Term Inflamm.
    No Inf.
  • Term 39 1
    0
  • True Inflamm. 7 32
    1
  • No Inf. 2 2
    29
  • RF accuracy was 100/113 88.49.

40
  • For those with term delivery, amino acids were
    low, carbohydrates were high.
  • For those who had preterm delivery without
    inflammation, both amino acids and carbohydrates
    were low.
  • For those who had inflammation, the carbohydrates
    were very low and the amino acids were high.

41
  • My collaborators in this research are
  • Chris Beecher, Metabolon, Inc.
  • Adele Cutler, USU
  • Leanna House, Duke University
  • Jackie Hughes-Oliver, NCSU
  • Xiadong Lin, U. of Cincinnati
  • Susan Simmons, UNC-Wilmington
  • Young Truong, UNC-Chapel Hill
  • Stan Young, NISS
Write a Comment
User Comments (0)
About PowerShow.com