Using Random Forests to explore a complex Metabolomic data set PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Using Random Forests to explore a complex Metabolomic data set


1
Using Random Forests to explore a complex
Metabolomic data set
  • Susan Simmons
  • Department of Mathematics and Statistics
  • University of North Carolina Wilmington

2
Collaborators
  • Dr. David Banks (Duke)
  • Dr. Jacqueline Hughes-Oliver (NC State)
  • Dr. Stan Young (NISS)
  • Dr. Young Truoung (UNC)
  • Dr. Chris Beecher (Metabolon)
  • Dr. Xiaodong Lin (SAMSI)

3
(No Transcript)
4
Large data sets
  • Examples
  • Walmart
  • 20 million transactions daily
  • ATT
  • 100 million customers and carries 200 million
    calls a day on its long-distance network
  • Mobil Oil
  • over 100 terabytes of data with oil exploration
  • Human genome
  • Gigabytes of data
  • IRA

5
Dimensionality
6
Dimensionality
  • 3,000 metabolites
  • 40,000 genes
  • 100,000 chemicals
  • Try to find the signal in these data sets (and
    not the noise)..Data mining
  • Examples of data mining techniques pattern
    recognition, expert systems, genetic algorithms,
    neural networks, random forests

7
Todays talk
  • Focus on classification (supervised learninguse
    a response to guide the learning process)
  • Response is categorical (Each observation belongs
    to a class)
  • Interested in relationship between variables and
    the response
  • Short, fat data (instead of long, skinny data)

8
Long, skinny data
X Y Z
2 8 9
3 4 4
7 5 46
8 7 3
4 56 35
6 58 63
12 9 3
14 2 35
24 1 45
2 7 4
13 78 25
14 56 34
18 6 89
35 8 56
9
Short, fat data
X Y Z S T V M N R Q L H G K B C W
4 36 5 8 30 4 35 7 3 78 9 3 1 40 2 5 34
6 7 34 6 7 67 8 89 8 4 2 6 5 9 8 67 3
7 46 2 4 5 6 7 58 9 7 9 50 4 45 7 8 45
8 4 5 65 57 57 42 2 7 23 4 6 76 8 0 56 90
nltp problem
10
Random Forests
  • Developed by Leo Breiman (Berkeley) and Adele
    Cutler (Utah State)
  • Can handle the nltp problem
  • Random forests are comparable in accuracy to
    support vector machines
  • Random forests are a combination of tree
    predictors

11
Constructing a tree
Observation Gender Height (inches)
1 F 60
2 F 66
3 M 68
4 F 70
5 F 66
6 M 72
7 F 64
8 M 67
12
Tree for previous data set
All observations N8
Height lt 66 N4
Height gt 66 N4
Male N0
Female N4
Male N3
Female N1
13
Random Forest
  • First, the number of trees to be grown must be
    specified.
  • Also, the number of variables randomly selected
    at each node must be specified (m).
  • Each tree is constructed in the following manner
  • 1. At each node, randomly select m variables to
    split on.

14
Random Forest
  • The node is split using the best split among the
    selected variables.
  • This process is continued until each node has
    only one observation, or all the observations
    belong to the same class.
  • Do this for each tree in the forest

15
Example Cereal Data
16
N70 (40 G, 30K)
Calories lt100 (2 G, 15 K)
Calories lt100 (38 G, 15 K)
Fat lt1 15 K
Fat gt1 2 G
Carbolt12 15 K
Carbogt12 38G
17
Random Forest
  • Another important feature is that each tree is
    created using a bootstrap sample of the learning
    set.
  • Each bootstrap sample contains approximately 2/3
    of the data (thus approximately 1/3 is left)
  • Now, we can use the trees built not containing
    observations to get an idea of the error rate
    (each tree will vote on which class the
    observation belongs to).
  • Example

18
N70 (40 G, 30K)
Calories lt100 (2 G, 15 K)
Calories lt100 (38 G, 15 K)
Fat lt1 15 K
Fat gt1 2 G
Carbolt12 15 K
Carbogt12 38G
Observation withheld from creating this
tree Calories Fat Carbo Mfr 98 2
10 K
19
Random Forest
  • This gives us an out of bag error rate
  • Random forests also give us an idea of which
    variables are important for classifying
    individuals.
  • Also gives information about outliers

20
The era of the omics sciences
21
Just a few of the omics sciences
  • Genomics
  • Transcriptomics
  • Proteomics
  • Metabolomics
  • Phenomics
  • Toxicogenomics
  • Phylomics
  • Foldomics
  • Kinomics
  • Interactomics
  • Behavioromics
  • Variomics
  • Pharmacogenomics

22
Functional Genomics
Genomics
Transciptomics
Proteomics
Metabolomics
23
Metabolomics
  • Metabolites are all the small molecules in a cell
    (i.e. ATP, sugar, pyruvate, urea)
  • 3,000 metabolites in the human body (compared to
    35,000 genes and approximately 100,000 proteins)
  • Most direct measure of cell physiology
  • Uses GC/MS and LC/MS to obtain measurements

24
Data
  • Currently only have GC/MS information
  • Missing values are very informative (below
    detection limits)
  • Imputed data using uniform random variables from
    0 to minimum value
  • 105 metabolites
  • 58 individuals (42 disease 1, 6 disease 2,
    and 10 controls)

25
Confusion matrix
1 2 3
1 40 1 8
2 0 5 1
3 2 0 1
Oob error 20.69
26
Outlier
27
Variable Importance
28
Visual Data
  • Dostat

29
Conclusions
  • Random forests, support vector machines, and
    neural networks are some of the newest algorithms
    for understanding large datasets.
  • There is still much more to be done.

30
Thank you
Write a Comment
User Comments (0)
About PowerShow.com