Title: Using Random Forests to explore a complex Metabolomic data set
1Using Random Forests to explore a complex
Metabolomic data set
- Susan Simmons
- Department of Mathematics and Statistics
- University of North Carolina Wilmington
2Collaborators
- Dr. David Banks (Duke)
- Dr. Jacqueline Hughes-Oliver (NC State)
- Dr. Stan Young (NISS)
- Dr. Young Truoung (UNC)
- Dr. Chris Beecher (Metabolon)
- Dr. Xiaodong Lin (SAMSI)
3(No Transcript)
4Large data sets
- Examples
- Walmart
- 20 million transactions daily
- ATT
- 100 million customers and carries 200 million
calls a day on its long-distance network - Mobil Oil
- over 100 terabytes of data with oil exploration
- Human genome
- Gigabytes of data
- IRA
5Dimensionality
6Dimensionality
- 3,000 metabolites
- 40,000 genes
- 100,000 chemicals
- Try to find the signal in these data sets (and
not the noise)..Data mining - Examples of data mining techniques pattern
recognition, expert systems, genetic algorithms,
neural networks, random forests
7Todays talk
- Focus on classification (supervised learninguse
a response to guide the learning process) - Response is categorical (Each observation belongs
to a class) - Interested in relationship between variables and
the response - Short, fat data (instead of long, skinny data)
8Long, skinny data
X Y Z
2 8 9
3 4 4
7 5 46
8 7 3
4 56 35
6 58 63
12 9 3
14 2 35
24 1 45
2 7 4
13 78 25
14 56 34
18 6 89
35 8 56
9Short, fat data
X Y Z S T V M N R Q L H G K B C W
4 36 5 8 30 4 35 7 3 78 9 3 1 40 2 5 34
6 7 34 6 7 67 8 89 8 4 2 6 5 9 8 67 3
7 46 2 4 5 6 7 58 9 7 9 50 4 45 7 8 45
8 4 5 65 57 57 42 2 7 23 4 6 76 8 0 56 90
nltp problem
10Random Forests
- Developed by Leo Breiman (Berkeley) and Adele
Cutler (Utah State) - Can handle the nltp problem
- Random forests are comparable in accuracy to
support vector machines - Random forests are a combination of tree
predictors
11Constructing a tree
Observation Gender Height (inches)
1 F 60
2 F 66
3 M 68
4 F 70
5 F 66
6 M 72
7 F 64
8 M 67
12Tree for previous data set
All observations N8
Height lt 66 N4
Height gt 66 N4
Male N0
Female N4
Male N3
Female N1
13Random Forest
- First, the number of trees to be grown must be
specified. - Also, the number of variables randomly selected
at each node must be specified (m). - Each tree is constructed in the following manner
- 1. At each node, randomly select m variables to
split on.
14Random Forest
- The node is split using the best split among the
selected variables. - This process is continued until each node has
only one observation, or all the observations
belong to the same class. - Do this for each tree in the forest
15Example Cereal Data
16N70 (40 G, 30K)
Calories lt100 (2 G, 15 K)
Calories lt100 (38 G, 15 K)
Fat lt1 15 K
Fat gt1 2 G
Carbolt12 15 K
Carbogt12 38G
17Random Forest
- Another important feature is that each tree is
created using a bootstrap sample of the learning
set. - Each bootstrap sample contains approximately 2/3
of the data (thus approximately 1/3 is left) - Now, we can use the trees built not containing
observations to get an idea of the error rate
(each tree will vote on which class the
observation belongs to). - Example
18N70 (40 G, 30K)
Calories lt100 (2 G, 15 K)
Calories lt100 (38 G, 15 K)
Fat lt1 15 K
Fat gt1 2 G
Carbolt12 15 K
Carbogt12 38G
Observation withheld from creating this
tree Calories Fat Carbo Mfr 98 2
10 K
19Random Forest
- This gives us an out of bag error rate
- Random forests also give us an idea of which
variables are important for classifying
individuals. - Also gives information about outliers
20The era of the omics sciences
21Just a few of the omics sciences
- Genomics
- Transcriptomics
- Proteomics
- Metabolomics
- Phenomics
- Toxicogenomics
- Phylomics
- Foldomics
- Kinomics
- Interactomics
- Behavioromics
- Variomics
- Pharmacogenomics
22Functional Genomics
Genomics
Transciptomics
Proteomics
Metabolomics
23Metabolomics
- Metabolites are all the small molecules in a cell
(i.e. ATP, sugar, pyruvate, urea) - 3,000 metabolites in the human body (compared to
35,000 genes and approximately 100,000 proteins) - Most direct measure of cell physiology
- Uses GC/MS and LC/MS to obtain measurements
24Data
- Currently only have GC/MS information
- Missing values are very informative (below
detection limits) - Imputed data using uniform random variables from
0 to minimum value - 105 metabolites
- 58 individuals (42 disease 1, 6 disease 2,
and 10 controls)
25Confusion matrix
1 2 3
1 40 1 8
2 0 5 1
3 2 0 1
Oob error 20.69
26Outlier
27Variable Importance
28Visual Data
29Conclusions
- Random forests, support vector machines, and
neural networks are some of the newest algorithms
for understanding large datasets. - There is still much more to be done.
30Thank you