Title: Analysis of Gene Expression Metadata In Studies of HNSCC and LSCC Louise Showe Molecular and Cellula
1Analysis of Gene Expression Metadata In Studies
of HNSCC and LSCCLouise ShoweMolecular and
Cellular Oncogenesis Program
2- Cancer Functional Genomics
- Gene Expression Microarrays
- Classification, Diagnosis, Prognosis, Response to
Treatment - Who will analyze the data?
3Molecular Biology Computational Biology
4(No Transcript)
5Data Pre-processing
6Classification Class Discovery
7Biomarker Selection
8(RFE) Recursive Feature Elimination Iteratively
Discards Genes That Contribute Least
Multivariate Wrapper Approach
Univariate Filter Approach
Data
Data
Univariate Feature selection
Learning Machine
Learning machine
Multivariate Feature Selection
9Resampling Methods
- Goals
- estimate the value and uncertainty of model
parameters - identify the best classification model for the
dataset - boost the probability of classification of
hard samples
10Biomedical Problem
- Patients with previous HNSCC are at risk for both
lung metastases and new primary lung cancers. - Distinction between a primary LSCC and a HNSCC
lung metastasis can be very difficult. - Clinical dilemma Depending on the origin,
patients have drastically different treatment
options and prognoses.
11Can We Distinguish Primary Lung SCC Tumors From
Head Neck Derived SCC Metastasis?
12Study Schema
18 HNSCC 10 LSCC (Training Set For Gene
Selection on U133A)
Identify Best Diagnostic Genes
Validate Genes On Independent Data Sets Data
for 122 Samples 4 Sources-2 different Affy
chips
Test On Lung Nodules From Patients With Previous
History Of HNSCC (MSKCC)
13Top genes by PDA-RFE
Top PDA genes have distinct profiles Top genes
by selected by t-test alone have common profile
14Penn Data Set U133A
10 Genes Can Accurately Classify Penn Samples
15Summary
- Excellent classification accuracy on small data
set. - Could it be validated on new samples?
16External Validation Datasets
- U133A
- 40 HNSCC Samples from Minnesota (MN)
- U95Av2
- 21 LSCC Samples from Dana-Farber (DF)
- 11 LSCC Samples from Columbia (CU)
- Note Only the 9530 probe sets common between
U95Av2 and U133A were considered. Raw data - re-analyzed (RMA).
17Unsupervised Clustering Shows Strong Systematic
Bias Due To Chip Type And Institution
LC.DF
LC.CU
HN.MN
HN.UP
LC.UP
U95Av2
U133A
Penn
Minnesota
CU
Dana-Farber
Head Neck
Lung
Lung
18DWD Visualization
courtesy of S. Marron, UNC
19DWD Correction
- Correct Bias Due To Different Hybridization
Sites - 1. Merge Dana Farber Set with Columbia Set
- (same chip U95, same phenotype LSCC)
- 2. Merge Minnesota Set with Penn Head Neck Set
- (same chip U133, same phenotype HNSCC)
- Correct Bias Due To Affy Chip Type And Cancer
Type - 3. Merge Penn LSCC Set with combined Dana Farber
Columbia Sets - (different chips U133 vs. U95, same phenotype)
20DWD Corrected For Systematic Bias
Head Neck
Lung
Samples cluster by cancer type and not By chip
type or institution
21Reduction in systematic bias improved global
correlation
Before Correction
After Correction
22Reduction in systematic bias regularized
classification
AFTER DWD
BEFORE DWD
2310 GeneClassification of Independent Test Set
Head Neck
Lung
Dana-Farber
CU
Minnesota
24Study at MSKCC
- Training Set
- 52 subjects
- 31 HNSCC
- 18 LSCC
- Test Set
- 12 lung nodules in patients with prior HNSCC
Talbot et al, Cancer Res (2005) 65 (8)
25Methods used
- Gene Selection by t-test
- Classification by Support Vector Machines (SVM)
- Accuracy by Leave-one-out cross-validation (LOOCV)
Main Conclusions
- Minimum set of 500 genes is needed for robust
classification
2610-gene Classification of Talbot et al Samples
Head Neck
Lung
2710 genes on MKSCC Samples
28PDA Classification of 50 New Lung 72 new HN
Samples is 96 Accurate With 10 Genes
29Additional Test Set Lung Nodules from patients
with prior HNSCC
- 13 samples from patients with prior HNSCC
- 11 samples clinically classified as primary lung
cancer (U01-U11) - 2 samples (lung nodule and pancreatic nodule from
the same patient) classified as metastases (U12
U13)
3010 gene classifier predictions agree with the
clinical assessment
Based on Data from Talbot et al Cancer Res. 2005
3110 genes separate HNSCC adjacent tissue
HNSCC
Adjacent Tissue (donor matched)
32Validation by QRT-PCR
33Expression Ratios From QPCR For Selected Genes
Pairs Correctly Classify New HNSCC LSCC Samples
9 7
Vachani ,Nebozhyn et al. Submitted to Cancer
Research
34Summary
- Microarray analysis enables diagnostics for HNSCC
- LSCC with just few genes - Using different analysis methods can make a big
difference in the results 10 genes vs. 500 for
HNSCC - Selection of biomarkers by RFE is much superior
to t-test - When combining data from separate experiments,
observed batch effect needs to be addressed and
alleviated - Resampling and validation on independent data
sets and experimental platforms are crucial for
assessing the reliability and reproducibility of
the results - Close clinical, biological and mathematical
collaboration is essential for proper design and
analysis of experiments
35- Wistar
- Michael Showe
- Michael Nebozhyn
- Malik Yousef
- Wen-Hwai Horng
- Linda Alila
- UNC
- Steve Marron
- Everett Zhou
- Xuxin Liu
- Univ of Pennsylvania
- Steve Albelda
- Anil Vachani
- Charles Powell (Columbia)
- Patrick Gaffney (U. of Minn)
- Bhuvanesh Singh (MSKCC)
- Matt Myerson (Harvard)
- Ruth Muschel (Penn)