Title: Robust diagnosis of DLBCL from gene expression data from different laboratories
1Robust diagnosis of DLBCL from gene expression
data from different laboratories
DIMACS - RUTCOR Workshop onBoolean and
Pseudo-Boolean Functionsin Memory of Peter L.
Hammer January 19-22, 2009
2Peter L Hammer Sorin Alexe David E
Axelrod RUTGERS UNIV
Gustavo Stolovitzky IBM TJ WATSON RESEARCH
Gyan Bhanot Arnold J Levine INSTITUTE FOR
ADVANCED STUDY PRINCETON
David Weissmann CANCER INSTITUTE OF NEW JERSEY
3Overview
- Motivation
- Pattern-based ensemble classifiers
- Case study compare data from two labs for DLBCL
vs FL diagnosis
Shipp et al. (2002) Nature Med. 8(1), 68-74.
(Whitehead Lab) Stolovitzky G. (2005) In
Deisboeck et al Complex Systems Science in
BioMedicine (in press) (preprint
http//www.wkap.nl/prod/a/Stolovitzky.pdf).
(DellaFavera Lab) Alexe, Alexe, Axelrod, Hammer,
Weissmann (2005) Artificial Intelligence in
Medicine Bhanot, Alexe, Stolowitzky, Levine
(2005) Genome Informatics
4Non-Hodgkin lymphomas
- FL low grade non-Hodgkin lymphoma / no cure if
advanced stage -
- second most frequent subtype of nodal lymphoid
malignancies -
- Incidence has risen from 23/ to more than
57/ 100,000/year (50 00) -
- t(1418) translocationover-expression of
anti-apoptotic bcl2 - 25-60 FL cases evolve to DLBCL
- DLBCL high grade non-Hodgkin lymphoma / high
variability to treatment - most frequent subtype of NHL
- lt 2 years survival if untreated
- Biomarkers FL transformation to DLBCL
- p53/MDM2 (Moller et al., 1999)
- p16 (Pyniol, 1998)
- p38MAPK (Elenitoba-Johnson et al., 2003)
- c-myc (Lossos et al., 2002)
5Gene arrays
- Gene arrays are a way to study the variation of
mRNA levels between different types of cells. - This allows diagnosis and inference of pathways
that cause disease / early stage diagnosis - Identify molecular profiles of disease
personalized medicine
6Lymphoma datasets
- Data WI (Shipp et al., 2002) Affy HuGeneFL
- CU (DallaFavera Lab, Stolovitzky, 2005) Affy
Hu95Av2 - Samples
- WI 58 DLBCL 19 FL
- CU 14 DLBCL 7 FL
- Genes
- WI 6817
- CU 12581
7Diagnosis problem
- Input
- Training (biomedical) data
- 2 classes FL and DLBCL
- m samples described by N gtgt features
- Output
- Collection of robust biomarkers, models
- Robust, accurate classifier / tested on
out-of-sample data -
8(No Transcript)
9Patterns (Logical Analysis of Data, Hammer 1988)
Positive Patterns
Negative Patterns
Model
- -Exhaustive collections of patterns
- Pattern space
- Classification / attribute analysis / new class
identification
10Data Preprocessing
- 50 P calls, UL 16000, LL 20
- 2/1 stratify WI data to train/test CU data test
- Normalize data to median 1000 per array
- Generate 500 data sets using noise k fold
stratified sampling jackknife - Find genes with high correlation to phenotype
using t-test or SNR. Keep genes that are in gt
90 of datasets
11Choosing support sets
- Create quality patterns using small subsets of
genes, validate using weighted voting with 10
fold cross validation - Sort genes by their appearance in good patterns
- Select top genes to cover each sample by at least
10 patterns
Alexe, Alexe, Hammer, Vizvari (2005)
12The 30 genes that best distinguish FL from DLBCL
13Genes identified by LAD (AIIM 2005) to
distinguish DLBCL from FL
14Examples of FL and DLBCL patterns
WI training data Each DLBCL case satisfies at
least one of the patterns P1 and P2 Each FL case
satisfies the pattern N1 (and none of the
patterns P1 and P2)
15Pattern data
16Meta-classifier performance
17Error distribution raw and pattern data
18Biology based method
19p53 related genes identified by filtering
procedure
FL ? DLBCL progression
20p53 pattern data
21Examples of p53 responsive genes patterns
WI data Each DLBCL case satisfies one of the
patterns P1, P2, P3 Each FL case satisfies one of
the patterns N1, N2, N3
22p53 combinatorial biomarker
77 FL 21 DLBCL cases (3.7 fold) at most one
gene over-expressed 79 DLBCL 23 FL cases
(3.4 fold) at least two genes over-expressed
Each individual gene over- expressed in about
40-70 DLBCL 20-40 FL (specificity 50-60,
sensitivity 60-70)
23What are these genes?
- Plk1 (stpk13) polo-like kinase serine threonine
protein kinase 13, M-phase specific - cell transformation, neoplastic, drives quiescent
cells into mitosis - over-expressed in various human tumors
- Takai et al., Oncogene, 2005 plk1 potential
target for cancer therapy, new prognostic marker
for cancer - Mito et al, Leuk Lymph, 2005 plk1 biomarker for
DLBCL - Cdk2 (p33) cyclin -dependent kinase G2/M
transition of mitotic cell cycle, interacts with
cyclins A, B3, D, E - P53 tumor suppressor gene (Levine 1982)
24Conclusions
- Pattern-based meta-classifier is robust against
noise - Good prediction of FL ? DLBCL
- Biology based analysis also possible
- Yields useful biomarker
- Should study biologically motivated sets of genes
? build pathways
25ltgt
- Thank you for your attention !