Robust diagnosis DLBCL from gene expression data from different laboratories - PowerPoint PPT Presentation

About This Presentation
Title:

Robust diagnosis DLBCL from gene expression data from different laboratories

Description:

... Progression of Follicular Lymphoma (FL) to Diffuse Large B Cell Lymphoma (DLBCL) ... FL low grade non-Hodgkin lymphoma ... Lymphoma datasets. Data: WI ... – PowerPoint PPT presentation

Number of Views:98
Avg rating:3.0/5.0
Slides: 35
Provided by: Gabr244
Category:

less

Transcript and Presenter's Notes

Title: Robust diagnosis DLBCL from gene expression data from different laboratories


1
Robust diagnosis DLBCL from gene expression data
from different laboratories
Gyan Bhanot, IBM Research
Dimacs Workshop, June 22, 2005
2
  • Collaborators
  • Gabriela Alexe1
  • Arnold Levine2,3
  • Gustavo Stolovitzky1

1 IBM 2 IAS 3 UMDNJ
3
Overview
  • Motivation
  • Pattern-based meta-classifiers
  • Case study compare data from two labs for DLBCL
    vs FL diagnosis

4
Motivation Cancer is a genetic/proteomic disease
  • Genetic mutations/viruss/radiation modify
    pathways to create survival advantage for damaged
    cell
  • Gene Arrays are a way to study the variation of
    mRNA levels between diseased and healthy cells.
  • This allows diagnosis and inference of pathways
    that cause disease

5
Cancer diagnosis
  • Input
  • Training (biomedical / proteomic, microarray)
    data
  • k ? 2 classes (m samples) described by N gtgt
    features
  • Output
  • Collection of robust biomarkers, models
  • Robust, accurate classifier / tested on
    out-of-sample data

6
(No Transcript)
7
Strategy of present paper
  • 1. Transform original data to pattern space
  • 2. Find robust sets of biomarkers with
    significant collective discriminatory power
  • 3. Use many machine learning tools on original
    and pattern data
  • ANN, SVM, kNN, Weighted voting, Classification
    trees
  • 4. Validate the results on data from a different
    lab

8
Patterns
Observed dataset
System response
9
Pattern basics
Positive patterns
Negative patterns
10
Individual classifiers used
  • SVM, ANN, WV, KNN, CART, LR
  • Trained / calibrated (leave-one-out)
  • raw data
  • pattern data

11
Application Progression of Follicular Lymphoma
(FL) to Diffuse Large B Cell Lymphoma (DLBCL)
Gene Array data
from different laboratories
Shipp et al. (2002) Nature Med. 8(1), 68-74.
(Whitehead Lab) Stolovitzky G. (2005) In
Deisboeck et al Complex Systems Science in
BioMedicine (in press) (preprint
http//www.wkap.nl/prod/a/Stolovitzky.pdf).
(DellaFavera Lab) Alexe et al (2005) Artificial
Intelligence in Medicine (in press)
12
Non-Hodgkin lymphomas
  • FL low grade non-Hodgkin lymphoma
  • t(1418) translocation over-expression of
    anti-apoptotic bcl2
  • 25-60 FL cases evolve to DLBCL
  • DLBCL high grade non-Hodgkin lymphoma
  • lt 2 years survival if untreated
  • Biomarkers FL transformation to DLBCL
  • p53/MDM2 (Moller et al., 1999)
  • p16 (Pyniol, 1998)
  • p38MAPK (Elenitoba-Johnson et al., 2003)
  • c-myc (Lossos et al., 2002)

13
Lymphoma datasets
  • Data WI (Shipp et al., 2002) Affy HuGeneFL
  • CU (DallaFavera Lab, Stolovitzky, 2005) Affy
    Hu95Av2
  • Samples
  • WI 58 DLBCL 19 FL
  • CU 14 DLBCL 7 FL
  • Genes
  • WI 6817
  • CU 12581

14
Data Preprocessing
  • 50 P calls, UL 16000, LL 20
  • 2/1 stratify WI data to train/test. CU data test
  • Compute SD per gene across samples
  • Normalize data to mean 0, SD 1 per gene
  • Generate 500 data sets using noise k fold
    stratified sampling jacknife
  • Find genes with high correlation to phenotype
    using t-test or SNR. Keep genes that are in gt
    450/501 of datasets

15
Choosing Support Sets
  • Create good patterns using small subsets of
    genes, validate using weighted voting with 10
    fold cross validation
  • Sort genes by their appearance in good patterns
  • Select top genes to cover each sample by at least
    10 patterns

16
The 30 genes that best distinguish FL from DLBCL
17
Examples of FL and DLBCL patterns
WI training data Each DLBCL case satisfies at
least one of the patterns P1 and P2 Each FL case
satisfies the pattern N1 (and none of the
patterns P1 and P2)
18
Pattern data
19
Meta-classifier performance
20
Error distribution raw and pattern data
21
Biology Based Methods
22
p53 related genes identified by filtering
procedure
FL ? DLBCL progression
23
p53 pattern data
24
Examples of p53 responsive genes patterns
WI data Each DLBCL case satisfies one of the
patterns P1, P2, P3 Each FL case satisfies one of
the patterns N1, N2, N3
25
p53 combinatorial biomarker
77 FL 21 DLBCL cases (3.7 fold) at most one
gene over-expressed 79 DLBCL 23 FL cases
(3.4 fold) at least two genes over-expressed
Each individual gene over- expressed in about
40-70 DLBCL 20-40 FL (specificity 50-60,
sensitivity 60-70)
26
What are these genes?
  • Plk1 (stpk13) polo-like kinase serine threonine
    protein kinase 13, M-phase specific
  • cell transformation, neoplastic, drives quiescent
    cells into mitosis
  • over-expressed in various human tumors
  • Takai et al., Oncogene, 2005 plk1 potential
    target for cancer therapy, new prognostic marker
    for cancer
  • Mito et al, Leuk Lymph, 2005 plk1 biomarker for
    DLBCL
  • Cdk2 (p33) cyclin -dependent kinase G2/M
    transition of mitotic cell cycle, interacts with
    cyclins A, B3, D, E
  • P53 tumor suppressor gene (Levine 1982)

27
Conclusions
  • Pattern-based meta-classifier is robust against
    noise
  • Good prediction of FL ? DLBCL
  • Biology Based Analysis also possible
  • Yields useful Biomarker
  • Should Study Biologically motivated sets of genes
    ? build pathways

28
ltgt
  • Thank you for your attention !

29
Artificial neural networks
30
Support vector machines
Find a maximum margin hyperplane in pattern space
(Vapnik)
(P)
(D)
31
k-Nearest neighbors
  • Training data samples in normalized peptide
    space
  • Prediction for test data The dominant class of
    the k-nearest neighbors in Euclidean metric

Positive
Negative
New case Negative
32
Weighted voting
  • Pattern data
  • each pattern P is a voter
  • weight fraction of correctly classified cases
    by the pattern
  • each test case compute sum of weights of
    triggered positive patterns and negative patterns
  • classify by highest weight

33
Logistic regression
  • Dataset of two phenotypes (e.g., cancer vs.
    non-cancer)
  • Transform into logit space y-gtln(p/1-p)
  • Find phenotype predictor as a linear combination
    of data values in logit space

Insightful Miner
34
Decision trees / forests
  • Find rules in training data
  • find root feature which best classifies samples
    by phenotype
  • iterate on each branch to find two new features
    which best split each branch by phenotype
  • if necessary prune weak support nodes
  • CART Classification and Regression Trees
    (Breiman)
  • Many trees forest
Write a Comment
User Comments (0)
About PowerShow.com