Title: GGS Lecture: Knowledge discovery in large datasets
1GGS LectureKnowledge discovery in large
datasets
- Yvan Saeys
- yvan.saeys_at_ugent.be
2Overview
- Emergence of large datasets
- Dealing with large datasets
- Dimension reduction techniques
- Case study knowledge discovery for splice site
prediction - Computer exercise
3Emergence of large datasets
- Examples image processing, text mining, spam
filtering, biological sequence analysis,
micro-array data - Complexity of datasets
- Many instances (examples)
- Many features (characteristics)
- Many dependencies between features (correlations)
4Examples of large datasets
- Micro-array data
- Colon cancer dataset (2000 genes, 22 samples)
- Leukemia (7129 genes, 72 cell-lines)
- Gene prediction data
- Homo sapiens splice site data (e.g. Genie) 5788
sequences, 90 bp - Text mining
- Hundreds of thousands of instances (documents),
thousands of features
5Dealing with complex data
- Data pre-processing
- Dimensionality reduction
- Instance selection
- Feature transformation/selection
- Data analysis
- Clustering
- Classification
- Requires methods that are fast and able to deal
with large amounts of data
6Dimensionality reduction
- Instance selection
- Remove identical/inconsistent/incomplete
instances (e.g. reduction of homologous genes in
gene prediction tasks) - Feature transformation/selection
- Projection techniques (e.g. principle component
analysis) - Compression techniques (e.g. minimum description
length) - Feature selection techniques
7Principle component analysis (PCA)
- Transforms the original features of the data to a
new set of variables (the principal components)
to summarize the features of the data - Usually only the 2 or 3 first PC are then used to
visualize the data - Example clustering gene expression data
8PCA Example
- Principal component analysis for clustering gene
expression data for sporulation in Yeast (Yeung
and Ruzzo , Bioinformatics 17 (9), 2001) - 447 genes, 7 timepoints
9Feature selection techniques
- In contrast to projection or compression, the
original features are not changed - For classification purposes
- Goal to find a minimal subset of features
with best classification performance
10Feature selection for Bioinformatics
- In many cases, the underlying biological process
that is modeled, is not yet fully understood - Which features to include ?
- Include as many features as possible, and hope
the relevant ones are included. - Then apply feature selection techniques to
identify the relevant features - Visualization, learn something from your data
(data ? knowledge)
11Benefits of feature selection
- Attain good or even better classification
performance using a small subset of features - Provide more cost-effective classifiers
- Less features to take into account
- faster classifiers
- Less features to store
- smaller datasets
- Gain more insight into the processes that
generated the data
12Feature selection techniques
- Filter approach
- Wrapper approach
- Embedded approach
Classification Model
FSS
FSS Search Method Classification Model
Classification Model
Classification Model Parameters FSS
13Filter methods
- Independent of classification model
- Uses only dataset of annotated examples
- A relevance measure for each feature is
calculated - E.g Feature Class entropy
- Kullback-Leibler divergence (cross-entropy)
- Information gain, gain ratio
- Features with a value lower than some threshold t
will be removed
14Filter method example
- Feature-class entropy
- Measures the uncertainty about the class when
observing feature I - Example
- f1 f2 f3 f4 class f1 f2 f3 f4 class
- 1 0 1 1 1 1 0 0 0 0
- 0 1 1 0 1 0 0 1 0 0
- 1 0 1 0 1 1 1 0 1 0
- 0 1 0 1 1 0 1 0 1 0
15Wrapper method
- Specific to a classification algorithm
- The search for a good feature subset is guided by
a search algorithm (e.g. greedy forward or
backward) - The algorithm uses the evaluation of the
classifier as a guide to find good feature
subsets - Examples sequential forward or backward search,
simulated annealing, genetic algorithms
16Wrapper method example
- Sequential backward elimination
- Starts with the set of all features
- Iteratively discards the feature whose removal
results in the best classification performance
17Wrapper method example (2)
Full feature set f1,f2,f3,f4
18Embedded methods
- Specific to a classification algorithm
- Model parameters are directly used to derive
feature weights - Examples
- Weighted Naïve Bayes Method (WNBM)
- Weighted Linear Support Vector Machine (WLSVM)
19Case study knowledge discovery for splice site
prediction
- Splice site prediction
- Correctly identify the borders of introns and
exons in genes (splice sites) - Important for gene prediction
- Split up into 2 tasks
- Donor prediction (exon -gt intron)
- Acceptor prediction (intron -gt exon)
20Splice site prediction
- Splice sites are characterized by a conserved
dinucleotide in the intron part of the sequence - Donor sites . GT
- Acceptor sites .. AG .
- Classification problem
- Distinguish between true GT, AG and false GT, AG.
21Splice site predictionFeatures
- Position dependent features
- e.g. an A on position 1, C on position 17, .
- Position independent features
- e.g. subsequence TCG occurs, GAG occurs,
1 2 3
17 28
atcgatcagtatcgat GT ctgagctatgag
atcgatcagtatcgat GT ctgagctatgag
22Example acceptor prediction
- Local context of 100 nucleotides around the
splice site - 100 position dependent features
- 400 binary features (A1000, T0100, C0010,
G0001) - 2x64 binary features, representing the occurrence
of 3-mers - Total 528 binary features
- Color coding of feature importance
23Donor prediction 528 features
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
A
T
C
G
AAA
AAA
ATT
ATT
ATC
ATC
ACC
ACC
AGG
AGG
AAT
AAT
ATG
ATG
TAT
TAT
TCC
TCC
AAC
AAC
ACT
ACT
TTA
TTA
AAG
AAG
ACG
ACG
TTT
TTT
ATA
ATA
AGT
AGT
TTC
TTC
ACA
ACA
AGC
AGC
TTG
TTG
AGA
AGA
TAC
TAC
TCT
TCT
TAA
TAA
TAG
TAG
TCA
TCA
TCG
TCG
24Acceptor prediction 528 features
AAT, TAA, AGA, AGG, AGT, TAG CAG
25How to decide on a splice site ?
- Classification models
- PWM
- Collection of (conditional) probabilities
- Linear discriminant analysis
- Hyperplane decision function in a
high-dimensional space - Classification tree
- Decision is made by traversing a tree structure
- Decision nodes
- Leaf nodes
- Easy to interpret by a human
26Classification Tree
- Choose the best attribute by a given selection
measure - Extend tree by adding new branch for each
attribute value - Sorting training examples to leaf nodes
- If examples unambiguously classified Then Stop
Else Repeat steps 1-4 for leaf nodes - Pruning unstable leaf nodes
Temperature
27Acceptor prediction
- Original dataset 353 binary features
- Reduce this set to 15 features (e.g. using a
filter technique) - 353 features is hard to visualize in e.g. a
decision tree - 15 features is easy to visualize
28352 Binary features
2915 Binary features
30Computer exercise
- Feature selection for classification of human
acceptor splice sites - Use WEKA machine learning toolkit for knowledge
discovery in acceptor sites. - Download files from
- http//www.psb.ugent.be/yvsae/GGSlecture.html