GGS Lecture: Knowledge discovery in large datasets - PowerPoint PPT Presentation

1 / 30

About This Presentation

Title:

GGS Lecture: Knowledge discovery in large datasets

Description:

Examples: image processing, text mining, spam filtering, biological sequence ... clustering gene expression data for sporulation in Yeast (Yeung and Ruzzo , ... – PowerPoint PPT presentation

Number of Views:120

Avg rating:3.0/5.0

Slides: 31

Provided by: yvs5

Category:

more less

Transcript and Presenter's Notes

Title: GGS Lecture: Knowledge discovery in large datasets

1
GGS LectureKnowledge discovery in large
datasets

Yvan Saeys
yvan.saeys_at_ugent.be

2
Overview

Emergence of large datasets
Dealing with large datasets
Dimension reduction techniques
Case study knowledge discovery for splice site
prediction
Computer exercise

3
Emergence of large datasets

Examples image processing, text mining, spam
filtering, biological sequence analysis,
micro-array data
Complexity of datasets
Many instances (examples)
Many features (characteristics)
Many dependencies between features (correlations)

4
Examples of large datasets

Micro-array data
Colon cancer dataset (2000 genes, 22 samples)
Leukemia (7129 genes, 72 cell-lines)
Gene prediction data
Homo sapiens splice site data (e.g. Genie) 5788
sequences, 90 bp
Text mining
Hundreds of thousands of instances (documents),
thousands of features

5
Dealing with complex data

Data pre-processing
Dimensionality reduction
Instance selection
Feature transformation/selection
Data analysis
Clustering
Classification
Requires methods that are fast and able to deal
with large amounts of data

6
Dimensionality reduction

Instance selection
Remove identical/inconsistent/incomplete
instances (e.g. reduction of homologous genes in
gene prediction tasks)
Feature transformation/selection
Projection techniques (e.g. principle component
analysis)
Compression techniques (e.g. minimum description
length)
Feature selection techniques

7
Principle component analysis (PCA)

Transforms the original features of the data to a
new set of variables (the principal components)
to summarize the features of the data
Usually only the 2 or 3 first PC are then used to
visualize the data
Example clustering gene expression data

8
PCA Example

Principal component analysis for clustering gene
expression data for sporulation in Yeast (Yeung
and Ruzzo , Bioinformatics 17 (9), 2001)
447 genes, 7 timepoints

9
Feature selection techniques

In contrast to projection or compression, the
original features are not changed
For classification purposes
Goal to find a minimal subset of features
with best classification performance

10
Feature selection for Bioinformatics

In many cases, the underlying biological process
that is modeled, is not yet fully understood
Which features to include ?
Include as many features as possible, and hope
the relevant ones are included.
Then apply feature selection techniques to
identify the relevant features
Visualization, learn something from your data
(data ? knowledge)

11
Benefits of feature selection

Attain good or even better classification
performance using a small subset of features
Provide more cost-effective classifiers
Less features to take into account
faster classifiers
Less features to store
smaller datasets
Gain more insight into the processes that
generated the data

12
Feature selection techniques

Filter approach
Wrapper approach
Embedded approach

Classification Model
FSS
FSS Search Method Classification Model
Classification Model
Classification Model Parameters FSS
13
Filter methods

Independent of classification model
Uses only dataset of annotated examples
A relevance measure for each feature is
calculated
E.g Feature Class entropy
Kullback-Leibler divergence (cross-entropy)
Information gain, gain ratio
Features with a value lower than some threshold t
will be removed

14
Filter method example

Feature-class entropy
Measures the uncertainty about the class when
observing feature I
Example
f1 f2 f3 f4 class f1 f2 f3 f4 class
1 0 1 1 1 1 0 0 0 0
0 1 1 0 1 0 0 1 0 0
1 0 1 0 1 1 1 0 1 0
0 1 0 1 1 0 1 0 1 0

15
Wrapper method

Specific to a classification algorithm
The search for a good feature subset is guided by
a search algorithm (e.g. greedy forward or
backward)
The algorithm uses the evaluation of the
classifier as a guide to find good feature
subsets
Examples sequential forward or backward search,
simulated annealing, genetic algorithms

16
Wrapper method example

Sequential backward elimination
Starts with the set of all features
Iteratively discards the feature whose removal
results in the best classification performance

17
Wrapper method example (2)
Full feature set f1,f2,f3,f4
18
Embedded methods

Specific to a classification algorithm
Model parameters are directly used to derive
feature weights
Examples
Weighted Naïve Bayes Method (WNBM)
Weighted Linear Support Vector Machine (WLSVM)

19
Case study knowledge discovery for splice site
prediction

Splice site prediction
Correctly identify the borders of introns and
exons in genes (splice sites)
Important for gene prediction
Split up into 2 tasks
Donor prediction (exon -gt intron)
Acceptor prediction (intron -gt exon)

20
Splice site prediction

Splice sites are characterized by a conserved
dinucleotide in the intron part of the sequence
Donor sites . GT
Acceptor sites .. AG .
Classification problem
Distinguish between true GT, AG and false GT, AG.

21
Splice site predictionFeatures

Position dependent features
e.g. an A on position 1, C on position 17, .
Position independent features
e.g. subsequence TCG occurs, GAG occurs,

1 2 3
17 28
atcgatcagtatcgat GT ctgagctatgag
atcgatcagtatcgat GT ctgagctatgag
22
Example acceptor prediction

Local context of 100 nucleotides around the
splice site
100 position dependent features
400 binary features (A1000, T0100, C0010,
G0001)
2x64 binary features, representing the occurrence
of 3-mers
Total 528 binary features
Color coding of feature importance

23
Donor prediction 528 features
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
A
T
C
G
AAA
AAA
ATT
ATT
ATC
ATC
ACC
ACC
AGG
AGG
AAT
AAT
ATG
ATG
TAT
TAT
TCC
TCC
AAC
AAC
ACT
ACT
TTA
TTA
AAG
AAG
ACG
ACG
TTT
TTT
ATA
ATA
AGT
AGT
TTC
TTC
ACA
ACA
AGC
AGC
TTG
TTG
AGA
AGA
TAC
TAC
TCT
TCT
TAA
TAA
TAG
TAG
TCA
TCA
TCG
TCG
24
Acceptor prediction 528 features
AAT, TAA, AGA, AGG, AGT, TAG CAG
25
How to decide on a splice site ?

Classification models
PWM
Collection of (conditional) probabilities
Linear discriminant analysis
Hyperplane decision function in a
high-dimensional space
Classification tree
Decision is made by traversing a tree structure
Decision nodes
Leaf nodes
Easy to interpret by a human

26
Classification Tree

Choose the best attribute by a given selection
measure
Extend tree by adding new branch for each
attribute value
Sorting training examples to leaf nodes
If examples unambiguously classified Then Stop
Else Repeat steps 1-4 for leaf nodes
Pruning unstable leaf nodes

Temperature
27
Acceptor prediction