GGS Lecture: Knowledge discovery in large datasets - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

GGS Lecture: Knowledge discovery in large datasets

Description:

Examples: image processing, text mining, spam filtering, biological sequence ... clustering gene expression data for sporulation in Yeast (Yeung and Ruzzo , ... – PowerPoint PPT presentation

Number of Views:120
Avg rating:3.0/5.0
Slides: 31
Provided by: yvs5
Category:

less

Transcript and Presenter's Notes

Title: GGS Lecture: Knowledge discovery in large datasets


1
GGS LectureKnowledge discovery in large
datasets
  • Yvan Saeys
  • yvan.saeys_at_ugent.be

2
Overview
  • Emergence of large datasets
  • Dealing with large datasets
  • Dimension reduction techniques
  • Case study knowledge discovery for splice site
    prediction
  • Computer exercise

3
Emergence of large datasets
  • Examples image processing, text mining, spam
    filtering, biological sequence analysis,
    micro-array data
  • Complexity of datasets
  • Many instances (examples)
  • Many features (characteristics)
  • Many dependencies between features (correlations)

4
Examples of large datasets
  • Micro-array data
  • Colon cancer dataset (2000 genes, 22 samples)
  • Leukemia (7129 genes, 72 cell-lines)
  • Gene prediction data
  • Homo sapiens splice site data (e.g. Genie) 5788
    sequences, 90 bp
  • Text mining
  • Hundreds of thousands of instances (documents),
    thousands of features

5
Dealing with complex data
  • Data pre-processing
  • Dimensionality reduction
  • Instance selection
  • Feature transformation/selection
  • Data analysis
  • Clustering
  • Classification
  • Requires methods that are fast and able to deal
    with large amounts of data

6
Dimensionality reduction
  • Instance selection
  • Remove identical/inconsistent/incomplete
    instances (e.g. reduction of homologous genes in
    gene prediction tasks)
  • Feature transformation/selection
  • Projection techniques (e.g. principle component
    analysis)
  • Compression techniques (e.g. minimum description
    length)
  • Feature selection techniques

7
Principle component analysis (PCA)
  • Transforms the original features of the data to a
    new set of variables (the principal components)
    to summarize the features of the data
  • Usually only the 2 or 3 first PC are then used to
    visualize the data
  • Example clustering gene expression data

8
PCA Example
  • Principal component analysis for clustering gene
    expression data for sporulation in Yeast (Yeung
    and Ruzzo , Bioinformatics 17 (9), 2001)
  • 447 genes, 7 timepoints

9
Feature selection techniques
  • In contrast to projection or compression, the
    original features are not changed
  • For classification purposes
  • Goal to find a minimal subset of features
    with best classification performance

10
Feature selection for Bioinformatics
  • In many cases, the underlying biological process
    that is modeled, is not yet fully understood
  • Which features to include ?
  • Include as many features as possible, and hope
    the relevant ones are included.
  • Then apply feature selection techniques to
    identify the relevant features
  • Visualization, learn something from your data
    (data ? knowledge)

11
Benefits of feature selection
  • Attain good or even better classification
    performance using a small subset of features
  • Provide more cost-effective classifiers
  • Less features to take into account
  • faster classifiers
  • Less features to store
  • smaller datasets
  • Gain more insight into the processes that
    generated the data

12
Feature selection techniques
  • Filter approach
  • Wrapper approach
  • Embedded approach

Classification Model
FSS
FSS Search Method Classification Model
Classification Model
Classification Model Parameters FSS
13
Filter methods
  • Independent of classification model
  • Uses only dataset of annotated examples
  • A relevance measure for each feature is
    calculated
  • E.g Feature Class entropy
  • Kullback-Leibler divergence (cross-entropy)
  • Information gain, gain ratio
  • Features with a value lower than some threshold t
    will be removed

14
Filter method example
  • Feature-class entropy
  • Measures the uncertainty about the class when
    observing feature I
  • Example
  • f1 f2 f3 f4 class f1 f2 f3 f4 class
  • 1 0 1 1 1 1 0 0 0 0
  • 0 1 1 0 1 0 0 1 0 0
  • 1 0 1 0 1 1 1 0 1 0
  • 0 1 0 1 1 0 1 0 1 0

15
Wrapper method
  • Specific to a classification algorithm
  • The search for a good feature subset is guided by
    a search algorithm (e.g. greedy forward or
    backward)
  • The algorithm uses the evaluation of the
    classifier as a guide to find good feature
    subsets
  • Examples sequential forward or backward search,
    simulated annealing, genetic algorithms

16
Wrapper method example
  • Sequential backward elimination
  • Starts with the set of all features
  • Iteratively discards the feature whose removal
    results in the best classification performance

17
Wrapper method example (2)
Full feature set f1,f2,f3,f4
18
Embedded methods
  • Specific to a classification algorithm
  • Model parameters are directly used to derive
    feature weights
  • Examples
  • Weighted Naïve Bayes Method (WNBM)
  • Weighted Linear Support Vector Machine (WLSVM)

19
Case study knowledge discovery for splice site
prediction
  • Splice site prediction
  • Correctly identify the borders of introns and
    exons in genes (splice sites)
  • Important for gene prediction
  • Split up into 2 tasks
  • Donor prediction (exon -gt intron)
  • Acceptor prediction (intron -gt exon)

20
Splice site prediction
  • Splice sites are characterized by a conserved
    dinucleotide in the intron part of the sequence
  • Donor sites . GT
  • Acceptor sites .. AG .
  • Classification problem
  • Distinguish between true GT, AG and false GT, AG.

21
Splice site predictionFeatures
  • Position dependent features
  • e.g. an A on position 1, C on position 17, .
  • Position independent features
  • e.g. subsequence TCG occurs, GAG occurs,

1 2 3
17 28
atcgatcagtatcgat GT ctgagctatgag
atcgatcagtatcgat GT ctgagctatgag
22
Example acceptor prediction
  • Local context of 100 nucleotides around the
    splice site
  • 100 position dependent features
  • 400 binary features (A1000, T0100, C0010,
    G0001)
  • 2x64 binary features, representing the occurrence
    of 3-mers
  • Total 528 binary features
  • Color coding of feature importance

23
Donor prediction 528 features
50
49
48
47
46
45
44
43
42
41
40
39
38
37
36
35
34
33
32
31
30
29
28
27
26
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
A
T
C
G
AAA
AAA
ATT
ATT
ATC
ATC
ACC
ACC
AGG
AGG
AAT
AAT
ATG
ATG
TAT
TAT
TCC
TCC
AAC
AAC
ACT
ACT
TTA
TTA
AAG
AAG
ACG
ACG
TTT
TTT
ATA
ATA
AGT
AGT
TTC
TTC
ACA
ACA
AGC
AGC
TTG
TTG
AGA
AGA
TAC
TAC
TCT
TCT
TAA
TAA
TAG
TAG
TCA
TCA
TCG
TCG
24
Acceptor prediction 528 features
AAT, TAA, AGA, AGG, AGT, TAG CAG
25
How to decide on a splice site ?
  • Classification models
  • PWM
  • Collection of (conditional) probabilities
  • Linear discriminant analysis
  • Hyperplane decision function in a
    high-dimensional space
  • Classification tree
  • Decision is made by traversing a tree structure
  • Decision nodes
  • Leaf nodes
  • Easy to interpret by a human

26
Classification Tree
  • Choose the best attribute by a given selection
    measure
  • Extend tree by adding new branch for each
    attribute value
  • Sorting training examples to leaf nodes
  • If examples unambiguously classified Then Stop
    Else Repeat steps 1-4 for leaf nodes
  • Pruning unstable leaf nodes

Temperature
27
Acceptor prediction
  • Original dataset 353 binary features
  • Reduce this set to 15 features (e.g. using a
    filter technique)
  • 353 features is hard to visualize in e.g. a
    decision tree
  • 15 features is easy to visualize

28
352 Binary features
29
15 Binary features
30
Computer exercise
  • Feature selection for classification of human
    acceptor splice sites
  • Use WEKA machine learning toolkit for knowledge
    discovery in acceptor sites.
  • Download files from
  • http//www.psb.ugent.be/yvsae/GGSlecture.html
Write a Comment
User Comments (0)
About PowerShow.com