Feature Selection and Pattern Discovery from Microarray Gene Expressions - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Feature Selection and Pattern Discovery from Microarray Gene Expressions

Description:

Itemsets: Collection of items. Example: {Milk, Diaper} ... of size k that could be frequent, given Fk-1. Fk = those itemsets that are actually frequent, Fk ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 35
Provided by: KUA2
Category:

less

Transcript and Presenter's Notes

Title: Feature Selection and Pattern Discovery from Microarray Gene Expressions


1
Feature Selection and Pattern Discovery from
Microarray Gene Expressions
CSCI5980 Functional Genomics, Systems Biology
and Bioinformtics
  • Rui Kuang and Chad Myers
  • Department of Computer Science and Engineering
  • University of Minnesota
  • kuang_at_cs.umn.edu

2
Feature Selection and Pattern Discovery
  • Feature selection identify the features that are
    most relevant for characterizing the system and
    its behavior.
  • Often improve classification accuracy
  • Draw focus on the key features to explain the
    system
  • Often used for maker gene selection in microarray
    data analysis
  • Pattern discovery identify special associations
    between features and samples.
  • Often subset of samples and subset of features
    like in bi-clustering
  • However, bi-clustering is difficult but there are
    efficient algorithms finding all the discretized
    bi-clusters

3
Finding Disease-causing Factors
  • Clinical factors and personalized genetic
    /genomic information

Human Disease
4
Biomarkers Personalized Medicine
  • Molecular traits that can characterize a certain
    phenotype (what we can observe or measure)
  • In cancer study, biomarkers can be used for
  • Prognosis predict the outcome of cancer
    (metastasis) for deciding how aggressive to treat
    the patient.
  • Treatment response predict a patients response
    to a certain type of treatment.
  • Discovery
  • DNA copy-number
  • Gene expression profiling
  • Proteomic profiling
  • Etc.

5
Biomarker Identification in a Case-Control study
Genes
Genes
Controls
Cases
6
Statistical Methods
Genes
Cases
Controls
7
Statistical Methods (Cont.)
  • Each gene is considered independently
  • Cannot detect combined markers that are highly
    discriminative between the groups
  • Use all the samples to quantify the difference
  • Cannot capture markers specific to a
    subpopulation.

8
Feature Selection
Genes
Cases

Controls
Search for a minimum set of genes which lead to
maximum classification performance
9
Filters vs. Wrappers
  • Main goal rank subsets of useful features.

10
Forward Selection
  • Add the highest ranked feature
  • Check classification performance

Classification Accuracy 75
Forward Selection
Rank features
11
Forward Selection
  • Add the highest ranked feature
  • Check classification performance
  • Add the next highest ranked feature

Classification Accuracy 75?95
Forward Selection
Rank features
12
Forward Selection
  • Add the highest ranked feature
  • Check classification performance
  • Add the next highest ranked feature

Classification Accuracy 95?80
Forward Selection
Rank features
13
Backward Elimination
  • Remove the lowest ranked feature
  • Check classification performance

Classification Accuracy 60?75
Forward Selection
Rank features
14
Backward Elimination
  • Remove the lowest ranked feature
  • Check classification performance
  • Remove the next lowest ranked
  • feature until performance worse

Classification Accuracy 95
Forward Selection
Rank features
15
Feature Selection
Slightly more sophisticated version of Forward
Selection and Backward Elimination
  • Forward Selection Method
  • Steps
  • Build classifiers with 1 feature and rank all
    features according to the predictive power of the
    classifiers
  • Start with the first feature
  • Build classifier and check classification
    performance
  • If performance is worse then previous round, then
    stop
  • Else add the next feature and go to a).
  • Backward Elimination Method
  • Steps
  • Start with all the features
  • For each feature x in the set
  • Remove x from the set
  • Check the classification performance of
    classifier build with the set without x
  • Remove the feature where the remaining features
    have the best classification performance.
  • Repeat 2-3, until the performance start to drop

16
Feature Selection Embedded Methods
  • Embedded methods incorporate variable selection
    as part of the model building process
  • Example SVM feature selection

Yes, stop!
All features
No, continue
Recursive Feature Elimination (RFE) SVM.
Guyon-Weston, 2000. US patent 7,117,188
17
Embedded methods
All features
Yes, stop!
No, continue
Recursive Feature Elimination (RFE) SVM.
Guyon-Weston, 2000. US patent 7,117,188
18
RFE SVM for cancer diagnosis
Differenciation of 14 tumors. Ramaswamy et al,
PNAS, 2001
19
Feature Selection (Cont.)
  • NP-hard to search through all the combinations
  • Need heuristic solutions
  • The assumption is based on the maximum
    classification performance.
  • There might be more than one subset of features
    that can give the optimal classification
    performance.
  • Dont consider the modular structure of
    co-expressed genes
  • May omit other important genes in the maker
    module

20
Bi-cluster Structure
Genes
  • Bi-clusterings Relevant knowledge can be hidden
    in a group of genes with common pattern across a
    subset of the conditions e.g. genes co-expressed
    under some conditions
  • It is NP-hard to discover even one bi-cluster
  • However, in the discretized case, the optimal
    solution can be found efficiently with
    association rule mining algorithms

Cases
Controls
21
Association Rule Mining
  • Proposed by Agrawal et al in 1993.
  • It is an important data mining model studied
    extensively by the database and data mining
    community.
  • Assume all data are categorical.
  • No good algorithm for numeric data.
  • Initially used for Market Basket Analysis to find
    how items purchased by customers are related.

22
Transaction data supermarket data
  • Market basket transactions
  • t1 bread, cheese, milk
  • t2 apple, eggs, salt, yogurt
  • tn biscuit, eggs, milk
  • Concepts
  • An item an item/article in a basket
  • I the set of all items sold in the store
  • A transaction items purchased in a basket it
    may have TID (transaction ID)
  • A transactional dataset A set of transactions

23
Rule strength measures
  • Support The rule holds with support sup in T
    (the transaction data set) if sup of
    transactions contain X ? Y.
  • sup Pr(X ? Y).
  • Confidence The rule holds in T with confidence
    conf if conf of tranactions that contain X also
    contain Y.
  • conf Pr(Y X)
  • An association rule is a pattern that states when
    X occurs, Y occurs with certain probability.

24
Association Rule Mining
  • Two types of patterns
  • Itemsets Collection of items
  • Example Milk, Diaper
  • Association Rules X ? Y, where X and Y are
    itemsets.
  • Example Milk ? Diaper

Set-Based Representation of Data
25
An example
t1 Beef, Chicken, Milk t2 Beef,
Cheese t3 Cheese, Boots t4 Beef, Chicken,
Cheese t5 Beef, Chicken, Clothes, Cheese,
Milk t6 Chicken, Clothes, Milk t7 Chicken,
Milk, Clothes
  • Transaction data
  • Assume
  • minsup 30
  • minconf 80
  • An example frequent itemset
  • Chicken, Clothes, Milk sup 3/7
  • Association rules from the itemset
  • Clothes ? Milk, Chicken sup 3/7, conf 3/3
  • Clothes, Chicken ? Milk, sup 3/7, conf
    3/3

26
Association Rule Mining
  • Process of finding interesting patterns
  • Find frequent itemsets using a support threshold
  • Find association rules for frequent itemsets
  • Sort association rules according to confidence
  • Support filtering is necessary
  • To eliminate spurious patterns
  • To avoid exponential search
  • Support has anti-monotone
  • property X ? Y implies ?(Y) ?(X)
  • Confidence is used because of its interpretation
    as conditional probability

Given d items, there are 2d possible candidate
itemsets
27
The Apriori algorithm
  • Two steps
  • Find all itemsets that have minimum support
    (frequent itemsets, also called large itemsets).
  • Use frequent itemsets to generate rules.
  • E.g., a frequent itemset
  • Chicken, Clothes, Milk sup 3/7
  • and one rule from the frequent itemset
  • Clothes ? Milk, Chicken sup 3/7, conf 3/3

28
Step 1 Mining all frequent itemsets
  • A frequent itemset is an itemset whose support
    is minsup.
  • Key idea The apriori property (downward closure
    property) any subsets of a frequent itemset are
    also frequent itemsets

ABC ABD ACD BCD
AB AC AD BC BD CD
A B C D
29
The Algorithm
  • Iterative algo. (also called level-wise search)
    Find all 1-item frequent itemsets then all
    2-item frequent itemsets, and so on.
  • In each iteration k, only consider itemsets that
    contain some k-1 frequent itemset.
  • Find frequent itemsets of size 1 F1
  • From k 2
  • Ck candidates of size k those itemsets of size
    k that could be frequent, given Fk-1
  • Fk those itemsets that are actually frequent,
    Fk ? Ck (need to scan the database once).

30
Details the algorithm
  • Algorithm Apriori(T)
  • C1 ? init-pass(T)
  • F1 ? f f ? C1, f.count/n ? minsup // n
    no. of transactions in T
  • for (k 2 Fk-1 ? ? k) do
  • Ck ? candidate-gen(Fk-1)
  • for each transaction t ? T do
  • for each candidate c ? Ck do
  • if c is contained in t then
  • c.count
  • end
  • end
  • Fk ? c ? Ck c.count/n ? minsup
  • end
  • return F ? ?k Fk

31
Step 2 Generate rules from frequent itemsets
  • Frequent itemsets ? association rules
  • One more step is needed to generate association
    rules
  • For each frequent itemset X,
  • For each proper nonempty subset A of X,
  • Let B X - A
  • A ? B is an association rule if
  • Confidence(A ? B) minconf,
  • support(A ? B) support(A?B) support(X)
  • confidence(A ? B) support(A ? B) / support(A)

32
On Apriori Algorithm
  • Seems to be very expensive
  • Level-wise search
  • K the size of the largest itemset
  • It makes at most K passes over data
  • In practice, K is bounded (10).
  • The algorithm is very fast. Under some
    conditions, all rules can be found in linear
    time.
  • Scale up to large data sets

33
More on association rule mining
  • Clearly the space of all association rules is
    exponential, O(2m), where m is the number of
    items in I.
  • The mining exploits sparseness of data, and high
    minimum support and high minimum confidence
    values.
  • Still, it always produces a huge number of rules,
    thousands, tens of thousands, millions, ...

34
Association Rule Mining
  • Association analysis is mainly about finding
    frequent patterns. It is not clear how to
    introduce label information to find the
    discriminative patterns between two classes
  • It is hard to extend the existing algorithms to
    handle tens of thousand of non-sparse features
  • However, if it works, the identified patterns can
    provide much more information than just a set of
    selection features.
Write a Comment
User Comments (0)
About PowerShow.com