Mining Association Rules from Microarray Gene Expression Data - PowerPoint PPT Presentation

About This Presentation
Title:

Mining Association Rules from Microarray Gene Expression Data

Description:

Form: LHS = RHS, where LHS and RHS are disjoint itemsets. ... Association Rule Discovery (ARD) algorithms. Originally used in market basket analysis ... – PowerPoint PPT presentation

Number of Views:106
Avg rating:3.0/5.0
Slides: 23
Provided by: Gam53
Learn more at: https://cse.buffalo.edu
Category:

less

Transcript and Presenter's Notes

Title: Mining Association Rules from Microarray Gene Expression Data


1
Mining Association Rules from Microarray Gene
Expression Data
2
Association Rules
  • Form LHS gt RHS, where LHS and RHS are disjoint
    itemsets.
  • Frequent itemset supportT(I) a, where
    supportT(I) is the number of transactions in T
    that contains all the items in I and a is the
    minimum support.

3
Objective Measurements of Rule Interestingness
  • Support of the rule LHS gt RHS
  • Support(LHS U RHS), frequency that LHS and
    RHS occur together in a transaction.
  • Confidence of the rule LHS gt RHS
  • Support (LHS U RHS)/Support (LHS), frequency
    that RHS occurs when LHS occurs.

4
Association Rule Discovery (ARD) algorithms
  • Originally used in market basket analysis
  • Unsupervised and domain independent
  • Example Apriori algorithm (Agrawal, 1993)
  • Advantage Finds all associations
  • Disadvantage Very large number of associations

5
Microarray technology
  • Measurement of gene expression levels in cells
  • Simultaneous measurement of thousands of genes
  • Facilitates the study of gene interactions
  • Expression profileset of values for different
    conditions
  • One condition one slide
  • condition tissue, treatment or time point
  • Data matrices

6
Rules Applied to Gene Expression
  • Each gene expression experiment is a single
    transaction and each gene is an item.
  • Gene value may be numerical, may need to be
    binned as being up (expressed), down (repressed),
    or neither.
  • Items in a gene expression transaction can also
    include relevant facts describing the cellular
    environment.
  • Find frequent itemsets apply Apriori algorithm.
  • Generate association rules from the frequent
    itemsets.

7
Example - association rules
  • Database
  • T1 A? B? C? D? E?
  • T2 A? B? C? D? E?
  • T3 A? B? C? D? E?
  • T4 A? B? C? D? E?
  • T5 A? B? C? D? E?
  • Rules
  • R1 A? ? B? ? C? ? E? (sup0.40, conf0.67)
  • R2 A? ? B? ? D? (sup0.40, conf1.00)
  • R3 A? ? B? (sup0.60, conf1.00)

8
Rule induction algorithms
G1 is Gene 1, S1 is sample 1, ? means
overexpressed , and ? means underexpressed.
G1 G2 G3 G4 Class S1 ? ? ? ? Cancer S2 ?
? ? ? Non-Cancer . . . . . . . . . .
. . . . . . . . Sn ? ? ? ? Non-Cancer
Association rule discovery algorithm

Rule set
R1 G1? ? G3? ? Cancer R2 G1? ? G4? ?
Non-Cancer R3 G1? ? G3 ? ? G2? ? Cancer
9
Forming Association rules
  • Any frequent itemset of size greater than one can
    be divided into two itemsets, LHS and RHS.
  • Using objective measures If the confidence of a
    candidate rule exceeds a speficied minimum
    confidence criterion, the rule may be included.
  • A very large set of rules is usually generated.

10
Example
  • 28 treatments were recorded on 28 microarray
    chips 28 transactions.
  • Each chip contains expression levels of
    approximately 6200 genes (items).
  • After Apriori algorithm, 70,000,000 rules were
    generated.

11
Subjective Measures for Selecting Rules
  • Select genes before finding frequent items
    limit the number of items.
  • Limit the size of LHS or RHS
  • E.g., LHS contains only one item.
  • Domain specific measures -- Use knowledge about
    the domain under study to specify patterns.

12
Rule Filtering and Group (Tuzhilin and
Adomavicius 2002)
  • Rule templates specify restrictions on the
    combinations of genes and their expression levels
    that can appear in the body and head of the rule.
  • RulePart HAS Quantifier OF C1, C2, ..., CN
    ONLY
  • RulePart BODY, HEAD, or RULE.
  • C1, C2, ..., CN is a comparison set,
    representing a list of genes against which the
    discovered rules will be compared
  • A gene, e.g., G17
  • A gene with a particular expression level, e.g.,
    G17?
  • A group (category) of genes, e.g., DNA_Repair
  • A group of genes with an expression level, e.g.,
    DNA_Repair
  • A group of genes with a list of allowable or
    unallowable expression levels, e.g., DNA_Repair
    ?,

13
Rule Filtering cont
  • Quantifier a keyword or an expression specifying
    how many genes specified by C1, C2, ..., CN List
    have to be contained in RulePart.
  • ALL, ANY, NONE, specifying the number of genes
    from C1, C2, , CN the RulePart must have
  • A numeric value e.g., 2, specifying a rule must
    have exactly 2 genes from the comparison set
  • A range of numeric values e.g., 1,3,5-7
  • ONLY is used to indicate RulePart can have only
    the genes in the C1, C2, , CN list.

14
Rule Filtering cont
  • All rules that contain at least one of the
    following genes G1, G5, G7
  • RULE HAS (ANY) of G1, G5, G7,
  • Matching rules G1? gt G3?.
  • "When genes involved in the DNA repair are
    upregulated, what other gene categories are also
    up- or downregulated?"
  • BODY HAS (ANY) OF DNA_Repair? AND HEAD HAS
    (ANY) OF All_Genes?, ?

15
Macro Templates (Tuzhilin and Adomavicius 2002)
  • Detect unexpected rules
  • CONTRADICT (GeneExprSet, G, ExpLevel)
  • BODY HAS (ALL) of GeneExprSet AND
  • HEAD HAS (ALL) OF G ? ExpLevel
  • CONTRADICT(G1?, G2?,G4,?
  • Unexpected rule G1??G2??G3?? G4?

16
Rule Grouping
  • Group similar rules together into classes to be
    analyzed
  • Gene hierarchy group genes based on their
    functions.

ALL
F1
F2
G2
G3
G4
G5
G1
17
Rule Grouping cont
  • Aggregated rules
  • Groups F1G1,G2,G3, F2G4,G5
  • Rules R1G1??G4?, R2G1??G5?,
    R3G1??G3??G5?
  • Aggregated rule RF1?F2, R'F1??F2?
  • R1,R2,R3?R, R1,R2?R', R3?R'

18
Rule Derivation Procedure
Some features may be irrelevant or a reduction
may be required for efficiency reasons
Database of transactions
Feature selection
A rule induction algorithm is applied to the
database (e.g. association rule algorithm or
decision tree algorithm)
Reduced database
Rule induction
Initial rule set
Rules are assessed and ranked using different
measures of interestingness
Rule selection
Relevant rule set
The rules need to be validated to be accepted as
knowledge. Can be done by more detailed
biological experiments in the context of gene
expression data
Rule validation
Rules representing knowledge
19
Mining Association Rules from Clinical Data
20
Association Rules in Medical Data
  • Medical record data
  • Millions of claims for medical procedures
  • Each patient may have several claims
  • Each claim may have several line items or
    records, one for each procedure performed.
  • A diagnosis was reported with each procedure.
  • Data patient code, procedure code, diagnosis
    code
  • Goal discover relationships between procedure
    performed on a patient and the reported diagnosis.

21
Cont
  • Data items the set of all procedure and
    diagnosis codes (7,365 9,38316,748)
  • Transaction the set of procedure and diagnosis
    codes for each patient (1,257,645 patients)
  • May cause unexpected rule cast gt heart
    disease
  • Each itemset consists of one or more
    procedure/diagnosis codes.
  • The support of an itemset I is the number of
    patients whose set of items include all the items
    in I (gt 1).

22
Formulating Rules
  • Applying Aprior algorithm to generate all
    frequent itemsets.
  • Restricting to those frequent itemsets which
    contain both procedures and diagnoses
  • For each selected frequent itemset, one rule is
    formulated with all procedure codes on the left
    and all diagnosis codes on the right.
  • Computing confidence and eliminating rules with
    confidence less than 65.
Write a Comment
User Comments (0)
About PowerShow.com