Title: Mining Association Rules from Microarray Gene Expression Data
1Mining Association Rules from Microarray Gene
Expression Data
2Association Rules
- Form LHS gt RHS, where LHS and RHS are disjoint
itemsets. - Frequent itemset supportT(I) a, where
supportT(I) is the number of transactions in T
that contains all the items in I and a is the
minimum support.
3Objective Measurements of Rule Interestingness
- Support of the rule LHS gt RHS
- Support(LHS U RHS), frequency that LHS and
RHS occur together in a transaction. - Confidence of the rule LHS gt RHS
- Support (LHS U RHS)/Support (LHS), frequency
that RHS occurs when LHS occurs. -
4Association Rule Discovery (ARD) algorithms
- Originally used in market basket analysis
- Unsupervised and domain independent
- Example Apriori algorithm (Agrawal, 1993)
- Advantage Finds all associations
- Disadvantage Very large number of associations
5Microarray technology
- Measurement of gene expression levels in cells
- Simultaneous measurement of thousands of genes
- Facilitates the study of gene interactions
- Expression profileset of values for different
conditions - One condition one slide
- condition tissue, treatment or time point
- Data matrices
-
6Rules Applied to Gene Expression
- Each gene expression experiment is a single
transaction and each gene is an item. - Gene value may be numerical, may need to be
binned as being up (expressed), down (repressed),
or neither. - Items in a gene expression transaction can also
include relevant facts describing the cellular
environment. - Find frequent itemsets apply Apriori algorithm.
- Generate association rules from the frequent
itemsets.
7Example - association rules
- Database
- T1 A? B? C? D? E?
- T2 A? B? C? D? E?
- T3 A? B? C? D? E?
- T4 A? B? C? D? E?
- T5 A? B? C? D? E?
- Rules
- R1 A? ? B? ? C? ? E? (sup0.40, conf0.67)
- R2 A? ? B? ? D? (sup0.40, conf1.00)
- R3 A? ? B? (sup0.60, conf1.00)
8 Rule induction algorithms
G1 is Gene 1, S1 is sample 1, ? means
overexpressed , and ? means underexpressed.
G1 G2 G3 G4 Class S1 ? ? ? ? Cancer S2 ?
? ? ? Non-Cancer . . . . . . . . . .
. . . . . . . . Sn ? ? ? ? Non-Cancer
Association rule discovery algorithm
Rule set
R1 G1? ? G3? ? Cancer R2 G1? ? G4? ?
Non-Cancer R3 G1? ? G3 ? ? G2? ? Cancer
9Forming Association rules
- Any frequent itemset of size greater than one can
be divided into two itemsets, LHS and RHS. - Using objective measures If the confidence of a
candidate rule exceeds a speficied minimum
confidence criterion, the rule may be included. - A very large set of rules is usually generated.
10Example
- 28 treatments were recorded on 28 microarray
chips 28 transactions. - Each chip contains expression levels of
approximately 6200 genes (items). - After Apriori algorithm, 70,000,000 rules were
generated.
11Subjective Measures for Selecting Rules
- Select genes before finding frequent items
limit the number of items. - Limit the size of LHS or RHS
- E.g., LHS contains only one item.
- Domain specific measures -- Use knowledge about
the domain under study to specify patterns.
12Rule Filtering and Group (Tuzhilin and
Adomavicius 2002)
- Rule templates specify restrictions on the
combinations of genes and their expression levels
that can appear in the body and head of the rule. - RulePart HAS Quantifier OF C1, C2, ..., CN
ONLY - RulePart BODY, HEAD, or RULE.
- C1, C2, ..., CN is a comparison set,
representing a list of genes against which the
discovered rules will be compared - A gene, e.g., G17
- A gene with a particular expression level, e.g.,
G17? - A group (category) of genes, e.g., DNA_Repair
- A group of genes with an expression level, e.g.,
DNA_Repair - A group of genes with a list of allowable or
unallowable expression levels, e.g., DNA_Repair
?,
13Rule Filtering cont
- Quantifier a keyword or an expression specifying
how many genes specified by C1, C2, ..., CN List
have to be contained in RulePart. - ALL, ANY, NONE, specifying the number of genes
from C1, C2, , CN the RulePart must have - A numeric value e.g., 2, specifying a rule must
have exactly 2 genes from the comparison set - A range of numeric values e.g., 1,3,5-7
- ONLY is used to indicate RulePart can have only
the genes in the C1, C2, , CN list.
14Rule Filtering cont
- All rules that contain at least one of the
following genes G1, G5, G7 - RULE HAS (ANY) of G1, G5, G7,
- Matching rules G1? gt G3?.
- "When genes involved in the DNA repair are
upregulated, what other gene categories are also
up- or downregulated?" - BODY HAS (ANY) OF DNA_Repair? AND HEAD HAS
(ANY) OF All_Genes?, ?
15Macro Templates (Tuzhilin and Adomavicius 2002)
- Detect unexpected rules
- CONTRADICT (GeneExprSet, G, ExpLevel)
- BODY HAS (ALL) of GeneExprSet AND
- HEAD HAS (ALL) OF G ? ExpLevel
- CONTRADICT(G1?, G2?,G4,?
- Unexpected rule G1??G2??G3?? G4?
16Rule Grouping
- Group similar rules together into classes to be
analyzed - Gene hierarchy group genes based on their
functions.
ALL
F1
F2
G2
G3
G4
G5
G1
17Rule Grouping cont
- Aggregated rules
- Groups F1G1,G2,G3, F2G4,G5
- Rules R1G1??G4?, R2G1??G5?,
R3G1??G3??G5? - Aggregated rule RF1?F2, R'F1??F2?
- R1,R2,R3?R, R1,R2?R', R3?R'
18Rule Derivation Procedure
Some features may be irrelevant or a reduction
may be required for efficiency reasons
Database of transactions
Feature selection
A rule induction algorithm is applied to the
database (e.g. association rule algorithm or
decision tree algorithm)
Reduced database
Rule induction
Initial rule set
Rules are assessed and ranked using different
measures of interestingness
Rule selection
Relevant rule set
The rules need to be validated to be accepted as
knowledge. Can be done by more detailed
biological experiments in the context of gene
expression data
Rule validation
Rules representing knowledge
19Mining Association Rules from Clinical Data
20Association Rules in Medical Data
- Medical record data
- Millions of claims for medical procedures
- Each patient may have several claims
- Each claim may have several line items or
records, one for each procedure performed. - A diagnosis was reported with each procedure.
- Data patient code, procedure code, diagnosis
code - Goal discover relationships between procedure
performed on a patient and the reported diagnosis.
21Cont
- Data items the set of all procedure and
diagnosis codes (7,365 9,38316,748) - Transaction the set of procedure and diagnosis
codes for each patient (1,257,645 patients) - May cause unexpected rule cast gt heart
disease - Each itemset consists of one or more
procedure/diagnosis codes. - The support of an itemset I is the number of
patients whose set of items include all the items
in I (gt 1).
22Formulating Rules
- Applying Aprior algorithm to generate all
frequent itemsets. - Restricting to those frequent itemsets which
contain both procedures and diagnoses - For each selected frequent itemset, one rule is
formulated with all procedure codes on the left
and all diagnosis codes on the right. - Computing confidence and eliminating rules with
confidence less than 65.