Mining Association Rules from Microarray Gene Expression Data - PowerPoint PPT Presentation

About This Presentation

Title:

Mining Association Rules from Microarray Gene Expression Data

Description:

Form: LHS = RHS, where LHS and RHS are disjoint itemsets. ... Association Rule Discovery (ARD) algorithms. Originally used in market basket analysis ... – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 23

Provided by: Gam53

Learn more at: https://cse.buffalo.edu

Category:

more less

Transcript and Presenter's Notes

Title: Mining Association Rules from Microarray Gene Expression Data

1
Mining Association Rules from Microarray Gene
Expression Data
2
Association Rules

Form LHS gt RHS, where LHS and RHS are disjoint
itemsets.
Frequent itemset supportT(I) a, where
supportT(I) is the number of transactions in T
that contains all the items in I and a is the
minimum support.

3
Objective Measurements of Rule Interestingness

Support of the rule LHS gt RHS
Support(LHS U RHS), frequency that LHS and
RHS occur together in a transaction.
Confidence of the rule LHS gt RHS
Support (LHS U RHS)/Support (LHS), frequency
that RHS occurs when LHS occurs.

4
Association Rule Discovery (ARD) algorithms

Originally used in market basket analysis
Unsupervised and domain independent
Example Apriori algorithm (Agrawal, 1993)
Advantage Finds all associations
Disadvantage Very large number of associations

5
Microarray technology

Measurement of gene expression levels in cells
Simultaneous measurement of thousands of genes
Facilitates the study of gene interactions
Expression profileset of values for different
conditions
One condition one slide
condition tissue, treatment or time point
Data matrices

6
Rules Applied to Gene Expression

Each gene expression experiment is a single
transaction and each gene is an item.
Gene value may be numerical, may need to be
binned as being up (expressed), down (repressed),
or neither.
Items in a gene expression transaction can also
include relevant facts describing the cellular
environment.
Find frequent itemsets apply Apriori algorithm.
Generate association rules from the frequent
itemsets.

7
Example - association rules

Database
T1 A? B? C? D? E?
T2 A? B? C? D? E?
T3 A? B? C? D? E?
T4 A? B? C? D? E?
T5 A? B? C? D? E?
Rules
R1 A? ? B? ? C? ? E? (sup0.40, conf0.67)
R2 A? ? B? ? D? (sup0.40, conf1.00)
R3 A? ? B? (sup0.60, conf1.00)

8
Rule induction algorithms
G1 is Gene 1, S1 is sample 1, ? means
overexpressed , and ? means underexpressed.
G1 G2 G3 G4 Class S1 ? ? ? ? Cancer S2 ?
? ? ? Non-Cancer . . . . . . . . . .
. . . . . . . . Sn ? ? ? ? Non-Cancer
Association rule discovery algorithm

Rule set
R1 G1? ? G3? ? Cancer R2 G1? ? G4? ?
Non-Cancer R3 G1? ? G3 ? ? G2? ? Cancer
9
Forming Association rules

Any frequent itemset of size greater than one can
be divided into two itemsets, LHS and RHS.
Using objective measures If the confidence of a
candidate rule exceeds a speficied minimum
confidence criterion, the rule may be included.
A very large set of rules is usually generated.

10
Example

28 treatments were recorded on 28 microarray
chips 28 transactions.
Each chip contains expression levels of
approximately 6200 genes (items).
After Apriori algorithm, 70,000,000 rules were
generated.

11
Subjective Measures for Selecting Rules

Select genes before finding frequent items
limit the number of items.
Limit the size of LHS or RHS
E.g., LHS contains only one item.
Domain specific measures -- Use knowledge about
the domain under study to specify patterns.

12
Rule Filtering and Group (Tuzhilin and
Adomavicius 2002)

Rule templates specify restrictions on the
combinations of genes and their expression levels
that can appear in the body and head of the rule.
RulePart HAS Quantifier OF C1, C2, ..., CN
ONLY
RulePart BODY, HEAD, or RULE.
C1, C2, ..., CN is a comparison set,
representing a list of genes against which the
discovered rules will be compared
A gene, e.g., G17
A gene with a particular expression level, e.g.,
G17?
A group (category) of genes, e.g., DNA_Repair
A group of genes with an expression level, e.g.,
DNA_Repair
A group of genes with a list of allowable or
unallowable expression levels, e.g., DNA_Repair
?,

13
Rule Filtering cont

Quantifier a keyword or an expression specifying
how many genes specified by C1, C2, ..., CN List
have to be contained in RulePart.
ALL, ANY, NONE, specifying the number of genes
from C1, C2, , CN the RulePart must have
A numeric value e.g., 2, specifying a rule must
have exactly 2 genes from the comparison set
A range of numeric values e.g., 1,3,5-7
ONLY is used to indicate RulePart can have only
the genes in the C1, C2, , CN list.

14
Rule Filtering cont

All rules that contain at least one of the
following genes G1, G5, G7
RULE HAS (ANY) of G1, G5, G7,
Matching rules G1? gt G3?.
"When genes involved in the DNA repair are
upregulated, what other gene categories are also
up- or downregulated?"
BODY HAS (ANY) OF DNA_Repair? AND HEAD HAS
(ANY) OF All_Genes?, ?

15
Macro Templates (Tuzhilin and Adomavicius 2002)

Detect unexpected rules
CONTRADICT (GeneExprSet, G, ExpLevel)
BODY HAS (ALL) of GeneExprSet AND
HEAD HAS (ALL) OF G ? ExpLevel
CONTRADICT(G1?, G2?,G4,?
Unexpected rule G1??G2??G3?? G4?

16
Rule Grouping

Group similar rules together into classes to be
analyzed
Gene hierarchy group genes based on their
functions.

ALL
F1
F2
G2
G3
G4
G5
G1
17
Rule Grouping cont

Aggregated rules
Groups F1G1,G2,G3, F2G4,G5
Rules R1G1??G4?, R2G1??G5?,
R3G1??G3??G5?
Aggregated rule RF1?F2, R'F1??F2?
R1,R2,R3?R, R1,R2?R', R3?R'

18
Rule Derivation Procedure
Some features may be irrelevant or a reduction
may be required for efficiency reasons
Database of transactions
Feature selection
A rule induction algorithm is applied to the
database (e.g. association rule algorithm or
decision tree algorithm)
Reduced database
Rule induction
Initial rule set
Rules are assessed and ranked using different
measures of interestingness
Rule selection
Relevant rule set
The rules need to be validated to be accepted as
knowledge. Can be done by more detailed
biological experiments in the context of gene
expression data
Rule validation
Rules representing knowledge
19
Mining Association Rules from Clinical Data
20
Association Rules in Medical Data

Medical record data
Millions of claims for medical procedures
Each patient may have several claims
Each claim may have several line items or
records, one for each procedure performed.
A diagnosis was reported with each procedure.
Data patient code, procedure code, diagnosis
code
Goal discover relationships between procedure
performed on a patient and the reported diagnosis.

21
Cont

Data items the set of all procedure and
diagnosis codes (7,365 9,38316,748)
Transaction the set of procedure and diagnosis
codes for each patient (1,257,645 patients)
May cause unexpected rule cast gt heart
disease
Each itemset consists of one or more
procedure/diagnosis codes.
The support of an itemset I is the number of
patients whose set of items include all the items
in I (gt 1).

22
Formulating Rules

Applying Aprior algorithm to generate all
frequent itemsets.
Restricting to those frequent itemsets which
contain both procedures and diagnoses
For each selected frequent itemset, one rule is
formulated with all procedure codes on the left
and all diagnosis codes on the right.
Computing confidence and eliminating rules with
confidence less than 65.

Write a Comment

User Comments (0)