Data Mining in DNA: Using the SUBDUE Knowledge Discovery System to Find Potential Gene Regulatory Sequences - PowerPoint PPT Presentation

About This Presentation

Title:

Data Mining in DNA: Using the SUBDUE Knowledge Discovery System to Find Potential Gene Regulatory Sequences

Description:

Data Mining in DNA: Using the SUBDUE Knowledge Discovery System to Find Potential Gene Regulatory Sequences by Ronald K. Maglothin – PowerPoint PPT presentation

Number of Views:146

Avg rating:3.0/5.0

Slides: 42

Provided by: RonM93

Learn more at: https://ailab.wsu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data Mining in DNA: Using the SUBDUE Knowledge Discovery System to Find Potential Gene Regulatory Sequences

1
Data Mining in DNA Using the SUBDUE Knowledge
Discovery System to Find Potential Gene
Regulatory Sequences

by
Ronald K. Maglothin

2
Committee Members

Dr. Lawrence B. Holder, Supervisor
Dr. Diane J. Cook
Dr. Lynn L. Peterson

3
Outline

DNA Sequence Domain
SUBDUE Knowledge Discovery System
Experiments with Unsupervised SUBDUE
Experiments with Supervised SUBDUE
Conclusion and Future Work

4
DNA Structure

All cells use DNA to store their genetic
information.
A DNA molecule is composed of two linear strands
coiled in a double helix.
Each strand is made of the bases adenine (A),
thymine (T), cytosine (C), and guanine (G),
joined in a linear sequence.

5
DNA Sequence

These four bases constitute a four-letter
alphabet that cells use to store genetic
information.
Molecular biologists can break up a DNA molecule
and determine its base sequence, which can be
stored as a character string in a computer

TTCAGCCGATATCCTGGTCAGATTCTCT
AAGTCGGCTATAGGACCAGTCTAAGAGA
6
Genes

A gene is a DNA sequence that encodes
instructions for building a protein.
Gene expression is the process of using a gene to
make a protein

transcription
translation
DNA
RNA
Protein
gene
transcript
product
7
Gene Regulation

Primary mechanism is to control the rate of DNA
transcription
Faster transcription more protein
Slower transcription less protein
Transcription rate is controlled by transcription
factors, which are proteins which bind to
specific DNA sequences.

8
Human Genome Project

A U.S.-led, worldwide effort to determine the
complete DNA sequence for humans, as well as
several other organisms.
These sequences will be used to study
mechanisms of disease
growth and development
evolutionary relationships

9
A Genome is a LOT of Data

Raw sequence (text)
Human (2005) 3 x 10 9 base pairs
Yeast (finished) 1.2 x 107 base pairs
Annotated sequence (Relational DB)
Links to 3D structures of protein products, other
genes in family, known transcription factors,
journal references, and other databases.

10
A Rich Domain for Knowledge Discovery

Most of the sequences (and genes) have unknown
function.
Efficient algorithms are needed to
identify important patterns
identify and classify possible genes
infer relationships between genes
predict protein structure

11
The SUBDUE Knowledge Discovery System

Input A graph G
Output A list of substructures that compress G
well
Uses a computationally-constrained beam search
and inexact graph match

12
What is a substructure?

A definition subgraph and a list of subgraph
instances

Input Graph
next
next
next
next
next
next
A
A
A
T
T
C
G
1
7
6
5
4
3
2
Substructure
Definition
Instances
next
A
T
next
A
T
1
2
next
A
T
6
5
13
MDL Heuristic

SUBDUE uses the Minimum Description Length
Principle to evaluate substructures.
Description Length of a graph is the number of
bits needed to send the graphs adjacency matrix
to a remote computer.
Goal is to minimize DL(S) DL(GS).

14
SUBDUE Parameters

Iterations Graph is compressed using the best
substructure, discovery is restarted
Threshold Controls how much two subgraphs can
differ to be considered similar
Beam Width The number of substructures in the
expansion list

15
Graph Representations

Simple linear

next
next
next
next
A
A
T
C
G

Downstream edges

4
3
3
2
2
2
1
1
1
1
A
A
T
C
G
16
Graph Representations

Start vertex

5
4
3
2
next
next
next
next
1
A
A
T
C
G
Start

Backbone

next
next
next
next
base
base
base
base
base
name
name
name
name
name
A
A
C
T
G
17
Graph Representations

Backbone-star

star
star
star
star
star
next
next
next
next
base
base
base
base
base
name
name
name
name
name
A
A
C
T
G
18
Unsupervised SUBDUE

Input An entire yeast chromosome

next
next
next
next
A
A
T
C
G

Heuristic

Results Not good patterns with two
to three bases

19
Polynomial Heuristic
20
Unsupervised SUBDUE -Discussion

Random noise is not a meaningful kind of pattern
variation in DNA.
Unsupervised SUBDUE finds DNA patterns that are
hard to evaluate and that are not focused on any
target concept.
We need to give SUBDUE more targeted input data
and to modify the system to use it effectively.

21
Supervised SUBDUE

Give SUBDUE two graphs a graph of positive
instances of a target concept, and a graph of
negative instances.
SUBDUE discovers substructures in the positive
graph, finds instances in the negative graph, and
bases the overall heuristic value on the values
in both graphs.

22
New Data Sets

Clusters of coexpressed yeast genes compiled by
Brazma et al., from expression data generated by
DeRisi et al.
The expression level of each gene in a cluster
changed at the same time and by a similar degree
during the experiment perhaps some genes in a
cluster are regulated by similar mechanisms?

23
New Data Sets

Positive examples
300-bp upstream windows (both strands) for all
genes in a given cluster
Negative examples
300-bp upstream windows for genes not in the
cluster, OR
300-bp windows randomly selected from the
complete genome (probably not involved in gene
regulation)

24
Supervised Heuristic

Based on the substructures values in the
positive and negative graphs

Numerator set to 1 when no

negative instances
25
Compression Ratio

Normalize the graph values by using the inverse
of the graph compression

26
Negative Graph Value

When there are no negative instances, setting
numerator to 1 actually penalizes such
substructures.
Using 2 x DL(G-) in this situation gave better
results.

27
Ratio Heuristic Results
28
Concept DL Heuristic

Based on the size of a message containing the
compressed positive graph, plus the errors
(negative instances).

29
Concept DL Heuristic Results

Relative graph size affected results

30
Backbone Representation
next
next
next
next
base
base
base
base
base
name
name
name
name
name
A
A
C
T
G

Base vertices allowed dont-care positions, but
heuristic had to be changed to accommodate them.
Overlap became very important.

31
DL Equations
32
Negative Graph Value

Using 2 x DL(G-) for no negative instances
favored such substructures too strongly.

33
Compression Difference Heuristic

Use subtraction with the compression values
instead of division.

34
Results
Cluster cr4.111101.77
35
Results
Cluster c2_4.2222200.39
36
Results of Brazma et al.
Cluster c2_4.2222200.39
37
Brazma Heuristic

Based on number of positive and negative instances

38
SUBDUE Using Brazma Heuristic
Cluster c2_4.2222200.39
39
Conclusion

SUBDUE can be used to discover likely
transcription factor binding sites.
Patterns found by SUBDUE are different from those
found by string-based algorithms, due to the
graph representation, beam search, and different
search heuristic.

40
Conclusion

Patterns found by unsupervised SUBDUE in DNA are
difficult to evaluate.
Using supervised SUBDUE can greatly focus the
search on the target concept.
Choosing the right graph representation and
heuristic are critical to success.

41
Future Work

Further refinement of the supervised MDL
heuristic.
Application of graph grammar theory to SUBDUEs
search.
Close collaboration with molecular biologists to
select data sets and evaluate results.

Write a Comment

User Comments (0)