Data Mining in DNA: Using the SUBDUE Knowledge Discovery System to Find Potential Gene Regulatory Sequences - PowerPoint PPT Presentation

About This Presentation
Title:

Data Mining in DNA: Using the SUBDUE Knowledge Discovery System to Find Potential Gene Regulatory Sequences

Description:

Data Mining in DNA: Using the SUBDUE Knowledge Discovery System to Find Potential Gene Regulatory Sequences by Ronald K. Maglothin – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 42
Provided by: RonM93
Learn more at: https://ailab.wsu.edu
Category:

less

Transcript and Presenter's Notes

Title: Data Mining in DNA: Using the SUBDUE Knowledge Discovery System to Find Potential Gene Regulatory Sequences


1
Data Mining in DNA Using the SUBDUE Knowledge
Discovery System to Find Potential Gene
Regulatory Sequences
  • by
  • Ronald K. Maglothin

2
Committee Members
  • Dr. Lawrence B. Holder, Supervisor
  • Dr. Diane J. Cook
  • Dr. Lynn L. Peterson

3
Outline
  • DNA Sequence Domain
  • SUBDUE Knowledge Discovery System
  • Experiments with Unsupervised SUBDUE
  • Experiments with Supervised SUBDUE
  • Conclusion and Future Work

4
DNA Structure
  • All cells use DNA to store their genetic
    information.
  • A DNA molecule is composed of two linear strands
    coiled in a double helix.
  • Each strand is made of the bases adenine (A),
    thymine (T), cytosine (C), and guanine (G),
    joined in a linear sequence.

5
DNA Sequence
  • These four bases constitute a four-letter
    alphabet that cells use to store genetic
    information.
  • Molecular biologists can break up a DNA molecule
    and determine its base sequence, which can be
    stored as a character string in a computer

TTCAGCCGATATCCTGGTCAGATTCTCT
AAGTCGGCTATAGGACCAGTCTAAGAGA
6
Genes
  • A gene is a DNA sequence that encodes
    instructions for building a protein.
  • Gene expression is the process of using a gene to
    make a protein

transcription
translation
DNA
RNA
Protein
gene
transcript
product
7
Gene Regulation
  • Primary mechanism is to control the rate of DNA
    transcription
  • Faster transcription more protein
  • Slower transcription less protein
  • Transcription rate is controlled by transcription
    factors, which are proteins which bind to
    specific DNA sequences.

8
Human Genome Project
  • A U.S.-led, worldwide effort to determine the
    complete DNA sequence for humans, as well as
    several other organisms.
  • These sequences will be used to study
  • mechanisms of disease
  • growth and development
  • evolutionary relationships

9
A Genome is a LOT of Data
  • Raw sequence (text)
  • Human (2005) 3 x 10 9 base pairs
  • Yeast (finished) 1.2 x 107 base pairs
  • Annotated sequence (Relational DB)
  • Links to 3D structures of protein products, other
    genes in family, known transcription factors,
    journal references, and other databases.

10
A Rich Domain for Knowledge Discovery
  • Most of the sequences (and genes) have unknown
    function.
  • Efficient algorithms are needed to
  • identify important patterns
  • identify and classify possible genes
  • infer relationships between genes
  • predict protein structure

11
The SUBDUE Knowledge Discovery System
  • Input A graph G
  • Output A list of substructures that compress G
    well
  • Uses a computationally-constrained beam search
    and inexact graph match

12
What is a substructure?
  • A definition subgraph and a list of subgraph
    instances

Input Graph
next
next
next
next
next
next
A
A
A
T
T
C
G
1
7
6
5
4
3
2
Substructure
Definition
Instances
next
A
T
next
A
T
1
2
next
A
T
6
5
13
MDL Heuristic
  • SUBDUE uses the Minimum Description Length
    Principle to evaluate substructures.
  • Description Length of a graph is the number of
    bits needed to send the graphs adjacency matrix
    to a remote computer.
  • Goal is to minimize DL(S) DL(GS).

14
SUBDUE Parameters
  • Iterations Graph is compressed using the best
    substructure, discovery is restarted
  • Threshold Controls how much two subgraphs can
    differ to be considered similar
  • Beam Width The number of substructures in the
    expansion list

15
Graph Representations
  • Simple linear

next
next
next
next
A
A
T
C
G
  • Downstream edges

4
3
3
2
2
2
1
1
1
1
A
A
T
C
G
16
Graph Representations
  • Start vertex

5
4
3
2
next
next
next
next
1
A
A
T
C
G
Start
  • Backbone

next
next
next
next
base
base
base
base
base
name
name
name
name
name
A
A
C
T
G
17
Graph Representations
  • Backbone-star


star
star
star
star
star
next
next
next
next
base
base
base
base
base
name
name
name
name
name
A
A
C
T
G
18
Unsupervised SUBDUE
  • Input An entire yeast chromosome

next
next
next
next
A
A
T
C
G
  • Heuristic
  • Results Not good patterns with two
  • to three bases

19
Polynomial Heuristic
20
Unsupervised SUBDUE -Discussion
  • Random noise is not a meaningful kind of pattern
    variation in DNA.
  • Unsupervised SUBDUE finds DNA patterns that are
    hard to evaluate and that are not focused on any
    target concept.
  • We need to give SUBDUE more targeted input data
    and to modify the system to use it effectively.

21
Supervised SUBDUE
  • Give SUBDUE two graphs a graph of positive
    instances of a target concept, and a graph of
    negative instances.
  • SUBDUE discovers substructures in the positive
    graph, finds instances in the negative graph, and
    bases the overall heuristic value on the values
    in both graphs.

22
New Data Sets
  • Clusters of coexpressed yeast genes compiled by
    Brazma et al., from expression data generated by
    DeRisi et al.
  • The expression level of each gene in a cluster
    changed at the same time and by a similar degree
    during the experiment perhaps some genes in a
    cluster are regulated by similar mechanisms?

23
New Data Sets
  • Positive examples
  • 300-bp upstream windows (both strands) for all
    genes in a given cluster
  • Negative examples
  • 300-bp upstream windows for genes not in the
    cluster, OR
  • 300-bp windows randomly selected from the
    complete genome (probably not involved in gene
    regulation)

24
Supervised Heuristic
  • Based on the substructures values in the
    positive and negative graphs
  • Numerator set to 1 when no

negative instances
25
Compression Ratio
  • Normalize the graph values by using the inverse
    of the graph compression

26
Negative Graph Value
  • When there are no negative instances, setting
    numerator to 1 actually penalizes such
    substructures.
  • Using 2 x DL(G-) in this situation gave better
    results.

27
Ratio Heuristic Results
28
Concept DL Heuristic
  • Based on the size of a message containing the
    compressed positive graph, plus the errors
    (negative instances).

29
Concept DL Heuristic Results
  • Relative graph size affected results

30
Backbone Representation
next
next
next
next
base
base
base
base
base
name
name
name
name
name
A
A
C
T
G
  • Base vertices allowed dont-care positions, but
    heuristic had to be changed to accommodate them.
  • Overlap became very important.

31
DL Equations
32
Negative Graph Value
  • Using 2 x DL(G-) for no negative instances
    favored such substructures too strongly.

33
Compression Difference Heuristic
  • Use subtraction with the compression values
    instead of division.

34
Results
Cluster cr4.111101.77
35
Results
Cluster c2_4.2222200.39
36
Results of Brazma et al.
Cluster c2_4.2222200.39
37
Brazma Heuristic
  • Based on number of positive and negative instances

38
SUBDUE Using Brazma Heuristic
Cluster c2_4.2222200.39
39
Conclusion
  • SUBDUE can be used to discover likely
    transcription factor binding sites.
  • Patterns found by SUBDUE are different from those
    found by string-based algorithms, due to the
    graph representation, beam search, and different
    search heuristic.

40
Conclusion
  • Patterns found by unsupervised SUBDUE in DNA are
    difficult to evaluate.
  • Using supervised SUBDUE can greatly focus the
    search on the target concept.
  • Choosing the right graph representation and
    heuristic are critical to success.

41
Future Work
  • Further refinement of the supervised MDL
    heuristic.
  • Application of graph grammar theory to SUBDUEs
    search.
  • Close collaboration with molecular biologists to
    select data sets and evaluate results.
Write a Comment
User Comments (0)
About PowerShow.com