I. Programming Fundamentals - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

I. Programming Fundamentals

Description:

Uncovers and clarifies any ambiguous issues ... Smith/Waterman. Dynamic Programming. Markov-model based. Large Database issues. 23 ... – PowerPoint PPT presentation

Number of Views:116
Avg rating:3.0/5.0
Slides: 62
Provided by: thomasc52
Category:

less

Transcript and Presenter's Notes

Title: I. Programming Fundamentals


1
I. Programming Fundamentals
  • Problem Solving
  • Problem Specification
  • Top-down Design
  • Languages
  • Debugging/Performance Tuning
  • Testing
  • Maintenance

2
1. Problem Solving(I. Programming Fundamentals)
  • Since late 60s, the phrase Problem Solving with
    Computers
  • The computer as a tool
  • Understand Problem
  • Specification
  • Design
  • Implement
  • Test
  • Maintain

3
2. Problem Specification(I. Programming
Fundamentals)
  • Can be informal, formal, or in between.
  • A definition of Input/Output Relationships
  • Uncovers and clarifies any ambiguous issues
  • Involves interactions between end users and
    solution developers
  • Ideally, produces a specification document
  • Realistically, prototyping usually starts
    simultaneously with specification

4
3. Top-down Design(I. Programming Fundamentals)
  • Extremely important methodological philosophy.
  • Develops a solution in successive phases of
    decreasing levels of abstraction
  • Any problem can have a solution described (at
    some level of abstraction) on about a half sheet
    of paper.
  • Aliases modular programming, stepwise
    refinement, (object oriented design is this
    philosophy with training wheels added)

5
4. Languages(I. Programming Fundamentals)
  • Choice of language can be important.
  • Often, however, final choice of language can be
    as much a matter of subjective, personal choice,
    as is a type of paint brush to an artist.
  • Issues acceptance (for maintenance),
    performance, portability

6
4. Languages(I. Programming Fundamentals)
  • Language types
  • Procedural (C, Fortran, Pascal, Basic)
  • Object-oriented (C, Smalltalk)
  • Functional, Declarative (LISP, Prolog)
  • String processing (SNOBOL)
  • Deployment technologies
  • Interpreted (Basic)
  • Compiled (C, Fortran, most languages)
  • Run-time systems
  • Statically linked (real-time, older systems)
  • Dynamically linked (most modern environments)

7
5. Debugging/Performance Tuning(I. Programming
Fundamentals)
  • The most unpredictable phase of the process
  • Not a matter of luck, however
  • Scientific principles are critical
  • Formulate hypothesis
  • Perform an experiment, examine results
  • Make a single change
  • Repeat at step 1.
  • Crash testing, and obvious error finding
  • Debugging tools can assist (gdb)

8
6. Testing(I. Programming Fundamentals)
  • Goes beyond crash testing
  • Need to develop test sets
  • Functional testing
  • Structural testing
  • Must struggle with specification now

9
7. Maintenance(I. Programming Fundamentals)
  • The on-going, necessary update and debugging of
    finished software
  • This step never ends
  • Often, earlier steps ignore this phase for the
    sake of expediency
  • Language choice, specification, modularization,
    all bear on this step

10
II. Data Structures
  • A practical framework for holding data.
  • Must consider input, intermediate, and possibly
    computed output data
  • Impacts on
  • Development time
  • Memory usage
  • Performance (execution time)
  • Maintainability

11
II. Data Structures (cont.)
  • Scalar and array variables
  • Static and dynamic structures
  • Dense and sparse structures
  • Linear and linked structures
  • Lists, Stacks (LIFO), Queues (FIFO), Trees,
    Graphs, and Heaps
  • Dynamic structure efficiency relies on OS
    interaction, and program behavior

12
III. Algorithms
  • Control flow
  • Template structures
  • Complexity analysis

13
III. Algorithms(Control Flow)
  • Sequential
  • Alternation or
  • selection
  • Iteration or
  • looping

Statement 1
Statement 2
?
Statement 1
Statement 2
?
Loop Body
14
III. Algorithms(Template Structures)
  • Divide and Conquer
  • Greedy
  • Backtracking
  • Branch and Bound
  • Searching (depth first, breadth first)
  • Dynamic Programming

15
III. Algorithms(Complexity Analysis)
  • for i 1 to 100
  • for j 1 to 50
  • xi ai bj
  • Inner statement executes 50 x 100, or 5,000
    times. If outer loop executed n times, and inner
    one n times, we would say that this algorithm
    had complexity O(n2).
  • In some sense, as the problem size n grows, the
    execution time will grow as the square of n.

16
IV. Systems and Networks
Memory
DATA
Scalar
Processor
4. Store Data
3. Execute Instruction
Array
2. Fetch Data
PROGRAMS
1. Fetch Instruction
17
IV. Systems and Networks (cont.)
Tools and Applications
Libraries and Languages
Peripherals Disks, etc
CPU/Memory
Network Operating System
Local Operating System
18
IV. Systems and Networks (cont.)
Network Medium
1 computer
1 computer
1 computer
1 computer
1 computer CPU Memory Disk
  • Many Possible Media Physical and Protocols
  • Functional Variants message passing, shared
    files, shared memory
  • Security issues protecting data, allow sharing
  • Heterogeneous Operating Systems

19
V. Tools and Scripts
  • Tools
  • Debugging
  • Performance Tuning
  • Administration
  • Scripting
  • Programs of shell commands
  • Glue to allow other programs to work together,
    and manipulate whole files (of sequence, for
    example) as simple data objects

20
VI. Databases
  • Pile o data
  • Stored on large non-volatile media (e.g. disk
    system), Local vs. networked.
  • Table Structures
  • Primary key for each item
  • Strength is relational query methods
  • SQL structured query language
  • retrieve from table X where name like Joe and
    age equal 32
  • Insert, delete, update, etc.

21
Introductory BCB Examples
  • Bioinformatics
  • Sequence alignment and database search
  • Gene discovery pipeline
  • EST Clustering
  • Computational Biology
  • Gene Prediction
  • Analysis of Low Complexity

22
Sequence Alignment and Database
Search(BioInformatics)
  • Alignment-based
  • Smith/Waterman
  • Dynamic Programming
  • Markov-model based
  • Large Database issues

23
Sequence Alignment
  • Nucleotide vs. amino acids
  • Global vs. Local
  • Pair-wise vs. multiple
  • Simplest case
  • Global, Pair-wise
  • Must match at both ends

24
Sequence Alignment Example
  • Example
  • S1 TTACTTGCC (9 bases)
  • S2 ATGACGAC (8 bases)
  • Scoring (1 possibility)
  • 2 match
  • 0 mismatch
  • -1 gap in either sequence
  • One Possible alignment
  • T T - A C T T G C C
  • A T G A C - - G A C
  • 0 2-1 2 2-1-1 2 0 2 Score 10 3 7

25
Cue to a Data Structure
Gap in S2
Gap in S1
Alignment (match/mismatch)
26
How hard can this be?
  • Brute force approach consider all possible
    alignments, and choose the one with best score
  • 3 choices at each internal branch point
  • Assume n x n comparison. 3n comparisons
  • n 3 ? 33 27 paths
  • n 20 ? 320 3.4 x 109 paths
  • n 200 ? 3200 2.6 x 1095 paths
  • If 1 path takes 1 nanosecond (10-9 secs)
  • 8.4 x 1078 years!
  • But, using data structures cleverly, this can be
    greatly sped up to O(n2)

27
Basics of Practical Alignment Algorithm
Example Sequences AAAG AGC For large
database Searching, O(n2) is impractical
28
Other Scoring Systems
29
EST Gene Discovery Pipeline(BioInformatics)
30
EST Sequence Clustering(BioInformatics)
  • Goal Group together expressed sequence tags
    (ESTs) and full length cDNA data into gene-based
    indices
  • Sequences considered linked if similarity score
    exceeds some threshold

31
Data Flow
32
Basic Flow of Execution
33
Expanding on Step 4c
34
Hashing
  • Generate unique integer for 8-base windows

Hash Example Sequence GCCACTTGGCGTTTTG Hashes
Hash 1 GCCACTTG 48406 Hash
2 CCACTTGG 44869 Hash 3 CACTTGGC
27601 Hash 4 ACTTGGCG 39668
Hash 5 CTTGGCGT 59069 ...etc.
35
Global Hash Table Data Structure
0
1
2
3
4
5
6
7
48 - 1
Cluster RepresentativeSequence
NameSequenceHashesHash IndexesTouch Count...
Linked list of clusters that contain at least 1
hash with value 2.
Pointer To Next Cluster Member
36
Gene Prediction(Computational Biology)
  • Contexts
  • Identifying full length transcripts
  • Finding genes in genomic sequence
  • Approaches
  • Deployment Issues

37
Genome Architecture in an Nutshell
38
Preview of an HMM Model for Gene Prediction
39
The Crux of Gene Prediction
40
Gene Prediction Approaches
  • Ab initio methods
  • Profile Hidden Markov Models (GENSCAN, HMMgene)
  • Neural Networks (GRAIL, Genie)
  • Decision Trees (MORGAN)
  • Issues
  • Seeding from training sets
  • Fully general approaches?
  • Interesting question
  • Can gene finding be done species-independent?

41
Simple Dicty Gene Finder(Intuition and an
Example)
  • Basic Idea (G. Klein) based on GC/AT content of
    Intron vs. Exons
  • Idealized Example Count G/Cs and A/Ts in a
    window size of 10 bases.

AT content
6
10
10
10
ltEXONgt
ltEXONgt
.CGCGGGCGCCGTATTTATATATTATA..AATATTTTATATAGCCCG
GCGCGGCCG...
ltINTRONgt
GC content
10
10
6
2
Acceptor Site
Point where GC.left and AT.right are both
maximized
Donor Site
42
Dicty Gene Finding Tool Model
  • Model Parameters
  • W -- Window Size
  • ?low -- threshold below which GC or AT
    content does not match hypothesis
  • ?high -- threshold above which GC or AT
    content matches hypothesis
  • m -- number of consecutive windows that
    will be examined
  • n -- number of windows out of m that
    that must exceed ? to qualify for an
    intron/exon or exon/intron transition
  • tol -- maximum distance from the GC/AT
    content transition at which the GT or AG
    motif must be found

43
Dicty Gene Finding Tool Model
W 8, m 4, ?high 7, ?low 6
1
2
3
4
5
6
G/C7
. . .GCGGGCGCTGGGGCCGCGTATTATAGTATTTAT. . .
n 3
n 4
44
Dicty Intron/Gene Prediction Algorithm
  • 1. Calculate AT (GC) content in size W windows
    right and left of each base position.
  • 2. Calculate n
  • AT count ? ?high, AT count ? ? low
  • for each window of m bases to the left and right
    of each base position.
  • 3. For each position If ...
  • ATlefthigh ? n ATrightlow ? n
  • ? potential acceptor site
  • ATleftlow ? n ATrighthigh ? n
  • ? potential donor site

45
Dicty Intron/GenePrediction Algorithm(continued)
  • 4. For each potential donor site
  • If GT (donor) or AG (acceptor) motif is found
    within Tol bases distance, note this as an
    intron boundary.
  • 5. Sort boundaries into candidate introns.

46
Test Data
  • gtIIADP1D6358 Antiparallèle 811 bases
  • AAAAACCTGCTTAGGATTAATTATGAGCGAATTTTTTTTCTTTAAAACTT
  • CCAAAAATATTTTTTTTTTTTTTTTTTTTTAATAATTTCGGTTTGCTCAT
  • AGATTTTTTATTTATTTAATTAATATTTTTAATTTTTTTTTTTTTAATCC
  • TAAAAATAGATTTTATTTATTTTATTTAATTTTTAATTATTAAAAGATAT
  • GAGATTTTTAAAGTTCGGGTTAGAAATTAATTTGGGTAAAGGAACTCTTA
  • TTGAATTTGATGAACAgtgtacttaaatatttaattaatttttttttttt
  • atttgttttaagaagaagaaaaagaaaaaatatagaaatagTAAAAAACT
  • ATTTCCATATATTTGTTATACTCTTACACACAAGGTTATAAATTTAAAGT
  • gttataaataatttaaaaattttattctgtaagaaaatttgttttgaaat
  • tatttgattaaaaatagaaggtttttttttttattttttttttttatttt
  • tatttttttttattttttataatttccgcgtttgaatttgttgtgtaaat
  • taattttaattttttttttttttttttttttttttttttttttttttttt
  • ttcatttttaacatcatttgattcattaatttattttttttttcaacatc
  • cccaacccaaaaaaaaaaaataaaaaaaaatgataagAAATTTAACAAAA
  • TTAACAAAATTTACAATTGAAAATAGATTTTACCAATCCTCATCAAAAGG
  • AAGATTCAGTGGTAAAAATGGAAACAATGCATTCAGGGGATCTCTAGAGT
  • CGACCGAAGGC

47
Parameter Space to Search
  • Ranges
  • W -- 3 ? 10 (8 values)
  • ?high -- .7xW ? W (?4 values)
  • ?low -- .5xW ? .9xW (?4 values)
  • m -- 3 ? 11 (9 values)
  • n -- m/2 ? m (?4 values)
  • tol -- 3-7 (5 values)
  • 3584 x 5 ? 18,000 sets of parameters
  • Search for sets that find all expected sites with
    a minimum of false positives.

48
Test Data
idt t1.fasta 3 1 3 1 2 4 2 2 269 401 341
687 idt t1.fasta 3 1 3 2 2 4 2 2 269 401 341
687 idt t1.fasta 3 1 3 1 3 4 2 2 269 401 341
687 idt t1.fasta 3 1 3 2 3 4 2 2 269 401 341
687 idt t1.fasta 3 2 3 1 2 4 2 2 269 401 341
687 idt t1.fasta 3 2 3 2 2 4 2 2 269 401 341
687 idt t1.fasta 3 2 3 1 3 4 2 2 269 401 341
687 idt t1.fasta 3 2 3 2 3 4 2 2 269 401 341
687 idt t1.fasta 3 3 3 1 2 4 2 2 269 401 341
687 idt t1.fasta 3 3 3 2 2 4 2 2 269 401 341
687 idt t1.fasta 3 3 3 1 3 4 2 2 269 401 341
687 idt t1.fasta 3 3 3 2 3 4 2 2 269 401 341
687 idt t1.fasta 3 2 4 1 2 4 2 2 269 401 341
687 idt t1.fasta 3 2 4 2 2 4 2 2 269 401 341
687 idt t1.fasta 3 2 4 1 3 4 2 2 269 401 341
687 idt t1.fasta 3 2 4 2 3 4 2 2 269 401 341
687 idt t1.fasta 3 3 4 1 2 4 2 2 269 401 341
687 idt t1.fasta 3 3 4 2 2 4 2 2 269 401 341
687 idt t1.fasta 3 3 4 1 3 4 2 2 269 401 341
687 idt t1.fasta 3 3 4 2 3 4 2 2 269 401 341
687 idt t1.fasta 3 4 4 1 2 4 2 2 269 401 341
687 idt t1.fasta 3 4 4 2 2 4 2 2 269 401 341
687 idt t1.fasta 3 4 4 1 3 4 2 2 269 401 341
687 idt t1.fasta 3 4 4 2 3 4 2 2 269 401 341
687 idt t1.fasta 3 2 5 1 2 4 2 2 269 401 341
687 idt t1.fasta 3 2 5 2 2 4 2 2 269 401 341
687 . . . . . About 18,000 more lines like this
. . .
49
Test Data Raw Results
len811 W3 n1 m3 thrL1 thrH2, Tol4, Sites
Found18 Intron 1 91 213 - 213 Intron
2 236 - 241 Intron 3 267 - 267 Intron 4
385 399 - 399 - 467 Intron 5 471 759 -
759 - 797 Intron 6 799 - 799 len811 W3 n1
m3 thrL2 thrH2, Tol4, Sites Found29
Intron 1 91 213 - 213 Intron 2 219 -
223 Intron 3 236 - 241 Intron 4 267 -
267 Intron 5 305 - 312 - 335 Intron 6 341
- 341 Intron 7 385 399 - 399 Intron 8
429 - 433 Intron 9 441 - 467 Intron 10 471
- 753 Intron 11 759 - 759 - 797 Intron 12
799 - 799 len811 W3 n1 m3 thrL1 thrH3,
Tol4, Sites Found13 Intron 1 91 213 -
213 - 241 Intron 2 267 399 - 399 -
467 Intron 3 471 759 - 759 - 786 . . .
About 18,000 sets of results like this. . .
50
Test Data Filtered Results
len811 W6 n5 m9 thrL5 thrH6, Tol3, Sites
Found11 ALL KNOWN SITES FOUND len811 W6 n5
m10 thrL5 thrH6, Tol3, Sites Found11 ALL
KNOWN SITES FOUND len811 W6 n5 m10 thrL5
thrH6, Tol4, Sites Found11 ALL KNOWN SITES
FOUND len811 W6 n5 m5 thrL5 thrH6, Tol6,
Sites Found11 ALL KNOWN SITES FOUND len811 W6
n5 m6 thrL5 thrH6, Tol6, Sites Found11 ALL
KNOWN SITES FOUND len811 W6 n5 m7 thrL5
thrH6, Tol6, Sites Found11 ALL KNOWN SITES
FOUND len811 W6 n5 m8 thrL5 thrH6, Tol6,
Sites Found11 ALL KNOWN SITES FOUND len811 W6
n5 m5 thrL5 thrH6, Tol7, Sites Found11 ALL
KNOWN SITES FOUND len811 W6 n5 m6 thrL5
thrH6, Tol7, Sites Found11 ALL KNOWN SITES
FOUND len811 W6 n5 m7 thrL5 thrH6, Tol7,
Sites Found11 ALL KNOWN SITES FOUND
This provides an initial set of likely to be
optimal parameters
51
Best Parameter Set on Known Gene
len811 W6 n5 m10 thrL5 thrH6, Tol4, Sites
Found11 1 AAAAACCTGC TTAGGATTAA TTATGAGCGA
ATTTTTTTTC TTTAAAACTT 51 CCAAAAATAT TTTTTTTTTT
TTTTTTTTTT AATAATTTCG GTTTGCTCAT 101 AGATTTTTTA
TTTATTTAAT TAATATTTTT AATTTTTTTT TTTTTAATCC 151
TAAAAATAGA TTTTATTTAT TTTATTTAAT TTTTAATTAT
TAAAAGATAT 201 GAGATTTTTA AAgttcgggt tagaaattaa
tttgggtaaa gGAACTCTTA 251 TTGAATTTGA TGAACAgtgt
acttaaatat ttaattaatt tttttttttt 301 atttgtttta
agaagaagaa aaagaaaaaa tatagaaata gTAAAAAACT 351
ATTTCCATAT ATTTGTTATA CTCTTACACA CAAGgttata
aatttaaagt 401 gttataaata atttaaaaat tttattctgt
aagAAAATTT GTTTTGAAAT 451 TATTTGATTA AAAATAGAAG
gttttttttt ttattttttt tttttatttt 501 tatttttttt
tattttttat aatttccgcg tttgaatttg ttgtgtaaat 551
taattttaat tttttttttt tttttttttt tttttttttt
tttttttttt 601 ttcattttta acatcatttg attcattaat
ttattttttt tttcaacatc 651 cccaacccaa aaaaaaaaaa
taaaaaaaaa tgataagAAA TTTAACAAAA 701 TTAACAAAAT
TTACAATTGA AAATAGATTT TACCAATCCT CATCAAAAGG 751
AAGATTCAGT GGTAAAAATG GAAACAATGC ATTCAGGGGA
TCTCTAGAGT 801 CGACCGAAGG C Intron 1 213 -
241 overpredicted (45 bases) Intron 2 267 -
341 UNDERPREDICTED (37 BASES) Intron 3 385
399 - 433 correct (325 bases) Intron 4
471 - 687 CORRECT - (404 BASES)
52
Analysis of Unknown Gene
  • Started with 21 reads from Michel
    Satre (genomic and ESTs)
  • Used phred to assemble them
  • 4 contigs found
  • 4th contig was longest (1759 bases)
  • Used parameters from previous analysis
  • Results for contig4 compared . . . . . . .

53
Contig4 Sampled Results(a closer look)
  • W6 n5 m5 thrL5 thrH6, Tol6, Sites Found14
  • Intron 1 54 - 401
  • Intron 2 579 - 612
  • Intron 3 711 782 -1113
  • Intron 4 1185 1350 -1350 -1504
  • Intron 5 1628 -1709
  • W6 n5 m5 thrL5 thrH6, Tol7, SitesFound14
  • Intron 1 54 - 401
  • Intron 2 579 - 612
  • Intron 3 711 782 -1113
  • Intron 4 1185 1350 -1350 -1504
  • Intron 5 1628 -1709
  • len1759 W6 n5 m6 thrL5 thrH6, Tol6, Sites
    Found14
  • Intron 1 54 - 401
  • Intron 2 579 - 612
  • Intron 3 711 782 -1164
  • Intron 4 1174 1350 -1350 -1504
  • len1759 W6 n5 m7 thrL5 thrH6, Tol6, Sites
    Found17
  • Intron 1 54 - 401
  • Intron 2 579 - 612
  • Intron 3 650 782 -1043
  • Intron 4 1087 -1164
  • Intron 5 1174 1350 -1350 -1504
  • Intron 6 1628 -1709
  • Intron 7 1735
  • len1759 W6 n5 m7 thrL5 thrH6, Tol7, Sites
    Found19
  • Intron 1 54 - 401
  • Intron 2 579 - 612
  • Intron 3 650 - 683
  • Intron 4 711 782 -1043
  • Intron 5 1087 1164 -1164
  • Intron 6 1350 -1350 -1504
  • Intron 7 1628 -1709
  • Intron 8 1735

54
Contig4 Results
  • len1759 W6 n5 m10 thrL5 thrH6, Tol4, Sites
    Found19
  • 1 TGATAATAAC AATAATAACA ATAATAATAA TAATAATAAT
    AATATTAATA
  • 51 ATTgtaataa taataatatt aataatgata ataataataa
    taataataat
  • 101 gataatgata ataataatat taatactgtt gataatcatg
    atgatgatat
  • 151 tataaataat ancaataatt ttaataaaaa tgaatatcca
    tcaagtaata
  • 201 tatcaccaat atctccaaaa tcttcaatat caagttttcc
    aacaaattta
  • 251 aataattcaa taaataatac aggttcaatg gtttcagatt
    ctttaagttc
  • 301 ttgtagaaat tcgatttcct ctagttcaat tgattcaagt
    gttgcttcaa
  • 351 ttcctataac aatacaatca atagattttg aagataagaa
    tattaaatca
  • 401 gACCAATTTA AAATAATATC AAAATCAAAT ATAGAAAATA
    CAATTGAAAC
  • 451 AAACCCAATA CCTCCATTCA ATCAAACCAA TAACCAATGT
    GAAGTTCAGT
  • 501 TACAATCACA TTCTTTACCA ACAATTTTAA AACAACCACA
    TATTTATAAA
  • 551 TCAAAATCAT TTTCTAGTAG TATCAATAgt aatagtaaaa
    ttaaaaaaat
  • 601 taaaaaatca agATCATTTG AAATTGAATC AAAAATTAAT
    TTATTTGATg
  • 651 taattaatca tatatattta aacctttcaa aagTTGGTAG
    TGAAGAACAA
  • 701 AAAATAACCA gtatgtatta aattaacaaa tgattaatat
    attgttgtaa
  • 751 aaatatatga aactaattta atattttaaa ggtgttttta
    aattatatga
  • 801 tatatatgat aagggtttta tttcaagaga tgatttaaaa
    gaagtattaa
  • 851 attatagaac taaacaaaat gggttaaaat ttcaagactt
    tacaatggaa
  • . . .
  • 1001 aagttaaaga aaaaggaaga aaatccaaat tatattttta
    aagaagAAAA
  • 1051 TATTGGAATA TATACCGGAA AAAGAAAGTT TTCATAgttt
    aaaaagatat
  • 1101 ttaaaaattg aaggatcaaa attatttttt atatctttat
    tttttattat
  • 1151 aaattcaatt ttagTAATAA CAAGTTTTTT AAATGTTCAT
    GCAAATAATA
  • 1201 AAAGAGCAAT AGAATTATTT GGCCCGGGTG TATATATAAC
    AAGAATTGCA
  • 1251 GCTCAATTAA TTGAATTCAA TGCAGCAATA ATTTTAATGA
    CAATGTGTAA
  • 1301 ACAATTATTT ACAATGATTA GAAATACCAA ATTTAAATTT
    TTATTTCCAG
  • 1351 TTGATAAATA TATGACATTT CATAAATTAA TTGGTTATAC
    ATTAATCATT
  • 1401 GCTTCATTTT TACATACTAT TGGTTGGATT GTTGGTATGG
    CAGTTGCNCC
  • 1451 CGGTAAACCT GATAATATTT TTTATGATTG TTTAGCACCT
    CATTTTAAAT
  • 1501 TTAGGCCACC AGTTTGGGAA ATGATTTTTA ACCGTTTACC
    AGGTGTAACA
  • 1551 GGTTTCACTT ACCAAAATCA TTTTTAATAA TTATGGCAAT
    TTTATCTTTA
  • 1601 AAAATTATTN GGAAATCTAA TTTNGNAgta agtttttttt
    tttaaaaaaa
  • 1651 aaaaaaaaaa aaaaattaat tattttttat tatataattt
    tatagttatt
  • 1701 ttattatagC CATCATTTAT TTATNGGATT TTATgtttna
    ttaattttac
  • 1751 atgggacaa
  • Intron 1 54 - 401

55
Further Intron Finding Options
  • Exhaustive parsing of sequence
  • 400 base sequence ? 50 acceptor/donors
  • 20 donor/acceptors ? 5 minutes on P750/.5GB
  • 24 donor/acceptors ? 1 day
  • 30 donor/acceptors ? year
  • Hybrid solution rank top 20 d/a sites and parse
  • Use protein/predicted gene homology to edit
    results

56
Gene PredictionRecognizing Initiation of Coding
57
High-level Outline
ConsensusKozak
0 errors
1 error
ATG/UTR
? 2 errors
Heuristic
Stops upstream
E(stop)
?
? ?
Check ORF for frame shifts
58
Core Heuristic Components
226 Classes
  • Kozak Existence and Fidelity
  • ATG Heuristic
  • Template (sIFl, sl, sFl) 5len ATG 3len (sIFr,
    sr, sFr)
  • Ideal ( 1, 3, 3) 125 ATG 300 (
    0, 6, 2)
  • Stops left of candidate ATG
  • CDS Stops in minimum frame
  • UTR Heuristic
  • In frame stops to All stops Ratio
  • Frame shifts needed for perfect ORF
  • Not Used
  • Codon or Hexamer Frequencies.
  • Known protein starting motifs.

59
Verification and Testing
  • Generation of sets of known CDS reads (12,826)
  • known ATG reads (13,672)
  • known UTR reads (1,035)
  • Run Classifier against all three sets
  • Identify classes with highest CDS to ATG
    differential UTR vs. CDS/ATG
  • Grade A K0E.ATG.L.pSL.ORFr0F or 1FS
  • K0E.ATG.L.npSL.ORFr0FS or 1FS
  • K1E.ATG.L.pSL.ORF0FS or 1FS
  • K1E.ATG.L.npSL.ORF0FS or 1FS
  • KG1E.ATG.L.pSL.ORF0FS or 1FS
  • KG1E.ATG.L.npSL.ORF0FS or 1FS
  • Grade B Same as A, but with ATG in Middle 1/2
  • Grade C zSL for K0E only and ATG in L, M, or R
  • UTR Class

60
Accuracy and Yield of Classes
  • ATG True Positive (of 13,672)
  • Grade A 867 - 6.3
  • Grade B 3,742 - 27.3
  • UTR 82 - 0.6
  • Total 34.3 (4,691)
  • CDS False Positive (of 12,826)
  • Grade A 3 - 0.02
  • Grade B 753 - 5.5
  • UTR 1725 - 13.5
  • Total 19.3 (2481)
  • UTR True Positive (of 1,035)
  • 691 - 66.8

Yield 34 ? 67 Confidence 95 ? 87
61
Fin
Write a Comment
User Comments (0)
About PowerShow.com