Title: I. Programming Fundamentals
1I. Programming Fundamentals
- Problem Solving
- Problem Specification
- Top-down Design
- Languages
- Debugging/Performance Tuning
- Testing
- Maintenance
21. Problem Solving(I. Programming Fundamentals)
- Since late 60s, the phrase Problem Solving with
Computers - The computer as a tool
- Understand Problem
- Specification
- Design
- Implement
- Test
- Maintain
32. Problem Specification(I. Programming
Fundamentals)
- Can be informal, formal, or in between.
- A definition of Input/Output Relationships
- Uncovers and clarifies any ambiguous issues
- Involves interactions between end users and
solution developers - Ideally, produces a specification document
- Realistically, prototyping usually starts
simultaneously with specification
43. Top-down Design(I. Programming Fundamentals)
- Extremely important methodological philosophy.
- Develops a solution in successive phases of
decreasing levels of abstraction - Any problem can have a solution described (at
some level of abstraction) on about a half sheet
of paper. - Aliases modular programming, stepwise
refinement, (object oriented design is this
philosophy with training wheels added)
54. Languages(I. Programming Fundamentals)
- Choice of language can be important.
- Often, however, final choice of language can be
as much a matter of subjective, personal choice,
as is a type of paint brush to an artist. - Issues acceptance (for maintenance),
performance, portability
64. Languages(I. Programming Fundamentals)
- Language types
- Procedural (C, Fortran, Pascal, Basic)
- Object-oriented (C, Smalltalk)
- Functional, Declarative (LISP, Prolog)
- String processing (SNOBOL)
- Deployment technologies
- Interpreted (Basic)
- Compiled (C, Fortran, most languages)
- Run-time systems
- Statically linked (real-time, older systems)
- Dynamically linked (most modern environments)
75. Debugging/Performance Tuning(I. Programming
Fundamentals)
- The most unpredictable phase of the process
- Not a matter of luck, however
- Scientific principles are critical
- Formulate hypothesis
- Perform an experiment, examine results
- Make a single change
- Repeat at step 1.
- Crash testing, and obvious error finding
- Debugging tools can assist (gdb)
86. Testing(I. Programming Fundamentals)
- Goes beyond crash testing
- Need to develop test sets
- Functional testing
- Structural testing
- Must struggle with specification now
97. Maintenance(I. Programming Fundamentals)
- The on-going, necessary update and debugging of
finished software - This step never ends
- Often, earlier steps ignore this phase for the
sake of expediency - Language choice, specification, modularization,
all bear on this step
10II. Data Structures
- A practical framework for holding data.
- Must consider input, intermediate, and possibly
computed output data - Impacts on
- Development time
- Memory usage
- Performance (execution time)
- Maintainability
11II. Data Structures (cont.)
- Scalar and array variables
- Static and dynamic structures
- Dense and sparse structures
- Linear and linked structures
- Lists, Stacks (LIFO), Queues (FIFO), Trees,
Graphs, and Heaps - Dynamic structure efficiency relies on OS
interaction, and program behavior
12III. Algorithms
- Control flow
- Template structures
- Complexity analysis
13III. Algorithms(Control Flow)
- Sequential
- Alternation or
- selection
- Iteration or
- looping
Statement 1
Statement 2
?
Statement 1
Statement 2
?
Loop Body
14III. Algorithms(Template Structures)
- Divide and Conquer
- Greedy
- Backtracking
- Branch and Bound
- Searching (depth first, breadth first)
- Dynamic Programming
15III. Algorithms(Complexity Analysis)
- for i 1 to 100
- for j 1 to 50
- xi ai bj
- Inner statement executes 50 x 100, or 5,000
times. If outer loop executed n times, and inner
one n times, we would say that this algorithm
had complexity O(n2). - In some sense, as the problem size n grows, the
execution time will grow as the square of n.
16IV. Systems and Networks
Memory
DATA
Scalar
Processor
4. Store Data
3. Execute Instruction
Array
2. Fetch Data
PROGRAMS
1. Fetch Instruction
17IV. Systems and Networks (cont.)
Tools and Applications
Libraries and Languages
Peripherals Disks, etc
CPU/Memory
Network Operating System
Local Operating System
18IV. Systems and Networks (cont.)
Network Medium
1 computer
1 computer
1 computer
1 computer
1 computer CPU Memory Disk
- Many Possible Media Physical and Protocols
- Functional Variants message passing, shared
files, shared memory - Security issues protecting data, allow sharing
- Heterogeneous Operating Systems
19V. Tools and Scripts
- Tools
- Debugging
- Performance Tuning
- Administration
- Scripting
- Programs of shell commands
- Glue to allow other programs to work together,
and manipulate whole files (of sequence, for
example) as simple data objects
20VI. Databases
- Pile o data
- Stored on large non-volatile media (e.g. disk
system), Local vs. networked. - Table Structures
- Primary key for each item
- Strength is relational query methods
- SQL structured query language
- retrieve from table X where name like Joe and
age equal 32 - Insert, delete, update, etc.
21Introductory BCB Examples
- Bioinformatics
- Sequence alignment and database search
- Gene discovery pipeline
- EST Clustering
- Computational Biology
- Gene Prediction
- Analysis of Low Complexity
22Sequence Alignment and Database
Search(BioInformatics)
- Alignment-based
- Smith/Waterman
- Dynamic Programming
- Markov-model based
- Large Database issues
23Sequence Alignment
- Nucleotide vs. amino acids
- Global vs. Local
- Pair-wise vs. multiple
- Simplest case
- Global, Pair-wise
- Must match at both ends
24Sequence Alignment Example
- Example
- S1 TTACTTGCC (9 bases)
- S2 ATGACGAC (8 bases)
- Scoring (1 possibility)
- 2 match
- 0 mismatch
- -1 gap in either sequence
- One Possible alignment
- T T - A C T T G C C
- A T G A C - - G A C
- 0 2-1 2 2-1-1 2 0 2 Score 10 3 7
-
25Cue to a Data Structure
Gap in S2
Gap in S1
Alignment (match/mismatch)
26How hard can this be?
- Brute force approach consider all possible
alignments, and choose the one with best score - 3 choices at each internal branch point
- Assume n x n comparison. 3n comparisons
- n 3 ? 33 27 paths
- n 20 ? 320 3.4 x 109 paths
- n 200 ? 3200 2.6 x 1095 paths
- If 1 path takes 1 nanosecond (10-9 secs)
- 8.4 x 1078 years!
- But, using data structures cleverly, this can be
greatly sped up to O(n2)
27Basics of Practical Alignment Algorithm
Example Sequences AAAG AGC For large
database Searching, O(n2) is impractical
28Other Scoring Systems
29EST Gene Discovery Pipeline(BioInformatics)
30EST Sequence Clustering(BioInformatics)
- Goal Group together expressed sequence tags
(ESTs) and full length cDNA data into gene-based
indices - Sequences considered linked if similarity score
exceeds some threshold
31Data Flow
32Basic Flow of Execution
33Expanding on Step 4c
34Hashing
- Generate unique integer for 8-base windows
Hash Example Sequence GCCACTTGGCGTTTTG Hashes
Hash 1 GCCACTTG 48406 Hash
2 CCACTTGG 44869 Hash 3 CACTTGGC
27601 Hash 4 ACTTGGCG 39668
Hash 5 CTTGGCGT 59069 ...etc.
35Global Hash Table Data Structure
0
1
2
3
4
5
6
7
48 - 1
Cluster RepresentativeSequence
NameSequenceHashesHash IndexesTouch Count...
Linked list of clusters that contain at least 1
hash with value 2.
Pointer To Next Cluster Member
36Gene Prediction(Computational Biology)
- Contexts
- Identifying full length transcripts
- Finding genes in genomic sequence
- Approaches
- Deployment Issues
37Genome Architecture in an Nutshell
38Preview of an HMM Model for Gene Prediction
39The Crux of Gene Prediction
40Gene Prediction Approaches
- Ab initio methods
- Profile Hidden Markov Models (GENSCAN, HMMgene)
- Neural Networks (GRAIL, Genie)
- Decision Trees (MORGAN)
- Issues
- Seeding from training sets
- Fully general approaches?
- Interesting question
- Can gene finding be done species-independent?
41Simple Dicty Gene Finder(Intuition and an
Example)
- Basic Idea (G. Klein) based on GC/AT content of
Intron vs. Exons - Idealized Example Count G/Cs and A/Ts in a
window size of 10 bases.
AT content
6
10
10
10
ltEXONgt
ltEXONgt
.CGCGGGCGCCGTATTTATATATTATA..AATATTTTATATAGCCCG
GCGCGGCCG...
ltINTRONgt
GC content
10
10
6
2
Acceptor Site
Point where GC.left and AT.right are both
maximized
Donor Site
42Dicty Gene Finding Tool Model
- Model Parameters
- W -- Window Size
- ?low -- threshold below which GC or AT
content does not match hypothesis - ?high -- threshold above which GC or AT
content matches hypothesis - m -- number of consecutive windows that
will be examined - n -- number of windows out of m that
that must exceed ? to qualify for an
intron/exon or exon/intron transition - tol -- maximum distance from the GC/AT
content transition at which the GT or AG
motif must be found
43Dicty Gene Finding Tool Model
W 8, m 4, ?high 7, ?low 6
1
2
3
4
5
6
G/C7
. . .GCGGGCGCTGGGGCCGCGTATTATAGTATTTAT. . .
n 3
n 4
44Dicty Intron/Gene Prediction Algorithm
- 1. Calculate AT (GC) content in size W windows
right and left of each base position. - 2. Calculate n
- AT count ? ?high, AT count ? ? low
- for each window of m bases to the left and right
of each base position. - 3. For each position If ...
- ATlefthigh ? n ATrightlow ? n
- ? potential acceptor site
- ATleftlow ? n ATrighthigh ? n
- ? potential donor site
45Dicty Intron/GenePrediction Algorithm(continued)
- 4. For each potential donor site
- If GT (donor) or AG (acceptor) motif is found
within Tol bases distance, note this as an
intron boundary. - 5. Sort boundaries into candidate introns.
-
46Test Data
- gtIIADP1D6358 Antiparallèle 811 bases
- AAAAACCTGCTTAGGATTAATTATGAGCGAATTTTTTTTCTTTAAAACTT
- CCAAAAATATTTTTTTTTTTTTTTTTTTTTAATAATTTCGGTTTGCTCAT
- AGATTTTTTATTTATTTAATTAATATTTTTAATTTTTTTTTTTTTAATCC
- TAAAAATAGATTTTATTTATTTTATTTAATTTTTAATTATTAAAAGATAT
- GAGATTTTTAAAGTTCGGGTTAGAAATTAATTTGGGTAAAGGAACTCTTA
- TTGAATTTGATGAACAgtgtacttaaatatttaattaatttttttttttt
- atttgttttaagaagaagaaaaagaaaaaatatagaaatagTAAAAAACT
- ATTTCCATATATTTGTTATACTCTTACACACAAGGTTATAAATTTAAAGT
- gttataaataatttaaaaattttattctgtaagaaaatttgttttgaaat
- tatttgattaaaaatagaaggtttttttttttattttttttttttatttt
- tatttttttttattttttataatttccgcgtttgaatttgttgtgtaaat
- taattttaattttttttttttttttttttttttttttttttttttttttt
- ttcatttttaacatcatttgattcattaatttattttttttttcaacatc
- cccaacccaaaaaaaaaaaataaaaaaaaatgataagAAATTTAACAAAA
- TTAACAAAATTTACAATTGAAAATAGATTTTACCAATCCTCATCAAAAGG
- AAGATTCAGTGGTAAAAATGGAAACAATGCATTCAGGGGATCTCTAGAGT
- CGACCGAAGGC
47Parameter Space to Search
- Ranges
- W -- 3 ? 10 (8 values)
- ?high -- .7xW ? W (?4 values)
- ?low -- .5xW ? .9xW (?4 values)
- m -- 3 ? 11 (9 values)
- n -- m/2 ? m (?4 values)
- tol -- 3-7 (5 values)
- 3584 x 5 ? 18,000 sets of parameters
- Search for sets that find all expected sites with
a minimum of false positives.
48Test Data
idt t1.fasta 3 1 3 1 2 4 2 2 269 401 341
687 idt t1.fasta 3 1 3 2 2 4 2 2 269 401 341
687 idt t1.fasta 3 1 3 1 3 4 2 2 269 401 341
687 idt t1.fasta 3 1 3 2 3 4 2 2 269 401 341
687 idt t1.fasta 3 2 3 1 2 4 2 2 269 401 341
687 idt t1.fasta 3 2 3 2 2 4 2 2 269 401 341
687 idt t1.fasta 3 2 3 1 3 4 2 2 269 401 341
687 idt t1.fasta 3 2 3 2 3 4 2 2 269 401 341
687 idt t1.fasta 3 3 3 1 2 4 2 2 269 401 341
687 idt t1.fasta 3 3 3 2 2 4 2 2 269 401 341
687 idt t1.fasta 3 3 3 1 3 4 2 2 269 401 341
687 idt t1.fasta 3 3 3 2 3 4 2 2 269 401 341
687 idt t1.fasta 3 2 4 1 2 4 2 2 269 401 341
687 idt t1.fasta 3 2 4 2 2 4 2 2 269 401 341
687 idt t1.fasta 3 2 4 1 3 4 2 2 269 401 341
687 idt t1.fasta 3 2 4 2 3 4 2 2 269 401 341
687 idt t1.fasta 3 3 4 1 2 4 2 2 269 401 341
687 idt t1.fasta 3 3 4 2 2 4 2 2 269 401 341
687 idt t1.fasta 3 3 4 1 3 4 2 2 269 401 341
687 idt t1.fasta 3 3 4 2 3 4 2 2 269 401 341
687 idt t1.fasta 3 4 4 1 2 4 2 2 269 401 341
687 idt t1.fasta 3 4 4 2 2 4 2 2 269 401 341
687 idt t1.fasta 3 4 4 1 3 4 2 2 269 401 341
687 idt t1.fasta 3 4 4 2 3 4 2 2 269 401 341
687 idt t1.fasta 3 2 5 1 2 4 2 2 269 401 341
687 idt t1.fasta 3 2 5 2 2 4 2 2 269 401 341
687 . . . . . About 18,000 more lines like this
. . .
49Test Data Raw Results
len811 W3 n1 m3 thrL1 thrH2, Tol4, Sites
Found18 Intron 1 91 213 - 213 Intron
2 236 - 241 Intron 3 267 - 267 Intron 4
385 399 - 399 - 467 Intron 5 471 759 -
759 - 797 Intron 6 799 - 799 len811 W3 n1
m3 thrL2 thrH2, Tol4, Sites Found29
Intron 1 91 213 - 213 Intron 2 219 -
223 Intron 3 236 - 241 Intron 4 267 -
267 Intron 5 305 - 312 - 335 Intron 6 341
- 341 Intron 7 385 399 - 399 Intron 8
429 - 433 Intron 9 441 - 467 Intron 10 471
- 753 Intron 11 759 - 759 - 797 Intron 12
799 - 799 len811 W3 n1 m3 thrL1 thrH3,
Tol4, Sites Found13 Intron 1 91 213 -
213 - 241 Intron 2 267 399 - 399 -
467 Intron 3 471 759 - 759 - 786 . . .
About 18,000 sets of results like this. . .
50Test Data Filtered Results
len811 W6 n5 m9 thrL5 thrH6, Tol3, Sites
Found11 ALL KNOWN SITES FOUND len811 W6 n5
m10 thrL5 thrH6, Tol3, Sites Found11 ALL
KNOWN SITES FOUND len811 W6 n5 m10 thrL5
thrH6, Tol4, Sites Found11 ALL KNOWN SITES
FOUND len811 W6 n5 m5 thrL5 thrH6, Tol6,
Sites Found11 ALL KNOWN SITES FOUND len811 W6
n5 m6 thrL5 thrH6, Tol6, Sites Found11 ALL
KNOWN SITES FOUND len811 W6 n5 m7 thrL5
thrH6, Tol6, Sites Found11 ALL KNOWN SITES
FOUND len811 W6 n5 m8 thrL5 thrH6, Tol6,
Sites Found11 ALL KNOWN SITES FOUND len811 W6
n5 m5 thrL5 thrH6, Tol7, Sites Found11 ALL
KNOWN SITES FOUND len811 W6 n5 m6 thrL5
thrH6, Tol7, Sites Found11 ALL KNOWN SITES
FOUND len811 W6 n5 m7 thrL5 thrH6, Tol7,
Sites Found11 ALL KNOWN SITES FOUND
This provides an initial set of likely to be
optimal parameters
51Best Parameter Set on Known Gene
len811 W6 n5 m10 thrL5 thrH6, Tol4, Sites
Found11 1 AAAAACCTGC TTAGGATTAA TTATGAGCGA
ATTTTTTTTC TTTAAAACTT 51 CCAAAAATAT TTTTTTTTTT
TTTTTTTTTT AATAATTTCG GTTTGCTCAT 101 AGATTTTTTA
TTTATTTAAT TAATATTTTT AATTTTTTTT TTTTTAATCC 151
TAAAAATAGA TTTTATTTAT TTTATTTAAT TTTTAATTAT
TAAAAGATAT 201 GAGATTTTTA AAgttcgggt tagaaattaa
tttgggtaaa gGAACTCTTA 251 TTGAATTTGA TGAACAgtgt
acttaaatat ttaattaatt tttttttttt 301 atttgtttta
agaagaagaa aaagaaaaaa tatagaaata gTAAAAAACT 351
ATTTCCATAT ATTTGTTATA CTCTTACACA CAAGgttata
aatttaaagt 401 gttataaata atttaaaaat tttattctgt
aagAAAATTT GTTTTGAAAT 451 TATTTGATTA AAAATAGAAG
gttttttttt ttattttttt tttttatttt 501 tatttttttt
tattttttat aatttccgcg tttgaatttg ttgtgtaaat 551
taattttaat tttttttttt tttttttttt tttttttttt
tttttttttt 601 ttcattttta acatcatttg attcattaat
ttattttttt tttcaacatc 651 cccaacccaa aaaaaaaaaa
taaaaaaaaa tgataagAAA TTTAACAAAA 701 TTAACAAAAT
TTACAATTGA AAATAGATTT TACCAATCCT CATCAAAAGG 751
AAGATTCAGT GGTAAAAATG GAAACAATGC ATTCAGGGGA
TCTCTAGAGT 801 CGACCGAAGG C Intron 1 213 -
241 overpredicted (45 bases) Intron 2 267 -
341 UNDERPREDICTED (37 BASES) Intron 3 385
399 - 433 correct (325 bases) Intron 4
471 - 687 CORRECT - (404 BASES)
52Analysis of Unknown Gene
- Started with 21 reads from Michel
Satre (genomic and ESTs) - Used phred to assemble them
- 4 contigs found
- 4th contig was longest (1759 bases)
- Used parameters from previous analysis
- Results for contig4 compared . . . . . . .
53Contig4 Sampled Results(a closer look)
- W6 n5 m5 thrL5 thrH6, Tol6, Sites Found14
- Intron 1 54 - 401
- Intron 2 579 - 612
- Intron 3 711 782 -1113
- Intron 4 1185 1350 -1350 -1504
- Intron 5 1628 -1709
- W6 n5 m5 thrL5 thrH6, Tol7, SitesFound14
- Intron 1 54 - 401
- Intron 2 579 - 612
- Intron 3 711 782 -1113
- Intron 4 1185 1350 -1350 -1504
- Intron 5 1628 -1709
- len1759 W6 n5 m6 thrL5 thrH6, Tol6, Sites
Found14 - Intron 1 54 - 401
- Intron 2 579 - 612
- Intron 3 711 782 -1164
- Intron 4 1174 1350 -1350 -1504
- len1759 W6 n5 m7 thrL5 thrH6, Tol6, Sites
Found17 - Intron 1 54 - 401
- Intron 2 579 - 612
- Intron 3 650 782 -1043
- Intron 4 1087 -1164
- Intron 5 1174 1350 -1350 -1504
- Intron 6 1628 -1709
- Intron 7 1735
- len1759 W6 n5 m7 thrL5 thrH6, Tol7, Sites
Found19 - Intron 1 54 - 401
- Intron 2 579 - 612
- Intron 3 650 - 683
- Intron 4 711 782 -1043
- Intron 5 1087 1164 -1164
- Intron 6 1350 -1350 -1504
- Intron 7 1628 -1709
- Intron 8 1735
54Contig4 Results
- len1759 W6 n5 m10 thrL5 thrH6, Tol4, Sites
Found19 - 1 TGATAATAAC AATAATAACA ATAATAATAA TAATAATAAT
AATATTAATA - 51 ATTgtaataa taataatatt aataatgata ataataataa
taataataat - 101 gataatgata ataataatat taatactgtt gataatcatg
atgatgatat - 151 tataaataat ancaataatt ttaataaaaa tgaatatcca
tcaagtaata - 201 tatcaccaat atctccaaaa tcttcaatat caagttttcc
aacaaattta - 251 aataattcaa taaataatac aggttcaatg gtttcagatt
ctttaagttc - 301 ttgtagaaat tcgatttcct ctagttcaat tgattcaagt
gttgcttcaa - 351 ttcctataac aatacaatca atagattttg aagataagaa
tattaaatca - 401 gACCAATTTA AAATAATATC AAAATCAAAT ATAGAAAATA
CAATTGAAAC - 451 AAACCCAATA CCTCCATTCA ATCAAACCAA TAACCAATGT
GAAGTTCAGT - 501 TACAATCACA TTCTTTACCA ACAATTTTAA AACAACCACA
TATTTATAAA - 551 TCAAAATCAT TTTCTAGTAG TATCAATAgt aatagtaaaa
ttaaaaaaat - 601 taaaaaatca agATCATTTG AAATTGAATC AAAAATTAAT
TTATTTGATg - 651 taattaatca tatatattta aacctttcaa aagTTGGTAG
TGAAGAACAA - 701 AAAATAACCA gtatgtatta aattaacaaa tgattaatat
attgttgtaa - 751 aaatatatga aactaattta atattttaaa ggtgttttta
aattatatga - 801 tatatatgat aagggtttta tttcaagaga tgatttaaaa
gaagtattaa - 851 attatagaac taaacaaaat gggttaaaat ttcaagactt
tacaatggaa
- . . .
- 1001 aagttaaaga aaaaggaaga aaatccaaat tatattttta
aagaagAAAA - 1051 TATTGGAATA TATACCGGAA AAAGAAAGTT TTCATAgttt
aaaaagatat - 1101 ttaaaaattg aaggatcaaa attatttttt atatctttat
tttttattat - 1151 aaattcaatt ttagTAATAA CAAGTTTTTT AAATGTTCAT
GCAAATAATA - 1201 AAAGAGCAAT AGAATTATTT GGCCCGGGTG TATATATAAC
AAGAATTGCA - 1251 GCTCAATTAA TTGAATTCAA TGCAGCAATA ATTTTAATGA
CAATGTGTAA - 1301 ACAATTATTT ACAATGATTA GAAATACCAA ATTTAAATTT
TTATTTCCAG - 1351 TTGATAAATA TATGACATTT CATAAATTAA TTGGTTATAC
ATTAATCATT - 1401 GCTTCATTTT TACATACTAT TGGTTGGATT GTTGGTATGG
CAGTTGCNCC - 1451 CGGTAAACCT GATAATATTT TTTATGATTG TTTAGCACCT
CATTTTAAAT - 1501 TTAGGCCACC AGTTTGGGAA ATGATTTTTA ACCGTTTACC
AGGTGTAACA - 1551 GGTTTCACTT ACCAAAATCA TTTTTAATAA TTATGGCAAT
TTTATCTTTA - 1601 AAAATTATTN GGAAATCTAA TTTNGNAgta agtttttttt
tttaaaaaaa - 1651 aaaaaaaaaa aaaaattaat tattttttat tatataattt
tatagttatt - 1701 ttattatagC CATCATTTAT TTATNGGATT TTATgtttna
ttaattttac - 1751 atgggacaa
- Intron 1 54 - 401
55Further Intron Finding Options
- Exhaustive parsing of sequence
- 400 base sequence ? 50 acceptor/donors
- 20 donor/acceptors ? 5 minutes on P750/.5GB
- 24 donor/acceptors ? 1 day
- 30 donor/acceptors ? year
- Hybrid solution rank top 20 d/a sites and parse
- Use protein/predicted gene homology to edit
results
56Gene PredictionRecognizing Initiation of Coding
57High-level Outline
ConsensusKozak
0 errors
1 error
ATG/UTR
? 2 errors
Heuristic
Stops upstream
E(stop)
?
? ?
Check ORF for frame shifts
58Core Heuristic Components
226 Classes
- Kozak Existence and Fidelity
- ATG Heuristic
- Template (sIFl, sl, sFl) 5len ATG 3len (sIFr,
sr, sFr) - Ideal ( 1, 3, 3) 125 ATG 300 (
0, 6, 2) - Stops left of candidate ATG
- CDS Stops in minimum frame
- UTR Heuristic
- In frame stops to All stops Ratio
- Frame shifts needed for perfect ORF
- Not Used
- Codon or Hexamer Frequencies.
- Known protein starting motifs.
59Verification and Testing
- Generation of sets of known CDS reads (12,826)
- known ATG reads (13,672)
- known UTR reads (1,035)
- Run Classifier against all three sets
- Identify classes with highest CDS to ATG
differential UTR vs. CDS/ATG - Grade A K0E.ATG.L.pSL.ORFr0F or 1FS
- K0E.ATG.L.npSL.ORFr0FS or 1FS
- K1E.ATG.L.pSL.ORF0FS or 1FS
- K1E.ATG.L.npSL.ORF0FS or 1FS
- KG1E.ATG.L.pSL.ORF0FS or 1FS
- KG1E.ATG.L.npSL.ORF0FS or 1FS
- Grade B Same as A, but with ATG in Middle 1/2
- Grade C zSL for K0E only and ATG in L, M, or R
- UTR Class
60Accuracy and Yield of Classes
- ATG True Positive (of 13,672)
- Grade A 867 - 6.3
- Grade B 3,742 - 27.3
- UTR 82 - 0.6
- Total 34.3 (4,691)
- CDS False Positive (of 12,826)
- Grade A 3 - 0.02
- Grade B 753 - 5.5
- UTR 1725 - 13.5
- Total 19.3 (2481)
- UTR True Positive (of 1,035)
- 691 - 66.8
Yield 34 ? 67 Confidence 95 ? 87
61Fin