Title: The Effect of Inverse Document Frequency Weights on Retrieval of Genomic Sequences: Towards a vector space approach
1The Effect of Inverse Document Frequency Weights
on Retrieval of Genomic SequencesTowards a
vector space approach
- Kevin C. O'Kane
- Department of Computer Science
- The University of Northern Iowa
- Cedar Falls, Iowa 50613
2The area of natural language text indexing and
retrieval has been studied since the mid-50's.
In text retrieval, the problem is to locate
documents related to a natural language query.
To this purpose, natural language text indexing
programs have employed many techniques to
identify terms in a document most likely to be
content descriptors as opposed to terms that are
poor content descriptors. By eliminating poor
descriptors and pre-indexing documents by
descriptors more likely to be good
discriminators, the speed of selection and
precision of document relevance ranking can be
improved. The vector space model, developed
by G. Salton, views the problem as an
n-dimensional hyperspace in which documents and
queries.
3Overview
- In text retrieval, the problem is to locate
documents related to a natural language query. - Natural language text indexing programs identify
terms in a document most likely to be content
descriptors. - The goal of these experiments is to apply text
indexing techniques to genomic data bases.
4Natural Language Indexing
- Natural language text indexing and retrieval has
been studied since the mid-50's. In text
retrieval, the problem is to locate documents
related to a natural language query. - Natural language text indexing programs employ
techniques to identify terms in a document most
likely to be content descriptors. - By eliminating poor descriptors and pre-indexing
documents by descriptors likely to be good
discriminators, the speed of selection and
precision of document relevance ranking can be
improved. - The vector space model, developed by G. Salton,
views the problem as an n-dimensional hyperspace
of documents and queries.
5Document Hyperspace
6Hyperspace Queries
7Clustering Objects by Feature
8Cosine Similarity Coefficient
9Genomic Data Bases
- EMBL (http//www.embl.org)
- SWISS-PROT (http//www.expasy.org/sprot/sprot-top
.html) - PROSITE (http//www.expasy.org/prosite/)
- PIR (http//pir.georgetown.edu/home.shtml)
- NCBI/NLM GenBank (http//www.ncbi.nih.gov/)
- MGD The Mouse Genome Database (http//www.informa
tics.jax.org/) - OMIM - Online Mendelian Inheritance in Man
(http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?db
OMIM)
10nt Sequence Data Base
- NCBI nt data base 12 billion bytes in length
comprising 2,584,440 sequences in FASTA format
(Sept 2004). - Example sequence
- gt gi2695852embY13263.1ABY13263 Acipenser
baeri mRNA for immunoglobulin heavy chain, clone
CAAGAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGTAT
AATAATGACAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGT
CCGGACCAGCAGTTATAAAGCCTGGAGAGTCCCATAAACTGTCCTGTAAA
GCCTCTGGATTCACATTCAGCAGCAACAACATGGGCTGGGTTCGACAAGC
TCCTGGAAAGGGTCTGGAATGGGTGTCTACTATAAGCTATAGTGTAAATG
CATACTATGCCCAGTCTGTCCAGGGAAGATTCACCATCTCCAGAGACGAT
TCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACTC
TGCCGTGTATTACTGTGCTCGAGAGTCTAACTTCAACCGCTTTGACTACT
GGGGATCCGGGACTATGGTGACCGTAACAAATGCTACGCCATCACCACCG
ACAGTGTTTCCGCTTATGCAGGCATGTTGTTCGGTCGATGTCACGGGTCC
TAGCGCTACGGGCTGCTTAGCAACCGAATTC
11GenBank
- LOCUS AAB2MCG1 289 bp
DNA linear PRI 23-AUG-2002 - DEFINITION Aotus azarai beta-2-microglobulin
precursor exon 1. - ACCESSION AF032092
- VERSION AF032092.1 GI3265027
- KEYWORDS .
- SEGMENT 1 of 2
- SOURCE Aotus azarai (Azara's night monkey)
- ORGANISM Aotus azarai
- Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi - Mammalia Eutheria Primates
Platyrrhini Cebidae Aotinae Aotus. - REFERENCE 1 (bases 1 to 289)
- AUTHORS Canavez,F.C., Ladasky,J.J.,
Muniz,J.A., Seuanez,H.N., Parham,P. and - Cavanez,C.
- TITLE beta2-Microglobulin in neotropical
primates (Platyrrhini) - JOURNAL Immunogenetics 48 (2), 133-140 (1998)
- MEDLINE 98298008
- PUBMED 9634477
- REFERENCE 2 (bases 1 to 289)
- AUTHORS Canavez,F.C., Ladasky,J.J.,
Seuanez,H.N. and Parham,P.
- JOURNAL Submitted (31-OCT-1997) Structural
Biology, Stanford University, - Fairchild Building Campus West Dr.
Room D-100, Stanford, CA - 94305-5126, USA
- FEATURES Location/Qualifiers
- source 1..289
- /organism"Aotus azarai"
- /mol_type"genomic DNA"
- /db_xref"taxon30591"
- sig_peptide 134..193
- exon lt134..200
- /number1
- intron 201..gt289
- /number1
- ORIGIN
- 1 gtccccgcgg gccttgtcct gattggctgt
ccctgcgggc cttgtcctga ttggctgtgc - 61 ccgactccgt ataacataaa tagaggcgtc
gagtcgcgcg ggcattactg cagcggacta - 121 cacttgggtc gagatggctc gcttcgtggt
ggtggccctg ctcgtgctac tctctctgtc - 181 tggcctggag gctatccagc gtaagtctct
cctcccgtcc ggcgctggtc cttcccctcc -
12Sequence Matching
- Currrent access to sequence databases mainly by
heuristic-assisted pattern matching on flat or
nearly flat files using programs such as BLAST
and FASTA. - Underlying data bases growing rapidly with
consequent deterioration of search times even on
large, multiprocessor systems as current software
tools reach design limits. - BLAST systems index data base sequences according
to short code letter words (usually, 3 letters
for amino acids and 11 for nucleotide data
bases) scoring matrices. - Queries also decomposed to similar short code
words. The data base is scanned sequences with
words in common with the query are processed to
extend the initial code word match.
13Example BLAST Output
-
Score E - Sequences producing significant alignments
(bits) Value - embBX015832.1CNS08KDO Single read from an
extremity of a ... 918 0.0 - embBX032891.1CNS08XJJ Single read from an
extremity of a ... 902 0.0 - embBX065445.1CNS09MNT Single read from an
extremity of a ... 894 0.0 - embBX052703.1CNS09CTV Single read from an
extremity of a ... 894 0.0 - embBX030708.1CNS08VUW Single read from an
extremity of a ... 894 0.0 - embBX030663.1CNS08VTN Single read from an
extremity of a ... 894 0.0 - ..................................................
........................... - gtembBX015832.1CNS08KDO Single read from an
extremity of a full-length cDNA clone made from
Anopheles gambiae total adult females. 3-PRIME
end of clone FK0AAA23DA12 - Length 866
- Score 918 bits (463), Expect 0.0
- Identities 535/559 (95)
- Strand Plus / Plus
-
14Developing A Vector Space Approach to Sequence
Indexing
- This work attempts to explore natural language
indexing techniques applied genomic data bases
through - Weight based indexing of k-tuples derived from
NCBI nt sequence data base. - Text terms used in genomic sequence data banks
and literature - Both applications are implemented for Linux and
written in Mumps and MDH, a Mumps related C
toolkit capable of indexing data sets of up to
256 terabytes using a B-tree based
multidimensional data model, that includes many
retrieval and sequence matching functions.
15Inverse Document Frequency Wgt.
- The IDF weight yields higher values for words
whose distribution is more concentrated and lower
values for words whose use is more widespread. - Thus, words of broad context are weighted lower
than words of narrow context. - Words of low weight are hypothesized to be poor
indexing terms while words with high weights are
hypothesized to be good indexing terms. - The bulk of the words, as is the case in natural
language text, reside in the middle range.
16Natural Language Example
- Word Freq(i,j) TotFreq DocFreq
Wgt1 Wgt2 Wgt3 MCA - 1 Death of a cult. (Apple Computer needs to
alter its strategy) (column) - apple 4 261 112
1.716 9.757 17 -1.1625 - computer 4 706 358
2.028 5.109 10 -19.4405 - mac 2 146 71
0.973 6.290 6 -0.0256 - macintosh 4 210 107
2.038 9.940 20 -0.5855 - strategy 2 79 67
1.696 6.406 11 -0.0592 - 3 WordPerfect. (WordPerfect for the
Macintosh 2.0) (evaluation) Taub, Eric. - edit 2 111 77
1.387 6.128 8 -0.0961 - frame 2 9 7
1.556 10.924 17 0.0131 - import 2 29 19
1.310 8.927 12 0.0998 - macintosh 3 210 107
1.529 7.705 12 -0.5855 - macro 3 38 24
1.895 12.189 23 0.1075 - outstand 1 10 9
0.900 5.711 5 0.0168 - user 4 861 435
2.021 4.330 9 -26.8094
17Indexing Experiment
- Sequences from the NCBI "nt" (non-redundant
nucleotide) data base were used. - The nt data base is approximately 12 billion
bytes in length comprising 2,584,440 sequences in
FASTA format (Sept 2004). - A word size of 11 was used throughout. A total
of 4,194,299 words were identified, slightly less
than the theoretical maximum of 4,194,304.
18Calculating the IDF Weight
- The overall frequencies of occurrence of all
possible 11 character words from each sequence
were determined along with the number of
sequences in which each unique word was found. - A weight Wgti for each word i was calculated by
taking the Log10, multiplied by 10 and truncated
to the nearest integer, of the total number of
sequences (N) divided by the number of sequences
in which the word occurred (DocFreqi). - Wgti (int) 10 Log10 ( N
/ DocFreqi ) - In natural language indexing, this is referred to
as the inverse document frequency (IDF) weight.
19File Sizes
- Initial file analysis produces about 110
intermediate files of about 440 million bytes
each from the input data base (12 GB). - out.table is a large (40 billion byte)
word-sequence file. - freq.bin contains the inverse document frequency
weight for each word (53 million bytes) - index (76 million bytes) gives for each word the
eight byte offset of the word's entry in
out.table. - index and freq.bin are merged into ITABLE (112
million bytes) which contains for each word its
weight, offset, and a pointer to a list of
aliases (not used with the nt data base).
20Data Base
- W ( w1, w2, w3, ... wM) vector of M weights
- F ( f1,1 f1,2 f1,3 ... f1,N )
- ( f2,1 f2,2 f2,3 ... f2,N )
- ( f3,1 f3,2 f3,3 ... f3,N )
- ... word-sequence matrix
- ( fM,1 fM,2 fM,3 ... fM,N )
-
21Number of Words at each Weight
- for i 1 to 120
- zi ? 0
- for j 1 to M
- if wj i then zi ? zi 1
-
22Number of Words at Each IDF Wgt.
23Sum of all Instances of Each Weight
- for i 1 to 120
// for each weight - xi ? 0
- for j 1 to M
// for each word - for k 1 to N //
for each sequence - if fj, k i then xi ? xi 1
24Number of Occurrences at Each IDF Level
25Sequence Retrieval
- For retrieval, a query sequence is read and
decomposed into 11 character words. These words
are reduced to a numeric equivalent which is used
as an index into the word-sequence table.
Entries in a master vector corresponding to
sequences are incremented by the weight of the
word if the word occurs in the sequence if the
weight of the word lies within a specified range.
When all words have been processed, entries in
the master sequence vector are normalized
according to the length of the underlying
sequence in respect to the length of the query.
Finally, the master sequence vector is sorted and
the top scoring entries printed or submitted to a
Smith-Waterman alignment, sorted and then
printed. Optionally, the Smith-Waterman
alignments themselves can be printed and the
selected sequences can be extracted from the nt
data base and stored in a separate output file
for additional processing. FASTA post-processing
is an option.
26Unweighted Result for 500 Random Queries
27Result for 500 Random Queries Weight Range 65-120
28Overall Results for 500 Random Queries
29Index Scoring Results
- Query gtgi19911940dbjAB072098.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region,
partial cds, isolate71 - Query string has 289 letters
- Searching ...
- 68224 gtgi19911940dbjAB072098.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p - 31420 gtgi29467317dbjAB089555.1 Hepatitis C
virus NS3 gene for polyprotein, partial cds, isol - 30508 gtgi19911912dbjAB072084.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p - 30296 gtgi29467668dbjAB100815.1 Hepatitis C
virus NS3 gene for polyprotein, partial cds, isol - 29800 gtgi14150634gbAF369255.1 Hepatitis C
virus Pt.2F NS3 protease gene, partial cds - 29444 gtgi19911960dbjAB072108.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p - 29240 gtgi19911888dbjAB072072.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p - 29196 gtgi14150646gbAF369261.1 Hepatitis C
virus Pt.6A NS3 protease gene, partial cds - 29120 gtgi19911862dbjAB072059.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p - 28896 gtgi14150628gbAF369252.1 Hepatitis C
virus Pt.128 NS3 protease gene, partial cds - 28116 gtgi2731651gbU81612.1HCU81612 Hepatitis
C virus polyprotein gene, partial cds - 28116 gtgi3157741dbjAB013621.1 Hepatitis C
virus RNA for polyprotein (NS3 proteinase
region), - 27700 gtgi14150620gbAF369248.1 Hepatitis C
virus Pt.1 NS3 protease gene, partial cds - ..................................................
...........................................
30Smith-Waterman Result Scoring
- Query gtgi19911940dbjAB072098.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region,
partial cds, isolate71 - Query string has 289 letters
- top gtgi19911940dbjAB072098.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region,
partial cds, isolate71 - 166 TTCCACGGTGCCGGCTCAAAGACCCTAGCCGGCCCGAAGGGCCA
AATCACCCAGATGTACACCAATGTAGACCAGGACCT 245 -
- 1 TTCCACGGTGCCGGCTCAAAGACCCTAGCCGGCCCGAAGGGCCA
AGTCACCCAGATGTACACCAATGTAGACCAGGTCCT 80 - 246 CGTCGGCTGGCCGGCGCCCCCCGGAGCGCGTTCCTTGACACCAT
GCACCTGCGGCAGCTCGGACCTTTATTTGGTCACGA 325 -
- 81 CGTCGGCTGGCCGGCGCCGCCCGGAGCGCGTTCCTTGAGACCAT
GCACCTGCGGCAGCTCGGACCTTTATTTGGTCACGA 160 - 326 GACATGCTGACGTCATCCCGGTGCGCCGGCGGGGCGACAGCAGG
GGGAGCTTGCTTTCTCCTAGGCCCATCTCTTACTTA 405 -
- 161 GACATGCTGACGTCATCCCGGTGCGCCGGCGGGGCGACAGCAGG
GGGAGCTTGCTTTCTCCTAGGCCCATCTCTTACTTA 240 - 406 AAGGGCTCTTCGGGCGGTCCACTGCTTTGCCCCTCGGGGCACGC
TGTGG 454 -
31S-W Scores
- 566 gtgi19911940dbjAB072098.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p - 505 gtgi29467668dbjAB100815.1 Hepatitis C
virus NS3 gene for polyprotein, partial cds, isol - 504 gtgi29467317dbjAB089555.1 Hepatitis C
virus NS3 gene for polyprotein, partial cds, isol - 503 gtgi19911914dbjAB072085.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p - 503 gtgi14150628gbAF369252.1 Hepatitis C virus
Pt.128 NS3 protease gene, partial cds - 502 gtgi29467247dbjAB089520.1 Hepatitis C
virus NS3 gene for polyprotein, partial cds, isol - 502 gtgi3157741dbjAB013621.1 Hepatitis C virus
RNA for polyprotein (NS3 proteinase region), - 501 gtgi19911862dbjAB072059.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p - 499 gtgi29467670dbjAB100816.1 Hepatitis C
virus NS3 gene for polyprotein, partial cds, isol - 498 gtgi3157753dbjAB013627.1 Hepatitis C virus
RNA for polyprotein (NS3 proteinase region), - 498 gtgi14150634gbAF369255.1 Hepatitis C virus
Pt.2F NS3 protease gene, partial cds - 497 gtgi29467311dbjAB089552.1 Hepatitis C
virus NS3 gene for polyprotein, partial cds, isol - 497 gtgi19911934dbjAB072095.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p - 497 gtgi19911912dbjAB072084.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p - 496 gtgi19911900dbjAB072078.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p - 495 gtgi14150638gbAF369257.1 Hepatitis C virus
Pt.3O NS3 protease gene, partial cds - 495 gtgi14150616gbAF369246.1 Hepatitis C virus
Pt.1A NS3 protease gene, partial cds - 495 gtgi14150646gbAF369261.1 Hepatitis C virus
Pt.6A NS3 protease gene, partial cds - 494 gtgi14150620gbAF369248.1 Hepatitis C virus
Pt.1 NS3 protease gene, partial cds
32Larger Sequences
- On larger query sequences (5,000 to 6,000
letters), the IDF method performed slightly
better than BLAST. On 25 sequences randomly
generated, the IDF method correctly ranked the
original sequence first 24 times and once at rank
3. BLAST, on the other hand, ranked the
original sequence first 21 times while the
remaining 4 were ranked 2, 2, 3 and 4. Average
time per query for the IDF method was 47.4
seconds and the average time for BLAST was 122.8
seconds.
33The Next Step
- Future work
- Weighted Term Vectors.
- Other weighting schemes such as the Modified
Centroid Algorithm. - Sequence-Sequence and Term-Term Correlations.
- Sequence clustering.
34References
- Altschul SF, Gish W, Miller W, Myers EW, Lipman
DJ. (1990) Basic local alignment search tool. J.
Mol. Biol. 215403-10. - O'Kane, K.C. and Lockner, M. J. (2004) Indexing
genomic sequence libraries, Information
Processing and Management, 41265-274. - O'Kane, K.C. (2004) The Effect of Inverse
Document Frequency Weights on Indexed Sequence
Retrieval, submitted. - Pearson, W. R. (2000) Flexible sequence
similarity searching with the FASTA3 program
package. Methods Mol. Biol. 132185-219. - Salton, G. (1983), Introduction to Modern
Information Retrieval, McGraw-Hill (New York
1983). - Smith, T.F. Waterman, M.S. (1981)
Identification of common molecular subsequences.
J. Mol. Biol. 147195-197
35(No Transcript)
36Hierarchical Data Base
37Bioinformatics
- Sloan Report on Bioinformatics from June 2004.
- Number of graduates
- There were only 26 new PhD's produced...
- 102 masters degrees awarded...
- Only 17 Bachelor's degrees produced...
- The data is for January 2002 until March
2003. - "... in the next few years the number of
graduates is expected to increase by two or three
times." - Average program enrollment
- 103 Bachelors
- 435 Masters
- 296 Phd
38B.S. in Bioinformtics at UNI
- Mathematics 800060 800061 800064 800152
800164 (17 hours) - Computer Science 810061 810062 810065
810066 - 810080 810114
810115 810180 (24 hours) -
- Biology 840051 840052 840130 840140
840153 (19 hours) -
- Chemistry 860070 or both 860044 and 860048
860063 (9-12 hours) - Elective One course from the following
(3 hours) -
- Computer Science 810143 810147 810153
810155 - 810161 810172
810181 -
-
Total 73-75 hours
39Courses
- 800060. Calculus I . The derivatives and
integrals of elementary functions and their
applications. - 800061. Calculus II. Continuation of 800060
- 800064. Elementary Probability and Statistics
for Bioinformatics. Descriptive statistics, basic
probability concepts, confidence intervals,
hypothesis testing, correlation and regression,
elementary concepts of survival analysis - 800152. Introduction to Probability. Axioms of
probability, sample spaces having equally likely
outcomes, conditional probability and
independence, random variables, expectation,
moment generating functions, jointly distributed
random variables, weak law of large numbers,
central limit theorem
40Courses
- 800164. Statistical Methods in Bioinformatics.
Analysis of a DNA sequence, analysis of multiple
DNA and protein sequences, BLAST. - 810061. Computer Science I. Introduction to
computer programming in the context of a modern
object-oriented programming language. Emphasis on
good programming techniques, object-oriented
design, and style through extensive practice in
designing, coding, and debugging programs. - 810062. Computer Science II. Intermediate
programming in an object-oriented environment.
Topics include object-oriented design,
implementation of classes and methods, dynamic
polymorphism, frameworks, patterns, software
reuses, limitations, exceptions, and threads.
41Courses
- 810065. Computing for Bioinformatics I.
Intermediate programming with emphasis on
bioinformatics. Includes file handling, memory
management, multi-threading, B-trees,
introduction to dynamic programming including
Wunsch-Neddleman and Smith-Waterman algorithms
for optimal alignments, exploration of BLAST,
FASTA and gapped alignment, substitution
matrices. - 810066. Computing for Bioinformatics II.
Advanced bioinformatics computing Perl and CGI
programming data base facilities for
bioinformatics pattern matching with regular
expressions advanced dynamic programming
optimal versus local alignment, multiple
alignments data base mining tools, Entrez, SRS,
BLAST, FASTA, CLUSTAL graphical 3-D
representation of proteins phylogenic trees.
42Courses
- 810080. Discrete Structures. Topics include
propositional and first-order logic proofs and
inference mathematical induction sets,
relations, and functions and graphs, lattices,
and Boolean algebra, all in the context of
computer science. - 810114. Database Systems. Storage of, and access
to, physical databases data models, query
languages, transaction processing, and recovery
techniques object-oriented and distributed
database systems and database design. - 810115. Information Storage and Retrieval.
Natural language processing analysis of textual
material by statistical, syntactic, and logical
methods retrieval systems models, dictionary
construction, query processing, file structures,
content analysis automatic retrieval systems and
question-answering systems and evaluation of
retrieval effectiveness.
43Courses
- 810180. Undergraduate Research in Computer
Science - 840051. General Biology Organismal Diversity.
Study of organismic biology emphasizing
evolutionary patterns and diversity of organisms
and interdependency of structure and function in
living systems. - 840052. General Biology Cell Structure and
Function. Study of cells, genetics, and DNA
technology emphasizing the chemical basis of life
and flow of information. - 840130. Molecular Biology of the Cell.
Introduction to the molecular, biochemical, and
cellular structure and function of cells, DNA
structure and functions, and the translation of
genetic information into functional structures of
living cells. DNA replication, transcription of
genes, and synthesis and processing of proteins
will be emphasized.
44Courses
- 840140. Genetics. Analytical approach to
classical, molecular, and population genetics - 840153. Recombinant DNA Techniques. Study of
techniques for manipulating and analyzing DNA,
including genomic library construction,
polymerase chain reaction, oligonucleotide
synthesis, genomic analysis with computers, and
DNA and RNA isolation. - 860070. General Chemistry I-II. Accelerated
course for well-prepared students. Content
similar to 860044 and 860048 but covered in one
semester. Completion satisfies General Chemistry
requirement of any chemistry major.
45Courses
- 860063. Applied Organic and Biochemistry. Basic
concepts in organic chemistry and biochemistry,
including nomenclature, functional groups,
reactivity, and macromolecules. - Elective from
- 810143(g). Operating Systems. History and
evolution of operating systems process and
processor management primary and auxiliary
storage management performance evaluation,
security, and distributed systems issues and
case studies of modern operating systems.
46Courses
- 810147. Networking. Network architectures and
communication protocol standards. Topics include
communication of digital data, data-link
protocols, local-area networks, network-layer
protocols, transport-layer protocols,
applications, network security, and management. - 810153. Design and Analysis of Algorithms.
Algorithm design techniques such as dynamic
programming and greedy algorithms complexity
analysis of algorithms efficient algorithms for
classical problems intractable problems and
techniques for addressing them and algorithms
for parallel machines. - 810155. Translation of Programming Languages.
Introduction to analysis of programming languages
and construction of translators.
47Courses
- 810161. Artificial Intelligence. Models of
intelligent behavior and problem solving
knowledge representation and search methods
learning topics such as knowledge-based systems,
language understanding, and vision optional
1-hour lab in symbolic programming techniques
heuristic programming symbolic representations
and algorithms and applications to search,
parsing, and high-level problem-solving tasks. - 810172. Software Engineering. Study of software
life cycle models and their phases--planning,
requirements, specifications, design,
implementation, testing, and maintenance.
Emphasis on tools, documentation, and
applications. - 810181. Theory of Computation. Topics include
regular languages and grammars finite state
automata context-free languages and grammars
language recognition and parsing and turing
computability and undecidability.