The Effect of Inverse Document Frequency Weights on Retrieval of Genomic Sequences: Towards a vector space approach

About This Presentation

Title:

The Effect of Inverse Document Frequency Weights on Retrieval of Genomic Sequences: Towards a vector space approach

Description:

... Probability and Statistics for Bioinformatics. ... Methods in Bioinformatics. ... bioinformatics; pattern matching with regular expressions; ... – PowerPoint PPT presentation

Number of Views:83

Avg rating:3.0/5.0

Slides: 48

Provided by: csU57

Learn more at: http://www.cs.uni.edu

Category:

more less

Transcript and Presenter's Notes

Title: The Effect of Inverse Document Frequency Weights on Retrieval of Genomic Sequences: Towards a vector space approach

1
The Effect of Inverse Document Frequency Weights
on Retrieval of Genomic SequencesTowards a
vector space approach

Kevin C. O'Kane
Department of Computer Science
The University of Northern Iowa
Cedar Falls, Iowa 50613

2
The area of natural language text indexing and
retrieval has been studied since the mid-50's.
In text retrieval, the problem is to locate
documents related to a natural language query.
To this purpose, natural language text indexing
programs have employed many techniques to
identify terms in a document most likely to be
content descriptors as opposed to terms that are
poor content descriptors. By eliminating poor
descriptors and pre-indexing documents by
descriptors more likely to be good
discriminators, the speed of selection and
precision of document relevance ranking can be
improved. The vector space model, developed
by G. Salton, views the problem as an
n-dimensional hyperspace in which documents and
queries.
3
Overview

In text retrieval, the problem is to locate
documents related to a natural language query.
Natural language text indexing programs identify
terms in a document most likely to be content
descriptors.
The goal of these experiments is to apply text
indexing techniques to genomic data bases.

4
Natural Language Indexing

Natural language text indexing and retrieval has
been studied since the mid-50's. In text
retrieval, the problem is to locate documents
related to a natural language query.
Natural language text indexing programs employ
techniques to identify terms in a document most
likely to be content descriptors.
By eliminating poor descriptors and pre-indexing
documents by descriptors likely to be good
discriminators, the speed of selection and
precision of document relevance ranking can be
improved.
The vector space model, developed by G. Salton,
views the problem as an n-dimensional hyperspace
of documents and queries.

5
Document Hyperspace
6
Hyperspace Queries
7
Clustering Objects by Feature
8
Cosine Similarity Coefficient
9
Genomic Data Bases

EMBL (http//www.embl.org)
SWISS-PROT (http//www.expasy.org/sprot/sprot-top
.html)
PROSITE (http//www.expasy.org/prosite/)
PIR (http//pir.georgetown.edu/home.shtml)
NCBI/NLM GenBank (http//www.ncbi.nih.gov/)
MGD The Mouse Genome Database (http//www.informa
tics.jax.org/)
OMIM - Online Mendelian Inheritance in Man
(http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?db
OMIM)

10
nt Sequence Data Base

NCBI nt data base 12 billion bytes in length
comprising 2,584,440 sequences in FASTA format
(Sept 2004).
Example sequence
gt gi2695852embY13263.1ABY13263 Acipenser
baeri mRNA for immunoglobulin heavy chain, clone
CAAGAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGTAT
AATAATGACAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGT
CCGGACCAGCAGTTATAAAGCCTGGAGAGTCCCATAAACTGTCCTGTAAA
GCCTCTGGATTCACATTCAGCAGCAACAACATGGGCTGGGTTCGACAAGC
TCCTGGAAAGGGTCTGGAATGGGTGTCTACTATAAGCTATAGTGTAAATG
CATACTATGCCCAGTCTGTCCAGGGAAGATTCACCATCTCCAGAGACGAT
TCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACTC
TGCCGTGTATTACTGTGCTCGAGAGTCTAACTTCAACCGCTTTGACTACT
GGGGATCCGGGACTATGGTGACCGTAACAAATGCTACGCCATCACCACCG
ACAGTGTTTCCGCTTATGCAGGCATGTTGTTCGGTCGATGTCACGGGTCC
TAGCGCTACGGGCTGCTTAGCAACCGAATTC

11
GenBank

LOCUS AAB2MCG1 289 bp
DNA linear PRI 23-AUG-2002
DEFINITION Aotus azarai beta-2-microglobulin
precursor exon 1.
ACCESSION AF032092
VERSION AF032092.1 GI3265027
KEYWORDS .
SEGMENT 1 of 2
SOURCE Aotus azarai (Azara's night monkey)
ORGANISM Aotus azarai
Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi
Mammalia Eutheria Primates
Platyrrhini Cebidae Aotinae Aotus.
REFERENCE 1 (bases 1 to 289)
AUTHORS Canavez,F.C., Ladasky,J.J.,
Muniz,J.A., Seuanez,H.N., Parham,P. and
Cavanez,C.
TITLE beta2-Microglobulin in neotropical
primates (Platyrrhini)
JOURNAL Immunogenetics 48 (2), 133-140 (1998)
MEDLINE 98298008
PUBMED 9634477
REFERENCE 2 (bases 1 to 289)
AUTHORS Canavez,F.C., Ladasky,J.J.,
Seuanez,H.N. and Parham,P.

JOURNAL Submitted (31-OCT-1997) Structural
Biology, Stanford University,
Fairchild Building Campus West Dr.
Room D-100, Stanford, CA
94305-5126, USA
FEATURES Location/Qualifiers
source 1..289
/organism"Aotus azarai"
/mol_type"genomic DNA"
/db_xref"taxon30591"
sig_peptide 134..193
exon lt134..200
/number1
intron 201..gt289
/number1
ORIGIN
1 gtccccgcgg gccttgtcct gattggctgt
ccctgcgggc cttgtcctga ttggctgtgc
61 ccgactccgt ataacataaa tagaggcgtc
gagtcgcgcg ggcattactg cagcggacta
121 cacttgggtc gagatggctc gcttcgtggt
ggtggccctg ctcgtgctac tctctctgtc
181 tggcctggag gctatccagc gtaagtctct
cctcccgtcc ggcgctggtc cttcccctcc

12
Sequence Matching

Currrent access to sequence databases mainly by
heuristic-assisted pattern matching on flat or
nearly flat files using programs such as BLAST
and FASTA.
Underlying data bases growing rapidly with
consequent deterioration of search times even on
large, multiprocessor systems as current software
tools reach design limits.
BLAST systems index data base sequences according
to short code letter words (usually, 3 letters
for amino acids and 11 for nucleotide data
bases) scoring matrices.
Queries also decomposed to similar short code
words. The data base is scanned sequences with
words in common with the query are processed to
extend the initial code word match.

13
Example BLAST Output

Score E
Sequences producing significant alignments
(bits) Value
embBX015832.1CNS08KDO Single read from an
extremity of a ... 918 0.0
embBX032891.1CNS08XJJ Single read from an
extremity of a ... 902 0.0
embBX065445.1CNS09MNT Single read from an
extremity of a ... 894 0.0
embBX052703.1CNS09CTV Single read from an
extremity of a ... 894 0.0
embBX030708.1CNS08VUW Single read from an
extremity of a ... 894 0.0
embBX030663.1CNS08VTN Single read from an
extremity of a ... 894 0.0
..................................................
...........................
gtembBX015832.1CNS08KDO Single read from an
extremity of a full-length cDNA clone made from
Anopheles gambiae total adult females. 3-PRIME
end of clone FK0AAA23DA12
Length 866
Score 918 bits (463), Expect 0.0
Identities 535/559 (95)
Strand Plus / Plus

14
Developing A Vector Space Approach to Sequence
Indexing

This work attempts to explore natural language
indexing techniques applied genomic data bases
through
Weight based indexing of k-tuples derived from
NCBI nt sequence data base.
Text terms used in genomic sequence data banks
and literature
Both applications are implemented for Linux and
written in Mumps and MDH, a Mumps related C
toolkit capable of indexing data sets of up to
256 terabytes using a B-tree based
multidimensional data model, that includes many
retrieval and sequence matching functions.

15
Inverse Document Frequency Wgt.

The IDF weight yields higher values for words
whose distribution is more concentrated and lower
values for words whose use is more widespread.
Thus, words of broad context are weighted lower
than words of narrow context.
Words of low weight are hypothesized to be poor
indexing terms while words with high weights are
hypothesized to be good indexing terms.
The bulk of the words, as is the case in natural
language text, reside in the middle range.

16
Natural Language Example

Word Freq(i,j) TotFreq DocFreq
Wgt1 Wgt2 Wgt3 MCA
1 Death of a cult. (Apple Computer needs to
alter its strategy) (column)
apple 4 261 112
1.716 9.757 17 -1.1625
computer 4 706 358
2.028 5.109 10 -19.4405
mac 2 146 71
0.973 6.290 6 -0.0256
macintosh 4 210 107
2.038 9.940 20 -0.5855
strategy 2 79 67
1.696 6.406 11 -0.0592
3 WordPerfect. (WordPerfect for the
Macintosh 2.0) (evaluation) Taub, Eric.
edit 2 111 77
1.387 6.128 8 -0.0961
frame 2 9 7
1.556 10.924 17 0.0131
import 2 29 19
1.310 8.927 12 0.0998
macintosh 3 210 107
1.529 7.705 12 -0.5855
macro 3 38 24
1.895 12.189 23 0.1075
outstand 1 10 9
0.900 5.711 5 0.0168
user 4 861 435
2.021 4.330 9 -26.8094

17
Indexing Experiment

Sequences from the NCBI "nt" (non-redundant
nucleotide) data base were used.
The nt data base is approximately 12 billion
bytes in length comprising 2,584,440 sequences in
FASTA format (Sept 2004).
A word size of 11 was used throughout. A total
of 4,194,299 words were identified, slightly less
than the theoretical maximum of 4,194,304.

18
Calculating the IDF Weight

The overall frequencies of occurrence of all
possible 11 character words from each sequence
were determined along with the number of
sequences in which each unique word was found.
A weight Wgti for each word i was calculated by
taking the Log10, multiplied by 10 and truncated
to the nearest integer, of the total number of
sequences (N) divided by the number of sequences
in which the word occurred (DocFreqi).
Wgti (int) 10 Log10 ( N
/ DocFreqi )
In natural language indexing, this is referred to
as the inverse document frequency (IDF) weight.

19
File Sizes

Initial file analysis produces about 110
intermediate files of about 440 million bytes
each from the input data base (12 GB).
out.table is a large (40 billion byte)
word-sequence file.
freq.bin contains the inverse document frequency
weight for each word (53 million bytes)
index (76 million bytes) gives for each word the
eight byte offset of the word's entry in
out.table.
index and freq.bin are merged into ITABLE (112
million bytes) which contains for each word its
weight, offset, and a pointer to a list of
aliases (not used with the nt data base).

20
Data Base

W ( w1, w2, w3, ... wM) vector of M weights
F ( f1,1 f1,2 f1,3 ... f1,N )
( f2,1 f2,2 f2,3 ... f2,N )
( f3,1 f3,2 f3,3 ... f3,N )
... word-sequence matrix
( fM,1 fM,2 fM,3 ... fM,N )

21
Number of Words at each Weight

for i 1 to 120
zi ? 0
for j 1 to M
if wj i then zi ? zi 1

22
Number of Words at Each IDF Wgt.
23
Sum of all Instances of Each Weight

for i 1 to 120
// for each weight
xi ? 0
for j 1 to M
// for each word
for k 1 to N //
for each sequence
if fj, k i then xi ? xi 1

24
Number of Occurrences at Each IDF Level
25
Sequence Retrieval

For retrieval, a query sequence is read and
decomposed into 11 character words. These words
are reduced to a numeric equivalent which is used
as an index into the word-sequence table.
Entries in a master vector corresponding to
sequences are incremented by the weight of the
word if the word occurs in the sequence if the
weight of the word lies within a specified range.
When all words have been processed, entries in
the master sequence vector are normalized
according to the length of the underlying
sequence in respect to the length of the query.
Finally, the master sequence vector is sorted and
the top scoring entries printed or submitted to a
Smith-Waterman alignment, sorted and then
printed. Optionally, the Smith-Waterman
alignments themselves can be printed and the
selected sequences can be extracted from the nt
data base and stored in a separate output file
for additional processing. FASTA post-processing
is an option.

26
Unweighted Result for 500 Random Queries
27
Result for 500 Random Queries Weight Range 65-120
28
Overall Results for 500 Random Queries
29
Index Scoring Results

Query gtgi19911940dbjAB072098.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region,
partial cds, isolate71
Query string has 289 letters
Searching ...
68224 gtgi19911940dbjAB072098.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p
31420 gtgi29467317dbjAB089555.1 Hepatitis C
virus NS3 gene for polyprotein, partial cds, isol
30508 gtgi19911912dbjAB072084.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p
30296 gtgi29467668dbjAB100815.1 Hepatitis C
virus NS3 gene for polyprotein, partial cds, isol
29800 gtgi14150634gbAF369255.1 Hepatitis C
virus Pt.2F NS3 protease gene, partial cds
29444 gtgi19911960dbjAB072108.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p
29240 gtgi19911888dbjAB072072.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p
29196 gtgi14150646gbAF369261.1 Hepatitis C
virus Pt.6A NS3 protease gene, partial cds
29120 gtgi19911862dbjAB072059.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p
28896 gtgi14150628gbAF369252.1 Hepatitis C
virus Pt.128 NS3 protease gene, partial cds
28116 gtgi2731651gbU81612.1HCU81612 Hepatitis
C virus polyprotein gene, partial cds
28116 gtgi3157741dbjAB013621.1 Hepatitis C
virus RNA for polyprotein (NS3 proteinase
region),
27700 gtgi14150620gbAF369248.1 Hepatitis C
virus Pt.1 NS3 protease gene, partial cds
..................................................
...........................................

30
Smith-Waterman Result Scoring

Query gtgi19911940dbjAB072098.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region,
partial cds, isolate71
Query string has 289 letters
top gtgi19911940dbjAB072098.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region,
partial cds, isolate71
166 TTCCACGGTGCCGGCTCAAAGACCCTAGCCGGCCCGAAGGGCCA
AATCACCCAGATGTACACCAATGTAGACCAGGACCT 245
1 TTCCACGGTGCCGGCTCAAAGACCCTAGCCGGCCCGAAGGGCCA
AGTCACCCAGATGTACACCAATGTAGACCAGGTCCT 80
246 CGTCGGCTGGCCGGCGCCCCCCGGAGCGCGTTCCTTGACACCAT
GCACCTGCGGCAGCTCGGACCTTTATTTGGTCACGA 325
81 CGTCGGCTGGCCGGCGCCGCCCGGAGCGCGTTCCTTGAGACCAT
GCACCTGCGGCAGCTCGGACCTTTATTTGGTCACGA 160
326 GACATGCTGACGTCATCCCGGTGCGCCGGCGGGGCGACAGCAGG
GGGAGCTTGCTTTCTCCTAGGCCCATCTCTTACTTA 405
161 GACATGCTGACGTCATCCCGGTGCGCCGGCGGGGCGACAGCAGG
GGGAGCTTGCTTTCTCCTAGGCCCATCTCTTACTTA 240
406 AAGGGCTCTTCGGGCGGTCCACTGCTTTGCCCCTCGGGGCACGC
TGTGG 454

31
S-W Scores

566 gtgi19911940dbjAB072098.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p
505 gtgi29467668dbjAB100815.1 Hepatitis C
virus NS3 gene for polyprotein, partial cds, isol
504 gtgi29467317dbjAB089555.1 Hepatitis C
virus NS3 gene for polyprotein, partial cds, isol
503 gtgi19911914dbjAB072085.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p
503 gtgi14150628gbAF369252.1 Hepatitis C virus
Pt.128 NS3 protease gene, partial cds
502 gtgi29467247dbjAB089520.1 Hepatitis C
virus NS3 gene for polyprotein, partial cds, isol
502 gtgi3157741dbjAB013621.1 Hepatitis C virus
RNA for polyprotein (NS3 proteinase region),
501 gtgi19911862dbjAB072059.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p
499 gtgi29467670dbjAB100816.1 Hepatitis C
virus NS3 gene for polyprotein, partial cds, isol
498 gtgi3157753dbjAB013627.1 Hepatitis C virus
RNA for polyprotein (NS3 proteinase region),
498 gtgi14150634gbAF369255.1 Hepatitis C virus
Pt.2F NS3 protease gene, partial cds
497 gtgi29467311dbjAB089552.1 Hepatitis C
virus NS3 gene for polyprotein, partial cds, isol
497 gtgi19911934dbjAB072095.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p
497 gtgi19911912dbjAB072084.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p
496 gtgi19911900dbjAB072078.1 Hepatitis C
virus type 1b gene for polyprotein, NS3 region, p
495 gtgi14150638gbAF369257.1 Hepatitis C virus
Pt.3O NS3 protease gene, partial cds
495 gtgi14150616gbAF369246.1 Hepatitis C virus
Pt.1A NS3 protease gene, partial cds
495 gtgi14150646gbAF369261.1 Hepatitis C virus
Pt.6A NS3 protease gene, partial cds
494 gtgi14150620gbAF369248.1 Hepatitis C virus
Pt.1 NS3 protease gene, partial cds

32
Larger Sequences

On larger query sequences (5,000 to 6,000
letters), the IDF method performed slightly
better than BLAST. On 25 sequences randomly
generated, the IDF method correctly ranked the
original sequence first 24 times and once at rank
3. BLAST, on the other hand, ranked the
original sequence first 21 times while the
remaining 4 were ranked 2, 2, 3 and 4. Average
time per query for the IDF method was 47.4
seconds and the average time for BLAST was 122.8
seconds.

33
The Next Step

Future work
Weighted Term Vectors.
Other weighting schemes such as the Modified
Centroid Algorithm.
Sequence-Sequence and Term-Term Correlations.
Sequence clustering.

34
References

Altschul SF, Gish W, Miller W, Myers EW, Lipman
DJ. (1990) Basic local alignment search tool. J.
Mol. Biol. 215403-10.
O'Kane, K.C. and Lockner, M. J. (2004) Indexing
genomic sequence libraries, Information
Processing and Management, 41265-274.
O'Kane, K.C. (2004) The Effect of Inverse
Document Frequency Weights on Indexed Sequence
Retrieval, submitted.
Pearson, W. R. (2000) Flexible sequence
similarity searching with the FASTA3 program
package. Methods Mol. Biol. 132185-219.
Salton, G. (1983), Introduction to Modern
Information Retrieval, McGraw-Hill (New York
1983).
Smith, T.F. Waterman, M.S. (1981)
Identification of common molecular subsequences.
J. Mol. Biol. 147195-197

35
(No Transcript)
36
Hierarchical Data Base
37
Bioinformatics

Sloan Report on Bioinformatics from June 2004.
Number of graduates
There were only 26 new PhD's produced...
102 masters degrees awarded...
Only 17 Bachelor's degrees produced...
The data is for January 2002 until March
2003.
"... in the next few years the number of
graduates is expected to increase by two or three
times."
Average program enrollment
103 Bachelors
435 Masters
296 Phd

38
B.S. in Bioinformtics at UNI

Mathematics 800060 800061 800064 800152
800164 (17 hours)
Computer Science 810061 810062 810065
810066
810080 810114
810115 810180 (24 hours)
Biology 840051 840052 840130 840140
840153 (19 hours)
Chemistry 860070 or both 860044 and 860048
860063 (9-12 hours)
Elective One course from the following
(3 hours)
Computer Science 810143 810147 810153
810155
810161 810172
810181
Total 73-75 hours

39
Courses

800060. Calculus I . The derivatives and
integrals of elementary functions and their
applications.
800061. Calculus II. Continuation of 800060
800064. Elementary Probability and Statistics
for Bioinformatics. Descriptive statistics, basic
probability concepts, confidence intervals,
hypothesis testing, correlation and regression,
elementary concepts of survival analysis
800152. Introduction to Probability. Axioms of
probability, sample spaces having equally likely
outcomes, conditional probability and
independence, random variables, expectation,
moment generating functions, jointly distributed
random variables, weak law of large numbers,
central limit theorem

40
Courses

800164. Statistical Methods in Bioinformatics.
Analysis of a DNA sequence, analysis of multiple
DNA and protein sequences, BLAST.
810061. Computer Science I. Introduction to
computer programming in the context of a modern
object-oriented programming language. Emphasis on
good programming techniques, object-oriented
design, and style through extensive practice in
designing, coding, and debugging programs.
810062. Computer Science II. Intermediate
programming in an object-oriented environment.
Topics include object-oriented design,
implementation of classes and methods, dynamic
polymorphism, frameworks, patterns, software
reuses, limitations, exceptions, and threads.

41
Courses

810065. Computing for Bioinformatics I.
Intermediate programming with emphasis on
bioinformatics. Includes file handling, memory
management, multi-threading, B-trees,
introduction to dynamic programming including
Wunsch-Neddleman and Smith-Waterman algorithms
for optimal alignments, exploration of BLAST,
FASTA and gapped alignment, substitution
matrices.
810066. Computing for Bioinformatics II.
Advanced bioinformatics computing Perl and CGI
programming data base facilities for
bioinformatics pattern matching with regular
expressions advanced dynamic programming
optimal versus local alignment, multiple
alignments data base mining tools, Entrez, SRS,
BLAST, FASTA, CLUSTAL graphical 3-D
representation of proteins phylogenic trees.

42
Courses

810080. Discrete Structures. Topics include
propositional and first-order logic proofs and
inference mathematical induction sets,
relations, and functions and graphs, lattices,
and Boolean algebra, all in the context of
computer science.
810114. Database Systems. Storage of, and access
to, physical databases data models, query
languages, transaction processing, and recovery
techniques object-oriented and distributed
database systems and database design.
810115. Information Storage and Retrieval.
Natural language processing analysis of textual
material by statistical, syntactic, and logical
methods retrieval systems models, dictionary
construction, query processing, file structures,
content analysis automatic retrieval systems and
question-answering systems and evaluation of
retrieval effectiveness.

43
Courses

810180. Undergraduate Research in Computer
Science
840051. General Biology Organismal Diversity.
Study of organismic biology emphasizing
evolutionary patterns and diversity of organisms
and interdependency of structure and function in
living systems.
840052. General Biology Cell Structure and
Function. Study of cells, genetics, and DNA
technology emphasizing the chemical basis of life
and flow of information.
840130. Molecular Biology of the Cell.
Introduction to the molecular, biochemical, and
cellular structure and function of cells, DNA
structure and functions, and the translation of
genetic information into functional structures of
living cells. DNA replication, transcription of
genes, and synthesis and processing of proteins
will be emphasized.

44
Courses

840140. Genetics. Analytical approach to
classical, molecular, and population genetics
840153. Recombinant DNA Techniques. Study of
techniques for manipulating and analyzing DNA,
including genomic library construction,
polymerase chain reaction, oligonucleotide
synthesis, genomic analysis with computers, and
DNA and RNA isolation.
860070. General Chemistry I-II. Accelerated
course for well-prepared students. Content
similar to 860044 and 860048 but covered in one
semester. Completion satisfies General Chemistry
requirement of any chemistry major.

45
Courses

860063. Applied Organic and Biochemistry. Basic
concepts in organic chemistry and biochemistry,
including nomenclature, functional groups,
reactivity, and macromolecules.
Elective from
810143(g). Operating Systems. History and
evolution of operating systems process and
processor management primary and auxiliary
storage management performance evaluation,
security, and distributed systems issues and
case studies of modern operating systems.

46
Courses

810147. Networking. Network architectures and
communication protocol standards. Topics include
communication of digital data, data-link
protocols, local-area networks, network-layer
protocols, transport-layer protocols,
applications, network security, and management.
810153. Design and Analysis of Algorithms.
Algorithm design techniques such as dynamic
programming and greedy algorithms complexity
analysis of algorithms efficient algorithms for
classical problems intractable problems and
techniques for addressing them and algorithms
for parallel machines.
810155. Translation of Programming Languages.
Introduction to analysis of programming languages
and construction of translators.

47
Courses

810161. Artificial Intelligence. Models of
intelligent behavior and problem solving
knowledge representation and search methods
learning topics such as knowledge-based systems,
language understanding, and vision optional
1-hour lab in symbolic programming techniques
heuristic programming symbolic representations
and algorithms and applications to search,
parsing, and high-level problem-solving tasks.
810172. Software Engineering. Study of software
life cycle models and their phases--planning,
requirements, specifications, design,
implementation, testing, and maintenance.
Emphasis on tools, documentation, and
applications.
810181. Theory of Computation. Topics include
regular languages and grammars finite state
automata context-free languages and grammars
language recognition and parsing and turing
computability and undecidability.