The Effect of Inverse Document Frequency Weights on Retrieval of Genomic Sequences: Towards a vector space approach - PowerPoint PPT Presentation

About This Presentation
Title:

The Effect of Inverse Document Frequency Weights on Retrieval of Genomic Sequences: Towards a vector space approach

Description:

... Probability and Statistics for Bioinformatics. ... Methods in Bioinformatics. ... bioinformatics; pattern matching with regular expressions; ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 48
Provided by: csU57
Learn more at: http://www.cs.uni.edu
Category:

less

Transcript and Presenter's Notes

Title: The Effect of Inverse Document Frequency Weights on Retrieval of Genomic Sequences: Towards a vector space approach


1
The Effect of Inverse Document Frequency Weights
on Retrieval of Genomic SequencesTowards a
vector space approach
  • Kevin C. O'Kane
  • Department of Computer Science
  • The University of Northern Iowa
  • Cedar Falls, Iowa 50613

2
The area of natural language text indexing and
retrieval has been studied since the mid-50's.
In text retrieval, the problem is to locate
documents related to a natural language query.
To this purpose, natural language text indexing
programs have employed many techniques to
identify terms in a document most likely to be
content descriptors as opposed to terms that are
poor content descriptors. By eliminating poor
descriptors and pre-indexing documents by
descriptors more likely to be good
discriminators, the speed of selection and
precision of document relevance ranking can be
improved. The vector space model, developed
by G. Salton, views the problem as an
n-dimensional hyperspace in which documents and
queries.
3
Overview
  • In text retrieval, the problem is to locate
    documents related to a natural language query.
  • Natural language text indexing programs identify
    terms in a document most likely to be content
    descriptors.
  • The goal of these experiments is to apply text
    indexing techniques to genomic data bases.

4
Natural Language Indexing
  • Natural language text indexing and retrieval has
    been studied since the mid-50's. In text
    retrieval, the problem is to locate documents
    related to a natural language query.
  • Natural language text indexing programs employ
    techniques to identify terms in a document most
    likely to be content descriptors.
  • By eliminating poor descriptors and pre-indexing
    documents by descriptors likely to be good
    discriminators, the speed of selection and
    precision of document relevance ranking can be
    improved.
  • The vector space model, developed by G. Salton,
    views the problem as an n-dimensional hyperspace
    of documents and queries.

5
Document Hyperspace
6
Hyperspace Queries
7
Clustering Objects by Feature
8
Cosine Similarity Coefficient
9
Genomic Data Bases
  • EMBL (http//www.embl.org)
  • SWISS-PROT (http//www.expasy.org/sprot/sprot-top
    .html)
  • PROSITE (http//www.expasy.org/prosite/)
  • PIR (http//pir.georgetown.edu/home.shtml)
  • NCBI/NLM GenBank (http//www.ncbi.nih.gov/)
  • MGD The Mouse Genome Database (http//www.informa
    tics.jax.org/)
  • OMIM - Online Mendelian Inheritance in Man
    (http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?db
    OMIM)

10
nt Sequence Data Base
  • NCBI nt data base 12 billion bytes in length
    comprising 2,584,440 sequences in FASTA format
    (Sept 2004).
  • Example sequence
  • gt gi2695852embY13263.1ABY13263 Acipenser
    baeri mRNA for immunoglobulin heavy chain, clone
    CAAGAACCACAATACTGCAGTACAATGGGGATTTTAACAGCTCTCTGTAT
    AATAATGACAGCTCTATCAAGTGTCCGGTCTGATGTAGTGTTGACTGAGT
    CCGGACCAGCAGTTATAAAGCCTGGAGAGTCCCATAAACTGTCCTGTAAA
    GCCTCTGGATTCACATTCAGCAGCAACAACATGGGCTGGGTTCGACAAGC
    TCCTGGAAAGGGTCTGGAATGGGTGTCTACTATAAGCTATAGTGTAAATG
    CATACTATGCCCAGTCTGTCCAGGGAAGATTCACCATCTCCAGAGACGAT
    TCCAACAGCATGCTGTATTTACAAATGAACAGCCTGAAGACTGAAGACTC
    TGCCGTGTATTACTGTGCTCGAGAGTCTAACTTCAACCGCTTTGACTACT
    GGGGATCCGGGACTATGGTGACCGTAACAAATGCTACGCCATCACCACCG
    ACAGTGTTTCCGCTTATGCAGGCATGTTGTTCGGTCGATGTCACGGGTCC
    TAGCGCTACGGGCTGCTTAGCAACCGAATTC

11
GenBank
  • LOCUS AAB2MCG1 289 bp
    DNA linear PRI 23-AUG-2002
  • DEFINITION Aotus azarai beta-2-microglobulin
    precursor exon 1.
  • ACCESSION AF032092
  • VERSION AF032092.1 GI3265027
  • KEYWORDS .
  • SEGMENT 1 of 2
  • SOURCE Aotus azarai (Azara's night monkey)
  • ORGANISM Aotus azarai
  • Eukaryota Metazoa Chordata
    Craniata Vertebrata Euteleostomi
  • Mammalia Eutheria Primates
    Platyrrhini Cebidae Aotinae Aotus.
  • REFERENCE 1 (bases 1 to 289)
  • AUTHORS Canavez,F.C., Ladasky,J.J.,
    Muniz,J.A., Seuanez,H.N., Parham,P. and
  • Cavanez,C.
  • TITLE beta2-Microglobulin in neotropical
    primates (Platyrrhini)
  • JOURNAL Immunogenetics 48 (2), 133-140 (1998)
  • MEDLINE 98298008
  • PUBMED 9634477
  • REFERENCE 2 (bases 1 to 289)
  • AUTHORS Canavez,F.C., Ladasky,J.J.,
    Seuanez,H.N. and Parham,P.
  • JOURNAL Submitted (31-OCT-1997) Structural
    Biology, Stanford University,
  • Fairchild Building Campus West Dr.
    Room D-100, Stanford, CA
  • 94305-5126, USA
  • FEATURES Location/Qualifiers
  • source 1..289
  • /organism"Aotus azarai"
  • /mol_type"genomic DNA"
  • /db_xref"taxon30591"
  • sig_peptide 134..193
  • exon lt134..200
  • /number1
  • intron 201..gt289
  • /number1
  • ORIGIN
  • 1 gtccccgcgg gccttgtcct gattggctgt
    ccctgcgggc cttgtcctga ttggctgtgc
  • 61 ccgactccgt ataacataaa tagaggcgtc
    gagtcgcgcg ggcattactg cagcggacta
  • 121 cacttgggtc gagatggctc gcttcgtggt
    ggtggccctg ctcgtgctac tctctctgtc
  • 181 tggcctggag gctatccagc gtaagtctct
    cctcccgtcc ggcgctggtc cttcccctcc

12
Sequence Matching
  • Currrent access to sequence databases mainly by
    heuristic-assisted pattern matching on flat or
    nearly flat files using programs such as BLAST
    and FASTA.
  • Underlying data bases growing rapidly with
    consequent deterioration of search times even on
    large, multiprocessor systems as current software
    tools reach design limits.
  • BLAST systems index data base sequences according
    to short code letter words (usually, 3 letters
    for amino acids and 11 for nucleotide data
    bases) scoring matrices.
  • Queries also decomposed to similar short code
    words. The data base is scanned sequences with
    words in common with the query are processed to
    extend the initial code word match.

13
Example BLAST Output

  • Score E
  • Sequences producing significant alignments
    (bits) Value
  • embBX015832.1CNS08KDO Single read from an
    extremity of a ... 918 0.0
  • embBX032891.1CNS08XJJ Single read from an
    extremity of a ... 902 0.0
  • embBX065445.1CNS09MNT Single read from an
    extremity of a ... 894 0.0
  • embBX052703.1CNS09CTV Single read from an
    extremity of a ... 894 0.0
  • embBX030708.1CNS08VUW Single read from an
    extremity of a ... 894 0.0
  • embBX030663.1CNS08VTN Single read from an
    extremity of a ... 894 0.0
  • ..................................................
    ...........................
  • gtembBX015832.1CNS08KDO Single read from an
    extremity of a full-length cDNA clone made from
    Anopheles gambiae total adult females. 3-PRIME
    end of clone FK0AAA23DA12
  • Length 866
  • Score 918 bits (463), Expect 0.0
  • Identities 535/559 (95)
  • Strand Plus / Plus


14
Developing A Vector Space Approach to Sequence
Indexing
  • This work attempts to explore natural language
    indexing techniques applied genomic data bases
    through
  • Weight based indexing of k-tuples derived from
    NCBI nt sequence data base.
  • Text terms used in genomic sequence data banks
    and literature
  • Both applications are implemented for Linux and
    written in Mumps and MDH, a Mumps related C
    toolkit capable of indexing data sets of up to
    256 terabytes using a B-tree based
    multidimensional data model, that includes many
    retrieval and sequence matching functions.

15
Inverse Document Frequency Wgt.
  • The IDF weight yields higher values for words
    whose distribution is more concentrated and lower
    values for words whose use is more widespread.
  • Thus, words of broad context are weighted lower
    than words of narrow context.
  • Words of low weight are hypothesized to be poor
    indexing terms while words with high weights are
    hypothesized to be good indexing terms.
  • The bulk of the words, as is the case in natural
    language text, reside in the middle range.

16
Natural Language Example
  • Word Freq(i,j) TotFreq DocFreq
    Wgt1 Wgt2 Wgt3 MCA
  • 1 Death of a cult. (Apple Computer needs to
    alter its strategy) (column)
  • apple 4 261 112
    1.716 9.757 17 -1.1625
  • computer 4 706 358
    2.028 5.109 10 -19.4405
  • mac 2 146 71
    0.973 6.290 6 -0.0256
  • macintosh 4 210 107
    2.038 9.940 20 -0.5855
  • strategy 2 79 67
    1.696 6.406 11 -0.0592
  • 3 WordPerfect. (WordPerfect for the
    Macintosh 2.0) (evaluation) Taub, Eric.
  • edit 2 111 77
    1.387 6.128 8 -0.0961
  • frame 2 9 7
    1.556 10.924 17 0.0131
  • import 2 29 19
    1.310 8.927 12 0.0998
  • macintosh 3 210 107
    1.529 7.705 12 -0.5855
  • macro 3 38 24
    1.895 12.189 23 0.1075
  • outstand 1 10 9
    0.900 5.711 5 0.0168
  • user 4 861 435
    2.021 4.330 9 -26.8094

17
Indexing Experiment
  • Sequences from the NCBI "nt" (non-redundant
    nucleotide) data base were used.
  • The nt data base is approximately 12 billion
    bytes in length comprising 2,584,440 sequences in
    FASTA format (Sept 2004).
  • A word size of 11 was used throughout. A total
    of 4,194,299 words were identified, slightly less
    than the theoretical maximum of 4,194,304.

18
Calculating the IDF Weight
  • The overall frequencies of occurrence of all
    possible 11 character words from each sequence
    were determined along with the number of
    sequences in which each unique word was found.
  • A weight Wgti for each word i was calculated by
    taking the Log10, multiplied by 10 and truncated
    to the nearest integer, of the total number of
    sequences (N) divided by the number of sequences
    in which the word occurred (DocFreqi).
  • Wgti (int) 10 Log10 ( N
    / DocFreqi )
  • In natural language indexing, this is referred to
    as the inverse document frequency (IDF) weight.

19
File Sizes
  • Initial file analysis produces about 110
    intermediate files of about 440 million bytes
    each from the input data base (12 GB).
  • out.table is a large (40 billion byte)
    word-sequence file.
  • freq.bin contains the inverse document frequency
    weight for each word (53 million bytes)
  • index (76 million bytes) gives for each word the
    eight byte offset of the word's entry in
    out.table.
  • index and freq.bin are merged into ITABLE (112
    million bytes) which contains for each word its
    weight, offset, and a pointer to a list of
    aliases (not used with the nt data base).

20
Data Base
  • W ( w1, w2, w3, ... wM) vector of M weights
  • F ( f1,1 f1,2 f1,3 ... f1,N )
  • ( f2,1 f2,2 f2,3 ... f2,N )
  • ( f3,1 f3,2 f3,3 ... f3,N )
  • ... word-sequence matrix
  • ( fM,1 fM,2 fM,3 ... fM,N )


21
Number of Words at each Weight
  • for i 1 to 120
  • zi ? 0
  • for j 1 to M
  • if wj i then zi ? zi 1

22
Number of Words at Each IDF Wgt.
23
Sum of all Instances of Each Weight
  • for i 1 to 120
    // for each weight
  • xi ? 0
  • for j 1 to M
    // for each word
  • for k 1 to N //
    for each sequence
  • if fj, k i then xi ? xi 1

24
Number of Occurrences at Each IDF Level
25
Sequence Retrieval
  • For retrieval, a query sequence is read and
    decomposed into 11 character words. These words
    are reduced to a numeric equivalent which is used
    as an index into the word-sequence table.
    Entries in a master vector corresponding to
    sequences are incremented by the weight of the
    word if the word occurs in the sequence if the
    weight of the word lies within a specified range.
    When all words have been processed, entries in
    the master sequence vector are normalized
    according to the length of the underlying
    sequence in respect to the length of the query.
    Finally, the master sequence vector is sorted and
    the top scoring entries printed or submitted to a
    Smith-Waterman alignment, sorted and then
    printed. Optionally, the Smith-Waterman
    alignments themselves can be printed and the
    selected sequences can be extracted from the nt
    data base and stored in a separate output file
    for additional processing. FASTA post-processing
    is an option.

26
Unweighted Result for 500 Random Queries
27
Result for 500 Random Queries Weight Range 65-120
28
Overall Results for 500 Random Queries
29
Index Scoring Results
  • Query gtgi19911940dbjAB072098.1 Hepatitis C
    virus type 1b gene for polyprotein, NS3 region,
    partial cds, isolate71
  • Query string has 289 letters
  • Searching ...
  • 68224 gtgi19911940dbjAB072098.1 Hepatitis C
    virus type 1b gene for polyprotein, NS3 region, p
  • 31420 gtgi29467317dbjAB089555.1 Hepatitis C
    virus NS3 gene for polyprotein, partial cds, isol
  • 30508 gtgi19911912dbjAB072084.1 Hepatitis C
    virus type 1b gene for polyprotein, NS3 region, p
  • 30296 gtgi29467668dbjAB100815.1 Hepatitis C
    virus NS3 gene for polyprotein, partial cds, isol
  • 29800 gtgi14150634gbAF369255.1 Hepatitis C
    virus Pt.2F NS3 protease gene, partial cds
  • 29444 gtgi19911960dbjAB072108.1 Hepatitis C
    virus type 1b gene for polyprotein, NS3 region, p
  • 29240 gtgi19911888dbjAB072072.1 Hepatitis C
    virus type 1b gene for polyprotein, NS3 region, p
  • 29196 gtgi14150646gbAF369261.1 Hepatitis C
    virus Pt.6A NS3 protease gene, partial cds
  • 29120 gtgi19911862dbjAB072059.1 Hepatitis C
    virus type 1b gene for polyprotein, NS3 region, p
  • 28896 gtgi14150628gbAF369252.1 Hepatitis C
    virus Pt.128 NS3 protease gene, partial cds
  • 28116 gtgi2731651gbU81612.1HCU81612 Hepatitis
    C virus polyprotein gene, partial cds
  • 28116 gtgi3157741dbjAB013621.1 Hepatitis C
    virus RNA for polyprotein (NS3 proteinase
    region),
  • 27700 gtgi14150620gbAF369248.1 Hepatitis C
    virus Pt.1 NS3 protease gene, partial cds
  • ..................................................
    ...........................................

30
Smith-Waterman Result Scoring
  • Query gtgi19911940dbjAB072098.1 Hepatitis C
    virus type 1b gene for polyprotein, NS3 region,
    partial cds, isolate71
  • Query string has 289 letters
  • top gtgi19911940dbjAB072098.1 Hepatitis C
    virus type 1b gene for polyprotein, NS3 region,
    partial cds, isolate71
  • 166 TTCCACGGTGCCGGCTCAAAGACCCTAGCCGGCCCGAAGGGCCA
    AATCACCCAGATGTACACCAATGTAGACCAGGACCT 245

  • 1 TTCCACGGTGCCGGCTCAAAGACCCTAGCCGGCCCGAAGGGCCA
    AGTCACCCAGATGTACACCAATGTAGACCAGGTCCT 80
  • 246 CGTCGGCTGGCCGGCGCCCCCCGGAGCGCGTTCCTTGACACCAT
    GCACCTGCGGCAGCTCGGACCTTTATTTGGTCACGA 325

  • 81 CGTCGGCTGGCCGGCGCCGCCCGGAGCGCGTTCCTTGAGACCAT
    GCACCTGCGGCAGCTCGGACCTTTATTTGGTCACGA 160
  • 326 GACATGCTGACGTCATCCCGGTGCGCCGGCGGGGCGACAGCAGG
    GGGAGCTTGCTTTCTCCTAGGCCCATCTCTTACTTA 405

  • 161 GACATGCTGACGTCATCCCGGTGCGCCGGCGGGGCGACAGCAGG
    GGGAGCTTGCTTTCTCCTAGGCCCATCTCTTACTTA 240
  • 406 AAGGGCTCTTCGGGCGGTCCACTGCTTTGCCCCTCGGGGCACGC
    TGTGG 454


31
S-W Scores
  • 566 gtgi19911940dbjAB072098.1 Hepatitis C
    virus type 1b gene for polyprotein, NS3 region, p
  • 505 gtgi29467668dbjAB100815.1 Hepatitis C
    virus NS3 gene for polyprotein, partial cds, isol
  • 504 gtgi29467317dbjAB089555.1 Hepatitis C
    virus NS3 gene for polyprotein, partial cds, isol
  • 503 gtgi19911914dbjAB072085.1 Hepatitis C
    virus type 1b gene for polyprotein, NS3 region, p
  • 503 gtgi14150628gbAF369252.1 Hepatitis C virus
    Pt.128 NS3 protease gene, partial cds
  • 502 gtgi29467247dbjAB089520.1 Hepatitis C
    virus NS3 gene for polyprotein, partial cds, isol
  • 502 gtgi3157741dbjAB013621.1 Hepatitis C virus
    RNA for polyprotein (NS3 proteinase region),
  • 501 gtgi19911862dbjAB072059.1 Hepatitis C
    virus type 1b gene for polyprotein, NS3 region, p
  • 499 gtgi29467670dbjAB100816.1 Hepatitis C
    virus NS3 gene for polyprotein, partial cds, isol
  • 498 gtgi3157753dbjAB013627.1 Hepatitis C virus
    RNA for polyprotein (NS3 proteinase region),
  • 498 gtgi14150634gbAF369255.1 Hepatitis C virus
    Pt.2F NS3 protease gene, partial cds
  • 497 gtgi29467311dbjAB089552.1 Hepatitis C
    virus NS3 gene for polyprotein, partial cds, isol
  • 497 gtgi19911934dbjAB072095.1 Hepatitis C
    virus type 1b gene for polyprotein, NS3 region, p
  • 497 gtgi19911912dbjAB072084.1 Hepatitis C
    virus type 1b gene for polyprotein, NS3 region, p
  • 496 gtgi19911900dbjAB072078.1 Hepatitis C
    virus type 1b gene for polyprotein, NS3 region, p
  • 495 gtgi14150638gbAF369257.1 Hepatitis C virus
    Pt.3O NS3 protease gene, partial cds
  • 495 gtgi14150616gbAF369246.1 Hepatitis C virus
    Pt.1A NS3 protease gene, partial cds
  • 495 gtgi14150646gbAF369261.1 Hepatitis C virus
    Pt.6A NS3 protease gene, partial cds
  • 494 gtgi14150620gbAF369248.1 Hepatitis C virus
    Pt.1 NS3 protease gene, partial cds

32
Larger Sequences
  • On larger query sequences (5,000 to 6,000
    letters), the IDF method performed slightly
    better than BLAST. On 25 sequences randomly
    generated, the IDF method correctly ranked the
    original sequence first 24 times and once at rank
    3. BLAST, on the other hand, ranked the
    original sequence first 21 times while the
    remaining 4 were ranked 2, 2, 3 and 4. Average
    time per query for the IDF method was 47.4
    seconds and the average time for BLAST was 122.8
    seconds.

33
The Next Step
  • Future work
  • Weighted Term Vectors.
  • Other weighting schemes such as the Modified
    Centroid Algorithm.
  • Sequence-Sequence and Term-Term Correlations.
  • Sequence clustering.

34
References
  • Altschul SF, Gish W, Miller W, Myers EW, Lipman
    DJ. (1990) Basic local alignment search tool. J.
    Mol. Biol. 215403-10.
  • O'Kane, K.C. and Lockner, M. J. (2004) Indexing
    genomic sequence libraries, Information
    Processing and Management, 41265-274.
  • O'Kane, K.C. (2004) The Effect of Inverse
    Document Frequency Weights on Indexed Sequence
    Retrieval, submitted.
  • Pearson, W. R. (2000) Flexible sequence
    similarity searching with the FASTA3 program
    package. Methods Mol. Biol. 132185-219.
  • Salton, G. (1983), Introduction to Modern
    Information Retrieval, McGraw-Hill (New York
    1983).
  • Smith, T.F. Waterman, M.S. (1981)
    Identification of common molecular subsequences.
    J. Mol. Biol. 147195-197

35
(No Transcript)
36
Hierarchical Data Base
37
Bioinformatics
  • Sloan Report on Bioinformatics from June 2004.
  • Number of graduates
  • There were only 26 new PhD's produced...
  • 102 masters degrees awarded...
  • Only 17 Bachelor's degrees produced...
  • The data is for January 2002 until March
    2003.
  • "... in the next few years the number of
    graduates is expected to increase by two or three
    times."
  • Average program enrollment
  • 103 Bachelors
  • 435 Masters
  • 296 Phd

38
B.S. in Bioinformtics at UNI
  • Mathematics 800060 800061 800064 800152
    800164 (17 hours)
  • Computer Science 810061 810062 810065
    810066
  • 810080 810114
    810115 810180 (24 hours)
  • Biology 840051 840052 840130 840140
    840153 (19 hours)
  • Chemistry 860070 or both 860044 and 860048
    860063 (9-12 hours)
  • Elective One course from the following
    (3 hours)
  • Computer Science 810143 810147 810153
    810155
  • 810161 810172
    810181

  • Total 73-75 hours

39
Courses
  • 800060. Calculus I . The derivatives and
    integrals of elementary functions and their
    applications.
  • 800061. Calculus II. Continuation of 800060
  • 800064. Elementary Probability and Statistics
    for Bioinformatics. Descriptive statistics, basic
    probability concepts, confidence intervals,
    hypothesis testing, correlation and regression,
    elementary concepts of survival analysis
  • 800152. Introduction to Probability. Axioms of
    probability, sample spaces having equally likely
    outcomes, conditional probability and
    independence, random variables, expectation,
    moment generating functions, jointly distributed
    random variables, weak law of large numbers,
    central limit theorem

40
Courses
  • 800164. Statistical Methods in Bioinformatics.
    Analysis of a DNA sequence, analysis of multiple
    DNA and protein sequences, BLAST.
  • 810061. Computer Science I. Introduction to
    computer programming in the context of a modern
    object-oriented programming language. Emphasis on
    good programming techniques, object-oriented
    design, and style through extensive practice in
    designing, coding, and debugging programs.
  • 810062. Computer Science II. Intermediate
    programming in an object-oriented environment.
    Topics include object-oriented design,
    implementation of classes and methods, dynamic
    polymorphism, frameworks, patterns, software
    reuses, limitations, exceptions, and threads.

41
Courses
  • 810065. Computing for Bioinformatics I.
    Intermediate programming with emphasis on
    bioinformatics. Includes file handling, memory
    management, multi-threading, B-trees,
    introduction to dynamic programming including
    Wunsch-Neddleman and Smith-Waterman algorithms
    for optimal alignments, exploration of BLAST,
    FASTA and gapped alignment, substitution
    matrices.
  • 810066. Computing for Bioinformatics II.
    Advanced bioinformatics computing Perl and CGI
    programming data base facilities for
    bioinformatics pattern matching with regular
    expressions advanced dynamic programming
    optimal versus local alignment, multiple
    alignments data base mining tools, Entrez, SRS,
    BLAST, FASTA, CLUSTAL graphical 3-D
    representation of proteins phylogenic trees.

42
Courses
  • 810080. Discrete Structures. Topics include
    propositional and first-order logic proofs and
    inference mathematical induction sets,
    relations, and functions and graphs, lattices,
    and Boolean algebra, all in the context of
    computer science.
  • 810114. Database Systems. Storage of, and access
    to, physical databases data models, query
    languages, transaction processing, and recovery
    techniques object-oriented and distributed
    database systems and database design.
  • 810115. Information Storage and Retrieval.
    Natural language processing analysis of textual
    material by statistical, syntactic, and logical
    methods retrieval systems models, dictionary
    construction, query processing, file structures,
    content analysis automatic retrieval systems and
    question-answering systems and evaluation of
    retrieval effectiveness.

43
Courses
  • 810180. Undergraduate Research in Computer
    Science
  • 840051. General Biology Organismal Diversity.
    Study of organismic biology emphasizing
    evolutionary patterns and diversity of organisms
    and interdependency of structure and function in
    living systems.
  • 840052. General Biology Cell Structure and
    Function. Study of cells, genetics, and DNA
    technology emphasizing the chemical basis of life
    and flow of information.
  • 840130. Molecular Biology of the Cell.
    Introduction to the molecular, biochemical, and
    cellular structure and function of cells, DNA
    structure and functions, and the translation of
    genetic information into functional structures of
    living cells. DNA replication, transcription of
    genes, and synthesis and processing of proteins
    will be emphasized.

44
Courses
  • 840140. Genetics. Analytical approach to
    classical, molecular, and population genetics
  • 840153. Recombinant DNA Techniques. Study of
    techniques for manipulating and analyzing DNA,
    including genomic library construction,
    polymerase chain reaction, oligonucleotide
    synthesis, genomic analysis with computers, and
    DNA and RNA isolation.
  • 860070. General Chemistry I-II. Accelerated
    course for well-prepared students. Content
    similar to 860044 and 860048 but covered in one
    semester. Completion satisfies General Chemistry
    requirement of any chemistry major.

45
Courses
  • 860063. Applied Organic and Biochemistry. Basic
    concepts in organic chemistry and biochemistry,
    including nomenclature, functional groups,
    reactivity, and macromolecules.
  • Elective from
  • 810143(g). Operating Systems. History and
    evolution of operating systems process and
    processor management primary and auxiliary
    storage management performance evaluation,
    security, and distributed systems issues and
    case studies of modern operating systems.

46
Courses
  • 810147. Networking. Network architectures and
    communication protocol standards. Topics include
    communication of digital data, data-link
    protocols, local-area networks, network-layer
    protocols, transport-layer protocols,
    applications, network security, and management.
  • 810153. Design and Analysis of Algorithms.
    Algorithm design techniques such as dynamic
    programming and greedy algorithms complexity
    analysis of algorithms efficient algorithms for
    classical problems intractable problems and
    techniques for addressing them and algorithms
    for parallel machines.
  • 810155. Translation of Programming Languages.
    Introduction to analysis of programming languages
    and construction of translators.

47
Courses
  • 810161. Artificial Intelligence. Models of
    intelligent behavior and problem solving
    knowledge representation and search methods
    learning topics such as knowledge-based systems,
    language understanding, and vision optional
    1-hour lab in symbolic programming techniques
    heuristic programming symbolic representations
    and algorithms and applications to search,
    parsing, and high-level problem-solving tasks.
  • 810172. Software Engineering. Study of software
    life cycle models and their phases--planning,
    requirements, specifications, design,
    implementation, testing, and maintenance.
    Emphasis on tools, documentation, and
    applications.
  • 810181. Theory of Computation. Topics include
    regular languages and grammars finite state
    automata context-free languages and grammars
    language recognition and parsing and turing
    computability and undecidability.
Write a Comment
User Comments (0)
About PowerShow.com