Genomics Article Search - PowerPoint PPT Presentation

About This Presentation
Title:

Genomics Article Search

Description:

Genomics Article Search. Group 3 -Adjoa, Alexia, Carmon, Nick, and Shelley- Our Website: ... We separated the articles in the dataset from between opening and ... – PowerPoint PPT presentation

Number of Views:21
Avg rating:3.0/5.0
Slides: 12
Provided by: YOR46
Category:

less

Transcript and Presenter's Notes

Title: Genomics Article Search


1
Genomics Article Search
  • Group 3
  • -Adjoa, Alexia, Carmon, Nick, and Shelley-

2
Our Websitehttp//unix.aml.yorku.ca8080/w04_g3/
indexSearch.html
3
How to Split the XML Dataset
  • We separated the articles in the dataset from
    between opening and closing PubmedArticle tags.
  • This creates 1139 XML documents.
  • In these XML documents, they call on a XSL
    stylesheet.

4
Technology Used
  • Java
  • HTML
  • XML
  • XSL
  • Perl

5
How to Parse and Extract Keywords and PMIDs
  • We used DOM to parse the text contents within the
    XML documents.
  • First step in parsing, we extracted the text
    within the XML tags.
  • Second step was string tokenizing the words and
    PMIDs separately.
  • Third step was to attach the document ID for each
    term.

6
How to Sort
  • We used Perl scripting to sort the tokenized
    terms.
  • The terms appear in alphabetical and numerical
    order.

7
How We Built the Two-Level Index
  • We used the data structure of TreeMaps to help
    build the two-level index.
  • We chose to use TreeMaps because elements inside
    are sorted in order (according to the key) in an
    ascending order at all times
  • They stored the keys (words and numbers), and
    their corresponding values (document IDs or
    pointers to them).

8
How We Built the Two-Level Index (contd)
  • A TreeMap which stores each keyword (or PMID)
    with the beginning and end pointers to the
    positions of their document IDs is written into
    in the text file as the lexicon / first level
    index
  • All the document numbers are then written into
    another text file to form the second level index
    or postings file.

9
How to Search
  • The servlet class gets the keyword or PMID from
    the form and tries to find a match in the lexicon
    (text file).
  • If a match is found, it takes the document IDs
    position pointers and retrieves the corresponding
    document IDs from the postings file.
  • It then returns hyperlinks of all the matching
    XML documents to the user.
  • If no match is found, it returns the user to a
    search page to try again.

10
  • DEMO

11
Questions, Comments??
Write a Comment
User Comments (0)
About PowerShow.com