OUR GROUP MEMBERS: - PowerPoint PPT Presentation

About This Presentation
Title:

OUR GROUP MEMBERS:

Description:

Implementing a Prototype of XML Software Tool for Displaying and Searching Genomics Documents ... 2) All files are parsed using DOM architecture. ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 17
Provided by: YOR46
Category:
Tags: group | members | our | dom

less

Transcript and Presenter's Notes

Title: OUR GROUP MEMBERS:


1
Implementing a Prototype of XML Software Tool for
Displaying and Searching Genomics Documents
  • OUR GROUP MEMBERS
  • Farhan Khalid
  • Vladimir Kossoi
  • Mani Yasrebi
  • David Feng
  • Mauricio Franco
  • Surjeet Roopani

2
INTRODUCTION
  • The objective of this assignment is to build a
    web
  • information system for displaying and searching
  • XML Documents.
  • The implementation of this project is divided
    into three parts
  • Creation of a two-level indexer
  • Searching
  • Creation of an XSL stylesheet for purposes of
    presenting XML documents

3
OUR SEARCHING PAGE
  • The main page is located at http//unix.aml.yorku.
    ca8080/w04_g4/Search.html

4
DISPLAYING GENOMICS DOCUMENTS(This page is the
java class SearchSerlvet)
5
DISPLAYING GENOMICS DOCUMENTS(This page is the
java class SearchSerlvet)
  • This page displays a number of important things
    to
  • the user
  • Number of documents found that contain
  • the specified term
  • Number of times the term was found in all
  • the documents
  • Time (seconds) it took to search for the
  • term
  • Article titles

6
Formatted XMLby using XSL stylesheet
7
INDEXER ARCHITECTURE
MainApp
Sorted collection
Create index
Positions.txt
Parse files
Dictionary.txt
ArticleTitles.txt
8
General Design
  • Two parts to the creation of indexer
  • Parsing all documents.
  • Write to file and create a lexicon
  • Parsing involves
  • 1) 1139 files are created
  • 2) All files are parsed using DOM architecture.
  • Article titles are extracted and saved in a file
    using object serialization.
  • ArrayList contains a collection of all terms,
    documents, and positions.
  • Using Collections.sort(List l) arraylist is
    sorted.

9
General Design Second part
  • Write to file and create a lexicon
  • Merge all duplicate terms
  • Write every term record to file in binary format
  • Store every term in a TreeMap as a Lexicon object
  • The connection (pointer) between the lexicon and
    Postings.txt is the start and end byte of record
    in Postings.txt.

10
General Design Second part
  • LEXICON

POSTINGS.TXT
lt1,465.xml(185)gtlt27,1129.xml(1156)230.xml(2110,
197)437.xml(16173,8,12,13,3,22,9,3,21,19,10,9,13,
3,20,10)446.xml(7202,19,2,3,14,14,15)705.xml(114
5)gtlt534,1.xml(3218,12,19)1001.xml(1316)
11
Lexicon
  • Our lexicon consists of 359,304
  • terms. A term was considered to be
  • sequence of any character except for the
  • following
  • " \t\n\r\f,'() and a .
  • These characters were used as delimiters.

12
INDEXER
  • First level
  • In memory, a collection of all terms stored in a
    TreeMap object.
  • Key is the term
  • Values are the start and end byte in
    Positions.txt
  • Second level
  • Postings.txt
  • lt1,465.xml(185)gtlt27,1129.xml(1156)230.xml(2110,
    197)437.xml(16173,8,12,13,3,22,9,3,21,19,10,9,13,
    3,20,10)gt
  • ltTotalFreqTerm, DocName(TermFreqpos1,posN)gt

13
Compression
  • Store differences of positions.
  • For each term, in each document positions are
    being compressed.
  • i.e. 100,102,105,110.
  • After compression 100,2,3,5
  • In Postings.txt
  • lt1,465.xml(185)gtlt27,1129.xml(1156)230.xml(2110,
    197)437.xml(16
  • 173,8,12,13,3,22,9,3,21,19,10,9,13,3,20,10)
  • For efficiency reasons positions are compressed
    only if there are gt 2 positions.
  • Without compression our Postings.txt file is
    3.38 MB (3,546,244 bytes)
  • With compression our Postings.txt file is 3.25
    MB (3,408,825 bytes)
  • We have saved 137,419 bytes.

14
SEARCHING
  • For searching we have created one servlet that
    does all the processing SearchServlet.java.
  • Servlet receives the term, looks in TreeMap
    object.
  • IF found, read Postings.txt from START
    until END byte. All the bytes read until START
    are discarded, thereby saving memory.
  • If this is the first time servlet is called, then
    the init method is executed.
  • Read two files Dictionary.txt and
    ArticleTitles.txt.
  • Store in two TreeMap objects

15
SEARCHING/DECOMPRESSION
  • In SearchSevlet the positions are NOT being
    decompressed.
  • We have created another file called
    SearchServletCompression, where positions are
    being Decompressed. The reason behind this is
    SPEED. With compression our search is much
    slower.
  • Below is a comparison

16
ANY
Write a Comment
User Comments (0)
About PowerShow.com