Final Project of Information Retrieval and Extraction - PowerPoint PPT Presentation

1 / 20
About This Presentation
Title:

Final Project of Information Retrieval and Extraction

Description:

sqlite. Language used : shell script : control the inverted file ... sql : used while trying to adopt the file format database - sqlite. First Indexing Trial ... – PowerPoint PPT presentation

Number of Views:114
Avg rating:3.0/5.0
Slides: 21
Provided by: Lou576
Category:

less

Transcript and Presenter's Notes

Title: Final Project of Information Retrieval and Extraction


1
Final Project of Information Retrieval and
Extraction
  • by d93921022 ???

2
Working Environment
  • OS Linux 7.3
  • CPU C800Mhz
  • Memory 128 MB
  • Tool used
  • stopper
  • stemmer
  • trec_eval
  • sqlite
  • Language used
  • shell script control the inverted file indexing
    procedures
  • AWK used for extract needed part from documents
  • sql used while trying to adopt the file format
    database - sqlite.

3
First Indexing Trial
  1. FBIS Source Files
  2. Documents Separation1851 5513
  3. Documents Pass Stemmer3352 10058
  4. Documents Pass Stopper3323 10929
  5. Words sort by AWK4407 11909
  6. Term Frequency Count and Inverted File Indexing
    (one file per word)gt 9hours, never finished
  • While considering about the indexing procedures,
    the most directly way is doing it step by step.
  • So in the first trial, I did each step and save
    the result as input of next step.
  • However, as the directory size grew, the time
    cost to write a file increased out of control.
  • Time cost of index file generating seems
    unacceptable and was stopped after 9 hours.

4
Second Indexing Trial
  1. FBIS Source Files
  2. Documents Separation2329 5836
  3. Documents Pass Stemmer3005 10726
  4. Documents Pass Stopper2234 5229
  5. Words Sort by AWK2244 4827
  6. Words Count and Indexing
  7. Two Suffix Directory Separating 5
  8. Word Files Indexing124100 break
  • The index generating took too much time.
  • This seemed to be caused by the number of files
    in a directory.
  • So, I tried to set up 2626 sub directories
    basing on the first two characters of each words
    and separated the index files storage.
  • However, it still took so long, and this trial
    was stopped while finishing FBIS3 after almost 13
    hours.

5
Third Indexing Trial
  1. FBIS Source Files
  2. Documents Separation 2015 10938
  3. Documents Pass Stemmer 2925 5542
  4. Documents Pass Stopper and Sort3417 10548
  5. Words Count and Indexing
  6. Suffix Directory Separating 6
  7. Word Files Indexing(break after 11 hours)
  • Well, before finding out a way to solve time
    consuming problem of indexing, the steps before
    also cost a lot of time.
  • I tried to combine the steps with pipeline
    command, but only worked when using system sort
    command.
  • After using stopper sort step, at least one
    hour is saved.
  • Time cost is still far from acceptable.

6
Fourth Indexing Trial
  1. FBIS Source Files 3351 10038
  2. Documents Separation
  3. Documents Pass Stemmer
  4. Documents Pass Stopper and Sort
  5. Words Count and Indexing
  6. Suffix Directory Separating 2
  7. Word Files Indexing 131423 141512
  • I finally found out the time was mostly cost on
    searching the location for next writing, which is
    a space allocation characteristic of linux
    systems.
  • So, I combined the former steps by doing a run
    from per source file to the sorted ones. All
    middle files are removed as soon as used by the
    next part.
  • The time consuming decreased amazingly. It only
    cost one-third of time used in last trail.
  • Indexing was finished for the first time after 29
    hours.

7
Fifth Indexing Trial
  1. For Each FBIS Source File11026 11929
  2. Documents Separation
  3. Documents Pass Stemmer
  4. Documents Pass Stopper and Sort
  5. Words Count and Database Indexing
  • The indexing took just so long and I really want
    to find a way for decreasing the time cost.
  • A file format database may be a solution.
  • So, I adopt sqlite and write all my index lines
    as table rows into a file using sqlite.
  • The time cost was immediately down to totally two
    and half hours, how amazing.

8
Indexing - Level Analysis
  1. For Each FBIS Source File 10853 11639
    v.s. 22257document count 61578 ? 130417 v.s.
    130417 (same)file size 262877184 ? 542937088
    v.s. same
  2. Documents Separation
  3. Documents Pass Stemmer
  4. Documents Pass Stopper and Sort
  5. Words Count and Database Indexing
  • Since the whole indexing can be done in 2.5
    hours, I then tried to count the level influence.
  • I tried to index FBIS3 then FBIS4 separately,
    then combined them as a set and tried again.
  • The time costs were nearly the same, and the
    document counts and file sizes were all equaled.
  • This is not at all surprising because of the
    working procedure did not add any outside
    information in.

9
Sixth Indexing Trial
  1. For Each FBIS Source File3549 39473304
    3543file size 176340992 ? 365469696
  2. Documents Separation
  3. Documents Pass Stemmer
  4. Documents Pass Stopper and Sort
  5. Words Count and Write in Single Indexing File
  • While revisiting the fourth and fifth trial, I
    figured out maybe the problem is the number of
    index files.
  • So I tried to write all the indexing message into
    a single file.
  • Two sub part were tried
  • Write after counting term frequency of each word.
  • Append after compute all frequency of a document.

10
Seventh Indexing Trial
  1. For Each FBIS Source File4438 5032file
    number 646 ? 655total file size 178606080 ?
    367759360
  2. Documents Separation
  3. Documents Pass Stemmer
  4. Documents Pass Stopper and Sort
  5. Words Count and write into 2626 Indexing File
  • When consider about query and indexing, single
    index file is just to large and would cost a long
    time to search for wanted terms.
  • So, I modified the final step and write the index
    lines into different files based on the word
    suffix.

11
Indexing Time
12
First Topic Query
  1. Extract Topics from Source Files and Pass Stemmer
    and Stopper 1
  2. Select Per Keyword Data from Index Database or
    Index file
  3. Weight Computing
  4. Ranking and Filtering
  5. Evaluation
  • Five query topics, totally 15 keywords
  • Total time to query
  • Index database 1338 ? 3127
  • Single index file 900 ? 1839
  • Separated index file 2 04
  • Seems not efficient enough. If exam several terms
    together, more time should be saved.

13
Second Topic Query
  1. Extract Topics from Source Files and Pass Stemmer
    and Stopper
  2. Generate One Query Strings for each topic
  3. Select Data from Index Database or Index File
  4. Weight Computing
  5. Ranking and Filtering
  6. Evaluation
  • Total time to query
  • Index database 230 ? 519
  • Single index file 226?455
  • Separated index file not much progress
    expected, for the queried file need to be checked
    separately.
  • But, as query terms increase, using separated
    index file would save a lot more search time.

14
Updated Topic Query
  1. Extract Topics from Source Files and Pass Stemmer
    and Stopper
  2. Generate Query Strings based on frequency of each
    term
  3. Select Data from Index Database or Index File
  4. Weight Computing
  5. Ranking and Filtering
  6. Evaluation
  • Some of the terms in the topics seem to get far
    too much return documents and seem not work at
    all.
  • Check the document frequency of each terms and
    removed the high frequency (gt10) terms.
  • Did not work, some more related terms need to be
    used for better precision.

15
Frequency Term Query
  1. Select Some Terms based on Descriptions,
    Narratives and web queries for each topic
  2. Order these terms based on document frequency of
    each word
  3. Deciding the Number of Terms to Use and Generate
    Query Strings
  4. The Following Steps are same as before
  • Number of terms are tried from five to 100.
  • The precision increase only in the beginning of
    adding terms.
  • While the query time raise proportionally as the
    query terms increase.
  • Terms of high frequency were removed, threshold
    were 10 and 20.
  • More strict frequency limit (10) seem to help.

16
Query Topic
17
Query Updated Topic
18
Query Terms
19
Query Time
query term topic 5 10 15 20 30 40 60 80 100
db FBIS 3 30 44 71 100 126 180 234 347 462 582
db FBIS 34 63 94 147 202 258 372 484 721 953 1197
file FBIS 3 43 67 90 117 143 192 245 349 476 594
file FBIS 34 89 135 188 243 343 404 510 722 986 1232
20
Conclusion
  • As I examined the index file and term frequency I
    generated. I found that there are so many terms
    seem to be useless.
  • They may be meaningless, like aaaf, or wrong
    spelling, like internacion.
  • Some terms have frequency count less than three.
  • If these terms are removed, the query would be
    doing even faster, I suppose.
  • I could have spent more time to sort and index
    the inverted file.
  • However, when I tried part of this, the time
    consuming made me consider about if it is
    worthwhile.
  • Maybe just a recent query cache is better than a
    full sort process.
  • Well, this makes the end of my project report.
Write a Comment
User Comments (0)
About PowerShow.com