N-gram Search Engine on Wikipedia - PowerPoint PPT Presentation

About This Presentation

Title:

N-gram Search Engine on Wikipedia

Description:

N-gram Search Engine on Wikipedia. Satoshi Sekine (NYU) Kapil Dalwani (JHU) ... 1. Search candidates. 2. Filtering. 3. Display. Implementation: Overview ... – PowerPoint PPT presentation

Number of Views:122

Avg rating:3.0/5.0

Slides: 18

Provided by: csJ8

Learn more at: https://www.cs.jhu.edu

Category:

more less

Transcript and Presenter's Notes

Title: N-gram Search Engine on Wikipedia

1
N-gram Search Engine on Wikipedia

Satoshi Sekine (NYU)
Kapil Dalwani (JHU)

2
Hammer Fast and multi-functional n-gram search
engine
Search ngram FAST INPUT token, POS, chunk,
NE OUTPUT frequency to text
ngrams
2
3
Characteristics

Search up to 7 grams with wildcards
Multi-level input
Token, POS, chunk, NE, combinations
NOT, OR for POS, chunk, NE
Multi-level output
Token, POS, chunk, NE
document information
Original sentences, KWIC, ngram
Display
Show the results in the order of frequency
Running Environment
Single CPU, PC-Linux, 400MB process, 500GB disk

3
4
Demo

http//linserv1.cims.nyu.edu23232/ngram_wikipedia
2

5
Available for you

Web system
At NYU
http//nlp.cs.nyu.edu/nsearch
At JHU?
USB Hard drive

6
Implementation Overview
Suffix array for text
N-gram data
Inverted index for n-gram data
Search request
Wikipedia text
POS, chunk, NE for N-gram data
Wikipedia POS, chunk, NE
7
Implementation Overview
Suffix array for text
N-gram data
Inverted index for n-gram data
Search request
Wikipedia text
POS, chunk, NE for N-gram data
Wikipedia POS, chunk, NE
8
From n-gram to Inverted Index

Example 3-grams
Posting list

Ngram ID Position1 Position2 Position3
1 A B C
2 A B B
3 B A C
1 2
A pos1
3
A pos2
3
B pos1
1 2
B pos2
2
B pos3
1 3
C pos3
9
Posting list

Wide variation of posting list size (in 7-gram
1.27B)
EOS (100,906,888), , (55,644,989), the
(33,762,672)
conscipcuous, consiety, Mizuk, (1)
3 types for faster speed and smaller index size
Bitmap (freq gt1) EOS 1.27B bits
(bitmap) lt-gt 3.2B bits (list)
List of ngramID
Encoded into pointer (freq1)

1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1
1 3
C pos3
C pos3 5
10
Search

Given an n-gram request (A B C)
Get posting lists for A, B and C
Search intersections of posting lists
Use look ahead to speed up the search
Look ahead size Sqrt(size of posting list)
Moffat and Zobel (1996)

4 33 34 55 76 80 89 92 99
SKIP
4 12 15 19 22 33 37 46 59 60 62 76 82 89 94 98
11
Implementation Overview
Suffix array for text
1 Search candidates.
N-gram data
Inverted index for n-gram data
Search request
Wikipedia text
POS, chunk, NE for N-gram data
Wikipedia POS, chunk, NE
12
Filtering

Not all candidate ngramIDs match the request
We need frequency, sentence information to
matched n-grams
POS, chunk and NE information is presented as ID
Reduce the index more than 200GB

A B
Freq123
NN
PERSON
VB
LOC
Freq10
Freq5
13
Implementation Overview
2. Filtering
Suffix array for text
N-gram data
Inverted index for n-gram data
Search request
Wikipedia text
POS, chunk, NE for N-gram data
Wikipedia POS, chunk, NE
14
Display

N-gram will be displayed in the descending order
of frequency
N-gram ID is ordered by the frequency
Sentences are searched using suffix array
POS, chunk, NE are displayed with sentence, KWIC,
ngram
Doc ID, title of Wikipedia (and possible features
of doc) is displayed with sentences and KWIC

15
Size of data
8 GB
Text 1.7 G words 200M sentences 2.4M
articles Ngram 1 8M 2 93M 3 377M 4 733M
5 1.00B 6 1.17B 7 1.27B
Total 530GB
Suffix array For text
260 GB
N-gram data
108 GB
Inverted index for n-gram data
8 GB
Wikipedia text
100 GB
POS, chunk, NE for N-gram data
6 GB
Wikipedia POS, chunk, NE
40 GB
Others
16
Future Work