Title: N-gram Search Engine on Wikipedia
1N-gram Search Engine on Wikipedia
- Satoshi Sekine (NYU)
- Kapil Dalwani (JHU)
2Hammer Fast and multi-functional n-gram search
engine
Search ngram FAST INPUT token, POS, chunk,
NE OUTPUT frequency to text
ngrams
2
3Characteristics
- Search up to 7 grams with wildcards
- Multi-level input
- Token, POS, chunk, NE, combinations
- NOT, OR for POS, chunk, NE
- Multi-level output
- Token, POS, chunk, NE
- document information
- Original sentences, KWIC, ngram
- Display
- Show the results in the order of frequency
- Running Environment
- Single CPU, PC-Linux, 400MB process, 500GB disk
3
4Demo
- http//linserv1.cims.nyu.edu23232/ngram_wikipedia
2
5Available for you
- Web system
- At NYU
- http//nlp.cs.nyu.edu/nsearch
- At JHU?
- USB Hard drive
6Implementation Overview
Suffix array for text
N-gram data
Inverted index for n-gram data
Search request
Wikipedia text
POS, chunk, NE for N-gram data
Wikipedia POS, chunk, NE
7Implementation Overview
Suffix array for text
N-gram data
Inverted index for n-gram data
Search request
Wikipedia text
POS, chunk, NE for N-gram data
Wikipedia POS, chunk, NE
8From n-gram to Inverted Index
- Example 3-grams
- Posting list
Ngram ID Position1 Position2 Position3
1 A B C
2 A B B
3 B A C
1 2
A pos1
3
A pos2
3
B pos1
1 2
B pos2
2
B pos3
1 3
C pos3
9Posting list
- Wide variation of posting list size (in 7-gram
1.27B) - EOS (100,906,888), , (55,644,989), the
(33,762,672) - conscipcuous, consiety, Mizuk, (1)
- 3 types for faster speed and smaller index size
- Bitmap (freq gt1) EOS 1.27B bits
(bitmap) lt-gt 3.2B bits (list) - List of ngramID
- Encoded into pointer (freq1)
-
1 0 0 0 1 1 0 1 0 0 0 0 1 0 0 1
1 3
C pos3
C pos3 5
10Search
- Given an n-gram request (A B C)
- Get posting lists for A, B and C
- Search intersections of posting lists
- Use look ahead to speed up the search
- Look ahead size Sqrt(size of posting list)
- Moffat and Zobel (1996)
4 33 34 55 76 80 89 92 99
SKIP
4 12 15 19 22 33 37 46 59 60 62 76 82 89 94 98
11Implementation Overview
Suffix array for text
1 Search candidates.
N-gram data
Inverted index for n-gram data
Search request
Wikipedia text
POS, chunk, NE for N-gram data
Wikipedia POS, chunk, NE
12Filtering
- Not all candidate ngramIDs match the request
- We need frequency, sentence information to
matched n-grams - POS, chunk and NE information is presented as ID
- Reduce the index more than 200GB
A B
Freq123
NN
PERSON
VB
LOC
Freq10
Freq5
13Implementation Overview
2. Filtering
Suffix array for text
N-gram data
Inverted index for n-gram data
Search request
Wikipedia text
POS, chunk, NE for N-gram data
Wikipedia POS, chunk, NE
14Display
- N-gram will be displayed in the descending order
of frequency - N-gram ID is ordered by the frequency
- Sentences are searched using suffix array
- POS, chunk, NE are displayed with sentence, KWIC,
ngram - Doc ID, title of Wikipedia (and possible features
of doc) is displayed with sentences and KWIC
15Size of data
8 GB
Text 1.7 G words 200M sentences 2.4M
articles Ngram 1 8M 2 93M 3 377M 4 733M
5 1.00B 6 1.17B 7 1.27B
Total 530GB
Suffix array For text
260 GB
N-gram data
108 GB
Inverted index for n-gram data
8 GB
Wikipedia text
100 GB
POS, chunk, NE for N-gram data
6 GB
Wikipedia POS, chunk, NE
40 GB
Others
16Future Work
- Other information (ex parse, coref, relation,
genre, discourse) - Longer n-gram
- Compress index, dictionary
- Ease the indexing load
- Now we need a big memory machine
- Distributing indexing
- Union operation for tokens
17Available for you
- Web demo
- At NYU
- http//nlp.cs.nyu.edu/nsearch
- At JHU?
- USB Hard drive