Design Challenges and Misconceptions in Named Entity Recognition - PowerPoint PPT Presentation

About This Presentation

Title:

Design Challenges and Misconceptions in Named Entity Recognition

Description:

Design Challenges and Misconceptions in Named Entity Recognition Lev Ratinov and Dan Roth The Named entity recognition problem: identify people, locations ... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 2

Provided by: View61

Category:

more less

Transcript and Presenter's Notes

Title: Design Challenges and Misconceptions in Named Entity Recognition

1
Design Challenges and Misconceptions in Named
Entity Recognition Lev Ratinov and Dan Roth

The Named entity recognition problem identify
people, locations, organizations and other
entities in text. Demo http//L2R.cs.uiuc.edu/co
gcomp/demos.php Download http//L2R.cs.uiuc.edu/
cogcomp/asoftware.php
A typical solution is modeled as a structured
problem HMM/CRF
How to represent NEs? There are at least 5
representation schemas that vary in the way they
mark the beginning, the end and single-token
entities. The letters used are B(egin), E(nd),
L(ast), I(nside), U(nit), and O(ther). The more
states we use, the more expressive the model is,
but it may require more training data to converge.
Choice of Inference algorithm Given that NER is a
sequential problem, Viterbi is the most popular
inference algorithm in NER. It computes the most
likely label assignment in a single-order HMM/CRF
in time that is quadratic in the number of states
and linear in the input size. When non-local
features are used, exact dynamic programming with
Viterbi becomes intractable, and Beam search is
usually used. We show that greedy left-to-right
decoding (beam search of size one) performs
comparably to Viterbi and Beam search, while
being much faster. The reason is that in NER,
short NE-chunks are separated by many O-labeled
tokens, which break the likelihood maximization
problem to short independent chunks, where
classifiers with local features perform well. The
non-local dependencies in NER have a very long
range, which the second-order state transition
features fail to model. With the label set
Per/Org/Loc/Misc/O, and the BILOU encoding, there
are 17 states (B-Per, I-Per, U-Per, etc.), and
with second order Viterbi, we need to make 173
decisions in each step, in contrast to only 17
necessary in greedy decoding.
How to inject knowledge? We investigated two
knowledge resources word class models extracted
from unlabeled text and extracting Gazetteers
from Wikipedia. The approaches are independent
and the contribution can be accumulated. Word
class models hierarchically cluster words based
on distribution-similarity-like measure. Two
clusters are merged if they tend to appear in
similar contexts. For example, clustering
Friday and Monday together, allows us to abstract
both as day of the week and helps dealing with
the data sparsity problem. A sample binary
clustering is given below Deciding which
level of abstraction to use is challenging
however, similarly to Hoffman encoding, each word
can be assigned a path from the root to the leaf.
For example, the encoding of IBM is 011. By
taking substrings of different lengths, we can
consider different levels of abstraction. Gazette
ers We use a collection of 14 high-precision,
low-recall lists extracted from the web that
cover common names, countries, monetary units,
temporal expressions, etc. While these gazetteers
have excellent accuracy, they do not provide
sufficient coverage. To further improve the
coverage, we have extracted 16 gazetteers from
Wikipedia, which collectively contain over 1.5M
entities. Wikipedia is an open, collaborative
encyclopedia with several attractive properties.
(1) It is kept updated manually by its
collaborators, hence new entities are constantly
added to it. (2) It contains redirection pages,
mapping several variations of spelling of the
same name to one canonical entry. For example,
Suker is redirected to an entry about Davor
Šuker, the Croatian footballer. (3) The entries
in Wikipedia are manually tagged with
categories. For example, the entry Microsoft in
Wikipedia has the following categories Companies
listed on NASDAQ Cloud computing vendors etc.
We used the category structure in Wikipedia to
group the titles and the corresponding redirects
into Higher-level concepts, such as locations,
organizations, artworks, etc.
How to model non-local information? Common
intuition multiple occurrences of the same token
should be assigned the same labels. We have
compared three approaches proposed in the
literature, and found that no single approach
consistently outperformed the rest on a
collection of 5 datasets. However, the
combination of the approaches was the most stable
and generally performed the best. Context
Aggregation collect the contexts for all
instances of a token, extract features from all
contexts and use them for all instances. For
example, all the instances of FIFA in the sample
text, will be added context aggregation features
extracted from Two-stage Prediction
Aggregation the model tags the text in the first
round of inference. Then , for each token, we
mark the most frequent label it assigned among
the multiple occurrences, and if there were named
entities found in the document which contained
the token, we mark this information too. We use
these statistics as features in a second round of
inference. Extended Prediction History the
previous two techniques treat all the tokens in
the document identically. However, it is often
the case that the easiest named entities to
identify occur at the beginning of the document.
When using beam search or greedy decoding, we
collect, for each token, statistic on how often
it was previously assigned each of the labels.
BIO1 BIO2 IOE1 IOE2 BILOU
retain O O O O O
the O O O O O
Golan I-loc B-loc I-loc B-loc B-loc
Heights I-loc I-loc E-loc E-loc L-loc
Israel B-loc B-loc I-loc E-loc U-loc
captured O O O O O
from O O O O O
Syria I-loc B-loc I-loc E-loc U-loc
Key result BILOU performs the best overall on 5
datasets, and better than BIO2, which is most
commonly used for NER.
Conclusions We have presented a conditional model
for NER that uses averaged perceptron
(implementation Learning Based Java (LBJ)
http//L2R.cs.uiuc.edu/cogcomp/asoftware.php)
with expressive features to achieve new state of
the art performance on the Named Entity
recognition task. We explored four fundamental
design decisions text chunks representation,
inference algorithm, non-local features and
external knowledge.
Main Results
Supported by NSF grant SoD-HCER-0613885, by MIAS,
a DHS-IDS Center for Multimodal Information
Access and Synthesis at UIUC and by the National
Library of Congress project NDIIPP

Write a Comment

User Comments (0)