Disambiguation of References to Individuals - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Disambiguation of References to Individuals

Description:

Any clusters that fell below a membership threshold (5) had their centroid ... For each page, they find the cluster that is closest to the page in the feature space. ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 19
Provided by: auge
Category:

less

Transcript and Presenter's Notes

Title: Disambiguation of References to Individuals


1
Disambiguation of References to Individuals
  • Levon Lloyd (State University of New York)
  • Varun Bhagwan, Daniel Gruhl (IBM Research
    Center)
  • Andrew Tomkins (Yahoo Research)

2
Introduction
  • The problem of name ambiguity is widespread.
  • They study the problem of disambiguating textual
    references to individuals.
  • Their goal is to develop algorithms capable of
    clustering references to a particular name so
    that the resulting clusters correspond as closely
    as possible to the particular individuals.
  • They explore the use of linguistically derived
    features and bottom up clustering to explore how
    well this disambiguation can be performed.

3
Applications
  • Applications
  • Dossier Creation
  • Relationship Detection
  • Person Search
  • Expertise Location
  • Authorship
  • Homepage Location
  • Maintaining a sufficiently high precision is
    paramount, and recall may then be improved as
    much as possible .

4
Related Work
  • Word sense disambiguation
  • Name co-reference
  • Place disambiguation
  • Authors in citations disambiguation
  • Templated-based extraction

5
Data Sets of Names
  • Two distinct data sets
  • Household name (famous actors/actresses or famous
    computer scientists/mathematicians)
  • General name (1000 names from analysis of web
    data)
  • From these general names, they restricted to
    those which occurred at least 500 times within
    the 2.1B web pages.
  • Then they used information the 1990 US Census to
    estimate the probability that a uniform person in
    that census would match both first and last name.
    (lt 5X10 -8)

6
Data Gathering
  • They used the full 2.8B pages of the IBMs
    WebFountain system to gather data and run
    experiments.
  • For each result, they extracted a region of 100
    words centered around the name, and replaced each
    occurrence of the first and last name with FIRST
    and LAST respectively.
  • The algorithm is asked to cluster references.

7
Feature Extraction
  • Keywords
  • tfidf-scored tokenized keywords from the text
    snippets
  • Entities
  • Any peoples name occurs on the entire pages.
    Any entity exists in the Stanford TAP knowledge
    base.
  • Descriptions
  • Appositives and noun phrase modifiers that modify
    the name reference in the snippet.
  • Phrases
  • Heads of all noun phrases in the snippet.

8
Example of Description
9
Clustering
  • K-means Clustering
  • Any clusters that fell below a membership
    threshold (5) had their centroid reseeded into
    the center of the largest cluster plus a small
    offset.
  • Incremental Clustering
  • Seed generation
  • Classification
  • Merging

10
Seed Generation
  • The goal of the seed generation step is to form a
    set of highly precise seed clusters that need not
    cover the entire set of documents.
  • Each feature is evaluated in turn in tfidf order,
    and perform one of three actions
  • If this feature has not appeared in any page in
    seed clusters and occurs in more than a threshold
    number of pages, then
  • If this feature has appeared in another seed
    cluster and the ratio is greater than a
    threshold, then
  • Otherwise skip the feature.

11
Classification
  • This step is to classify each page that was not
    assigned to a seed cluster.
  • For each page, they find the cluster that is
    closest to the page in the feature space. If the
    distance is below a threshold then add it to that
    cluster.
  • Otherwise, they find the cluster that is closest
    to it in their entity co-occurrence space. If the
    distance is below a threshold then add it to that
    cluster.
  • If the page is not close enough to any existing
    cluster, then create a singleton cluster with
    just this page.

12
Cluster Merging
  • The first two steps often create too many
    clusters, thus they add a final step to merge
    clusters.
  • They merge clusters by repeatedly merging a
    cluster with its nearest neighbor in the feature
    space until there are no clusters that are close
    enough to it.

13
Evaluation Metric
  • B-CUBED metric

14
Evaluating Features
15
Focus on Precision
  • They computer the cohesion of a cluster, and find
    many smaller clusters have high precision.
  • They give the algorithm the ability to endorse
    certain clusters as appearing to be of high
    quality.
  • At approximately 10 of the data, the algorithm
    is able to select clusters of near-perfect
    precision.
  • They consider the case of 10 distinct names with
    200 snippets per name, resulting in 2000 data
    points to be clusters.

16
(No Transcript)
17
Incremental Clustering
  • They consider terminating the algorithm after
    each phase.

18
Conclusion
  • They have presented a technique for
    disambiguating occurrences of an ambiguous name
    from snippets of web text referring to
    individuals.
  • They show that, over typically web references,
    with linguistically enhanced feature vectors and
    an incremental classifier, to return results for
    25 of the data with precision in excess of
    0.95, out-performing the non-enhanced approach by
    a factor of around X2.5.
Write a Comment
User Comments (0)
About PowerShow.com