Ontology-Driven Automatic Entity Disambiguation in Unstructured Text - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text

Description:

Idea is to spot entity names in text and assign each potential match a confidence score ... Spotting Entity Names. Search document for entity names within the ontology ... – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 28
Provided by: boanergesa9
Category:

less

Transcript and Presenter's Notes

Title: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text


1
Ontology-Driven Automatic Entity Disambiguation
in Unstructured Text
Jed Hassell, Boanerges Aleman-Meza, Budak
Arpinar 5th International Semantic Web
Conference Athens, GA, Nov. 5 9, 2006
Acknowledgement NSF-ITR-IDM Award 0325464
SemDIS Discovering Complex Relationships in the
Semantic Web
2
The Question is
  • How to determine the most likely match of a
    named-entity in unstructured text?
  • Example
  • Which A. Joshi is this text referring to?
  • out of, say, 20 candidate entities (in a
    populated ontology)

3
likely match confidence score
  • Idea is to spot entity names in text and assign
    each potential match a confidence score
  • The confidence score represents a degree of
    certainty that a spotted entity refers to a
    particular object in the ontology

4
Our Approach, three steps
  • Spot Entity Names
  • assign initial confidence score
  • Adjust confidence score using
  • proximity relationships (text)
  • co-occurrence relationships (text)
  • connections (graph)
  • popular entities (graph)
  • Iterate again to propagate result
  • finish when confidence scores are not updated

5
Spotting Entity Names
  • Search document for entity names within the
    ontology
  • Each match becomes a candidate entity
  • Assign initial confidence scores

6
Using Text-proximity Relationships
  • Relationships that can be expected to be in near
    text-proximity of the entity
  • Measured in terms of character spaces

7
Using Co-occurrence Relations
  • Similar to text-proximity with the exception that
    proximity is not relevant
  • i.e., location within the document does not matter

8
Using Popular Entities (graph)
  • Intention bias the right entity to be the most
    popular entity
  • This should be used with care, depending on the
    domain
  • good for tie-breaking
  • DBLP scenario entity with more papers
  • e.g., only two A. Joshi entities with gt50 papers

9
Using Relations to other Entities
  • Entities can be related to one another through
    their collaboration network
  • neighboring entities get a boost in their
    confidence score
  • i.e., propagation
  • This is the iterative step in our apprach,
  • It starts with entities having highest confidence
    score
  • Example
  • Conference Program Committee Members
  • - Professor Smith
  • - Professor Smiths co-editor in recent book
  • - Professor Smiths recently graduated Ph.D
    advisee
  • . . . . . . . . .

10
In Summary, ontology-driven
  • Using clues
  • from the text where the entity appears
  • from the ontology
  • Example RDF/XML snippet of a persons metadata

11
Overview of System Architecture
12
Once no more iterations are needed
  • Output of results XML format
  • URI
  • Confidence score
  • Entity name (as it appears in the text)
  • Start and end position (location in a document)
  • Can easily be converted to other formats
  • Microformats, RDFa, ...

13
Sample Output
14
Sample Output - Microformat
15
Evaluation Gold Standard Set
  • We evaluate our method using a gold standard set
    of documents
  • Randomly chose 20 consecutive post from DBWorld
  • Set of manually disambiguated documents
  • (two) humans validated the right entity match
  • We used precision and recall as the measurement
    of evaluation for our system

16
Evaluation, sample DBWorld post
17
Sample disambiguated document
18
Using DBLP data as ontology
  • Converted DBLPs bibliographic data to RDF
  • 447,121 authors
  • A SAX parser to convert DBLPs XML data to RDF
  • Created relationships such as co-author
  • Added
  • Affiliations (for a subset of authors)
  • Areas of interest (for a subset of authors)
  • spellings for international characters
  • Lessons learned lead us to create SwetoDblp
    (containing many improvements)

SwetoDblp http//lsdis.cs.uga.edu/projects/semdi
s/swetodblp/
DBLP http//www.informatik.uni-trier.de/ley/db/
19
Evaluation, Precision Recall
  • We define set A as the set of unique names
    identified using the disambiguated dataset (i.e.,
    exact results)
  • We define set B as the set of entities found by
    our method
  • A ? B represents the set of entities correctly
    identified by our method

20
Evaluation, Precision Recall
  • Precision is the proportion of correctly
    disambiguated entities with regard to B
  • Recall is the proportion of correctly
    disambiguated entities with regard to A

21
Evaluation, Results
  • Precision and recall (compared to gold standard)
  • Precision and recall on a per document basis

Correct Disambiguation Found Entities Total Entities Precision Recall
602 620 758 97.1 79.4
22
Related Work
  • Semex Personal Information Management
  • The results of disambiguated entities are
    propagated to other ambiguous entities, which
    could then be reconciled based on recently
    reconciled entities (much like our work does)
  • Takes advantage of a predictable structure such
    as fields where an email or name is expected to
    appear
  • Our approach works with unstructured data

Semex Dong, Halevy, Madhaven, SIGMOD-2005
23
Related Work
  • Kim
  • Contains an entity recognition portion that uses
    natural language processing
  • Evaluations performed on human annotated corpora
  • SCORE Technology (now, http//www.fortent.com/)
  • Uses associations from a knowledge base, yet
    implementation details are not available
    (commercial product)

SCORE Sheth et al, Internet Computing, 6(4),
2002
Kim Popov et al., ISWC-2003
24
Conclusions
  • Our method uses relationships between entities in
    the ontology to go beyond traditional
    syntactic-based disambiguation techniques
  • This work is among the first to successfully use
    relationships for identifying named-entities in
    text without relying on the structure of the text

25
Future Work
  • Improvements on spotting
  • e.g., canonical names (Tim Timothy)
  • Integration/deployment as a UIMA component
  • allows analysis along a document collection
  • for applications such as semantic annotation and
    search
  • Further evaluations
  • Using different datasets and document sets
  • Compare with respect to other methods, and
  • to determine best contributing factor in
    disambiguation
  • measure how far in the list we missed the right
    entity

UIMA IBMs Unstructured Information Management
Architecture
26
Scalability, Semantics, Automation
  • Usage of background knowledge in the form of a
    (large) populated ontology
  • Flexibility to use a different ontology, but,
  • the ontology must fit the domain
  • Its an automatic approach, yet
  • Human defines threshold values (and some weights)

27
References
  • Aleman-Meza, B., Nagarajan, M., Ramakrishnan, C.,
    Ding, L., Kolari, P., Sheth, A., Arpinar, B.,
    Joshi, A.,Finin, T. Semantic Analytics on Social
    Networks Experiences in Addressing the Problem
    of Conflict of Interest Detection. 15th
    International World Wide Web Conference,
    Edinburgh, Scotland (2006)
  • DBWorld. http//www.cs.wisc.edu/dbworld/ April 9,
    2006.
  • Dong, X. L., Halevy, A., Madhaven, J. Reference
    Reconciliation in Complex Information Spaces.
    Proc. of SIGMOD, Baltimore, MD. (2005)
  • Ley, M. The DBLP Computer Science Bibliography
    Evolution, Research Issues, Perspectives. Proc.
    of the 9th International Symposium on String
    Processing and Information Retrieval, Lisbon,
    Portugal (Sept. 2002) 1-10
  • Popov, B., Kiryakov, A., Kirilov, A., Manov, D.,
    Ognyanoff, D., Goranov, M. KIM - Semantic
    Annotation Platform. Proc. of the 2nd Intl.
    Semantic Web Conf, Sanibel Island, Florida (2003)
  • Sheth, A., Bertram, C., Avant, D., Hammond, B.,
    Kochut, K., Warke, Y. Managing semantic content
    for the Web. IEEE Internet Computing, 6(4) (2002)
    80-87
  • Zhu, J., Uren, V., Motta, E. ESpotter Adaptive
    Named Entity Recognition for Web Browsing, 3rd
    Professional Knowledge Management Conference,
    Kaiserslautern, Germany, 2005
  • Evaluation datasets at http//lsdis.cs.uga.edu/a
    leman/publications/Hassell_ISWC2006/
Write a Comment
User Comments (0)
About PowerShow.com