Ontology-Driven Automatic Entity Disambiguation in Unstructured Text - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Ontology-Driven Automatic Entity Disambiguation in Unstructured Text

Description:

Idea is to spot entity names in text and assign each potential match a confidence score ... Spotting Entity Names. Search document for entity names within the ontology ... – PowerPoint PPT presentation

Number of Views:111

Avg rating:3.0/5.0

Slides: 28

Provided by: boanergesa9

Category:

more less

Transcript and Presenter's Notes

Title: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text

1
Ontology-Driven Automatic Entity Disambiguation
in Unstructured Text
Jed Hassell, Boanerges Aleman-Meza, Budak
Arpinar 5th International Semantic Web
Conference Athens, GA, Nov. 5 9, 2006
Acknowledgement NSF-ITR-IDM Award 0325464
SemDIS Discovering Complex Relationships in the
Semantic Web
2
The Question is

How to determine the most likely match of a
named-entity in unstructured text?
Example
Which A. Joshi is this text referring to?
out of, say, 20 candidate entities (in a
populated ontology)

3
likely match confidence score

Idea is to spot entity names in text and assign
each potential match a confidence score
The confidence score represents a degree of
certainty that a spotted entity refers to a
particular object in the ontology

4
Our Approach, three steps

Spot Entity Names
assign initial confidence score
Adjust confidence score using
proximity relationships (text)
co-occurrence relationships (text)
connections (graph)
popular entities (graph)
Iterate again to propagate result
finish when confidence scores are not updated

5
Spotting Entity Names

Search document for entity names within the
ontology
Each match becomes a candidate entity
Assign initial confidence scores

6
Using Text-proximity Relationships

Relationships that can be expected to be in near
text-proximity of the entity
Measured in terms of character spaces

7
Using Co-occurrence Relations

Similar to text-proximity with the exception that
proximity is not relevant
i.e., location within the document does not matter

8
Using Popular Entities (graph)

Intention bias the right entity to be the most
popular entity
This should be used with care, depending on the
domain
good for tie-breaking
DBLP scenario entity with more papers
e.g., only two A. Joshi entities with gt50 papers

9
Using Relations to other Entities

Entities can be related to one another through
their collaboration network
neighboring entities get a boost in their
confidence score
i.e., propagation
This is the iterative step in our apprach,
It starts with entities having highest confidence
score
Example
Conference Program Committee Members
- Professor Smith
- Professor Smiths co-editor in recent book
- Professor Smiths recently graduated Ph.D
advisee
. . . . . . . . .

10
In Summary, ontology-driven

Using clues
from the text where the entity appears
from the ontology
Example RDF/XML snippet of a persons metadata

11
Overview of System Architecture
12
Once no more iterations are needed

Output of results XML format
URI
Confidence score
Entity name (as it appears in the text)
Start and end position (location in a document)
Can easily be converted to other formats
Microformats, RDFa, ...

13
Sample Output
14
Sample Output - Microformat
15
Evaluation Gold Standard Set

We evaluate our method using a gold standard set
of documents
Randomly chose 20 consecutive post from DBWorld
Set of manually disambiguated documents
(two) humans validated the right entity match
We used precision and recall as the measurement
of evaluation for our system

16
Evaluation, sample DBWorld post
17
Sample disambiguated document
18
Using DBLP data as ontology

Converted DBLPs bibliographic data to RDF
447,121 authors
A SAX parser to convert DBLPs XML data to RDF
Created relationships such as co-author
Added
Affiliations (for a subset of authors)
Areas of interest (for a subset of authors)
spellings for international characters
Lessons learned lead us to create SwetoDblp
(containing many improvements)

SwetoDblp http//lsdis.cs.uga.edu/projects/semdi
s/swetodblp/
DBLP http//www.informatik.uni-trier.de/ley/db/
19
Evaluation, Precision Recall

We define set A as the set of unique names
identified using the disambiguated dataset (i.e.,
exact results)
We define set B as the set of entities found by
our method
A ? B represents the set of entities correctly
identified by our method

20
Evaluation, Precision Recall

Precision is the proportion of correctly
disambiguated entities with regard to B
Recall is the proportion of correctly
disambiguated entities with regard to A

21
Evaluation, Results

Precision and recall (compared to gold standard)
Precision and recall on a per document basis

Correct Disambiguation Found Entities Total Entities Precision Recall
602 620 758 97.1 79.4
22
Related Work

Semex Personal Information Management
The results of disambiguated entities are
propagated to other ambiguous entities, which
could then be reconciled based on recently
reconciled entities (much like our work does)
Takes advantage of a predictable structure such
as fields where an email or name is expected to
appear
Our approach works with unstructured data

Semex Dong, Halevy, Madhaven, SIGMOD-2005
23
Related Work

Kim
Contains an entity recognition portion that uses
natural language processing
Evaluations performed on human annotated corpora
SCORE Technology (now, http//www.fortent.com/)
Uses associations from a knowledge base, yet
implementation details are not available
(commercial product)

SCORE Sheth et al, Internet Computing, 6(4),
2002
Kim Popov et al., ISWC-2003
24
Conclusions

Our method uses relationships between entities in
the ontology to go beyond traditional
syntactic-based disambiguation techniques
This work is among the first to successfully use
relationships for identifying named-entities in
text without relying on the structure of the text

25
Future Work

Improvements on spotting
e.g., canonical names (Tim Timothy)
Integration/deployment as a UIMA component
allows analysis along a document collection
for applications such as semantic annotation and
search
Further evaluations
Using different datasets and document sets
Compare with respect to other methods, and
to determine best contributing factor in
disambiguation
measure how far in the list we missed the right
entity

UIMA IBMs Unstructured Information Management
Architecture
26
Scalability, Semantics, Automation

Usage of background knowledge in the form of a
(large) populated ontology
Flexibility to use a different ontology, but,
the ontology must fit the domain
Its an automatic approach, yet
Human defines threshold values (and some weights)

27
References

Aleman-Meza, B., Nagarajan, M., Ramakrishnan, C.,
Ding, L., Kolari, P., Sheth, A., Arpinar, B.,
Joshi, A.,Finin, T. Semantic Analytics on Social
Networks Experiences in Addressing the Problem
of Conflict of Interest Detection. 15th
International World Wide Web Conference,
Edinburgh, Scotland (2006)
DBWorld. http//www.cs.wisc.edu/dbworld/ April 9,
2006.
Dong, X. L., Halevy, A., Madhaven, J. Reference
Reconciliation in Complex Information Spaces.
Proc. of SIGMOD, Baltimore, MD. (2005)
Ley, M. The DBLP Computer Science Bibliography
Evolution, Research Issues, Perspectives. Proc.
of the 9th International Symposium on String
Processing and Information Retrieval, Lisbon,
Portugal (Sept. 2002) 1-10
Popov, B., Kiryakov, A., Kirilov, A., Manov, D.,
Ognyanoff, D., Goranov, M. KIM - Semantic
Annotation Platform. Proc. of the 2nd Intl.
Semantic Web Conf, Sanibel Island, Florida (2003)
Sheth, A., Bertram, C., Avant, D., Hammond, B.,
Kochut, K., Warke, Y. Managing semantic content
for the Web. IEEE Internet Computing, 6(4) (2002)
80-87
Zhu, J., Uren, V., Motta, E. ESpotter Adaptive
Named Entity Recognition for Web Browsing, 3rd
Professional Knowledge Management Conference,
Kaiserslautern, Germany, 2005
Evaluation datasets at http//lsdis.cs.uga.edu/a
leman/publications/Hassell_ISWC2006/