Title: Ontology-Driven Automatic Entity Disambiguation in Unstructured Text
1Ontology-Driven Automatic Entity Disambiguation
in Unstructured Text
Jed Hassell, Boanerges Aleman-Meza, Budak
Arpinar 5th International Semantic Web
Conference Athens, GA, Nov. 5 9, 2006
Acknowledgement NSF-ITR-IDM Award 0325464
SemDIS Discovering Complex Relationships in the
Semantic Web
2The Question is
- How to determine the most likely match of a
named-entity in unstructured text? - Example
- Which A. Joshi is this text referring to?
- out of, say, 20 candidate entities (in a
populated ontology)
3likely match confidence score
- Idea is to spot entity names in text and assign
each potential match a confidence score - The confidence score represents a degree of
certainty that a spotted entity refers to a
particular object in the ontology
4Our Approach, three steps
- Spot Entity Names
- assign initial confidence score
- Adjust confidence score using
- proximity relationships (text)
- co-occurrence relationships (text)
- connections (graph)
- popular entities (graph)
- Iterate again to propagate result
- finish when confidence scores are not updated
5Spotting Entity Names
- Search document for entity names within the
ontology - Each match becomes a candidate entity
- Assign initial confidence scores
6Using Text-proximity Relationships
- Relationships that can be expected to be in near
text-proximity of the entity - Measured in terms of character spaces
7Using Co-occurrence Relations
- Similar to text-proximity with the exception that
proximity is not relevant - i.e., location within the document does not matter
8Using Popular Entities (graph)
- Intention bias the right entity to be the most
popular entity - This should be used with care, depending on the
domain - good for tie-breaking
- DBLP scenario entity with more papers
- e.g., only two A. Joshi entities with gt50 papers
9Using Relations to other Entities
- Entities can be related to one another through
their collaboration network - neighboring entities get a boost in their
confidence score - i.e., propagation
- This is the iterative step in our apprach,
- It starts with entities having highest confidence
score - Example
- Conference Program Committee Members
- - Professor Smith
- - Professor Smiths co-editor in recent book
- - Professor Smiths recently graduated Ph.D
advisee - . . . . . . . . .
10In Summary, ontology-driven
- Using clues
- from the text where the entity appears
- from the ontology
- Example RDF/XML snippet of a persons metadata
11Overview of System Architecture
12Once no more iterations are needed
- Output of results XML format
- URI
- Confidence score
- Entity name (as it appears in the text)
- Start and end position (location in a document)
- Can easily be converted to other formats
- Microformats, RDFa, ...
13Sample Output
14Sample Output - Microformat
15Evaluation Gold Standard Set
- We evaluate our method using a gold standard set
of documents - Randomly chose 20 consecutive post from DBWorld
- Set of manually disambiguated documents
- (two) humans validated the right entity match
- We used precision and recall as the measurement
of evaluation for our system
16Evaluation, sample DBWorld post
17Sample disambiguated document
18Using DBLP data as ontology
- Converted DBLPs bibliographic data to RDF
- 447,121 authors
- A SAX parser to convert DBLPs XML data to RDF
- Created relationships such as co-author
- Added
- Affiliations (for a subset of authors)
- Areas of interest (for a subset of authors)
- spellings for international characters
- Lessons learned lead us to create SwetoDblp
(containing many improvements)
SwetoDblp http//lsdis.cs.uga.edu/projects/semdi
s/swetodblp/
DBLP http//www.informatik.uni-trier.de/ley/db/
19Evaluation, Precision Recall
- We define set A as the set of unique names
identified using the disambiguated dataset (i.e.,
exact results) - We define set B as the set of entities found by
our method - A ? B represents the set of entities correctly
identified by our method
20Evaluation, Precision Recall
- Precision is the proportion of correctly
disambiguated entities with regard to B - Recall is the proportion of correctly
disambiguated entities with regard to A
21Evaluation, Results
- Precision and recall (compared to gold standard)
- Precision and recall on a per document basis
Correct Disambiguation Found Entities Total Entities Precision Recall
602 620 758 97.1 79.4
22Related Work
- Semex Personal Information Management
- The results of disambiguated entities are
propagated to other ambiguous entities, which
could then be reconciled based on recently
reconciled entities (much like our work does) - Takes advantage of a predictable structure such
as fields where an email or name is expected to
appear - Our approach works with unstructured data
Semex Dong, Halevy, Madhaven, SIGMOD-2005
23Related Work
- Kim
- Contains an entity recognition portion that uses
natural language processing - Evaluations performed on human annotated corpora
- SCORE Technology (now, http//www.fortent.com/)
- Uses associations from a knowledge base, yet
implementation details are not available
(commercial product)
SCORE Sheth et al, Internet Computing, 6(4),
2002
Kim Popov et al., ISWC-2003
24Conclusions
- Our method uses relationships between entities in
the ontology to go beyond traditional
syntactic-based disambiguation techniques - This work is among the first to successfully use
relationships for identifying named-entities in
text without relying on the structure of the text
25Future Work
- Improvements on spotting
- e.g., canonical names (Tim Timothy)
- Integration/deployment as a UIMA component
- allows analysis along a document collection
- for applications such as semantic annotation and
search - Further evaluations
- Using different datasets and document sets
- Compare with respect to other methods, and
- to determine best contributing factor in
disambiguation - measure how far in the list we missed the right
entity
UIMA IBMs Unstructured Information Management
Architecture
26Scalability, Semantics, Automation
- Usage of background knowledge in the form of a
(large) populated ontology - Flexibility to use a different ontology, but,
- the ontology must fit the domain
- Its an automatic approach, yet
- Human defines threshold values (and some weights)
27References
- Aleman-Meza, B., Nagarajan, M., Ramakrishnan, C.,
Ding, L., Kolari, P., Sheth, A., Arpinar, B.,
Joshi, A.,Finin, T. Semantic Analytics on Social
Networks Experiences in Addressing the Problem
of Conflict of Interest Detection. 15th
International World Wide Web Conference,
Edinburgh, Scotland (2006) - DBWorld. http//www.cs.wisc.edu/dbworld/ April 9,
2006. - Dong, X. L., Halevy, A., Madhaven, J. Reference
Reconciliation in Complex Information Spaces.
Proc. of SIGMOD, Baltimore, MD. (2005) - Ley, M. The DBLP Computer Science Bibliography
Evolution, Research Issues, Perspectives. Proc.
of the 9th International Symposium on String
Processing and Information Retrieval, Lisbon,
Portugal (Sept. 2002) 1-10 - Popov, B., Kiryakov, A., Kirilov, A., Manov, D.,
Ognyanoff, D., Goranov, M. KIM - Semantic
Annotation Platform. Proc. of the 2nd Intl.
Semantic Web Conf, Sanibel Island, Florida (2003) - Sheth, A., Bertram, C., Avant, D., Hammond, B.,
Kochut, K., Warke, Y. Managing semantic content
for the Web. IEEE Internet Computing, 6(4) (2002)
80-87 - Zhu, J., Uren, V., Motta, E. ESpotter Adaptive
Named Entity Recognition for Web Browsing, 3rd
Professional Knowledge Management Conference,
Kaiserslautern, Germany, 2005 - Evaluation datasets at http//lsdis.cs.uga.edu/a
leman/publications/Hassell_ISWC2006/