Disambiguation Algorithm for People Search on the Web presentation

About This Presentation

Transcript and Presenter's Notes

Title: Disambiguation Algorithm for People Search on the Web

1
Disambiguation Algorithm for People Search on
the Web

Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi
Chen,
Rabia Nuray-Turan, Naveen Ashish
For questions visit
http//www.ics.uci.edu/dvk
Computer Science Department
University of California, Irvine

2
Entity (People) Search
Person2
Person1
Top-K Webpages
Person3
Unknown beforehand
3
Standard Approach to Entity Resolution
4
Key Observation More Info is Available

5
RelDC Framework
6
Where is the Graph here? Use Extraction!
7
Overall Algorithm Overview

User Input. A user submits a query to the
middleware via a web-based interface.
Web page Retrieval. The middleware queries a
search engines API, gets top-K Web pages.
Preprocessing. The retrieved Web pages are
preprocessed
TF/IDF. Preprocessing steps for computing TF/IDF
are carried out.
Ontology. Ontologies are used to enrich the
Webpage content.
Extraction. Named entities, and web related
information is extracted from the Webpages.
Graph Creation. The Entity-Relationship Graph is
generated
Enhanced TF/IDF. Ontology-enhanced TF/IDF values
are computed
Clustering. Correlation clustering is applied
Cluster Processing. Each resulting cluster is
then processed as follows
Sketches. A set of keywords that represent the
web pages within a cluster is computed for each
cluster. The goal is that the user should be able
to find the person of interest by looking at the
sketch.
Cluster Ranking. All cluster are ranked by a
choosing criteria to be presented in a certain
order to the user
Web page Ranking. Once the user hones in on a
particular cluster, the Web pages in this cluster
are presented in a certain order, computed on
this step.
Visualization of Results. The results are
presented to the user in the form of clusters
(and their sketches) corresponding to namesakes
and which can be explored further.

8
Correlation Clustering

In CC, each pair of nodes (u,v) is labeled
with or - edge
labeling is done according to a similarity
function s(u,v)
Similarity function s(u,v)
if s(u,v) believes u and v are similar, then
label
else label -
s(u,v) is typically trained from past data
Clustering
looks at edges
tries to minimize disagreement
disagreement for element x placed in cluster C,
is a number of - edges that connect x and other
elements in C

9
Similarity Function

Connection strength between u and v
where ck the number of u-v paths of type k
and wk the weigh of u-v paths of type k
Similarity s(u,v) is a combination

10
Training s(u,v) on pre-labeled data
11
Experiments Quality of Disambiguation
By Artiles, et al. in SIGIR05
By Bekkerman McCallum in WWW05
12
Experiments Effect on Search

Write a Comment

User Comments (0)

About PowerShow.com

Disambiguation Algorithm for People Search on the Web PowerPoint PPT Presentation