Disambiguation Algorithm for People Search on the Web PowerPoint PPT Presentation

presentation player overlay
About This Presentation
Transcript and Presenter's Notes

Title: Disambiguation Algorithm for People Search on the Web


1
Disambiguation Algorithm for People Search on
the Web
  • Dmitri V. Kalashnikov, Sharad Mehrotra, Zhaoqi
    Chen,
  • Rabia Nuray-Turan, Naveen Ashish
  • For questions visit
  • http//www.ics.uci.edu/dvk
  • Computer Science Department
  • University of California, Irvine

2
Entity (People) Search
Person2
Person1
Top-K Webpages
Person3
Unknown beforehand
3
Standard Approach to Entity Resolution
4
Key Observation More Info is Available

5
RelDC Framework
6
Where is the Graph here? Use Extraction!
7
Overall Algorithm Overview
  • User Input. A user submits a query to the
    middleware via a web-based interface.
  • Web page Retrieval. The middleware queries a
    search engines API, gets top-K Web pages.
  • Preprocessing. The retrieved Web pages are
    preprocessed
  • TF/IDF. Preprocessing steps for computing TF/IDF
    are carried out.
  • Ontology. Ontologies are used to enrich the
    Webpage content.
  • Extraction. Named entities, and web related
    information is extracted from the Webpages.
  • Graph Creation. The Entity-Relationship Graph is
    generated
  • Enhanced TF/IDF. Ontology-enhanced TF/IDF values
    are computed
  • Clustering. Correlation clustering is applied
  • Cluster Processing. Each resulting cluster is
    then processed as follows
  • Sketches. A set of keywords that represent the
    web pages within a cluster is computed for each
    cluster. The goal is that the user should be able
    to find the person of interest by looking at the
    sketch.
  • Cluster Ranking. All cluster are ranked by a
    choosing criteria to be presented in a certain
    order to the user
  • Web page Ranking. Once the user hones in on a
    particular cluster, the Web pages in this cluster
    are presented in a certain order, computed on
    this step.
  • Visualization of Results. The results are
    presented to the user in the form of clusters
    (and their sketches) corresponding to namesakes
    and which can be explored further.

8
Correlation Clustering
  • In CC, each pair of nodes (u,v) is labeled
  • with or - edge
  • labeling is done according to a similarity
    function s(u,v)
  • Similarity function s(u,v)
  • if s(u,v) believes u and v are similar, then
    label
  • else label -
  • s(u,v) is typically trained from past data
  • Clustering
  • looks at edges
  • tries to minimize disagreement
  • disagreement for element x placed in cluster C,
    is a number of - edges that connect x and other
    elements in C

9
Similarity Function
  • Connection strength between u and v
  • where ck the number of u-v paths of type k
  • and wk the weigh of u-v paths of type k
  • Similarity s(u,v) is a combination

10
Training s(u,v) on pre-labeled data
11
Experiments Quality of Disambiguation
By Artiles, et al. in SIGIR05
By Bekkerman McCallum in WWW05
12
Experiments Effect on Search
Write a Comment
User Comments (0)
About PowerShow.com