Grouping search-engine returned citations for person-name queries - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Grouping search-engine returned citations for person-name queries

Description:

... by manual inspection are phone number, email address, state, city and zip code. ... same host as one of the URLs that belongs to the web page referred by the other ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 25
Provided by: auge
Category:

less

Transcript and Presenter's Notes

Title: Grouping search-engine returned citations for person-name queries


1
Grouping search-engine returned citations for
person-name queries
  • Reema Al-Kamha, David W. Embley
  • (Proceedings of the 6th annual ACM international
    workshop on Web information and data management
    2004)

2
Abstract
  • They present a technique to group search-engine
    returned citations for person-name queries.
  • The objective is to put the returned citations in
    groups such that each group relates to one
    person.
  • They use a multi-faceted approach that considers
    evidence from three facets (attributes, links,
    page similarity).
  • They construct a relatedness confidence matrix
    for pairs of citations.
  • They merge pairs whose matching confidence value
    is above an threshold.

3
(No Transcript)
4
Related work
  • The problem is related to cross-document
    coreferencing and object identity.
  • G. Mann and D. Yarowsky (2003)
  • They use document vectors over biographical
    information such as birth year, birth place,
    spouse name..
  • S. Tcjada (2001)
  • About object identification, one technique is
    vector space modeling, and the other is
    probabilistic modeling.

5
A multi-faceted approach
  • They use a multi-faceted method to group relevant
    citations.
  • Each facet represents an aspect of the problem
    about if two citations reference the same person
    or different persons.
  • In this paper, they consider attributes about a
    person, links within and among sites, and page
    similarity as facets.

6
Facet 1 Attributes
  • Attributes they found by manual inspection are
    phone number, email address, state, city and zip
    code.
  • In order to extract values from a web page, they
    write regular expressions for each attribute.

7
Facet 2 Links (1)
  • If two URLs share a common host, they may refer
    to the same person.
  • If the URL of one citation has the same host as
    one of the URLs that belongs to the web page
    referred by the other citation, they may refer
    to the same person.

8
(No Transcript)
9
Facet 2 Links (2)
  • Because many names often appear on popular hosts,
    when two citations share a popular host, we have
    less confidence that they refer to the same
    person.
  • They need to find a way to determine whether the
    host is popular or not.
  • The query linksiteURL in Google shows all pages
    that point to that URL.
  • A host h is popular for person-name queries if
    more than 400 pages point to h.

10
Facet 3 Page Similarity (1)
  • If two different web pages are similar, they may
    refer to the same person.
  • They use pairs of words that start with a capital
    letter and that are either adjacent or separated
    by a connector (and, or, but) or by a preposition
    which may be followed by an article (a, an, the)
    or by a single capital letter followed by dot.
  • David Embley, who is a professor of the Data
    Extraction Research Group in the Computer Science
    Department at Brigham Young Univeristy.

11
Facet 3 Page Similarity (2)
  • They construct a stop word list which is a list
    of frequently appearing adjacent cap-word pairs
  • Home Page, Privacy Policy
  • They collected approximately 10,000 web documents
    taken at random from the Open Directory Project.
  • They constructed all adjacent cap-word pairs and
    sorted by their frequencies and considered all
    pairs only with a frequency greater than two to
    be stop words.

12
Facet 3 Page Similarity (3)
  • They consider the number of adjacent cap-word
    pairs as an indicator of the similarity between
    two web pages.
  • The greater the number of adjacent cap-word
    pairs, the greater the similarity between the
    pages.

13
Confidence Matrix Construction (1)
  • They construct a confidence matrix, one for each
    facet.
  • First, they construct a training set to compute
    the conditional probabilities.
  • There are some restrictions for training set.
  • They should contain male, female, and
    gender-neutral names.
  • They should contain names that the returned
    citations are grouped in different size groups.
  • They should contain names that the returned
    citations are grouped in different number of
    groups.
  • They entered each name (9) as a query for Google,
    and collected the first 50 returned citations for
    each name.

14
Confidence Matrix Construction (2)
  • They use training set to estimate the conditional
    probabilities.
  • P( Same PersonYes EmailYes)
  • P( Same PersonYes CityYes and StateYes)

15
Final Confidence Matrix
  • They generate the final confidence matrix by
    combining the confidence matrices for the three
    facets using Stanford certainty theory.
  • Stanford certainty theory gives the following
    rule to combine the evidence from these two
    independent observations.
  • Suppose CF(E1) is the certainty factor associated
    with evidence E1 for some observation B, and
    CF(E2) is another certainty factor. The
    compounded CF of B is calculated by
    CF(E1)CF(E2)-(CF(E1)CF(E2)).

16
Grouping Algorithm
  • If there is high confident between two citations
    Ci, Cj, they are grouped into a set S1.
  • If there is high confident between two citations
    Cj, Ck, they are grouped into a set S2.
  • Because S1 and S2 share one or more citations,
    they are grouped together in one group S3.
  • Keep merging any two sets of citations that share
    one or more citations until no citation is shared
    between any two sets.
  • The threshold is 0.8.

17
Example (1)
  • They apply their technique to the first 10
    returned citations for the person-name query
    Kelly Flanagan.
  • Pages referenced by the two citations C4 and C7
    have the same city and state.
  • They have P( Same Person Yes City Yes
    and State Yes)0.96.

18
Example (2)
  • The final confidence value between citation C1
    and C8 using Stanford certainty theory as 0.96
    0 0.78 0.960 0.960.78 0.780
    0.9600.78 0.9912.

19
Experimental results (1)
  • They chose 10 names by opening an arbitrary page
    from a phone book and choosing an arbitrary name
    from the page.
  • The system returned the grouping result for the
    first 50 returned citations for each name.
  • The size of test set are 500 citations.

20
Experimental results (2)
  • To evaluate the performance of their system, they
    use split and merge measures.
  • First, they count how many splits they should do
    over all the groups to make the citations in each
    group relate to one person.
  • Then, they counted how many merges they should do
    between the groups to ensure that no two groups
    relate to one person.
  • They normalize the split and merge scores to
    range between 0 and 1.
  • For example.

21
Experimental results (3)
22
Experimental results (4)
  • Using a multi-faceted approach gives much better
    performance than using each facet separately.
  • For groups that should have been merged, no
    evidence or only weak evidence was found to group
    them.
  • Human expert may look at pictures, a deeper
    understanding of the meaning of distinguishing
    phrases.

23
Concluding remarks
  • They designed and implemented a system that can
    automatically group the returned citations from a
    search engine person-name query.
  • They used a multi-faceted approach that considers
    three facets.
  • They gave experimental evidence to show that
    their approach can be successful.

24
Evaluation example
  • Correct grouping result for 8 citations
  • G1 C1, C2, C4, C6, C7
  • G2 C3, C8
  • G3 C5
  • The grouping result of their system
  • G1 C1, C2, C4
  • G2 C3, C6, C7
  • G3 C5, C8
  • The number of splits over all the citations is
    0112, and total number of merge scores is 2.
  • back
Write a Comment
User Comments (0)
About PowerShow.com