Grouping search-engine returned citations for person-name queries - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Grouping search-engine returned citations for person-name queries

Description:

... by manual inspection are phone number, email address, state, city and zip code. ... same host as one of the URLs that belongs to the web page referred by the other ... – PowerPoint PPT presentation

Number of Views:16

Avg rating:3.0/5.0

Slides: 25

Provided by: auge

Category:

more less

Transcript and Presenter's Notes

Title: Grouping search-engine returned citations for person-name queries

1
Grouping search-engine returned citations for
person-name queries

Reema Al-Kamha, David W. Embley
(Proceedings of the 6th annual ACM international
workshop on Web information and data management
2004)

2
Abstract

They present a technique to group search-engine
returned citations for person-name queries.
The objective is to put the returned citations in
groups such that each group relates to one
person.
They use a multi-faceted approach that considers
evidence from three facets (attributes, links,
page similarity).
They construct a relatedness confidence matrix
for pairs of citations.
They merge pairs whose matching confidence value
is above an threshold.

3
(No Transcript)
4
Related work

The problem is related to cross-document
coreferencing and object identity.
G. Mann and D. Yarowsky (2003)
They use document vectors over biographical
information such as birth year, birth place,
spouse name..
S. Tcjada (2001)
About object identification, one technique is
vector space modeling, and the other is
probabilistic modeling.

5
A multi-faceted approach

They use a multi-faceted method to group relevant
citations.
Each facet represents an aspect of the problem
about if two citations reference the same person
or different persons.
In this paper, they consider attributes about a
person, links within and among sites, and page
similarity as facets.

6
Facet 1 Attributes

Attributes they found by manual inspection are
phone number, email address, state, city and zip
code.
In order to extract values from a web page, they
write regular expressions for each attribute.

7
Facet 2 Links (1)

If two URLs share a common host, they may refer
to the same person.
If the URL of one citation has the same host as
one of the URLs that belongs to the web page
referred by the other citation, they may refer
to the same person.

8
(No Transcript)
9
Facet 2 Links (2)

Because many names often appear on popular hosts,
when two citations share a popular host, we have
less confidence that they refer to the same
person.
They need to find a way to determine whether the
host is popular or not.
The query linksiteURL in Google shows all pages
that point to that URL.
A host h is popular for person-name queries if
more than 400 pages point to h.

10
Facet 3 Page Similarity (1)

If two different web pages are similar, they may
refer to the same person.
They use pairs of words that start with a capital
letter and that are either adjacent or separated
by a connector (and, or, but) or by a preposition
which may be followed by an article (a, an, the)
or by a single capital letter followed by dot.
David Embley, who is a professor of the Data
Extraction Research Group in the Computer Science
Department at Brigham Young Univeristy.

11
Facet 3 Page Similarity (2)

They construct a stop word list which is a list
of frequently appearing adjacent cap-word pairs
Home Page, Privacy Policy
They collected approximately 10,000 web documents
taken at random from the Open Directory Project.
They constructed all adjacent cap-word pairs and
sorted by their frequencies and considered all
pairs only with a frequency greater than two to
be stop words.

12
Facet 3 Page Similarity (3)

They consider the number of adjacent cap-word
pairs as an indicator of the similarity between
two web pages.
The greater the number of adjacent cap-word
pairs, the greater the similarity between the
pages.

13
Confidence Matrix Construction (1)

They construct a confidence matrix, one for each
facet.
First, they construct a training set to compute
the conditional probabilities.
There are some restrictions for training set.
They should contain male, female, and
gender-neutral names.
They should contain names that the returned
citations are grouped in different size groups.
They should contain names that the returned
citations are grouped in different number of
groups.
They entered each name (9) as a query for Google,
and collected the first 50 returned citations for
each name.

14
Confidence Matrix Construction (2)

They use training set to estimate the conditional
probabilities.
P( Same PersonYes EmailYes)
P( Same PersonYes CityYes and StateYes)

15
Final Confidence Matrix

They generate the final confidence matrix by
combining the confidence matrices for the three
facets using Stanford certainty theory.
Stanford certainty theory gives the following
rule to combine the evidence from these two
independent observations.
Suppose CF(E1) is the certainty factor associated
with evidence E1 for some observation B, and
CF(E2) is another certainty factor. The
compounded CF of B is calculated by
CF(E1)CF(E2)-(CF(E1)CF(E2)).

16
Grouping Algorithm

If there is high confident between two citations
Ci, Cj, they are grouped into a set S1.
If there is high confident between two citations
Cj, Ck, they are grouped into a set S2.
Because S1 and S2 share one or more citations,
they are grouped together in one group S3.
Keep merging any two sets of citations that share
one or more citations until no citation is shared
between any two sets.
The threshold is 0.8.

17
Example (1)

They apply their technique to the first 10
returned citations for the person-name query
Kelly Flanagan.
Pages referenced by the two citations C4 and C7
have the same city and state.
They have P( Same Person Yes City Yes
and State Yes)0.96.

18
Example (2)

The final confidence value between citation C1
and C8 using Stanford certainty theory as 0.96
0 0.78 0.960 0.960.78 0.780
0.9600.78 0.9912.

19
Experimental results (1)

They chose 10 names by opening an arbitrary page
from a phone book and choosing an arbitrary name
from the page.
The system returned the grouping result for the
first 50 returned citations for each name.
The size of test set are 500 citations.

20
Experimental results (2)

To evaluate the performance of their system, they
use split and merge measures.
First, they count how many splits they should do
over all the groups to make the citations in each
group relate to one person.
Then, they counted how many merges they should do
between the groups to ensure that no two groups
relate to one person.
They normalize the split and merge scores to
range between 0 and 1.
For example.

21
Experimental results (3)
22
Experimental results (4)

Using a multi-faceted approach gives much better
performance than using each facet separately.
For groups that should have been merged, no
evidence or only weak evidence was found to group
them.
Human expert may look at pictures, a deeper
understanding of the meaning of distinguishing
phrases.

23
Concluding remarks

They designed and implemented a system that can
automatically group the returned citations from a
search engine person-name query.
They used a multi-faceted approach that considers
three facets.
They gave experimental evidence to show that
their approach can be successful.

24
Evaluation example

Correct grouping result for 8 citations
G1 C1, C2, C4, C6, C7
G2 C3, C8
G3 C5
The grouping result of their system
G1 C1, C2, C4
G2 C3, C6, C7
G3 C5, C8
The number of splits over all the citations is
0112, and total number of merge scores is 2.
back