Linking and Summarizing Information on Entities

About This Presentation

Title:

Linking and Summarizing Information on Entities

Description:

Apple iPod Nano 4GB. Entity Linkage ... 4GB iPod nano 4GB. De-duplication. Ironic, isn't it? 7. NIH Lister Hill Medical Center ... – PowerPoint PPT presentation

Number of Views:221

Avg rating:3.0/5.0

Slides: 83

Provided by: tanyeefanm

Category:

more less

Transcript and Presenter's Notes

Title: Linking and Summarizing Information on Entities

1
Linking and Summarizing Information on Entities

Presented by
Min-Yen Kan
Web IR / NLP Group (WING)
Department of Computer ScienceNational
University of Singapore, Singapore
This talk archived as http//wing.comp.nus.edu.sg/
kanmy/talks/080407-nihLMC.htm

2
Singapore, the garden city

4M people, sandwiched between Malaysia and
Indonesia
50 km from the equator hot and humid year-long
Known for urban planning, fondness for
acronyms and aversion to bubble gum litterers -D

WING _at_ NUS
http//wing.comp.nus.edu.sg

1 postdoc, 6 Ph.D. students, 5 undergraduates
Projects of in natural language processing,
digital libraries, and information retrieval.

3
Entity Centric Information Management

Collate all studies on SBP2 that new findings in
the last year.
Oh, I meant the PROTEIN SBP2, not the gene.
What other proteins does SBP2 bind to?
Tell me more about the contradiction from
previous results.
Which Miller did the study on SBP2 in 2002?

4
Entity Centric Information Management

Two consequences to discuss today
Linkage
Joint work with Yee Fan TAN, Dongwon LEE (PSU) et
al.
Summarization
Joint work with Ziheng LIN et al.

5
Whats Entity Linkage?

Aggregating data on an object together from
heterogeneous resources
Problem Entity names are ambiguous!
Medical terms
Person names
Products
Customer records
These problems exist even when we have
controlled vocabulary and lexicons (Specialist,
UMLS, MeSH)

By UV cross-linking and immunoprecipitation, we
show that SBP2 specifically binds selenoprotein
mRNAs both in vitro and in vivo. The SBP2 clone
used in this study generates a 3173 nt transcript
(2541 nt of coding sequence plus a 632 nt 3 UTR
truncated at the polyadenylation site).
Protein
Gene
6
Examples of Split Records

Dongwon Lee, 110 E. Foster Ave. 410, State
College, PA, 16802
Honda Fix
Joint Conf. on Digital Libraries
Apple iPod Nano 4GB
Entity Linkage

LEE Dong, 110 East Foster Avenue Apartment 410,
University Park, PA 16802-2343
Honda Jazz
JCDL
4GB iPod nano 4GB
De-duplication

Ironic, isnt it?
7
All over the web!
Jeffrey D. Ullman (Stanford University)
8
Record linkage, formally defined

Input
Two lists of records, A and B
Output
For each record a in A and for each record b in
B,does a and b refer to the same entity?
Note
Entities do not come with unique identifiers
To disambiguate (deduplicate) items in a single
list L, we set A B L

9
Talk Outline

Linkage using the Web
Introduction
gtgt Record linkage using internal knowledge
String matching
Classification or clustering
Graphical formalisms
Blocking
Record linkage using search engines
Update Summarization

10
Fellegi-Sunter model
no-decision region (hold for human review)
designate as definite match
designate as definite non-match
true matches? true non-matches
Frequency of Similarity
false matches
false non-matches
Similarity (a, b)
11
String matching

String similarity
Strings as ordered sequences
Edit distance
Jaro and Jaro-Winkler
Strings as unordered sets
Jaccard similarity
Cosine similarity
Abbreviation matching
Pattern detection e.g. National Institute of
Health (NIH)

(a, b, c) ? (c, b, a)
a, b, c c, b, a
12
Machine Learning

Create features
String similarity, relationships (e.g.
collaborators)
Then learn a model
Naïve Bayes, Support Vector Machine, K-means,
Agglomerative Clustering,

Yoojin Hong, Byung-Won On and Dongwon Lee.
SystemSupport for Name Authority Control Problem
inDigital Libraries OpenDBLP Approach. ECDL
2004.
Same Person?
Sudha Ram, Jinsoo Park and Dongwon Lee.
DigitalLibraries for the Next Millennium
Challenges andResearch Directions. Information
Systems Frontiers 1999.
13
Graphical Methods Social network analysis

Nodes entities
Edges relationships

Analysis Connected components Distance between
nodes Node/edge centrality Cliques Bipartite
subgraphs
14
Talk Outline

Linkage using the Web
Introduction
Record linkage using internal knowledge
gtgt Record linkage using search engines
Search Engine Features
Adaptive Queries
Query Probing
Update Summarization

15
Record linkage using search engines

Previously
We assumed input data records contain sufficient
information to perform linkage
What if
There is insufficient or only noisy information?
e.g., linking short forms to long forms
Ask other people!
I.e., consult external (vs. internal) sources of
knowledge
Use web as collective knowledge base

16
Anatomy of Search Engine Results
Number of results
Ranked list
Title
Programmatically accessible through APIs
Snippet
URL
Web page
17
Derivable Features

Counts
Co-occurrence measure between count(q1),
count(q2) and count(q1 and q2)
Hyperlinkage
Count of web pages of q1 point to pages of q2,
and vice versa?
Incorporate additional indirect links with less
weight(e.g., q 1 ? p ? q2)

Snippets or web pages
(Cosine) similarity using tokens
Counts of specific terms
e.g. number of snippets for q1 containing the
string q2
Further natural language processing

18
Web page features

Named entities (NE)
We consider people, organizations, locations
Each NE token a feature
NE-targeted (NE-T)
Motivation middle names and titles
For NEs having a token of target name
Extract tokens that are not in target name as
features

Charles, Chelsea, Morrice,Edward, Fox, London,
Born Edward Charles Morrice Fox in
Chelsea,London
Charles, Morrice,
19
Using URLs

Where web pages are located is also useful
Hypothesis If web pages of q1 and web pages of
q2 overlap a lot, q1 and q2 are the same entity
Measure this using URL / Host information

Caveat Not all hosts are equally telling
citeseer vs. harvard.edu for author names
pubmed vs. diabetes-info.com for diabetic terms
Solution Weight by Inverse Host Frequency

20
URL Features (cont.)

Page URLs
Hypothesis URL itself tells quite a lot
Home page of lindek
CS department, University of Alberta, Canada
MeURLin (Kan and Nguyen Thi, 2005)
Tokens (http, www, cs, ualberta, ca, lindek)
URI parts (schemehttp, hostnamecs, userlindek,
)
N-grams (ca ualberta, uaberta cs, cs www, www
lindek)
Length of tokens

http//www.cs.ualberta.ca/lindek/
21
Web search engine linkage

Test whether q1 and q2 should be linked
Hypothesis Web pages of q1 and web pages of q2
share some representative data I
Similar to disconnected triples

Jeffrey D. Ullman 384K pgs
Jeffrey D. Ullman aho 174K pgs
J. Ullman 124K pgs
J. Ullman aho 41K pgs
Shimon Ullman 27.3K pgs
Shimon Ullman aho 66 pgs

q1
q2
22
Evaluation - Full web pages in WEPS

Goal
To compare the usefulness of various features for
the Web People Search Task
Architecture

Cosine similarity Single link hierarchical
agglomerative clustering Minimum
similarity threshold
Input web pages
Feature vectors
Clusters
23
Evaluation

F(a 0.5) and similarity threshold 0.2

24
Evaluation - Author Disambiguation

Dataset
Manually-disambiguated dataset of 24 ambiguous
names in computer science domain
Each ambiguous name represented 2 unique authors
(k 2) except for one where it represented 3
Each name is attributed to 30 citations on
average
Proportion of largest class ranges from 50 to
97
Search engine
Google (http//www.google.com/)

25
Evaluation

Single link performs best
Good for clustering citations from different
publication pages together (some pages list only
selected publications)
Some authors have disparate research areas, not
well represented by a centroid vector
Resolving hostnames to IP addresses give best
accuracy

Classification accuracyaveraged over all names
26
Discussion
Per-name accuracies using single link
Per-name average number of URLsreturned per
citation
27
Discussion

Apparent correlation between accuracy and average
number of URLs returned per citation
Author names with few URLs tend to fare poorly
since results are mainly aggregator web sites
Whats the cost?
Lots of queries needed
Web page downloads are expensive
Hence, slow
Can we speed this up?
Sure thing

28
Query probing

Consider some publication venues
Joint Conference on Digital Libraries
European Conference on Digital Libraries
Digital Libraries
Query probing
Use common n-gram digital libraries as query
probe
If we can obtain information on all three
conferences, we save two queries

29
Adaptive querying

Combine two methods when needed
Methods
Ms stronger method but very slow (e.g. web page
similarity)
Mw weaker method but fast (e.g. host overlap)
Aim
Accuracy close to Ms
Significantly reduced running time than Ms

Algorithm
Execute Mw
If heuristic suggests that Mw results are likely
incorrect
Execute Ms

30
Entity Linkage - Conclusion

Important problem with a rich history
New external methods poll contextual evidence
for judgment
Need to combine methods to obtain best aspect of
each

31
Talk Outline

Linkage using the Web
gtgt Graph-based Update Summarization
Introduction
Timestamped Graphs
Evaluation and Conclusions

Now that all this data is linked, how do we
process it?
32
Applications of Summarization
Doing Less Work
Decision Support
33
More seriously an exciting challenge ...

...put a book on the scanner, turn the dial to 2
pages, and read the result...
...download 1000 documents from the web, send
them to the summarizer, and select the best ones
by reading the summaries of the clusters...
...forward the Japanese email to the summarizer,
select 1 par, and skim the translated summary.
get a weekly digest of new treatments and
therapies for pressure ulcers

An update task
34
Simplifying summarization

Select important sentences verbatim from the
input text to form a summary
Input A text document with k sentences
Output Top n (n ltlt k) sentences with the
highest numeric scores (each sentence in the
input document is assigned a numeric score)

Extractive Summarization
35
Summarization

Heuristics for extractive summarization
Cue/stigma phrases
Sentence position (relative to document,
section, paragraph)
Sentence length
TFIDF, TF scores
Similarity (with title, context, query)
Machine learning to tune weights by supervised
learning
Recently, graphical representations of text have
shed new light on the summarization problem

36
Revisiting Social Networks Prestige

One motivation was to model the problem as
finding prestige of nodes in a social network
PageRank random walk
In summarization, lead to TextRank and LexRank
Did we leave anything out of our representation
for summarization?
Yes, the notion of an evolving network

37
Social networks change!

Natural evolving networks (Dorogovtsev and
Mendes, 2001)
Citation networks New papers can cite old ones,
but the old network is static
The Web new pages are added with an old page
connecting it to the web graph, old pages may
update links

38
Talk Outline

Linkage using the Web
Graph-based Update Summarization
Introduction
gtgt Timestamped Graphs
Evaluation and Conclusion

39
Evolutionary models for summarization

Writers and readers often follow conventional
rhetorical styles - articles are not written or
read in an arbitrary way
Consider the evolution of texts using a very
simplistic model
Writers write from the first sentence onwards in
a text
Readers read from the first sentence onwards of
a text
A simple model sentences get added incrementally
to the graph

40
Timestamped Graph Construction

These assumptions suggest us to iteratively add
sentences into the graph in chronological order.
At each iteration, consider which edges to add
to the graph.
For single document simple and straightforward
add 1st sentence, followed by the 2nd, and so
forth, until the last sentence is added
For multi-document treat it as multiple
instances of single documents, which evolve in
parallel i.e., add 1st sentences of all
documents, followed by all 2nd sentences, and so
forth
NB Doesnt really model chronological ordering
between articles, fix later

41
Timestamped Graph Construction

Model
Documents as columns
di document i
Sentences as rows
sj jth sentence of document

42
Timestamped Graph Construction

A multi document example

doc3
doc2
doc1
sent1
sent2
sent3
43

An example TSG DUC 2007 D0703A-A

44
Timestamped Graph Construction

These are just one instance of TSGs
Lets generalize and formalize them
Def A timestamped graph algorithm tsg(M) is a
9-tuple (d, e, u, f,s, t, i, s, t) that
specifies a resulting algorithm that takes as
input the set of texts M and outputs a graph G

Input text transformation function
Properties of nodes
Properties of edges
45
Edge properties (d, e, u, f)

Edge Direction (d)
Forward, backward, or undirected
Edge Number (e)
number of edges to instantiate per timestep
Edge Weight (u)
weighted or unweighted edges
Inter-document factor (f)
penalty factor for links between documents in
multi-document sets.

46
Node properties (s, t, i, s)

Vertex selection function s(u, G)
One strategy among those nodes not yet
connected to u in G, choose the one with highest
similarity according to u
Similarity functions Jaccard, cosine, concept
links (Ye et al.. 2005)
Text unit type (t)
Most extractive algorithms use sentences as
elementary units
Node increment factor (i)
How many nodes get added at each timestep
Skew degree (s)
Models how nodes in multi-document graphs are
added
Skew degree how many iterations to wait before
adding the 1st sentence of the next document
Skip for today

47
Timestamped Graph Construction

Representations
We can model a number of different algorithms
using this 9-tuple formalism
(d, e, u, f, s, t, i, s, t)
The given toy example
(f, 1, 0, 1, max-cosine-based, sentence, 1, 0,
null)
LexRank graphs
(u, N, 1, 1, cosine-based, sentence, Lmax, 0,
null)
N total number of sentences in the cluster
Lmax the max document length
i.e., all sentences are added into the graph in
one timestep, each connected to all others, and
cosine scores are given to edge weights

48
System Overview

Sentence splitting
Detect and mark sentence boundaries
Annotate each sentence with the doc ID and the
sentence number
E.g., XIE19980304.0061 4 March 1998 from Xinhua
News XIE19980304.0061-14 the 14th sentence of
this document
Graph construction
Construct TSG in this phase

49
System Overview

Sentence Ranking
Apply topic-sensitive random walk on the graph
to redistribute the weights of the nodes
Sentence extraction
Extract the top-ranked sentences
Two different modified MMR re-rankers are used,
depending on whether it is main or update task

50
Talk Outline

Linkage using the Web
Graph-based Update Summarization
Introduction
Timestamped Graphs
gtgt Evaluation and Conclusion

51
Evaluation

Dataset DUC 2005, 2006 and 2007.
Evaluation tool ROUGE n-gram based automatic
evaluation
Each dataset contains 50 or 45 clusters, each
cluster contains a query and 25 documents
Evaluate on some parameters
Do different e values affect the summarization
process?
e 2 works best for DUC dataset
How do topic-sensitivity and edge weighting
perform in running PageRank?
Applying both seems to have best effect
How does skewing the graph affect the
information flow in the graph?
Skew of 1 works best, but need to try other
possibilities

52
Holistic Evaluation in DUC 2007

Extractive-based TSG system
Used modified maximal marginal relevance for
update tasks
Penalize links in previously read articles
Extension of inter-document factor (f)

Cluster 1
Cluster 2
Cluster 3
53
Evaluation Results

Main task 10th of 32 systems
Update task 3rd of 24 systems
Conclusion
TSG formalism better tailored to deal with
update / incremental text tasks
New method that may be competitive with current
approaches
Other top scoring systems may do sentence
compression (abstractive), not just extraction

54
Graph-based Update Summary - Conclusion

Proposed a timestamped graph model for text
understanding and summarization
Adds sentences in an incremental fashion
Future work
Freely skewed model
Empirical and theoretical properties of TSGs

55
Where do we go from here?

Organizing data around entities, events
How people deal with data anyways
Understand objects and their inter/intra-relation
ship
Automation requires domain-expertise within a
generic framework

Collate all studies on SBP2 that new findings in
the last year.
Oh, I meant the PROTEIN SBP2, not the gene.
What other proteins does SBP2 bind to?
Tell me more about the contradiction from
previous results.
Which Miller did the study on SBP2 in 2002?

Thank you!
http//wing.comp.nus.edu.sg/

56
Backup Slides Entity Linkage

50 Minute talk total
7 Apr 2008, 10 11 AM

57
Social network analysis

Connected triple
Random walk

Maximum flow
Clustering

x1
s
t
x2
x2
x1
x3
58
Scalability Issues

Pairwise comparisons
Requires O(n2) time
Major bottleneck

Possible solutions
Blocking techniques
Avoiding pairwise comparisons altogether

Input d1, d2, , dn for i 1 to n for j
(i 1) to n compute sim(di, dj)
59
Cost-utility Framework
cost of acquiring fi
utility of acquiring fi
feature fi
known value
value that can be acquired
60
Record Matching
2 Information that canbe acquired at a cost
Training data Assume all feature-valuesand their
acquisition costsknown Testing data Assume 1
known, butfeature-values and theiracquisition
costs in 2unknown Costs Set to MIN_LEN
MAX_LEN
1 Given information
Header-reference pair (instance)
TITLE_MIN_LEN TITLE_MAX_LEN AUTHOR_MIN_LEN AUTHOR_
MAX_LEN VENUE_MIN_LEN VENUE_MAX_LEN
TITLE_SIM AUTHOR_SIM VENUE_SIM
MATCH/MISMATCH?
61
Costs and Utilities

Costs
Trained 3 models (using M5), treat as regression
Utilities
Trained 23 8 classifiers (each to predict
match/mismatch using only known feature-values)
For a test instance with a missing feature-value
F
Get confidence of appropriate classifier without
F
Get expected confidence of appropriate classifier
with F
Utility is difference between the two confidence
scores
Note
Similar to Saar-Tsechansky et al.

62
Results
Without cleaning of header records
With manual cleaning of header records
Increasing proportion of feature-values acquired
Increasing proportion of feature-values acquired
63
Selected Bibliography

General and surveys
Ivan P. Fellegi and Alan B. Sunter. A theory for
record linkage. Journal of the American
Statistical Association, 64(328)11831210,
December 1969.
William E. Winkler and Yves Thibaudeau. An
application of the Fellegi-Sunter Model of record
linkage to the 1990 U.S. Decennial Census.
Technical Report RR91/09, U.S. Bureau of the
Census, 1991.
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and
Vassilios S. Verykios. Duplicate record
detection A survey. IEEE Transactions on
Knowledge and Data Engineering (TKDE),
19(1)116, January 2007.
William E. Winkler. Overview of record linkage
and current research directions. Technical Report
RRS2006/02, U.S. Bureau of the Census, February
2006.
Mikhail Bilenko, Raymond J. Mooney, William W.
Cohen, Pradeep Ravikumar, and Stephen E.
Fienberg. Adaptive name matching in information
integration. IEEE Intelligent Systems,
18(5)1623, January/February 2003.
Min-Yen Kan and Yee Fan Tan. Record Matching in
Digital Library Metadata. To appear in
Communications of the ACM (CACM).

64
Selected Bibliography

String matching
Robert A. Wagner and Michael J. Fischer. The
string-to-string correction problem. Journal of
the Association of Computing Machinery,
21(1)168173, January 1974.
Saul B. Needleman and Christian D. Wunsch. 1970.
A general method applicable to the search for
similarities in the amino acid sequence of two
proteins. Journal of Molecular Biology,
148(3)443453, March 1970.
Temple F. Smith and Michael S. Waterman.
Identification of common molecular subsequences.
Journal of Molecular Biology, 147(1)195197,
March 1981.
Andrés Marzal and Enrique Vidal. Computation of
normalized edit distance and applications. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 15(9)926932, September 1993.
Alvaro E. Monge and Charles Elkan. The field
matching problem Algorithms and applications. In
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 267270, August
1996.
Jie Wei. Markov edit distance. IEEE Transactions
on Pattern Analysis and Machine Intelligence,
26(3)311321, March 2004.
Mikhail Bilenko and Raymond J. Mooney. Adaptive
duplicate detection using learnable string
similarity measures. In ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining, pages 3948, August 2003.
Andrew McCallum, Kedar Bellare, and Fernando
Pereira. A Conditional Random Field For
Discriminatively-Trained Finite-State String Edit
Distance. In Conference on Uncertainty in
Artificial Intelligence (UAI), July 2005.
William. W. Cohen, Pradeep Ravikumar, and Stephen
E. Fienberg. A comparison of string distance
metrics for name-matching tasks. In Information
Integration on the Web (IIWeb), pages 7378,
August 2003.
Ariel S. Schwartz and Marti A. Hearst. A simple
algorithm for identifying abbreviation
definitions in biomedical text. In Pacific
Symposium on Biocomputing (PSB), pages 451462,
January 2003.
Youngja Park and Roy J. Byrd. Hybrid text mining
for finding abbreviations and their definitions.
In Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 126133, June
2001.
Jeffrey T. Chang , Hinrich Schütze, and Russ B.
Altman. Creating an online dictionary of
abbreviations from MEDLINE. Journal of the
American Medical Informatics Association,
9(6)612620, November/December 2002.
Hiroko Ao and Toshihisa Takagi. ALICE An
algorithm to extract abbreviations from MEDLINE.
Journal of the American Medical Informatics
Association, 12(5)576586, September/October
2005.

65
Selected Bibliography

Direct classification or clustering, and blocking
Hui Han, Hongyuan Zha, and C. Lee Giles. A
model-based K-means algorithm for name
disambiguation. In Workshop on Semantic Web
Technologies for Searching and Retrieving
Scientific Data, October 2003.
Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li,
and Kostas Tsioutsiouliklis. Two supervised
learning approaches for name disambiguation in
author citations. In ACM/IEEE Joint Conference on
Digital Libraries (JCDL), pages 296305, June
2004.
Hui Han, Wei Xu, Hongyuan Zha, and C. Lee Giles.
A hierarchical naive bayes mixture model for name
disambiguation in author citations. In ACM
Symposium on Applied Computing (SAC), pages
10651069, March 2005.
Hui Han, Hongyuan Zha, and C. Lee Giles. Name
disambiguation in author citations using a K-way
spectral clustering method. In ACM/IEEE Joint
Conference on Digital Libraries (JCDL), pages
334343, June 2005.
Dongwon Lee, Byung-Won On, Jaewoo Kang, and
Sanghyun Park. Effective and scalable solutions
for mixed and split citation problems in digital
libraries. In ACM SIGMOD Workshop on Information
Quality in Information Systems (IQIS), pages
6976, June 2005.
Byung-Won On, Dongwon Lee, Jaewoo Kang, and
Prasenjit Mitra. Comparative study of name
disambiguation problem using a scalable
blocking-based framework. In ACM/IEEE Joint
Conference on Digital Libraries (JCDL), pages
344353, June 2005.
Andrew McCallum, Kamal Nigam, and Lyle Ungar.
Efficient clustering of high-dimensional data
sets with application to reference matching. In
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 169178, August
2000.
Matthew Michelson and Craig A. Knoblock. Learning
blocking schemes for record linkage. In National
Conference on Artificial Intelligence (AAAI),
July 2006.
Mikhail Bilenko, Beena Kamath, and Raymond J.
Mooney. Adaptive Blocking Learning to Scale Up
Record Linkage and Clustering. In IEEE
International Conference on Data Mining (ICDM),
December 2006.

66
Selected Bibliography

Graphical models
Jie Wei. Markov edit distance. IEEE Transactions
on Pattern Analysis and Machine Intelligence,
26(3)311321, March 2004.
John Lafferty, Andrew McCallum, and Fernando
Pereira. Conditional random fields Probabilistic
models for segmenting and labeling sequence data.
In International Conference on Machine Learning
(ICML), pages 282289, June/July 2001.
Andrew McCallum and Ben Wellner. Object
consolidation by graph partitioning with a
conditionally-trained distance metric. In ACM
SIGKDD Workshop on Data Cleaning, Record Linkage,
and Object Consolidation, pages 1924, August
2003.
Ben Wellner, Andrew McCallum, Fuchun Peng, and
Michael Hay. An integrated, conditional model of
information extraction and coreference with
application to citation matching. In Conference
on Uncertainty in Artificial Intelligence (UAI),
pages 593601, July 2004.
Andrew McCallum, Kedar Bellare, and Fernando
Pereira. A Conditional Random Field For
Discriminatively-Trained Finite-State String Edit
Distance. In Conference on Uncertainty in
Artificial Intelligence (UAI), July 2005.
Xin Dong, Alon Halevy, and Jayant Madhavan.
Reference reconciliation in complex information
spaces. In ACM SIGMOD International Conference on
Management of Data, pages 8596, June 2005.
Indrajit Bhattacharya and Lise Getoor. A latent
dirichlet model for unsupervised entity
resolution. In SIAM International Conference on
Data Mining, pages 4758, April 2006.

67
Selected Bibliography

Social network analysis
H. A. Kautz, B. Selman, and M. A. Shah. The
hidden web. AI Magazine, 18(2)2736, 1997.
P. Mutschke. Mining networks and central entities
in digital libraries. A graph theoretic approach
applied to co-author networks. In Intelligent
Data Analysis (IDA), pages 155166, August 2003.
M. E. J. Newman. Who is the best connected
scientist? A study of scientific coauthorship
networks. In Complex Networks, pages 337370,
February 2004.
E. Otte and R. Rousseau. Social network analysis
a powerful strategy, also for the information
sciences. Journal of Information Science, 28(6),
December 2002.
T. Krichel and N. Bakkalbasi. A social network
analysis of research collaboration in the
economics community. In International Workshop on
Webometrics, Informetrics and Scientometrics
Seventh COLLNET Meeting, May 2006.
R. Rousseau and M. Thelwall. Escher staircases on
the world wide web. First Monday, 9(6), June
2004.
D. G. Feitelson. On identifying name equivalences
in digital libraries. Information Research, 9(4),
October 2004.
R. Bekkerman and A. McCallum. Disambiguating web
appearances of people in a social network. In
International conference on World Wide Web (WWW),
pages 463470, May 2005.
R. Holzer, B. Malin, and L. Sweeney. Email alias
detection using social network analysis. In
Workshop on Link Discovery Issues, Approaches
and Applications (LinkKDD), August 2005.
B. Malin, E. Airoldi, and K. M. Carley. A network
analysis model for disambiguation of names in
lists. Computational and Mathematical
Organization Theory, 11(2)119139, July 2005.
G. Flake, S. Lawrence, and C. L. Giles. Efficient
identification of web communities. In ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, pages 150160, August 2000.
P. K. Reddy and M. Kitsuregawa. An approach to
build a cyber-community hierarchy. In SIAM ICDM
Workshop on Web Analysis, April 2002.
Patrick Reuther. Personal name matching New test
collections and a social network based approach.
Technical Report Mathematics/Computer Science
06-01, University of Trier, March 2006.
Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki,
Keisuke Ishida, Takuichi Nishimura, Hideaki
Takeda, Kôiti Hasida, and Mitsuru Ishizuka.
POLYPHONET an advanced social network extraction
system from the web. In International conference
on World Wide Web (WWW), pages 397-406, May 2006.

68
Selected Bibliography

Web-based methods
Jamie P. Callan, Margie E. Connell, and Aiqun Du.
Automatic discovery of language models for text
databases. In ACM SIGMOD International Conference
on Management of Data, pages 479490, June 1999.
Jamie P. Callan and Margie E. Connell.
Query-based sampling of text databases. ACM
Transactions on Information Systems (TOIS),
19(2)97130, April 2001.
Panagiotis G. Ipeirotis and Luis Gravano.
Distributed search over the hidden-web
Hierarchical database sampling and selection. In
International Conference on Very Large Databases
(VLDB), pages 394405, August 2002.
Luis Gravano, Panagiotis G. Ipeirotis, and Mehran
Sahami. QProber A system for automatic
classification of hidden-web databases. ACM
Transactions on Information Systems (TOIS),
21(1)141, January 2003.
Aron Culotta, Ron Bekkerman, and Andrew McCallum.
Extracting social networks and contact
information from email and the web. In Conference
on Email and Anti-Spam (CEAS), July 2004.
Philipp Cimiano, Siegfried Handschuh, and Steffen
Staab. Towards the self-annotating web. In
International conference on World Wide Web (WWW),
pages 462471, May 2004.
Philipp Cimiano, Günter Ladwig, and Steffen
Staab. Gimme the context Context-driven
automatic semantic annotation with C-PANKOW. In
International conference on World Wide Web (WWW),
pages 332341, May 2005.
Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki,
Keisuke Ishida, Takuichi Nishimura, Hideaki
Takeda, Kôiti Hasida, and Mitsuru Ishizuka.
POLYPHONET an advanced social network extraction
system from the web. In International conference
on World Wide Web (WWW), pages 397-406, May 2006.
Yee Fan Tan, Min-Yen Kan, and Dongwon Lee. Search
engine driven author disambiguation. In ACM/IEEE
Joint Conference on Digital Libraries (JCDL),
June 2006.
Ergin Elmacioglu, Min-Yen Kan, Dongwon Lee, and
Yi Zhang. Googled name linkage. 2007.
Yee Fan Tan, Ergin Elmacioglu, Min-Yen Kan, and
Dongwon Lee. Record Linkage of Short Forms to
Long Forms A Case Study of Publication Venues.
2007.
Min-Yen Kan. Web page classification without the
web page. In International conference on World
Wide Web (WWW), pages 262263, May 2004.
Min-Yen Kan and Hoang Oanh Nguyen Thi. Fast
webpage classification using url features. In
International Conference on Information and
Knowledge Management (CIKM), pages 325326,
October/November 2005.
Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay
Jain, and Luis Gravano. To search or to crawl?
Towards a query optimizer for text-centric tasks.
In ACM SIGMOD International Conference on
Management of Data, pages 265276, June 2006.

69
Backup Slides - Summarization

50 Minute talk total
7 Apr 2008, 10 11 AM

70
A Summarization Machine
MULTIDOCS
QUERY
DOC
50
Very Brief
Brief
Headline
10
100
Long
Extract
Abstract
Indicative
Informative
Generic
Query-oriented
Summary
Just the news
Background
Generate a summary given a text document
71
Summarization, defined

Definitions
Take a text document, extract content from it and
present the most important content to the user in
a condensed form and in a manner sensitive to the
users or applications needs
Summarization requires
understanding the meaning of a text document
generating fluent text summary
Studies of human summarizers
Cremmins (65) Endres-Niggemeyer (98) showed
that professional summarizers used clues to pick
summary content.

72
Skew Degree Examples

time(d1) lt time(d2) lt time(d3) lt time(d4)

d1 d2 d3 d4
d1 d2 d3 d4
Freely skewed Only add a new document when it
would be linked by some node using vertex
function s
Skewed by 1
Skewed by 2
73
Input text transformation function (t)

Document Segmentation Function (t)
Problem observed in some clusters where some
documents in a multi-document cluster are very
long
Takes many timestamps to introduce all of the
sentences, causing too many edges to be drawn
?(G) segments long documents into several sub
docs
Solution is too hacked hope to investigate
more in current and future work

d5b
d5a
d5
74
Evaluation on number of edges (e)

Tried different e values
Optimal performance e 2
At e 1, graph is too loosely connected, not
suitable for PageRank ? very low performance
At e N, a LexRank system

e 2
e 2
N
N
N
75
Evaluation (other edge parameters)

PageRank generic vs topic-sensitive
Edge weight (u) unweighted vs weighted
Optimal performance topic-sensitive PageRank
and weighted edges

Topic-sensitive Weighted edges ROUGE-1 ROUGE-2
No No 0.39358 0.07690
Yes No 0.39443 0.07838
No Yes 0.39823 0.08072
Yes Yes 0.39845 0.08282
76
Evaluation on skew degree (s)

Different skew degrees s 0, 1 and 2
Optimal performance s 1
s 2 introduces a delay interval that is too
large
Need to try freely skewed graphs

Skew degree ROUGE-1 ROUGE-2
0 0.36982 0.07580
1 0.37268 0.07682
2 0.36998 0.07489
77
Describing Summaries

Aspects of summarization (Sparck-Jones 97,
Hovy and Lin 99)
Input
Single-document vs. multi-document
Purpose
Situation embedded in larger system (MT, IR) or
not?
Generic vs. query-oriented authors view or
users interest?
Indicative vs. informative categorization or
understanding?
Background vs. just-the-news does user have
prior knowledge?
Output
Extract vs. abstract use text fragments or
re-phrase content?

78
Differences for main and update task processing

Main task
Construct a TSG for input cluster
Run topic-sensitive PageRank on the TSG
Apply first modified version of MMR to extract
sentences

Update task
Cluster A
Construct a TSG for cluster A
Run topic-sensitive PageRank on the TSG
Apply the second modified version of MMR to
extract sentences
Cluster B
Construct a TSG for clusters A and B
Run topic-sensitive PageRank on the TSG only
retain sentences from B
Apply the second modified version of MMR to
extract sentences
Cluster C
Construct a TSG for clusters A, B and C
Run topic-sensitive PageRank on the TSG only
retain sentences from C
Apply the second modified version of MMR to
extract sentences

79
Sentence Ranking

Once a timestamped graph is built, we want to
compute an prestige score for each node
PageRank use an iterative method that allows
the weights of the nodes to redistribute until
stability is reached
Similarities as edges ? weighted edges query ?
topic-sensitive

Topic sensitive (Q) portion
Standard random walk term
80
Sentence Extraction Main task

Original MMR integrates a penalty of the
maximal similarity of the candidate document and
one selected document
Ye et al. (2005) introduced a modified MMR
integrates a penalty of the total similarity of
the candidate sentence and all selected sentences
Score(s) PageRank score of s S selected
sentences
This is used in the main task

Penalty All previous sentence similarity
81
Sentence Extraction Update task

Update task assumes readers already read previous
cluster(s)
implies we should not select sentences that have
redundant information with previous cluster(s)
Propose a modified MMR for the update task
consider the total similarity of the candidate
sentence with all selected sentences and
sentences in previously-read cluster(s)
P contains some top-ranked sentences in previous
cluster(s)

Previous cluster overlap
82
References

Günes Erkan and Dragomir R. Radev. 2004.
LexRank Graph-based centrality as salience in
text summari-zation. Journal of Artificial
Intelligence Research, (22).
Rada Mihalcea and Paul Tarau. 2004. TextRank
Bring-ing order into texts. In Proceedings of
EMNLP 2004.
S.N. Dorogovtsev and J.F.F. Mendes. 2001.
Evolution of networks. Submitted to Advances in
Physics on 6th March 2001.
Sergey Brin and Lawrence Page. 1998. The anatomy
of a large-scale hypertextual Web search engine.
Com-puter Networks and ISDN Systems, 30(1-7).
Jon M. Kleinberg. 1999. Authoritative sources in
a hy-perlinked environment. In Proceedings of
ACM-SIAM Symposium on Discrete Algorithms, 1999.
Shiren Ye, Long Qiu, Tat-Seng Chua, and Min-Yen
Kan. 2005. NUS at DUC 2005 Understanding
docu-ments via concepts links. In Proceedings of
DUC 2005.