Title: Linking and Summarizing Information on Entities
1Linking and Summarizing Information on Entities
- Presented by
- Min-Yen Kan
- Web IR / NLP Group (WING)
- Department of Computer ScienceNational
University of Singapore, Singapore - This talk archived as http//wing.comp.nus.edu.sg/
kanmy/talks/080407-nihLMC.htm
2Singapore, the garden city
- 4M people, sandwiched between Malaysia and
Indonesia - 50 km from the equator hot and humid year-long
- Known for urban planning, fondness for
acronyms and aversion to bubble gum litterers -D
WING _at_ NUS
http//wing.comp.nus.edu.sg
- 1 postdoc, 6 Ph.D. students, 5 undergraduates
- Projects of in natural language processing,
digital libraries, and information retrieval.
3Entity Centric Information Management
- Collate all studies on SBP2 that new findings in
the last year. - Oh, I meant the PROTEIN SBP2, not the gene.
- What other proteins does SBP2 bind to?
- Tell me more about the contradiction from
previous results. - Which Miller did the study on SBP2 in 2002?
4Entity Centric Information Management
- Two consequences to discuss today
- Linkage
- Joint work with Yee Fan TAN, Dongwon LEE (PSU) et
al. - Summarization
- Joint work with Ziheng LIN et al.
5Whats Entity Linkage?
- Aggregating data on an object together from
heterogeneous resources - Problem Entity names are ambiguous!
- Medical terms
- Person names
- Products
- Customer records
- These problems exist even when we have
controlled vocabulary and lexicons (Specialist,
UMLS, MeSH)
By UV cross-linking and immunoprecipitation, we
show that SBP2 specifically binds selenoprotein
mRNAs both in vitro and in vivo. The SBP2 clone
used in this study generates a 3173 nt transcript
(2541 nt of coding sequence plus a 632 nt 3 UTR
truncated at the polyadenylation site).
Protein
Gene
6Examples of Split Records
- Dongwon Lee, 110 E. Foster Ave. 410, State
College, PA, 16802 - Honda Fix
- Joint Conf. on Digital Libraries
- Apple iPod Nano 4GB
- Entity Linkage
- LEE Dong, 110 East Foster Avenue Apartment 410,
University Park, PA 16802-2343 - Honda Jazz
- JCDL
- 4GB iPod nano 4GB
- De-duplication
Ironic, isnt it?
7All over the web!
Jeffrey D. Ullman (Stanford University)
8Record linkage, formally defined
- Input
- Two lists of records, A and B
- Output
- For each record a in A and for each record b in
B,does a and b refer to the same entity? - Note
- Entities do not come with unique identifiers
- To disambiguate (deduplicate) items in a single
list L, we set A B L
9Talk Outline
- Linkage using the Web
- Introduction
- gtgt Record linkage using internal knowledge
- String matching
- Classification or clustering
- Graphical formalisms
- Blocking
- Record linkage using search engines
- Update Summarization
10Fellegi-Sunter model
no-decision region (hold for human review)
designate as definite match
designate as definite non-match
true matches? true non-matches
Frequency of Similarity
false matches
false non-matches
Similarity (a, b)
11String matching
- String similarity
- Strings as ordered sequences
- Edit distance
- Jaro and Jaro-Winkler
- Strings as unordered sets
- Jaccard similarity
- Cosine similarity
- Abbreviation matching
- Pattern detection e.g. National Institute of
Health (NIH)
(a, b, c) ? (c, b, a)
a, b, c c, b, a
12Machine Learning
- Create features
- String similarity, relationships (e.g.
collaborators) - Then learn a model
- Naïve Bayes, Support Vector Machine, K-means,
Agglomerative Clustering,
Yoojin Hong, Byung-Won On and Dongwon Lee.
SystemSupport for Name Authority Control Problem
inDigital Libraries OpenDBLP Approach. ECDL
2004.
Same Person?
Sudha Ram, Jinsoo Park and Dongwon Lee.
DigitalLibraries for the Next Millennium
Challenges andResearch Directions. Information
Systems Frontiers 1999.
13Graphical Methods Social network analysis
- Nodes entities
- Edges relationships
Analysis Connected components Distance between
nodes Node/edge centrality Cliques Bipartite
subgraphs
14Talk Outline
- Linkage using the Web
- Introduction
- Record linkage using internal knowledge
- gtgt Record linkage using search engines
- Search Engine Features
- Adaptive Queries
- Query Probing
- Update Summarization
15Record linkage using search engines
- Previously
- We assumed input data records contain sufficient
information to perform linkage - What if
- There is insufficient or only noisy information?
- e.g., linking short forms to long forms
- Ask other people!
- I.e., consult external (vs. internal) sources of
knowledge - Use web as collective knowledge base
16Anatomy of Search Engine Results
Number of results
Ranked list
Title
Programmatically accessible through APIs
Snippet
URL
Web page
17Derivable Features
- Counts
- Co-occurrence measure between count(q1),
count(q2) and count(q1 and q2) - Hyperlinkage
- Count of web pages of q1 point to pages of q2,
and vice versa? - Incorporate additional indirect links with less
weight(e.g., q 1 ? p ? q2)
- Snippets or web pages
- (Cosine) similarity using tokens
- Counts of specific terms
- e.g. number of snippets for q1 containing the
string q2 - Further natural language processing
18Web page features
- Named entities (NE)
- We consider people, organizations, locations
- Each NE token a feature
- NE-targeted (NE-T)
- Motivation middle names and titles
- For NEs having a token of target name
- Extract tokens that are not in target name as
features
Charles, Chelsea, Morrice,Edward, Fox, London,
Born Edward Charles Morrice Fox in
Chelsea,London
Charles, Morrice,
19Using URLs
- Where web pages are located is also useful
- Hypothesis If web pages of q1 and web pages of
q2 overlap a lot, q1 and q2 are the same entity - Measure this using URL / Host information
- Caveat Not all hosts are equally telling
- citeseer vs. harvard.edu for author names
- pubmed vs. diabetes-info.com for diabetic terms
- Solution Weight by Inverse Host Frequency
20URL Features (cont.)
- Page URLs
- Hypothesis URL itself tells quite a lot
- Home page of lindek
- CS department, University of Alberta, Canada
- MeURLin (Kan and Nguyen Thi, 2005)
- Tokens (http, www, cs, ualberta, ca, lindek)
- URI parts (schemehttp, hostnamecs, userlindek,
) - N-grams (ca ualberta, uaberta cs, cs www, www
lindek) - Length of tokens
http//www.cs.ualberta.ca/lindek/
21Web search engine linkage
- Test whether q1 and q2 should be linked
- Hypothesis Web pages of q1 and web pages of q2
share some representative data I - Similar to disconnected triples
- Jeffrey D. Ullman 384K pgs
- Jeffrey D. Ullman aho 174K pgs
- J. Ullman 124K pgs
- J. Ullman aho 41K pgs
- Shimon Ullman 27.3K pgs
- Shimon Ullman aho 66 pgs
q1
q2
22Evaluation - Full web pages in WEPS
- Goal
- To compare the usefulness of various features for
the Web People Search Task - Architecture
Cosine similarity Single link hierarchical
agglomerative clustering Minimum
similarity threshold
Input web pages
Feature vectors
Clusters
23Evaluation
- F(a 0.5) and similarity threshold 0.2
24Evaluation - Author Disambiguation
- Dataset
- Manually-disambiguated dataset of 24 ambiguous
names in computer science domain - Each ambiguous name represented 2 unique authors
(k 2) except for one where it represented 3 - Each name is attributed to 30 citations on
average - Proportion of largest class ranges from 50 to
97 - Search engine
- Google (http//www.google.com/)
25Evaluation
- Single link performs best
- Good for clustering citations from different
publication pages together (some pages list only
selected publications) - Some authors have disparate research areas, not
well represented by a centroid vector - Resolving hostnames to IP addresses give best
accuracy
Classification accuracyaveraged over all names
26Discussion
Per-name accuracies using single link
Per-name average number of URLsreturned per
citation
27Discussion
- Apparent correlation between accuracy and average
number of URLs returned per citation - Author names with few URLs tend to fare poorly
since results are mainly aggregator web sites - Whats the cost?
- Lots of queries needed
- Web page downloads are expensive
- Hence, slow
- Can we speed this up?
- Sure thing
28Query probing
- Consider some publication venues
- Joint Conference on Digital Libraries
- European Conference on Digital Libraries
- Digital Libraries
- Query probing
- Use common n-gram digital libraries as query
probe - If we can obtain information on all three
conferences, we save two queries
29Adaptive querying
- Combine two methods when needed
- Methods
- Ms stronger method but very slow (e.g. web page
similarity) - Mw weaker method but fast (e.g. host overlap)
- Aim
- Accuracy close to Ms
- Significantly reduced running time than Ms
- Algorithm
- Execute Mw
- If heuristic suggests that Mw results are likely
incorrect - Execute Ms
30Entity Linkage - Conclusion
- Important problem with a rich history
- New external methods poll contextual evidence
for judgment - Need to combine methods to obtain best aspect of
each
31Talk Outline
- Linkage using the Web
- gtgt Graph-based Update Summarization
- Introduction
- Timestamped Graphs
- Evaluation and Conclusions
Now that all this data is linked, how do we
process it?
32Applications of Summarization
Doing Less Work
Decision Support
33More seriously an exciting challenge ...
- ...put a book on the scanner, turn the dial to 2
pages, and read the result... - ...download 1000 documents from the web, send
them to the summarizer, and select the best ones
by reading the summaries of the clusters... - ...forward the Japanese email to the summarizer,
select 1 par, and skim the translated summary. - get a weekly digest of new treatments and
therapies for pressure ulcers
An update task
34Simplifying summarization
- Select important sentences verbatim from the
input text to form a summary - Input A text document with k sentences
- Output Top n (n ltlt k) sentences with the
highest numeric scores (each sentence in the
input document is assigned a numeric score)
Extractive Summarization
35Summarization
- Heuristics for extractive summarization
- Cue/stigma phrases
- Sentence position (relative to document,
section, paragraph) - Sentence length
- TFIDF, TF scores
- Similarity (with title, context, query)
- Machine learning to tune weights by supervised
learning - Recently, graphical representations of text have
shed new light on the summarization problem
36Revisiting Social Networks Prestige
- One motivation was to model the problem as
finding prestige of nodes in a social network - PageRank random walk
- In summarization, lead to TextRank and LexRank
- Did we leave anything out of our representation
for summarization? - Yes, the notion of an evolving network
37Social networks change!
- Natural evolving networks (Dorogovtsev and
Mendes, 2001) - Citation networks New papers can cite old ones,
but the old network is static - The Web new pages are added with an old page
connecting it to the web graph, old pages may
update links
38Talk Outline
- Linkage using the Web
- Graph-based Update Summarization
- Introduction
- gtgt Timestamped Graphs
- Evaluation and Conclusion
39Evolutionary models for summarization
- Writers and readers often follow conventional
rhetorical styles - articles are not written or
read in an arbitrary way - Consider the evolution of texts using a very
simplistic model - Writers write from the first sentence onwards in
a text - Readers read from the first sentence onwards of
a text - A simple model sentences get added incrementally
to the graph
40Timestamped Graph Construction
- These assumptions suggest us to iteratively add
sentences into the graph in chronological order. - At each iteration, consider which edges to add
to the graph. - For single document simple and straightforward
add 1st sentence, followed by the 2nd, and so
forth, until the last sentence is added - For multi-document treat it as multiple
instances of single documents, which evolve in
parallel i.e., add 1st sentences of all
documents, followed by all 2nd sentences, and so
forth - NB Doesnt really model chronological ordering
between articles, fix later
41Timestamped Graph Construction
- Model
- Documents as columns
- di document i
- Sentences as rows
- sj jth sentence of document
42Timestamped Graph Construction
doc3
doc2
doc1
sent1
sent2
sent3
43- An example TSG DUC 2007 D0703A-A
44Timestamped Graph Construction
- These are just one instance of TSGs
- Lets generalize and formalize them
- Def A timestamped graph algorithm tsg(M) is a
9-tuple (d, e, u, f,s, t, i, s, t) that
specifies a resulting algorithm that takes as
input the set of texts M and outputs a graph G
Input text transformation function
Properties of nodes
Properties of edges
45Edge properties (d, e, u, f)
- Edge Direction (d)
- Forward, backward, or undirected
-
- Edge Number (e)
- number of edges to instantiate per timestep
-
- Edge Weight (u)
- weighted or unweighted edges
-
- Inter-document factor (f)
- penalty factor for links between documents in
multi-document sets.
46Node properties (s, t, i, s)
- Vertex selection function s(u, G)
- One strategy among those nodes not yet
connected to u in G, choose the one with highest
similarity according to u - Similarity functions Jaccard, cosine, concept
links (Ye et al.. 2005) -
- Text unit type (t)
- Most extractive algorithms use sentences as
elementary units -
- Node increment factor (i)
- How many nodes get added at each timestep
-
- Skew degree (s)
- Models how nodes in multi-document graphs are
added - Skew degree how many iterations to wait before
adding the 1st sentence of the next document - Skip for today
47Timestamped Graph Construction
- Representations
- We can model a number of different algorithms
using this 9-tuple formalism - (d, e, u, f, s, t, i, s, t)
- The given toy example
- (f, 1, 0, 1, max-cosine-based, sentence, 1, 0,
null) - LexRank graphs
- (u, N, 1, 1, cosine-based, sentence, Lmax, 0,
null) - N total number of sentences in the cluster
Lmax the max document length - i.e., all sentences are added into the graph in
one timestep, each connected to all others, and
cosine scores are given to edge weights
48System Overview
- Sentence splitting
- Detect and mark sentence boundaries
- Annotate each sentence with the doc ID and the
sentence number - E.g., XIE19980304.0061 4 March 1998 from Xinhua
News XIE19980304.0061-14 the 14th sentence of
this document - Graph construction
- Construct TSG in this phase
49System Overview
- Sentence Ranking
- Apply topic-sensitive random walk on the graph
to redistribute the weights of the nodes - Sentence extraction
- Extract the top-ranked sentences
- Two different modified MMR re-rankers are used,
depending on whether it is main or update task
50Talk Outline
- Linkage using the Web
- Graph-based Update Summarization
- Introduction
- Timestamped Graphs
- gtgt Evaluation and Conclusion
51Evaluation
- Dataset DUC 2005, 2006 and 2007.
- Evaluation tool ROUGE n-gram based automatic
evaluation - Each dataset contains 50 or 45 clusters, each
cluster contains a query and 25 documents - Evaluate on some parameters
- Do different e values affect the summarization
process? - e 2 works best for DUC dataset
- How do topic-sensitivity and edge weighting
perform in running PageRank? - Applying both seems to have best effect
- How does skewing the graph affect the
information flow in the graph? - Skew of 1 works best, but need to try other
possibilities
52Holistic Evaluation in DUC 2007
- Extractive-based TSG system
- Used modified maximal marginal relevance for
update tasks - Penalize links in previously read articles
- Extension of inter-document factor (f)
Cluster 1
Cluster 2
Cluster 3
53Evaluation Results
- Main task 10th of 32 systems
- Update task 3rd of 24 systems
- Conclusion
- TSG formalism better tailored to deal with
update / incremental text tasks - New method that may be competitive with current
approaches - Other top scoring systems may do sentence
compression (abstractive), not just extraction
54Graph-based Update Summary - Conclusion
- Proposed a timestamped graph model for text
understanding and summarization - Adds sentences in an incremental fashion
- Future work
- Freely skewed model
- Empirical and theoretical properties of TSGs
55Where do we go from here?
- Organizing data around entities, events
- How people deal with data anyways
- Understand objects and their inter/intra-relation
ship - Automation requires domain-expertise within a
generic framework
- Collate all studies on SBP2 that new findings in
the last year. - Oh, I meant the PROTEIN SBP2, not the gene.
- What other proteins does SBP2 bind to?
- Tell me more about the contradiction from
previous results. - Which Miller did the study on SBP2 in 2002?
- Thank you!
- http//wing.comp.nus.edu.sg/
56Backup Slides Entity Linkage
- 50 Minute talk total
- 7 Apr 2008, 10 11 AM
57Social network analysis
- Connected triple
- Random walk
x1
s
t
x2
x2
x1
x3
58Scalability Issues
- Pairwise comparisons
- Requires O(n2) time
- Major bottleneck
- Possible solutions
- Blocking techniques
- Avoiding pairwise comparisons altogether
Input d1, d2, , dn for i 1 to n for j
(i 1) to n compute sim(di, dj)
59Cost-utility Framework
cost of acquiring fi
utility of acquiring fi
feature fi
known value
value that can be acquired
60Record Matching
2 Information that canbe acquired at a cost
Training data Assume all feature-valuesand their
acquisition costsknown Testing data Assume 1
known, butfeature-values and theiracquisition
costs in 2unknown Costs Set to MIN_LEN
MAX_LEN
1 Given information
Header-reference pair (instance)
TITLE_MIN_LEN TITLE_MAX_LEN AUTHOR_MIN_LEN AUTHOR_
MAX_LEN VENUE_MIN_LEN VENUE_MAX_LEN
TITLE_SIM AUTHOR_SIM VENUE_SIM
MATCH/MISMATCH?
61Costs and Utilities
- Costs
- Trained 3 models (using M5), treat as regression
- Utilities
- Trained 23 8 classifiers (each to predict
match/mismatch using only known feature-values) - For a test instance with a missing feature-value
F - Get confidence of appropriate classifier without
F - Get expected confidence of appropriate classifier
with F - Utility is difference between the two confidence
scores - Note
- Similar to Saar-Tsechansky et al.
62Results
Without cleaning of header records
With manual cleaning of header records
Increasing proportion of feature-values acquired
Increasing proportion of feature-values acquired
63Selected Bibliography
- General and surveys
- Ivan P. Fellegi and Alan B. Sunter. A theory for
record linkage. Journal of the American
Statistical Association, 64(328)11831210,
December 1969. - William E. Winkler and Yves Thibaudeau. An
application of the Fellegi-Sunter Model of record
linkage to the 1990 U.S. Decennial Census.
Technical Report RR91/09, U.S. Bureau of the
Census, 1991. - Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and
Vassilios S. Verykios. Duplicate record
detection A survey. IEEE Transactions on
Knowledge and Data Engineering (TKDE),
19(1)116, January 2007. - William E. Winkler. Overview of record linkage
and current research directions. Technical Report
RRS2006/02, U.S. Bureau of the Census, February
2006. - Mikhail Bilenko, Raymond J. Mooney, William W.
Cohen, Pradeep Ravikumar, and Stephen E.
Fienberg. Adaptive name matching in information
integration. IEEE Intelligent Systems,
18(5)1623, January/February 2003. - Min-Yen Kan and Yee Fan Tan. Record Matching in
Digital Library Metadata. To appear in
Communications of the ACM (CACM).
64Selected Bibliography
- String matching
- Robert A. Wagner and Michael J. Fischer. The
string-to-string correction problem. Journal of
the Association of Computing Machinery,
21(1)168173, January 1974. - Saul B. Needleman and Christian D. Wunsch. 1970.
A general method applicable to the search for
similarities in the amino acid sequence of two
proteins. Journal of Molecular Biology,
148(3)443453, March 1970. - Temple F. Smith and Michael S. Waterman.
Identification of common molecular subsequences.
Journal of Molecular Biology, 147(1)195197,
March 1981. - Andrés Marzal and Enrique Vidal. Computation of
normalized edit distance and applications. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 15(9)926932, September 1993. - Alvaro E. Monge and Charles Elkan. The field
matching problem Algorithms and applications. In
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 267270, August
1996. - Jie Wei. Markov edit distance. IEEE Transactions
on Pattern Analysis and Machine Intelligence,
26(3)311321, March 2004. - Mikhail Bilenko and Raymond J. Mooney. Adaptive
duplicate detection using learnable string
similarity measures. In ACM SIGKDD International
Conference on Knowledge Discovery and Data
Mining, pages 3948, August 2003. - Andrew McCallum, Kedar Bellare, and Fernando
Pereira. A Conditional Random Field For
Discriminatively-Trained Finite-State String Edit
Distance. In Conference on Uncertainty in
Artificial Intelligence (UAI), July 2005. - William. W. Cohen, Pradeep Ravikumar, and Stephen
E. Fienberg. A comparison of string distance
metrics for name-matching tasks. In Information
Integration on the Web (IIWeb), pages 7378,
August 2003. - Ariel S. Schwartz and Marti A. Hearst. A simple
algorithm for identifying abbreviation
definitions in biomedical text. In Pacific
Symposium on Biocomputing (PSB), pages 451462,
January 2003. - Youngja Park and Roy J. Byrd. Hybrid text mining
for finding abbreviations and their definitions.
In Conference on Empirical Methods in Natural
Language Processing (EMNLP), pages 126133, June
2001. - Jeffrey T. Chang , Hinrich Schütze, and Russ B.
Altman. Creating an online dictionary of
abbreviations from MEDLINE. Journal of the
American Medical Informatics Association,
9(6)612620, November/December 2002. - Hiroko Ao and Toshihisa Takagi. ALICE An
algorithm to extract abbreviations from MEDLINE.
Journal of the American Medical Informatics
Association, 12(5)576586, September/October
2005.
65Selected Bibliography
- Direct classification or clustering, and blocking
- Hui Han, Hongyuan Zha, and C. Lee Giles. A
model-based K-means algorithm for name
disambiguation. In Workshop on Semantic Web
Technologies for Searching and Retrieving
Scientific Data, October 2003. - Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li,
and Kostas Tsioutsiouliklis. Two supervised
learning approaches for name disambiguation in
author citations. In ACM/IEEE Joint Conference on
Digital Libraries (JCDL), pages 296305, June
2004. - Hui Han, Wei Xu, Hongyuan Zha, and C. Lee Giles.
A hierarchical naive bayes mixture model for name
disambiguation in author citations. In ACM
Symposium on Applied Computing (SAC), pages
10651069, March 2005. - Hui Han, Hongyuan Zha, and C. Lee Giles. Name
disambiguation in author citations using a K-way
spectral clustering method. In ACM/IEEE Joint
Conference on Digital Libraries (JCDL), pages
334343, June 2005. - Dongwon Lee, Byung-Won On, Jaewoo Kang, and
Sanghyun Park. Effective and scalable solutions
for mixed and split citation problems in digital
libraries. In ACM SIGMOD Workshop on Information
Quality in Information Systems (IQIS), pages
6976, June 2005. - Byung-Won On, Dongwon Lee, Jaewoo Kang, and
Prasenjit Mitra. Comparative study of name
disambiguation problem using a scalable
blocking-based framework. In ACM/IEEE Joint
Conference on Digital Libraries (JCDL), pages
344353, June 2005. - Andrew McCallum, Kamal Nigam, and Lyle Ungar.
Efficient clustering of high-dimensional data
sets with application to reference matching. In
ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pages 169178, August
2000. - Matthew Michelson and Craig A. Knoblock. Learning
blocking schemes for record linkage. In National
Conference on Artificial Intelligence (AAAI),
July 2006. - Mikhail Bilenko, Beena Kamath, and Raymond J.
Mooney. Adaptive Blocking Learning to Scale Up
Record Linkage and Clustering. In IEEE
International Conference on Data Mining (ICDM),
December 2006.
66Selected Bibliography
- Graphical models
- Jie Wei. Markov edit distance. IEEE Transactions
on Pattern Analysis and Machine Intelligence,
26(3)311321, March 2004. - John Lafferty, Andrew McCallum, and Fernando
Pereira. Conditional random fields Probabilistic
models for segmenting and labeling sequence data.
In International Conference on Machine Learning
(ICML), pages 282289, June/July 2001. - Andrew McCallum and Ben Wellner. Object
consolidation by graph partitioning with a
conditionally-trained distance metric. In ACM
SIGKDD Workshop on Data Cleaning, Record Linkage,
and Object Consolidation, pages 1924, August
2003. - Ben Wellner, Andrew McCallum, Fuchun Peng, and
Michael Hay. An integrated, conditional model of
information extraction and coreference with
application to citation matching. In Conference
on Uncertainty in Artificial Intelligence (UAI),
pages 593601, July 2004. - Andrew McCallum, Kedar Bellare, and Fernando
Pereira. A Conditional Random Field For
Discriminatively-Trained Finite-State String Edit
Distance. In Conference on Uncertainty in
Artificial Intelligence (UAI), July 2005. - Xin Dong, Alon Halevy, and Jayant Madhavan.
Reference reconciliation in complex information
spaces. In ACM SIGMOD International Conference on
Management of Data, pages 8596, June 2005. - Indrajit Bhattacharya and Lise Getoor. A latent
dirichlet model for unsupervised entity
resolution. In SIAM International Conference on
Data Mining, pages 4758, April 2006.
67Selected Bibliography
- Social network analysis
- H. A. Kautz, B. Selman, and M. A. Shah. The
hidden web. AI Magazine, 18(2)2736, 1997. - P. Mutschke. Mining networks and central entities
in digital libraries. A graph theoretic approach
applied to co-author networks. In Intelligent
Data Analysis (IDA), pages 155166, August 2003. - M. E. J. Newman. Who is the best connected
scientist? A study of scientific coauthorship
networks. In Complex Networks, pages 337370,
February 2004. - E. Otte and R. Rousseau. Social network analysis
a powerful strategy, also for the information
sciences. Journal of Information Science, 28(6),
December 2002. - T. Krichel and N. Bakkalbasi. A social network
analysis of research collaboration in the
economics community. In International Workshop on
Webometrics, Informetrics and Scientometrics
Seventh COLLNET Meeting, May 2006. - R. Rousseau and M. Thelwall. Escher staircases on
the world wide web. First Monday, 9(6), June
2004. - D. G. Feitelson. On identifying name equivalences
in digital libraries. Information Research, 9(4),
October 2004. - R. Bekkerman and A. McCallum. Disambiguating web
appearances of people in a social network. In
International conference on World Wide Web (WWW),
pages 463470, May 2005. - R. Holzer, B. Malin, and L. Sweeney. Email alias
detection using social network analysis. In
Workshop on Link Discovery Issues, Approaches
and Applications (LinkKDD), August 2005. - B. Malin, E. Airoldi, and K. M. Carley. A network
analysis model for disambiguation of names in
lists. Computational and Mathematical
Organization Theory, 11(2)119139, July 2005. - G. Flake, S. Lawrence, and C. L. Giles. Efficient
identification of web communities. In ACM SIGKDD
International Conference on Knowledge Discovery
and Data Mining, pages 150160, August 2000. - P. K. Reddy and M. Kitsuregawa. An approach to
build a cyber-community hierarchy. In SIAM ICDM
Workshop on Web Analysis, April 2002. - Patrick Reuther. Personal name matching New test
collections and a social network based approach.
Technical Report Mathematics/Computer Science
06-01, University of Trier, March 2006. - Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki,
Keisuke Ishida, Takuichi Nishimura, Hideaki
Takeda, Kôiti Hasida, and Mitsuru Ishizuka.
POLYPHONET an advanced social network extraction
system from the web. In International conference
on World Wide Web (WWW), pages 397-406, May 2006.
68Selected Bibliography
- Web-based methods
- Jamie P. Callan, Margie E. Connell, and Aiqun Du.
Automatic discovery of language models for text
databases. In ACM SIGMOD International Conference
on Management of Data, pages 479490, June 1999. - Jamie P. Callan and Margie E. Connell.
Query-based sampling of text databases. ACM
Transactions on Information Systems (TOIS),
19(2)97130, April 2001. - Panagiotis G. Ipeirotis and Luis Gravano.
Distributed search over the hidden-web
Hierarchical database sampling and selection. In
International Conference on Very Large Databases
(VLDB), pages 394405, August 2002. - Luis Gravano, Panagiotis G. Ipeirotis, and Mehran
Sahami. QProber A system for automatic
classification of hidden-web databases. ACM
Transactions on Information Systems (TOIS),
21(1)141, January 2003. - Aron Culotta, Ron Bekkerman, and Andrew McCallum.
Extracting social networks and contact
information from email and the web. In Conference
on Email and Anti-Spam (CEAS), July 2004. - Philipp Cimiano, Siegfried Handschuh, and Steffen
Staab. Towards the self-annotating web. In
International conference on World Wide Web (WWW),
pages 462471, May 2004. - Philipp Cimiano, Günter Ladwig, and Steffen
Staab. Gimme the context Context-driven
automatic semantic annotation with C-PANKOW. In
International conference on World Wide Web (WWW),
pages 332341, May 2005. - Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki,
Keisuke Ishida, Takuichi Nishimura, Hideaki
Takeda, Kôiti Hasida, and Mitsuru Ishizuka.
POLYPHONET an advanced social network extraction
system from the web. In International conference
on World Wide Web (WWW), pages 397-406, May 2006. - Yee Fan Tan, Min-Yen Kan, and Dongwon Lee. Search
engine driven author disambiguation. In ACM/IEEE
Joint Conference on Digital Libraries (JCDL),
June 2006. - Ergin Elmacioglu, Min-Yen Kan, Dongwon Lee, and
Yi Zhang. Googled name linkage. 2007. - Yee Fan Tan, Ergin Elmacioglu, Min-Yen Kan, and
Dongwon Lee. Record Linkage of Short Forms to
Long Forms A Case Study of Publication Venues.
2007. - Min-Yen Kan. Web page classification without the
web page. In International conference on World
Wide Web (WWW), pages 262263, May 2004. - Min-Yen Kan and Hoang Oanh Nguyen Thi. Fast
webpage classification using url features. In
International Conference on Information and
Knowledge Management (CIKM), pages 325326,
October/November 2005. - Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay
Jain, and Luis Gravano. To search or to crawl?
Towards a query optimizer for text-centric tasks.
In ACM SIGMOD International Conference on
Management of Data, pages 265276, June 2006.
69Backup Slides - Summarization
- 50 Minute talk total
- 7 Apr 2008, 10 11 AM
70A Summarization Machine
MULTIDOCS
QUERY
DOC
50
Very Brief
Brief
Headline
10
100
Long
Extract
Abstract
Indicative
Informative
Generic
Query-oriented
Summary
Just the news
Background
Generate a summary given a text document
71Summarization, defined
- Definitions
- Take a text document, extract content from it and
present the most important content to the user in
a condensed form and in a manner sensitive to the
users or applications needs - Summarization requires
- understanding the meaning of a text document
- generating fluent text summary
- Studies of human summarizers
- Cremmins (65) Endres-Niggemeyer (98) showed
that professional summarizers used clues to pick
summary content.
72Skew Degree Examples
- time(d1) lt time(d2) lt time(d3) lt time(d4)
d1 d2 d3 d4
d1 d2 d3 d4
Freely skewed Only add a new document when it
would be linked by some node using vertex
function s
Skewed by 1
Skewed by 2
73Input text transformation function (t)
- Document Segmentation Function (t)
- Problem observed in some clusters where some
documents in a multi-document cluster are very
long - Takes many timestamps to introduce all of the
sentences, causing too many edges to be drawn - ?(G) segments long documents into several sub
docs - Solution is too hacked hope to investigate
more in current and future work
d5b
d5a
d5
74Evaluation on number of edges (e)
- Tried different e values
- Optimal performance e 2
- At e 1, graph is too loosely connected, not
suitable for PageRank ? very low performance - At e N, a LexRank system
e 2
e 2
N
N
N
75Evaluation (other edge parameters)
- PageRank generic vs topic-sensitive
- Edge weight (u) unweighted vs weighted
- Optimal performance topic-sensitive PageRank
and weighted edges
Topic-sensitive Weighted edges ROUGE-1 ROUGE-2
No No 0.39358 0.07690
Yes No 0.39443 0.07838
No Yes 0.39823 0.08072
Yes Yes 0.39845 0.08282
76Evaluation on skew degree (s)
- Different skew degrees s 0, 1 and 2
- Optimal performance s 1
- s 2 introduces a delay interval that is too
large - Need to try freely skewed graphs
Skew degree ROUGE-1 ROUGE-2
0 0.36982 0.07580
1 0.37268 0.07682
2 0.36998 0.07489
77Describing Summaries
- Aspects of summarization (Sparck-Jones 97,
Hovy and Lin 99) - Input
- Single-document vs. multi-document
- Purpose
- Situation embedded in larger system (MT, IR) or
not? - Generic vs. query-oriented authors view or
users interest? - Indicative vs. informative categorization or
understanding? - Background vs. just-the-news does user have
prior knowledge? - Output
- Extract vs. abstract use text fragments or
re-phrase content?
78Differences for main and update task processing
- Main task
- Construct a TSG for input cluster
- Run topic-sensitive PageRank on the TSG
- Apply first modified version of MMR to extract
sentences
- Update task
- Cluster A
- Construct a TSG for cluster A
- Run topic-sensitive PageRank on the TSG
- Apply the second modified version of MMR to
extract sentences - Cluster B
- Construct a TSG for clusters A and B
- Run topic-sensitive PageRank on the TSG only
retain sentences from B - Apply the second modified version of MMR to
extract sentences - Cluster C
- Construct a TSG for clusters A, B and C
- Run topic-sensitive PageRank on the TSG only
retain sentences from C - Apply the second modified version of MMR to
extract sentences
79Sentence Ranking
- Once a timestamped graph is built, we want to
compute an prestige score for each node - PageRank use an iterative method that allows
the weights of the nodes to redistribute until
stability is reached - Similarities as edges ? weighted edges query ?
topic-sensitive
Topic sensitive (Q) portion
Standard random walk term
80Sentence Extraction Main task
- Original MMR integrates a penalty of the
maximal similarity of the candidate document and
one selected document - Ye et al. (2005) introduced a modified MMR
integrates a penalty of the total similarity of
the candidate sentence and all selected sentences -
- Score(s) PageRank score of s S selected
sentences - This is used in the main task
Penalty All previous sentence similarity
81Sentence Extraction Update task
- Update task assumes readers already read previous
cluster(s) - implies we should not select sentences that have
redundant information with previous cluster(s) - Propose a modified MMR for the update task
- consider the total similarity of the candidate
sentence with all selected sentences and
sentences in previously-read cluster(s) - P contains some top-ranked sentences in previous
cluster(s)
Previous cluster overlap
82References
- Günes Erkan and Dragomir R. Radev. 2004.
LexRank Graph-based centrality as salience in
text summari-zation. Journal of Artificial
Intelligence Research, (22). - Rada Mihalcea and Paul Tarau. 2004. TextRank
Bring-ing order into texts. In Proceedings of
EMNLP 2004. - S.N. Dorogovtsev and J.F.F. Mendes. 2001.
Evolution of networks. Submitted to Advances in
Physics on 6th March 2001. - Sergey Brin and Lawrence Page. 1998. The anatomy
of a large-scale hypertextual Web search engine.
Com-puter Networks and ISDN Systems, 30(1-7). - Jon M. Kleinberg. 1999. Authoritative sources in
a hy-perlinked environment. In Proceedings of
ACM-SIAM Symposium on Discrete Algorithms, 1999. - Shiren Ye, Long Qiu, Tat-Seng Chua, and Min-Yen
Kan. 2005. NUS at DUC 2005 Understanding
docu-ments via concepts links. In Proceedings of
DUC 2005.