Linking and Summarizing Information on Entities - PowerPoint PPT Presentation

About This Presentation
Title:

Linking and Summarizing Information on Entities

Description:

Apple iPod Nano 4GB. Entity Linkage ... 4GB iPod nano 4GB. De-duplication. Ironic, isn't it? 7. NIH Lister Hill Medical Center ... – PowerPoint PPT presentation

Number of Views:221
Avg rating:3.0/5.0
Slides: 83
Provided by: tanyeefanm
Category:

less

Transcript and Presenter's Notes

Title: Linking and Summarizing Information on Entities


1
Linking and Summarizing Information on Entities
  • Presented by
  • Min-Yen Kan
  • Web IR / NLP Group (WING)
  • Department of Computer ScienceNational
    University of Singapore, Singapore
  • This talk archived as http//wing.comp.nus.edu.sg/
    kanmy/talks/080407-nihLMC.htm

2
Singapore, the garden city
  • 4M people, sandwiched between Malaysia and
    Indonesia
  • 50 km from the equator hot and humid year-long
  • Known for urban planning, fondness for
    acronyms and aversion to bubble gum litterers -D

WING _at_ NUS
http//wing.comp.nus.edu.sg
  • 1 postdoc, 6 Ph.D. students, 5 undergraduates
  • Projects of in natural language processing,
    digital libraries, and information retrieval.

3
Entity Centric Information Management
  • Collate all studies on SBP2 that new findings in
    the last year.
  • Oh, I meant the PROTEIN SBP2, not the gene.
  • What other proteins does SBP2 bind to?
  • Tell me more about the contradiction from
    previous results.
  • Which Miller did the study on SBP2 in 2002?

4
Entity Centric Information Management
  • Two consequences to discuss today
  • Linkage
  • Joint work with Yee Fan TAN, Dongwon LEE (PSU) et
    al.
  • Summarization
  • Joint work with Ziheng LIN et al.

5
Whats Entity Linkage?
  • Aggregating data on an object together from
    heterogeneous resources
  • Problem Entity names are ambiguous!
  • Medical terms
  • Person names
  • Products
  • Customer records
  • These problems exist even when we have
    controlled vocabulary and lexicons (Specialist,
    UMLS, MeSH)

By UV cross-linking and immunoprecipitation, we
show that SBP2 specifically binds selenoprotein
mRNAs both in vitro and in vivo. The SBP2 clone
used in this study generates a 3173 nt transcript
(2541 nt of coding sequence plus a 632 nt 3 UTR
truncated at the polyadenylation site).
Protein
Gene
6
Examples of Split Records
  • Dongwon Lee, 110 E. Foster Ave. 410, State
    College, PA, 16802
  • Honda Fix
  • Joint Conf. on Digital Libraries
  • Apple iPod Nano 4GB
  • Entity Linkage
  • LEE Dong, 110 East Foster Avenue Apartment 410,
    University Park, PA 16802-2343
  • Honda Jazz
  • JCDL
  • 4GB iPod nano 4GB
  • De-duplication

Ironic, isnt it?
7
All over the web!
Jeffrey D. Ullman (Stanford University)
8
Record linkage, formally defined
  • Input
  • Two lists of records, A and B
  • Output
  • For each record a in A and for each record b in
    B,does a and b refer to the same entity?
  • Note
  • Entities do not come with unique identifiers
  • To disambiguate (deduplicate) items in a single
    list L, we set A B L

9
Talk Outline
  • Linkage using the Web
  • Introduction
  • gtgt Record linkage using internal knowledge
  • String matching
  • Classification or clustering
  • Graphical formalisms
  • Blocking
  • Record linkage using search engines
  • Update Summarization

10
Fellegi-Sunter model
no-decision region (hold for human review)
designate as definite match
designate as definite non-match
true matches? true non-matches
Frequency of Similarity
false matches
false non-matches
Similarity (a, b)
11
String matching
  • String similarity
  • Strings as ordered sequences
  • Edit distance
  • Jaro and Jaro-Winkler
  • Strings as unordered sets
  • Jaccard similarity
  • Cosine similarity
  • Abbreviation matching
  • Pattern detection e.g. National Institute of
    Health (NIH)

(a, b, c) ? (c, b, a)
a, b, c c, b, a
12
Machine Learning
  • Create features
  • String similarity, relationships (e.g.
    collaborators)
  • Then learn a model
  • Naïve Bayes, Support Vector Machine, K-means,
    Agglomerative Clustering,

Yoojin Hong, Byung-Won On and Dongwon Lee.
SystemSupport for Name Authority Control Problem
inDigital Libraries OpenDBLP Approach. ECDL
2004.
Same Person?
Sudha Ram, Jinsoo Park and Dongwon Lee.
DigitalLibraries for the Next Millennium
Challenges andResearch Directions. Information
Systems Frontiers 1999.
13
Graphical Methods Social network analysis
  • Nodes entities
  • Edges relationships

Analysis Connected components Distance between
nodes Node/edge centrality Cliques Bipartite
subgraphs
14
Talk Outline
  • Linkage using the Web
  • Introduction
  • Record linkage using internal knowledge
  • gtgt Record linkage using search engines
  • Search Engine Features
  • Adaptive Queries
  • Query Probing
  • Update Summarization

15
Record linkage using search engines
  • Previously
  • We assumed input data records contain sufficient
    information to perform linkage
  • What if
  • There is insufficient or only noisy information?
  • e.g., linking short forms to long forms
  • Ask other people!
  • I.e., consult external (vs. internal) sources of
    knowledge
  • Use web as collective knowledge base

16
Anatomy of Search Engine Results
Number of results
Ranked list
Title
Programmatically accessible through APIs
Snippet
URL
Web page
17
Derivable Features
  • Counts
  • Co-occurrence measure between count(q1),
    count(q2) and count(q1 and q2)
  • Hyperlinkage
  • Count of web pages of q1 point to pages of q2,
    and vice versa?
  • Incorporate additional indirect links with less
    weight(e.g., q 1 ? p ? q2)
  • Snippets or web pages
  • (Cosine) similarity using tokens
  • Counts of specific terms
  • e.g. number of snippets for q1 containing the
    string q2
  • Further natural language processing

18
Web page features
  • Named entities (NE)
  • We consider people, organizations, locations
  • Each NE token a feature
  • NE-targeted (NE-T)
  • Motivation middle names and titles
  • For NEs having a token of target name
  • Extract tokens that are not in target name as
    features

Charles, Chelsea, Morrice,Edward, Fox, London,
Born Edward Charles Morrice Fox in
Chelsea,London
Charles, Morrice,
19
Using URLs
  • Where web pages are located is also useful
  • Hypothesis If web pages of q1 and web pages of
    q2 overlap a lot, q1 and q2 are the same entity
  • Measure this using URL / Host information
  • Caveat Not all hosts are equally telling
  • citeseer vs. harvard.edu for author names
  • pubmed vs. diabetes-info.com for diabetic terms
  • Solution Weight by Inverse Host Frequency

20
URL Features (cont.)
  • Page URLs
  • Hypothesis URL itself tells quite a lot
  • Home page of lindek
  • CS department, University of Alberta, Canada
  • MeURLin (Kan and Nguyen Thi, 2005)
  • Tokens (http, www, cs, ualberta, ca, lindek)
  • URI parts (schemehttp, hostnamecs, userlindek,
    )
  • N-grams (ca ualberta, uaberta cs, cs www, www
    lindek)
  • Length of tokens

http//www.cs.ualberta.ca/lindek/
21
Web search engine linkage
  • Test whether q1 and q2 should be linked
  • Hypothesis Web pages of q1 and web pages of q2
    share some representative data I
  • Similar to disconnected triples
  • Jeffrey D. Ullman 384K pgs
  • Jeffrey D. Ullman aho 174K pgs
  • J. Ullman 124K pgs
  • J. Ullman aho 41K pgs
  • Shimon Ullman 27.3K pgs
  • Shimon Ullman aho 66 pgs

q1
q2
22
Evaluation - Full web pages in WEPS
  • Goal
  • To compare the usefulness of various features for
    the Web People Search Task
  • Architecture

Cosine similarity Single link hierarchical
agglomerative clustering Minimum
similarity threshold
Input web pages
Feature vectors
Clusters
23
Evaluation
  • F(a 0.5) and similarity threshold 0.2

24
Evaluation - Author Disambiguation
  • Dataset
  • Manually-disambiguated dataset of 24 ambiguous
    names in computer science domain
  • Each ambiguous name represented 2 unique authors
    (k 2) except for one where it represented 3
  • Each name is attributed to 30 citations on
    average
  • Proportion of largest class ranges from 50 to
    97
  • Search engine
  • Google (http//www.google.com/)

25
Evaluation
  • Single link performs best
  • Good for clustering citations from different
    publication pages together (some pages list only
    selected publications)
  • Some authors have disparate research areas, not
    well represented by a centroid vector
  • Resolving hostnames to IP addresses give best
    accuracy

Classification accuracyaveraged over all names
26
Discussion
Per-name accuracies using single link
Per-name average number of URLsreturned per
citation
27
Discussion
  • Apparent correlation between accuracy and average
    number of URLs returned per citation
  • Author names with few URLs tend to fare poorly
    since results are mainly aggregator web sites
  • Whats the cost?
  • Lots of queries needed
  • Web page downloads are expensive
  • Hence, slow
  • Can we speed this up?
  • Sure thing

28
Query probing
  • Consider some publication venues
  • Joint Conference on Digital Libraries
  • European Conference on Digital Libraries
  • Digital Libraries
  • Query probing
  • Use common n-gram digital libraries as query
    probe
  • If we can obtain information on all three
    conferences, we save two queries

29
Adaptive querying
  • Combine two methods when needed
  • Methods
  • Ms stronger method but very slow (e.g. web page
    similarity)
  • Mw weaker method but fast (e.g. host overlap)
  • Aim
  • Accuracy close to Ms
  • Significantly reduced running time than Ms
  • Algorithm
  • Execute Mw
  • If heuristic suggests that Mw results are likely
    incorrect
  • Execute Ms

30
Entity Linkage - Conclusion
  • Important problem with a rich history
  • New external methods poll contextual evidence
    for judgment
  • Need to combine methods to obtain best aspect of
    each

31
Talk Outline
  • Linkage using the Web
  • gtgt Graph-based Update Summarization
  • Introduction
  • Timestamped Graphs
  • Evaluation and Conclusions

Now that all this data is linked, how do we
process it?
32
Applications of Summarization
Doing Less Work
Decision Support
33
More seriously an exciting challenge ...
  • ...put a book on the scanner, turn the dial to 2
    pages, and read the result...
  • ...download 1000 documents from the web, send
    them to the summarizer, and select the best ones
    by reading the summaries of the clusters...
  • ...forward the Japanese email to the summarizer,
    select 1 par, and skim the translated summary.
  • get a weekly digest of new treatments and
    therapies for pressure ulcers

An update task
34
Simplifying summarization
  • Select important sentences verbatim from the
    input text to form a summary
  • Input A text document with k sentences
  • Output Top n (n ltlt k) sentences with the
    highest numeric scores (each sentence in the
    input document is assigned a numeric score)

Extractive Summarization
35
Summarization
  • Heuristics for extractive summarization
  • Cue/stigma phrases
  • Sentence position (relative to document,
    section, paragraph)
  • Sentence length
  • TFIDF, TF scores
  • Similarity (with title, context, query)
  • Machine learning to tune weights by supervised
    learning
  • Recently, graphical representations of text have
    shed new light on the summarization problem

36
Revisiting Social Networks Prestige
  • One motivation was to model the problem as
    finding prestige of nodes in a social network
  • PageRank random walk
  • In summarization, lead to TextRank and LexRank
  • Did we leave anything out of our representation
    for summarization?
  • Yes, the notion of an evolving network

37
Social networks change!
  • Natural evolving networks (Dorogovtsev and
    Mendes, 2001)
  • Citation networks New papers can cite old ones,
    but the old network is static
  • The Web new pages are added with an old page
    connecting it to the web graph, old pages may
    update links

38
Talk Outline
  • Linkage using the Web
  • Graph-based Update Summarization
  • Introduction
  • gtgt Timestamped Graphs
  • Evaluation and Conclusion

39
Evolutionary models for summarization
  • Writers and readers often follow conventional
    rhetorical styles - articles are not written or
    read in an arbitrary way
  • Consider the evolution of texts using a very
    simplistic model
  • Writers write from the first sentence onwards in
    a text
  • Readers read from the first sentence onwards of
    a text
  • A simple model sentences get added incrementally
    to the graph

40
Timestamped Graph Construction
  • These assumptions suggest us to iteratively add
    sentences into the graph in chronological order.
  • At each iteration, consider which edges to add
    to the graph.
  • For single document simple and straightforward
    add 1st sentence, followed by the 2nd, and so
    forth, until the last sentence is added
  • For multi-document treat it as multiple
    instances of single documents, which evolve in
    parallel i.e., add 1st sentences of all
    documents, followed by all 2nd sentences, and so
    forth
  • NB Doesnt really model chronological ordering
    between articles, fix later

41
Timestamped Graph Construction
  • Model
  • Documents as columns
  • di document i
  • Sentences as rows
  • sj jth sentence of document

42
Timestamped Graph Construction
  • A multi document example

doc3
doc2
doc1
sent1
sent2
sent3
43
  • An example TSG DUC 2007 D0703A-A

44
Timestamped Graph Construction
  • These are just one instance of TSGs
  • Lets generalize and formalize them
  • Def A timestamped graph algorithm tsg(M) is a
    9-tuple (d, e, u, f,s, t, i, s, t) that
    specifies a resulting algorithm that takes as
    input the set of texts M and outputs a graph G

Input text transformation function
Properties of nodes
Properties of edges
45
Edge properties (d, e, u, f)
  • Edge Direction (d)
  • Forward, backward, or undirected
  • Edge Number (e)
  • number of edges to instantiate per timestep
  • Edge Weight (u)
  • weighted or unweighted edges
  • Inter-document factor (f)
  • penalty factor for links between documents in
    multi-document sets.

46
Node properties (s, t, i, s)
  • Vertex selection function s(u, G)
  • One strategy among those nodes not yet
    connected to u in G, choose the one with highest
    similarity according to u
  • Similarity functions Jaccard, cosine, concept
    links (Ye et al.. 2005)
  • Text unit type (t)
  • Most extractive algorithms use sentences as
    elementary units
  • Node increment factor (i)
  • How many nodes get added at each timestep
  • Skew degree (s)
  • Models how nodes in multi-document graphs are
    added
  • Skew degree how many iterations to wait before
    adding the 1st sentence of the next document
  • Skip for today

47
Timestamped Graph Construction
  • Representations
  • We can model a number of different algorithms
    using this 9-tuple formalism
  • (d, e, u, f, s, t, i, s, t)
  • The given toy example
  • (f, 1, 0, 1, max-cosine-based, sentence, 1, 0,
    null)
  • LexRank graphs
  • (u, N, 1, 1, cosine-based, sentence, Lmax, 0,
    null)
  • N total number of sentences in the cluster
    Lmax the max document length
  • i.e., all sentences are added into the graph in
    one timestep, each connected to all others, and
    cosine scores are given to edge weights

48
System Overview
  • Sentence splitting
  • Detect and mark sentence boundaries
  • Annotate each sentence with the doc ID and the
    sentence number
  • E.g., XIE19980304.0061 4 March 1998 from Xinhua
    News XIE19980304.0061-14 the 14th sentence of
    this document
  • Graph construction
  • Construct TSG in this phase

49
System Overview
  • Sentence Ranking
  • Apply topic-sensitive random walk on the graph
    to redistribute the weights of the nodes
  • Sentence extraction
  • Extract the top-ranked sentences
  • Two different modified MMR re-rankers are used,
    depending on whether it is main or update task

50
Talk Outline
  • Linkage using the Web
  • Graph-based Update Summarization
  • Introduction
  • Timestamped Graphs
  • gtgt Evaluation and Conclusion

51
Evaluation
  • Dataset DUC 2005, 2006 and 2007.
  • Evaluation tool ROUGE n-gram based automatic
    evaluation
  • Each dataset contains 50 or 45 clusters, each
    cluster contains a query and 25 documents
  • Evaluate on some parameters
  • Do different e values affect the summarization
    process?
  • e 2 works best for DUC dataset
  • How do topic-sensitivity and edge weighting
    perform in running PageRank?
  • Applying both seems to have best effect
  • How does skewing the graph affect the
    information flow in the graph?
  • Skew of 1 works best, but need to try other
    possibilities

52
Holistic Evaluation in DUC 2007
  • Extractive-based TSG system
  • Used modified maximal marginal relevance for
    update tasks
  • Penalize links in previously read articles
  • Extension of inter-document factor (f)

Cluster 1
Cluster 2
Cluster 3
53
Evaluation Results
  • Main task 10th of 32 systems
  • Update task 3rd of 24 systems
  • Conclusion
  • TSG formalism better tailored to deal with
    update / incremental text tasks
  • New method that may be competitive with current
    approaches
  • Other top scoring systems may do sentence
    compression (abstractive), not just extraction

54
Graph-based Update Summary - Conclusion
  • Proposed a timestamped graph model for text
    understanding and summarization
  • Adds sentences in an incremental fashion
  • Future work
  • Freely skewed model
  • Empirical and theoretical properties of TSGs

55
Where do we go from here?
  • Organizing data around entities, events
  • How people deal with data anyways
  • Understand objects and their inter/intra-relation
    ship
  • Automation requires domain-expertise within a
    generic framework
  • Collate all studies on SBP2 that new findings in
    the last year.
  • Oh, I meant the PROTEIN SBP2, not the gene.
  • What other proteins does SBP2 bind to?
  • Tell me more about the contradiction from
    previous results.
  • Which Miller did the study on SBP2 in 2002?
  • Thank you!
  • http//wing.comp.nus.edu.sg/

56
Backup Slides Entity Linkage
  • 50 Minute talk total
  • 7 Apr 2008, 10 11 AM

57
Social network analysis
  • Connected triple
  • Random walk
  • Maximum flow
  • Clustering

x1
s
t
x2
x2
x1
x3
58
Scalability Issues
  • Pairwise comparisons
  • Requires O(n2) time
  • Major bottleneck
  • Possible solutions
  • Blocking techniques
  • Avoiding pairwise comparisons altogether

Input d1, d2, , dn for i 1 to n for j
(i 1) to n compute sim(di, dj)
59
Cost-utility Framework
cost of acquiring fi
utility of acquiring fi
feature fi
known value
value that can be acquired
60
Record Matching
2 Information that canbe acquired at a cost
Training data Assume all feature-valuesand their
acquisition costsknown Testing data Assume 1
known, butfeature-values and theiracquisition
costs in 2unknown Costs Set to MIN_LEN
MAX_LEN
1 Given information
Header-reference pair (instance)
TITLE_MIN_LEN TITLE_MAX_LEN AUTHOR_MIN_LEN AUTHOR_
MAX_LEN VENUE_MIN_LEN VENUE_MAX_LEN
TITLE_SIM AUTHOR_SIM VENUE_SIM
MATCH/MISMATCH?
61
Costs and Utilities
  • Costs
  • Trained 3 models (using M5), treat as regression
  • Utilities
  • Trained 23 8 classifiers (each to predict
    match/mismatch using only known feature-values)
  • For a test instance with a missing feature-value
    F
  • Get confidence of appropriate classifier without
    F
  • Get expected confidence of appropriate classifier
    with F
  • Utility is difference between the two confidence
    scores
  • Note
  • Similar to Saar-Tsechansky et al.

62
Results
Without cleaning of header records
With manual cleaning of header records
Increasing proportion of feature-values acquired
Increasing proportion of feature-values acquired
63
Selected Bibliography
  • General and surveys
  • Ivan P. Fellegi and Alan B. Sunter. A theory for
    record linkage. Journal of the American
    Statistical Association, 64(328)11831210,
    December 1969.
  • William E. Winkler and Yves Thibaudeau. An
    application of the Fellegi-Sunter Model of record
    linkage to the 1990 U.S. Decennial Census.
    Technical Report RR91/09, U.S. Bureau of the
    Census, 1991.
  • Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and
    Vassilios S. Verykios. Duplicate record
    detection A survey. IEEE Transactions on
    Knowledge and Data Engineering (TKDE),
    19(1)116, January 2007.
  • William E. Winkler. Overview of record linkage
    and current research directions. Technical Report
    RRS2006/02, U.S. Bureau of the Census, February
    2006.
  • Mikhail Bilenko, Raymond J. Mooney, William W.
    Cohen, Pradeep Ravikumar, and Stephen E.
    Fienberg. Adaptive name matching in information
    integration. IEEE Intelligent Systems,
    18(5)1623, January/February 2003.
  • Min-Yen Kan and Yee Fan Tan. Record Matching in
    Digital Library Metadata. To appear in
    Communications of the ACM (CACM).

64
Selected Bibliography
  • String matching
  • Robert A. Wagner and Michael J. Fischer. The
    string-to-string correction problem. Journal of
    the Association of Computing Machinery,
    21(1)168173, January 1974.
  • Saul B. Needleman and Christian D. Wunsch. 1970.
    A general method applicable to the search for
    similarities in the amino acid sequence of two
    proteins. Journal of Molecular Biology,
    148(3)443453, March 1970.
  • Temple F. Smith and Michael S. Waterman.
    Identification of common molecular subsequences.
    Journal of Molecular Biology, 147(1)195197,
    March 1981.
  • Andrés Marzal and Enrique Vidal. Computation of
    normalized edit distance and applications. IEEE
    Transactions on Pattern Analysis and Machine
    Intelligence, 15(9)926932, September 1993.
  • Alvaro E. Monge and Charles Elkan. The field
    matching problem Algorithms and applications. In
    ACM SIGKDD International Conference on Knowledge
    Discovery and Data Mining, pages 267270, August
    1996.
  • Jie Wei. Markov edit distance. IEEE Transactions
    on Pattern Analysis and Machine Intelligence,
    26(3)311321, March 2004.
  • Mikhail Bilenko and Raymond J. Mooney. Adaptive
    duplicate detection using learnable string
    similarity measures. In ACM SIGKDD International
    Conference on Knowledge Discovery and Data
    Mining, pages 3948, August 2003.
  • Andrew McCallum, Kedar Bellare, and Fernando
    Pereira. A Conditional Random Field For
    Discriminatively-Trained Finite-State String Edit
    Distance. In Conference on Uncertainty in
    Artificial Intelligence (UAI), July 2005.
  • William. W. Cohen, Pradeep Ravikumar, and Stephen
    E. Fienberg. A comparison of string distance
    metrics for name-matching tasks. In Information
    Integration on the Web (IIWeb), pages 7378,
    August 2003.
  • Ariel S. Schwartz and Marti A. Hearst. A simple
    algorithm for identifying abbreviation
    definitions in biomedical text. In Pacific
    Symposium on Biocomputing (PSB), pages 451462,
    January 2003.
  • Youngja Park and Roy J. Byrd. Hybrid text mining
    for finding abbreviations and their definitions.
    In Conference on Empirical Methods in Natural
    Language Processing (EMNLP), pages 126133, June
    2001.
  • Jeffrey T. Chang , Hinrich Schütze, and Russ B.
    Altman. Creating an online dictionary of
    abbreviations from MEDLINE. Journal of the
    American Medical Informatics Association,
    9(6)612620, November/December 2002.
  • Hiroko Ao and Toshihisa Takagi. ALICE An
    algorithm to extract abbreviations from MEDLINE.
    Journal of the American Medical Informatics
    Association, 12(5)576586, September/October
    2005.

65
Selected Bibliography
  • Direct classification or clustering, and blocking
  • Hui Han, Hongyuan Zha, and C. Lee Giles. A
    model-based K-means algorithm for name
    disambiguation. In Workshop on Semantic Web
    Technologies for Searching and Retrieving
    Scientific Data, October 2003.
  • Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li,
    and Kostas Tsioutsiouliklis. Two supervised
    learning approaches for name disambiguation in
    author citations. In ACM/IEEE Joint Conference on
    Digital Libraries (JCDL), pages 296305, June
    2004.
  • Hui Han, Wei Xu, Hongyuan Zha, and C. Lee Giles.
    A hierarchical naive bayes mixture model for name
    disambiguation in author citations. In ACM
    Symposium on Applied Computing (SAC), pages
    10651069, March 2005.
  • Hui Han, Hongyuan Zha, and C. Lee Giles. Name
    disambiguation in author citations using a K-way
    spectral clustering method. In ACM/IEEE Joint
    Conference on Digital Libraries (JCDL), pages
    334343, June 2005.
  • Dongwon Lee, Byung-Won On, Jaewoo Kang, and
    Sanghyun Park. Effective and scalable solutions
    for mixed and split citation problems in digital
    libraries. In ACM SIGMOD Workshop on Information
    Quality in Information Systems (IQIS), pages
    6976, June 2005.
  • Byung-Won On, Dongwon Lee, Jaewoo Kang, and
    Prasenjit Mitra. Comparative study of name
    disambiguation problem using a scalable
    blocking-based framework. In ACM/IEEE Joint
    Conference on Digital Libraries (JCDL), pages
    344353, June 2005.
  • Andrew McCallum, Kamal Nigam, and Lyle Ungar.
    Efficient clustering of high-dimensional data
    sets with application to reference matching. In
    ACM SIGKDD International Conference on Knowledge
    Discovery and Data Mining, pages 169178, August
    2000.
  • Matthew Michelson and Craig A. Knoblock. Learning
    blocking schemes for record linkage. In National
    Conference on Artificial Intelligence (AAAI),
    July 2006.
  • Mikhail Bilenko, Beena Kamath, and Raymond J.
    Mooney. Adaptive Blocking Learning to Scale Up
    Record Linkage and Clustering. In IEEE
    International Conference on Data Mining (ICDM),
    December 2006.

66
Selected Bibliography
  • Graphical models
  • Jie Wei. Markov edit distance. IEEE Transactions
    on Pattern Analysis and Machine Intelligence,
    26(3)311321, March 2004.
  • John Lafferty, Andrew McCallum, and Fernando
    Pereira. Conditional random fields Probabilistic
    models for segmenting and labeling sequence data.
    In International Conference on Machine Learning
    (ICML), pages 282289, June/July 2001.
  • Andrew McCallum and Ben Wellner. Object
    consolidation by graph partitioning with a
    conditionally-trained distance metric. In ACM
    SIGKDD Workshop on Data Cleaning, Record Linkage,
    and Object Consolidation, pages 1924, August
    2003.
  • Ben Wellner, Andrew McCallum, Fuchun Peng, and
    Michael Hay. An integrated, conditional model of
    information extraction and coreference with
    application to citation matching. In Conference
    on Uncertainty in Artificial Intelligence (UAI),
    pages 593601, July 2004.
  • Andrew McCallum, Kedar Bellare, and Fernando
    Pereira. A Conditional Random Field For
    Discriminatively-Trained Finite-State String Edit
    Distance. In Conference on Uncertainty in
    Artificial Intelligence (UAI), July 2005.
  • Xin Dong, Alon Halevy, and Jayant Madhavan.
    Reference reconciliation in complex information
    spaces. In ACM SIGMOD International Conference on
    Management of Data, pages 8596, June 2005.
  • Indrajit Bhattacharya and Lise Getoor. A latent
    dirichlet model for unsupervised entity
    resolution. In SIAM International Conference on
    Data Mining, pages 4758, April 2006.

67
Selected Bibliography
  • Social network analysis
  • H. A. Kautz, B. Selman, and M. A. Shah. The
    hidden web. AI Magazine, 18(2)2736, 1997.
  • P. Mutschke. Mining networks and central entities
    in digital libraries. A graph theoretic approach
    applied to co-author networks. In Intelligent
    Data Analysis (IDA), pages 155166, August 2003.
  • M. E. J. Newman. Who is the best connected
    scientist? A study of scientific coauthorship
    networks. In Complex Networks, pages 337370,
    February 2004.
  • E. Otte and R. Rousseau. Social network analysis
    a powerful strategy, also for the information
    sciences. Journal of Information Science, 28(6),
    December 2002.
  • T. Krichel and N. Bakkalbasi. A social network
    analysis of research collaboration in the
    economics community. In International Workshop on
    Webometrics, Informetrics and Scientometrics
    Seventh COLLNET Meeting, May 2006.
  • R. Rousseau and M. Thelwall. Escher staircases on
    the world wide web. First Monday, 9(6), June
    2004.
  • D. G. Feitelson. On identifying name equivalences
    in digital libraries. Information Research, 9(4),
    October 2004.
  • R. Bekkerman and A. McCallum. Disambiguating web
    appearances of people in a social network. In
    International conference on World Wide Web (WWW),
    pages 463470, May 2005.
  • R. Holzer, B. Malin, and L. Sweeney. Email alias
    detection using social network analysis. In
    Workshop on Link Discovery Issues, Approaches
    and Applications (LinkKDD), August 2005.
  • B. Malin, E. Airoldi, and K. M. Carley. A network
    analysis model for disambiguation of names in
    lists. Computational and Mathematical
    Organization Theory, 11(2)119139, July 2005.
  • G. Flake, S. Lawrence, and C. L. Giles. Efficient
    identification of web communities. In ACM SIGKDD
    International Conference on Knowledge Discovery
    and Data Mining, pages 150160, August 2000.
  • P. K. Reddy and M. Kitsuregawa. An approach to
    build a cyber-community hierarchy. In SIAM ICDM
    Workshop on Web Analysis, April 2002.
  • Patrick Reuther. Personal name matching New test
    collections and a social network based approach.
    Technical Report Mathematics/Computer Science
    06-01, University of Trier, March 2006.
  • Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki,
    Keisuke Ishida, Takuichi Nishimura, Hideaki
    Takeda, Kôiti Hasida, and Mitsuru Ishizuka.
    POLYPHONET an advanced social network extraction
    system from the web. In International conference
    on World Wide Web (WWW), pages 397-406, May 2006.

68
Selected Bibliography
  • Web-based methods
  • Jamie P. Callan, Margie E. Connell, and Aiqun Du.
    Automatic discovery of language models for text
    databases. In ACM SIGMOD International Conference
    on Management of Data, pages 479490, June 1999.
  • Jamie P. Callan and Margie E. Connell.
    Query-based sampling of text databases. ACM
    Transactions on Information Systems (TOIS),
    19(2)97130, April 2001.
  • Panagiotis G. Ipeirotis and Luis Gravano.
    Distributed search over the hidden-web
    Hierarchical database sampling and selection. In
    International Conference on Very Large Databases
    (VLDB), pages 394405, August 2002.
  • Luis Gravano, Panagiotis G. Ipeirotis, and Mehran
    Sahami. QProber A system for automatic
    classification of hidden-web databases. ACM
    Transactions on Information Systems (TOIS),
    21(1)141, January 2003.
  • Aron Culotta, Ron Bekkerman, and Andrew McCallum.
    Extracting social networks and contact
    information from email and the web. In Conference
    on Email and Anti-Spam (CEAS), July 2004.
  • Philipp Cimiano, Siegfried Handschuh, and Steffen
    Staab. Towards the self-annotating web. In
    International conference on World Wide Web (WWW),
    pages 462471, May 2004.
  • Philipp Cimiano, Günter Ladwig, and Steffen
    Staab. Gimme the context Context-driven
    automatic semantic annotation with C-PANKOW. In
    International conference on World Wide Web (WWW),
    pages 332341, May 2005.
  • Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki,
    Keisuke Ishida, Takuichi Nishimura, Hideaki
    Takeda, Kôiti Hasida, and Mitsuru Ishizuka.
    POLYPHONET an advanced social network extraction
    system from the web. In International conference
    on World Wide Web (WWW), pages 397-406, May 2006.
  • Yee Fan Tan, Min-Yen Kan, and Dongwon Lee. Search
    engine driven author disambiguation. In ACM/IEEE
    Joint Conference on Digital Libraries (JCDL),
    June 2006.
  • Ergin Elmacioglu, Min-Yen Kan, Dongwon Lee, and
    Yi Zhang. Googled name linkage. 2007.
  • Yee Fan Tan, Ergin Elmacioglu, Min-Yen Kan, and
    Dongwon Lee. Record Linkage of Short Forms to
    Long Forms A Case Study of Publication Venues.
    2007.
  • Min-Yen Kan. Web page classification without the
    web page. In International conference on World
    Wide Web (WWW), pages 262263, May 2004.
  • Min-Yen Kan and Hoang Oanh Nguyen Thi. Fast
    webpage classification using url features. In
    International Conference on Information and
    Knowledge Management (CIKM), pages 325326,
    October/November 2005.
  • Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay
    Jain, and Luis Gravano. To search or to crawl?
    Towards a query optimizer for text-centric tasks.
    In ACM SIGMOD International Conference on
    Management of Data, pages 265276, June 2006.

69
Backup Slides - Summarization
  • 50 Minute talk total
  • 7 Apr 2008, 10 11 AM

70
A Summarization Machine
MULTIDOCS
QUERY
DOC
50
Very Brief
Brief
Headline
10
100
Long
Extract
Abstract
Indicative
Informative
Generic
Query-oriented
Summary
Just the news
Background
Generate a summary given a text document
71
Summarization, defined
  • Definitions
  • Take a text document, extract content from it and
    present the most important content to the user in
    a condensed form and in a manner sensitive to the
    users or applications needs
  • Summarization requires
  • understanding the meaning of a text document
  • generating fluent text summary
  • Studies of human summarizers
  • Cremmins (65) Endres-Niggemeyer (98) showed
    that professional summarizers used clues to pick
    summary content.

72
Skew Degree Examples
  • time(d1) lt time(d2) lt time(d3) lt time(d4)

d1 d2 d3 d4
d1 d2 d3 d4
Freely skewed Only add a new document when it
would be linked by some node using vertex
function s
Skewed by 1
Skewed by 2
73
Input text transformation function (t)
  • Document Segmentation Function (t)
  • Problem observed in some clusters where some
    documents in a multi-document cluster are very
    long
  • Takes many timestamps to introduce all of the
    sentences, causing too many edges to be drawn
  • ?(G) segments long documents into several sub
    docs
  • Solution is too hacked hope to investigate
    more in current and future work

d5b
d5a
d5
74
Evaluation on number of edges (e)
  • Tried different e values
  • Optimal performance e 2
  • At e 1, graph is too loosely connected, not
    suitable for PageRank ? very low performance
  • At e N, a LexRank system

e 2
e 2
N
N
N
75
Evaluation (other edge parameters)
  • PageRank generic vs topic-sensitive
  • Edge weight (u) unweighted vs weighted
  • Optimal performance topic-sensitive PageRank
    and weighted edges

Topic-sensitive Weighted edges ROUGE-1 ROUGE-2
No No 0.39358 0.07690
Yes No 0.39443 0.07838
No Yes 0.39823 0.08072
Yes Yes 0.39845 0.08282
76
Evaluation on skew degree (s)
  • Different skew degrees s 0, 1 and 2
  • Optimal performance s 1
  • s 2 introduces a delay interval that is too
    large
  • Need to try freely skewed graphs

Skew degree ROUGE-1 ROUGE-2
0 0.36982 0.07580
1 0.37268 0.07682
2 0.36998 0.07489
77
Describing Summaries
  • Aspects of summarization (Sparck-Jones 97,
    Hovy and Lin 99)
  • Input
  • Single-document vs. multi-document
  • Purpose
  • Situation embedded in larger system (MT, IR) or
    not?
  • Generic vs. query-oriented authors view or
    users interest?
  • Indicative vs. informative categorization or
    understanding?
  • Background vs. just-the-news does user have
    prior knowledge?
  • Output
  • Extract vs. abstract use text fragments or
    re-phrase content?

78
Differences for main and update task processing
  • Main task
  • Construct a TSG for input cluster
  • Run topic-sensitive PageRank on the TSG
  • Apply first modified version of MMR to extract
    sentences
  • Update task
  • Cluster A
  • Construct a TSG for cluster A
  • Run topic-sensitive PageRank on the TSG
  • Apply the second modified version of MMR to
    extract sentences
  • Cluster B
  • Construct a TSG for clusters A and B
  • Run topic-sensitive PageRank on the TSG only
    retain sentences from B
  • Apply the second modified version of MMR to
    extract sentences
  • Cluster C
  • Construct a TSG for clusters A, B and C
  • Run topic-sensitive PageRank on the TSG only
    retain sentences from C
  • Apply the second modified version of MMR to
    extract sentences

79
Sentence Ranking
  • Once a timestamped graph is built, we want to
    compute an prestige score for each node
  • PageRank use an iterative method that allows
    the weights of the nodes to redistribute until
    stability is reached
  • Similarities as edges ? weighted edges query ?
    topic-sensitive

Topic sensitive (Q) portion
Standard random walk term
80
Sentence Extraction Main task
  • Original MMR integrates a penalty of the
    maximal similarity of the candidate document and
    one selected document
  • Ye et al. (2005) introduced a modified MMR
    integrates a penalty of the total similarity of
    the candidate sentence and all selected sentences
  • Score(s) PageRank score of s S selected
    sentences
  • This is used in the main task

Penalty All previous sentence similarity
81
Sentence Extraction Update task
  • Update task assumes readers already read previous
    cluster(s)
  • implies we should not select sentences that have
    redundant information with previous cluster(s)
  • Propose a modified MMR for the update task
  • consider the total similarity of the candidate
    sentence with all selected sentences and
    sentences in previously-read cluster(s)
  • P contains some top-ranked sentences in previous
    cluster(s)

Previous cluster overlap
82
References
  • Günes Erkan and Dragomir R. Radev. 2004.
    LexRank Graph-based centrality as salience in
    text summari-zation. Journal of Artificial
    Intelligence Research, (22).
  • Rada Mihalcea and Paul Tarau. 2004. TextRank
    Bring-ing order into texts. In Proceedings of
    EMNLP 2004.
  • S.N. Dorogovtsev and J.F.F. Mendes. 2001.
    Evolution of networks. Submitted to Advances in
    Physics on 6th March 2001.
  • Sergey Brin and Lawrence Page. 1998. The anatomy
    of a large-scale hypertextual Web search engine.
    Com-puter Networks and ISDN Systems, 30(1-7).
  • Jon M. Kleinberg. 1999. Authoritative sources in
    a hy-perlinked environment. In Proceedings of
    ACM-SIAM Symposium on Discrete Algorithms, 1999.
  • Shiren Ye, Long Qiu, Tat-Seng Chua, and Min-Yen
    Kan. 2005. NUS at DUC 2005 Understanding
    docu-ments via concepts links. In Proceedings of
    DUC 2005.
Write a Comment
User Comments (0)
About PowerShow.com