NearDuplicate Detection for eRulemaking - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

NearDuplicate Detection for eRulemaking

Description:

Incorporating Pair-wise Constraints in Clustering ... Introducing pair-wise constraints. Highly Accurate. Efficient. Easily applied to other datasets ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 39
Provided by: scie295
Category:

less

Transcript and Presenter's Notes

Title: NearDuplicate Detection for eRulemaking


1
Near-Duplicate Detection for eRulemaking
  • Grace Hui Yang, Jamie Callan
  • Language Technologies Institute
  • School of Computer Science
  • Carnegie Mellon University

Stuart Shulman Library and Information
ScienceSchool of Information Sciences
University of Pittsburgh
2
Duplicates and Near-Duplicates
3
Duplicates and Near-Duplicates in eRulemaking
  • U.S. regulatory agencies must solicit, consider,
    and respond to public comments.
  • Some popular regulations attract hundreds of
    thousands of comments
  • Very labor-intensive to sort through manually

4
Duplicates and Near-Duplicates in eRulemaking
  • Special interest groups make form letters
    available for generating comments via email and
    the Web
  • Moveon.org, http//www.moveon.org
  • GetActive, http//www.getactive.org
  • Modifying a form letter is very easy

5
Form Letter
  • Insert screen shot of moveon.org, showing form
    letter and enter-your-comment-here

Individual Information
Personal Notes
6
Goal
  • Identify and Organize duplicates for browsing.
  • Achieve highly effective near-duplicate detection
    by incorporating additional knowledge

7
What is a (Near)-Duplicate in eRulemaking ?
(Text Documents)
8
Duplicate - Exact
9
Near Duplicate - Block Edit
10
Near Duplicate - Minor Change
11
Minor Change Block Edit
12
Near Duplicate - Block Reordering
13
Near Duplicate - Key Block
14
How Can Near-Duplicates Be Detected?
15
Related Work
  • Duplicate Detection Using Fingerprints
  • Hashing functions SHA1Rabin
  • Fingerprint granularity Shivakumar et al.95
    Hoad Zobel03
  • Fingerprint size Broder et al. 97
  • Substring selection strategy
  • position-based Brin et al. 95
  • hash-value-based Broder et al. 97
  • anchor-based Hoad Zobel03
  • frequency-based Chowdhury et al. 02
  • Duplicate Detection Using Full-Text Metzler et
    al. 05

16
Our Detection Strategy
  • Group Near-duplicates based on
  • Text similarity
  • Editing patterns
  • Metadata
  • Clustering!

17
Document Clustering
  • Put similar documents together
  • How is text similarity defined?
  • Similar Vocabulary
  • Similar Word Frequencies
  • If two documents similarity is above a threshold,
    put them into same cluster

18
(No Transcript)
19
(No Transcript)
20
Incorporating Pair-wise Constraints in Clustering
  • Key Block are very common
  • Typical text similarity doesnt work
  • Different words, different frequencies

21
Incorporating Pair-wise Constraints in Clustering
  • Solution Incorporating Pair-wise Constraints in
    Clustering
  • Editing patterns
  • Metadata
  • These provide hints to the clustering algorithm
    about how to group documents
  • Example must-link, cannot-link (Wagstaff
    cardie2000), family-link

22
Must-links
  • Two documents must be in the same cluster
  • Created when
  • complete containment of the another one (key
    block),
  • word overlap gt 95 (minor change).

23
Cannot-links
  • Two documents cannot be in the same cluster
  • Created when two documents
  • cite different docket identification numbers
  • People submitted comments to wrong places

24
Family-links
  • Two documents are likely to be in the same
    cluster
  • Created when two documents have
  • the same email relayer,
  • the same docket identification number,
  • similar file sizes, or
  • the same footer block.

25
How to Incorporate Pair-wise Constraints?
  • When forming clusters,
  • if two documents have a must-link, they must be
    put into same group, even if their text
    similarity is low
  • if two documents have a cannot-link, they cannot
    be put into same group, even if their text
    similarity is high
  • if two documents have a family-link, increase
    their text similarity score, so that their chance
    of being in the same group will be higher than
    before.

26
Evaluation
27
Evaluation Methodology
  • We created three 1,000 email subsets
  • Two from the EPAs Mercury dataset
  • docket (USEPA-OAR-2002-0056)
  • One from DOT SUV dataset
  • docket (USDOT-2003-16128)
  • Assessors manually organized documents into
    near-duplicate clusters
  • Compare human-human agreement to human-computer
    agreement

28
Experimental Setup
  • Sample Name NTF
  • of Docs 1000
  • of Docs (duplicates removed) 275
  • of Known form letters 28
  • of Assessors 2
  • Assessor 1 UCSUR13
  • Assessor 2 UCSUR16

29
Experimental Setup
  • Sample Name NTF2
  • of Docs 1000
  • of Docs (duplicates removed) 270
  • of Known form letters 26
  • of Assessors 2
  • Assessor 1 UCSUR8
  • Assessor 2 UCSUR9

30
Experimental Setup
  • Sample Name DOT
  • of Docs 1000
  • of Docs (duplicates removed) 270
  • of Known form letters 4
  • of Assessors 2
  • Assessor 1 SUPER (Stuart)
  • Assessor 2 G (Grace)

31
Experimental Results
- Comparing human-DURIAN (DUplicate Removal In
lArge collectioN)intercoder agreement with
human-human intercoder agreement (measured in AC1)
32
Experimental Results
- Comparing with other duplicate detection
Algorithms (measured in F1)
33
Impact of Pair-wise Constraints
  • Number of Constraints vs. F1.

34
Impact of Pair-wise Constraints
  • Number of Constraints vs. F1.
  • Number of Constraints vs. F1.

35
(No Transcript)
36
(No Transcript)
37
Conclusion
  • Near-duplicate detection on large public comment
    datasets is practical
  • Full text analysis and clustering
  • Use of additional knowledge
  • Introducing pair-wise constraints
  • Highly Accurate
  • Efficient
  • Easily applied to other datasets

38
Please come to our demo (poster site B1)
Questions?
Write a Comment
User Comments (0)
About PowerShow.com