Title: L. Padmasree
1Signature Based Duplicate Detection in Digital
Libraries
- L. Padmasree
- Vamshi Ambati
- J. Anand Chandulal
- M. Sreenivasa Rao
School of Information Technology, JNT University,
Hyderabad, 500 072 , India. srmeda_at_gmail.com
2Motivation
- Books scanned in Digital Libraries are
procured from varied sources. - Scanning centers are distributed across the
country. - Duplicates could arise between scanning points.
- Pre-scanning duplicate detection is required
3Challenges
- Duplicate detection is by using metadata (title,
author, publishing year, edition, etc) - Entered by varied operators and so there is scope
for - Incorrectness
- Incompleteness
- Errors could be -
- Typographical mistakes
- Word disorder
- Inconsistent abbreviations
- Even with missing words
- Makes duplicate detection more difficult.
- Duplicate detection must have quick turnaround
time and accuracy
4RELATED WORK
- Most traditional methods based on string
similarity are - character-based techniques
- vector space based techniques.
- Character-based technique
- rely on character edit operations, such as
deletions, insertions, substitutions and sub
sequence comparison. - Vector space based techniques
- transform strings into vector representation on
which similarity computations are conducted. - In the present work we used an efficient and
fast duplication detection technique using
similarity search.
5Our Approach
- Uses Signature file method
- Uses Similarity search techniques to find
duplicates with close proximity match - Language independent
- Fast and Accurate
- Uses Online Tool to customize
6The Process
- Metadata is created at scanning centers
- Signature is computed for the metadata
- Use superimposed Technique and Hashing method
- Signature is stored in central repository
- Pre-scanned book metadata is submitted as a query
- Use same technique to compute the signature
- Similarity search gives close proximity match
duplicate
7Duplicate Detection in Digital Library system
Duplicate Detection Technique
8Example of the process
Books Data
Central Repository Central Repository
Metadata of Books Signatures
The Meaning And Teaching Of Music -Will Earhart Some Famous Singers Of The 19th Century -Francis Rogers A Dictionary of Musical Terms - Dr.th.baker The Arts of Japan - Edward Dillon 011111110000101111100011111011 111001010000001001111110110110 111100101000110100000111111111 111101100000000000000011001111
Example Query The Arts of Japan - Edward Dillon
Query - Spell Mistakes Query - Missing Words Query - Jumbled Words
The Ars of Japa Edward Dilon The of Japan - Edward Dillon Dillon Edward -The Japan of Arts
111101100001110000000011001111 111101100000011000000011001111 111101100000100000000011001111
Result
Result The Arts of Japan - Edward Dillon
9Superimposed Coding Technique
- In Superimposed Coding Technique each record is
mapped into an individual binary signature. - Record is either the title or the author name of
the book or the combination. - Signatures of the records in the training data
and testing data are encoded binary
representations. - The signature of the 'title or author name' of
the book is obtained by superimposing the
signatures of the words with OR operation.
Computer Programming 1100 0001 1000 0101 0100 0100
Signature of the book 1101 1101 0100
10The Hashing method
- The signature of each word is obtained by hashing
method. - The hashing function H(w) maps the word(w) into
one of the patterns generated by computing a hash
value of the word. - The hash function uses shift and add strategy.
- The ASCII values of the characters in the word
are added and shifted by H(w). - in order to compute the hash value. The final
hash value is obtained by mod operation with nCr.
11Duplicate Detection in Digital Library System
- The Similarity Match Algorithm for Library
Database - Input L library database consists of documents
D1, D2, , Dm, query Q. - Output B book corresponding to query Q
- Procedure Library (D1, D2, ,Dm, Q in B
out) - for i1 to m do
- Si superimposed-coding (Di)
- end do
- X superimposed-coding (Q)
- O Jaccard (S1, S2,Sm, X)
- Look up in Library database L for a book B
(document) whose Signature matches with minimum
Jaccard distance. - End
12Jaccard Distance
- The Jaccard distance between the query signature
and target signature can be obtained by using the
expression - d (r s) / (q r st)
- q - The number of bits that equals to1 for both
target and query signatures. - r - The number of bits that equals to 1 for
target signature but that are 0 for the query
signatures. - s - The number of bits that equals to 0 for the
target signature but equals to 1 for the query
signature - t - The number of bits that equals to 0 for both
target and query signatures .
13False drops
- Minimized on the appropriate choice of two
parameters n and r. - Online Tool
14EXPERIMENTAL RESULTS
Meta data Query-Spell mistakes Query-Spell mistakes Query-Missing Words Query-Missing Words Query-Jumbled Words Query-Jumbled Words
Meta data False drop () DR () false drop () DR () false drop () DR ()
1000 7 93 9 91 3 97
5000 8 92 10 90 5 95
23000 10 90 12 88 5 95
DR Detection Rate
15Scalability and accuracy of duplicate detection
system
16(No Transcript)
17(No Transcript)
18CONCLUSION
- Effective and efficient duplicate detection
technique is proposed. - Duplicate detection was done by similarity search
using signature file method where we can detect
the duplicate with typographical mistakes, word
disorder, and inconsistent abbreviations and even
with missing words. - Language independent and High performance with
95 accuracy
19Questions?