Title: UMBC AN HONORS UNIVERSITY IN MARYLAND
1Text Based Similarity Metrics and Delta for
Semantic Web Graphs
Krishnamurthy Viswanathan and Tim Finin,
University of Maryland, Baltimore County
Motivation
- Case 3 Different versions of the same SW graph
- In addition, when this case is detected, generate
a delta between the two versions
Classification
Text similarity is very useful in information
retrie-val for near duplicate and similarity
detection
Similarity metrics computed for each candidate
pair
Approach
Naïve Bayes Classifier Similarity in classes
and properties
Naïve Bayes/SVM classifier Difference only in
Base-URI
SVM Classifier Versioning Relationship
Input corpus of SWDs
Convert to canonical form
Convert to n-triples format
Problem
Create Reduced Forms
Compute Text-Based Similarity Metrics
Identify pairs of similar documents
Generating Deltas
- Given a collection of SW graphs as RDF
doc-uments, identify pairs of graphs that are
similar - Generate a delta for pairs of graphs identified
as having a versioning relationship
Generate delta between versions
Identify ontology versions
Contributions
SW Graph Canonicalization
- Defined text-based similarity metrics
char-acterizing relations between SW graphs - Evaluated these metrics for three specific cases
of similarity
ltahasCapitalgt . _x _y
ltaIsPartOfgt USA . _x ltpersonJohngt
ltalikesgt cheese . ltpersonJohngt ltalivesIngt
. _x
ltpersonJohngt ltalivesIngt _x . _x ltaIsPartOfgt
USA . ltpersonJohngt ltalikesgt cheese . _x
ltahasCapitalgt y .
Evaluation
- Case 1 Same classes and properties used but
differ only in literal content
- Three datasets of 400 semantic web documents for
training and testing - 17 combinations of similarity metrics tested
Jaccard, Containment, Cosine similarity, Hamming
distance between Simhash fingerprints
BNode Table
_g2 ltahasCapitalgt _g1 . _g2 ltaIsPartOfgt
USA . ltpersonJohngt ltalikesgt cheese
. ltpersonJohngt ltalivesIngt _g2 .
Old bnode identifier New bnode identifier
_y _g1
_x _g2
- Assigns uniform identifiers to blank nodes
- Provides a deterministic order to statements
- Empirical method that works for most examples
Type of Similarity True Positives False Positives Precision Recall
Similarity in classes properties 0.986 0.014 0.987 0.986
Difference only in base URI 0.988 0.012 0.988 0.988
Versioning Relationship 0.909 0.091 0.913 0.909
Four reduced forms
- Case 2 Differ only in base-URI
- Only literals from the original n-triple file
- All non-literal content from original n-triple
file - Base-URI of every node replaced by
- Literals and base-URIs replaced by
UMBCAN HONORS UNIVERSITY IN MARYLAND