Trainingless Ontologybased Text Categorization. - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

Trainingless Ontologybased Text Categorization.

Description:

... for Semantic Association Discovery', Fourth European Semantic Web Conference, ... on Peer-to-Peer Knowledge Management, San Diego, CA, July 17, 2005 ... – PowerPoint PPT presentation

Number of Views:115

Avg rating:3.0/5.0

Slides: 51

Provided by: Mac77

Category:

more less

Transcript and Presenter's Notes

Title: Trainingless Ontologybased Text Categorization.

1
Training-less Ontology-based Text Categorization.

Maciej Janik

Major professor Dr. Krzysztof J.
Kochut Committee Dr. John A. Miller Dr. Khaled
Rasheed Dr. Amit P. Sheth
December 14th, 2007 PhD Prospectus presentation
2
Outline

Document categorization
Classic approach to categorization
Graph categorization and similarity metrics
Ontology-based approach to categorization
Algorithm sketch
Algorithm details and assumptions
Example and preliminary results
Planned work and expected results
References

3
Document categorization

Document classification/categorization is a
problem in information science. The task is to
assign an electronic document to one or more
categories, based on its contents. Wikipedia

4
Document categorization by people

People categorize document by understanding its
content, using their knowledge and understanding
what the category is.
Categorization is based on
Document content
Knowledge
Category
Perceived interest

features, graphontologycategory
definitioncategorization context
5
Automatic text categorization

Automatic text classification can be defined as
task of assigning category labels to new
documents based on the knowledge gained in a
classification system at the training stage.
require training with pre-classified documents
Proposed solution
use already defined knowledge for document
categorization and skip the training stage

6
Classic categorization

Methods are based on word/phrase statistics,
information gain and other probability or
similarity measures.
Examples Sebastiani
Naïve Bayes, SVM, Decision Tree, k-NN
Categorization based on information (frequencies,
probabilities) learned from the training
documents.
Vocabulary extension/unification possible by use
of synonyms, homonyms, word groups (eg. from
WordNet)
Document representation for categorization
Set or vector of features - most popular and
simple bag of words
Does not include information about document
structure, relative position of phrases, etc.

7
Graph representation of text

Graph representation preserves (selected)
structural information from document
Relative words positions to find close
co-occurring phrases.
Paragraph, formatting (eg. emphasize), part of
document.
Sample representations
Words form a directed graph, chained in order as
they appear in each sentence.
Words form a weighted graph, where edge connects
words within certain distance and weight
determines closeness.
Connected terms based on NLP processing or
co-occurrence.

8
Graph representations - examples
Schenker
Gamon
9
Graph-based categorization

Categorization based on similarity metrics
Schenker
Isomorphism
Maximum common subgraph/ minimum common
supergraph
Graph edit distance
Statistical methods
Diameter, degree distribution, betwenness
Comparison of node neighbors
Distance preservation measure
Methods
k-NN most straightforward
similarity to centroids graph mean and graph
median
term distance to category

10
Ontology

An explicit specification of a
conceptualization. Tom Gruber
Ontology is a data model that represents a set of
concepts within a domain and the relationships
between those concepts. It is used to reason
about the objects within that domain. Wikipedia

11
Ontology - example
12
Use of ontologies in classification

Term unification
Hierarchy of concepts
Entity recognition and disambiguation
Strengthening co-occurrence of related entities
Nearest neighbors

13
Ontology-based classification

Ontology IS the knowledge base and THE
CLASSIFIER no need for training set.
Rich instance base defines known universe.
Schema with taxonomy describe categorization
structure.
Classification is based on recognized entities in
text and semantic relationships between them.
Categories assigned are based on entities types
and taxonomy embedded in schema.

14
OntoCategorization bases

Probability
Traditionally, document is classified based on
probabilities that given feature (word, phrase)
belongs to a certain category.
Here the more features belong to a category, the
more probable that document belongs to the
category.
Similarity
Category is defined as ontology fragment
(entities, classes, structures, etc.)
Similarity of document graph to given ontology
fragment describes closeness to selected category
Connectivity (components)
Knowledge is based on associations.
Entities in one category should form a connected
component, as they belong to the same subject.

15
Classes and categories

Classes do not have to be categories
Classes
Form taxonomy / partonomy
Strict, formal requirements
Membership based on features
Categories
Can include other categories, intersect with
them, etc. more set-like approach
Category can be a complex structure of classes,
relationships and instances
Topic of interest that can span multiple,
normally unrelated classes in schema

16
Who? What? Where? When? Why?

WWW What (who)? Where? When?
These text dimensions are orthogonal (in most
text).
Fairly easy to find place and date/time.
What / who description of articles topic .
Ontology classification
Focus on text core find what and who by
matching entities.
Recognize relationships between entities to
construct an initial document graph.
Graph overlay from ontology on core entities
reveals semantics from background knowledge of
analyzed text.
Why? Hmm

17
OntoCategorization system
18
Algorithm sketch

Convert text to thematic graph
From words to entities (spotting).
Extract relationships and form triples (NLP).
Overlay background knowledge.
Remove unwanted entities (time/place).
Categorize graph using ontology
Select thematic component to categorization
(disambiguation and topic set)
Find best category coverage for selected thematic
graph.

19
Algorithm sketch more details

Match phrases in text with entities in ontology
and assign initial weight.
Graph overlay add relationships from ontology
between matched entities.
Mark / remove entities related to dates and
places.
Add extracted relationships (NLP) between
recognized entities.
Propagate entity weight in graph in similar way
as in hubs-authorities algorithm Kleinberg.
Find thematic graph(s) for further analysis
connected component.
Calculate most important entities based on weight
and graph centrality.
Find categories in schema that cover largest part
of thematic component, are lowest in hierarchy
and include most important entities.

20
Experiments

Wikipedia ontology
Includes around 2,000,000 entries
Multiple entity names (variations for matching)
Has rich instance base (articles)
Internal href, templates and infobox relations
carry semantic connections among entries
Has large schema with categories over 310,00
categories
They DO NOT form a taxonomy, just a graph (even
include cycles)

21
Experiments (2)

Wikipedia 2 RDF
Created initially by dbpedia.org
Auer, Lehmann
Creation of RDF some modifications
Focus on href, infoboxes and templates
Special relationships for entities in infoboxes
and templates
Only English version of Wikipedia
Entity name variations for matching
Name, short name (no brackets), redirect,
disambiguation, alternate names

22
Algorithm details (1)

Entity name matching
Entities and relationships are the content of
document they define topic(s).
Ontology defines known entities, literals or
phrases assigned to them and classifications.
Analyzed text must contain some of these entities
to be categorizable otherwise it is outside of
the ontology scope.
Matching assigns spotted phrases to known
literals, and later to entities.
Possible use of stop words and/or stemming.

23
Example of entity matching

Ford Motor Co. is in the process of selling
Jaguar and Land Rover, according to Ford
CEO Alan Mulally.

24
Algorithm details (2)

Semantic graph construction
Add relationships between recognized entities
from ontology, as ontology defines meaningful
(semantic) connections between them.
Add relationships extracted from NLP analysis of
annotated text.
Connected entites enable to perform graph
analysis, connectivity, finding paths, etc.
Date and place elimination
Dates and places are orthogonal to topic.
Path connecting entities through place or date is
very little meaningful for document topic.

25
Example parse tree and triples

Ford Motor Co. is in the process of selling
Jaguar and Land Rover, according to Ford CEO Alan
Mulally.

26
Example NLP ontology knowledge

Ford Motor Co. is in the process of selling
Jaguar and Land Rover, according to Ford CEO Alan
Mulally.

named_after
Jaguar (animal)
Jaguar Cars
Chief Executive Officer
parent_company
sells
Ford Motor Company
has_CEO
is_a
sells
CEO_of
parent_company
Land Rover
Alan Mulally
27
Algorithm details (3)

Weight propagation
Each entity has its initial weight assigned by
strength of phrase matching.
Like in the web, entities are interconnected
influence each other.
We are looking for authority entities
assumption is they are most representative for
topic.

28
Algorithm details (4)

Thematic subgraph in matched graph
Assumption is that entities associated with the
same or related topics are interconnected in
ontology same as in real life.
Graph component topic-related entites.
Each document (or document fragment) should treat
about one or two main topics leave only most
important (weight) and largest component(s).

29
Thematic graph examples
Chief Executive Officer
Jaguar Cars
Jaguar (animal)
Ford Motor Company
Alan Mulally
Land Rover
Announcement
Sales
News
Business
Newspaper
Buyer
30
Algorithm details (5)

Most important and central entities
Topic tends to center around few entites that are
either most important (weight) or are most
central in graph.
Also classification of whole subgraph should be a
subset of possible classification of these
entities.

31
Algorithm details (6)

Categorization
Category is defined as set and/or hierarchy of
classes defined in ontology schema.
Each entity has a hierarchy of assigned
categories.
Best ontology class for graph should
Cover maximum number of entities in the graph.
Be on relatively lowest level in hierarchy.
Be close in hierarchy to classified entity.
Include most important entities (the more, the
better)

32
Entities and categories
Car Manufacturers
Felines
Living people
Off-road wehicles
Ford
Pantherinae
Ford people
Jaguar
Panthera
Ford executives
Jaguar Cars
Alan Mulally
Jaguar (animal)
Ford Motor Company
Land Rover
Chief Executive Officer
33
Longer example

Ford, utility ready to work on plug-in car
Automaker, Southern California Edison to unveil
alliance in response to demand for
energy-efficient vehicles.
DETROIT (Reuters) -- Ford Motor Co. and power
utility Southern California Edison will announce
an unusual alliance Monday aimed at clearing the
way for a new generation of rechargeable electric
cars, the companies said.
Ford (Charts , Fortune 500) Chief Executive Alan
Mulally and Edison International (Charts ,
Fortune 500) Chief Executive John Bryson are
scheduled to meet with reporters at Edison's
headquarters in Rosemead, Calif., the companies
said.
...
Led by Toyota Motor Corp's (Charts) Prius, the
current generation of hybrid vehicles uses
batteries to power the vehicle at low speeds and
in to provide assistance during stop-and-go
traffic and hard acceleration, delivering higher
fuel economy.
General Motors Corp. (Charts , Fortune 500) has
already begun work this year to develop its own
plug-in hybrid car, designed to use little or no
gasoline over short distances. The company showed
off a concept version of the Chevrolet Volt in
January at the Detroit Auto show and has awarded
contracts to two battery makers to research
advanced batteries for a possible production
version.

34
Longer example

Ford, utility ready to work on plug-in car
Automaker, Southern California Edison to unveil
alliance in response to demand for
energy-efficient vehicles.
DETROIT (Reuters) -- Ford Motor Co. and power
utility Southern California Edison will announce
an unusual alliance Monday aimed at clearing the
way for a new generation of rechargeable electric
cars, the companies said.
Ford (Charts , Fortune 500) Chief Executive Alan
Mulally and Edison International (Charts ,
Fortune 500) Chief Executive John Bryson are
scheduled to meet with reporters at Edison's
headquarters in Rosemead, Calif., the companies
said.
...
Led by Toyota Motor Corp's (Charts) Prius, the
current generation of hybrid vehicles uses
batteries to power the vehicle at low speeds and
in to provide assistance during stop-and-go
traffic and hard acceleration, delivering higher
fuel economy.
General Motors Corp. (Charts , Fortune 500) has
already begun work this year to develop its own
plug-in hybrid car, designed to use little or no
gasoline over short distances. The company showed
off a concept version of the Chevrolet Volt in
January at the Detroit Auto show and has awarded
contracts to two battery makers to research
advanced batteries for a possible production
version.

35
(No Transcript)
36
Longer example graph properties

Initial number of vertexes 205
Initial number of edges 361
Largest component 95
Component for analysis 35
Central and most important entities
Hybrid_vehicle Centrality 208, weight
1.516873
Automobile Centrality 213, weight 1.249790,
Internal_combustion_engine Centrality 233,
weight 1.069511
Ford_Motor_Company Centrality 237, weight
1.451533,
Southern_California_Edison Centrality 351,
weight 1.308824

37
Longer example categories

CategoryAutomobiles
CAT instances lt13gt, (avg. height 2.384615)weight
0.874697
CategoryAlternative_propulsion
CAT instances lt4gt, (avg. height 1.250000) weight
0.873287
CategoryCar_manufacturers
instances lt3gt (avg. height 1.000000) weight
0.781271
CategoryVehicles
CAT instances lt13gt, (avg. height 2.923077)
weight 0.647903
CategoryTransportation
CAT instances lt11gt, (avg. Height 3.090909)
weight 0.629714

38
Wikipedia categories

Wikipedia categories DO NOT form a taxonomy
It is just a directed graph, that contains
cycles.
Not possible to use subsumption for categories.
Thesaurus-like structure. Voss
Categories may be very deep and detailed, or very
broad
Hard to pinpoint the cut-off point good for
categorization.
There is no simple mapping between news
categories and categories in Wikipedia.

39
Overall performance of initial tests

Tests against classic BOW statistic classifier
McCallum.
Source articles and categories taken from CNN
total of 7158 documents in 14 categories.
Divided into 50 training / 50 testing split
Mapping between Wikipedia and CNN categories done
manually by crawling generated Wikipedia schema
(still not really precise)

40
Text corpora CNN news
41
CNN and Wikipedia

CNN categories
Classified by people
Describe mostly article interest, not necessarily
its content
Frequently described readers interest rather
than true subject.
Hard to match to Wikipedia categories
Wikipedia categories
Content-based
Very detailed and deep

42
Categorization results - BOW
43
Categorization results BOW on Wikipedia
44
Categorization results - Wikipedia
45
Summary of work

Ontology storage and querying
Brahms RDF/S storage
Sparqler query language extension with path
queries
For use in Glycomics project
Prototype of ontology-based categorization
Partial implementation not all modules included
yet
Use of general-purpose ontology RDF graph
created from English Wikipedia
Initial tests confirm proof of concept
Published as technical report, submitted to WWW
2008

46
Remaining research

Goal
Create comprehensive model for ontology-based
categorization.
Create semantic context definition
Modify and/or create graph similarity measures
that exploit context information

47
Current work in progress

Goal
Create a system, where user can categorize text
document with given ontology using specified
semantic context.
NLP module for relationship extraction
Definition of query context
Extension of SPARQL with context queries

48
Proposed work

Include NLP analysis in creating relationships
between entities
Will help to link entities that do not have
connection in ontology or strengthen this
connection.
Explore categorization to a user-defined context
(collection of instances, classes, structures,
path expressions).
Extend definition of category to include context.
Experiment with other well-developed ontologies
to categorize more specialized documents
Eg. PubMed
(optional) Study the applicability of the method
for ontology-based document summarization.

49
Published papers

Maciej Janik, Krys Kochut. "BRAHMS A WorkBench
RDF Store And High Performance Memory System for
Semantic Association Discovery", Fourth
International Semantic Web Conference, ISWC 2005,
Galway, Ireland, 6-10 November 2005
Krys Kochut, Maciej Janik. "SPARQLeR Extended
Sparql for Semantic Association Discovery",
Fourth European Semantic Web Conference, ESWC
2007, Innsbruck, Austria, 3-7 June 2007
Matthew Perry, Maciej Janik, Cartic Ramakrishnan,
Conrad Ibanez, Budak Arpinar, Amit Sheth.
"Peer-to-Peer Discovery of Semantic
Associations", Second International Workshop on
Peer-to-Peer Knowledge Management, San Diego, CA,
July 17, 2005
Maciej Janik, Krys Kochut. "Wikipedia in action
Ontological Knowledge in Text Categorization",
UGA Technical Report No. UGA-CS-TR-07-001,
November 2007 submitted to WWW 2008
S. Nimmagadda, A. Basu, M. Evenson, J. Han, M.
Janik, R. Narra, K. Nimmagadda, A. Sharma, K.J.
Kochut, J.A. Miller and W. S. York, "GlycoVault
A Bioinformatics Infrastructure for Glycan
Pathway Visualization, Analysis and Modeling,"
Proceedings of the 5th International Conference
on Information Technology New Generations
(ITNG'08), Las Vegas, Nevada (April 2008) to
appear

50
References

Auer, S. and Lehmann, J., What have Innsbruck and
Leipzig in common? Extracting Semantics from Wiki
Content. in European Semantic Web Conference
(ESWC'07), (Innsbruck, Austria, 2007), Springer,
503-517.
Gamon, M., Graph-Based Text Representation for
Novelty Detection. in Workshop on TextGraphs at
HLT-NAACL 2006, (New York, NY, US, 2006).
Gruber, T. A Translation Approach to Portable
Ontology Specifications. Knowledge Acquisition, 5
(2). 199-220, 1993.
Kleinberg, J.M., Authoritative Sources in a
Hyperlinked Environment. in ACM-SIAM Symposium on
Discrete Algorithms, (1998).
McCallum, A.K. Bow A toolkit for statistical
language modeling, text retrieval, classification
and clustering. http//www.cs.cmu.edu/mccallum/bo
w, 1996.
Nagarajan, M., Sheth, A.P., Aguilera, M., Keeton,
K., Merchant, A. and Uysal, M. Altering Document
Term Vectors for Classification - Ontologies as
Expectations of Cooccurrence LSDIS Technical
Report, November, 2006.
Schenker, A., Bunke, H., Last, M. and Kandel, A.
Graph-Theoretic Techniques for Web Content
Mining. World Scientific, London, 2005.
Sebastiani, F. Machine learning in automated text
categorization. ACM Computing Surveys (CSUR), 34
(1). 1 - 47.
Voss, J. Collaborative thesaurus tagging the
Wikipedia way. ArXiv Computer Science e-prints,
cs/0604036.