Trainingless Ontologybased Text Categorization. - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Trainingless Ontologybased Text Categorization.

Description:

... for Semantic Association Discovery', Fourth European Semantic Web Conference, ... on Peer-to-Peer Knowledge Management, San Diego, CA, July 17, 2005 ... – PowerPoint PPT presentation

Number of Views:115
Avg rating:3.0/5.0
Slides: 51
Provided by: Mac77
Category:

less

Transcript and Presenter's Notes

Title: Trainingless Ontologybased Text Categorization.


1
Training-less Ontology-based Text Categorization.
  • Maciej Janik

Major professor Dr. Krzysztof J.
Kochut Committee Dr. John A. Miller Dr. Khaled
Rasheed Dr. Amit P. Sheth
December 14th, 2007 PhD Prospectus presentation
2
Outline
  • Document categorization
  • Classic approach to categorization
  • Graph categorization and similarity metrics
  • Ontology-based approach to categorization
  • Algorithm sketch
  • Algorithm details and assumptions
  • Example and preliminary results
  • Planned work and expected results
  • References

3
Document categorization
  • Document classification/categorization is a
    problem in information science. The task is to
    assign an electronic document to one or more
    categories, based on its contents. Wikipedia

4
Document categorization by people
  • People categorize document by understanding its
    content, using their knowledge and understanding
    what the category is.
  • Categorization is based on
  • Document content
  • Knowledge
  • Category
  • Perceived interest

features, graphontologycategory
definitioncategorization context
5
Automatic text categorization
  • Automatic text classification can be defined as
    task of assigning category labels to new
    documents based on the knowledge gained in a
    classification system at the training stage.
  • require training with pre-classified documents
  • Proposed solution
  • use already defined knowledge for document
    categorization and skip the training stage

6
Classic categorization
  • Methods are based on word/phrase statistics,
    information gain and other probability or
    similarity measures.
  • Examples Sebastiani
  • Naïve Bayes, SVM, Decision Tree, k-NN
  • Categorization based on information (frequencies,
    probabilities) learned from the training
    documents.
  • Vocabulary extension/unification possible by use
    of synonyms, homonyms, word groups (eg. from
    WordNet)
  • Document representation for categorization
  • Set or vector of features - most popular and
    simple bag of words
  • Does not include information about document
    structure, relative position of phrases, etc.

7
Graph representation of text
  • Graph representation preserves (selected)
    structural information from document
  • Relative words positions to find close
    co-occurring phrases.
  • Paragraph, formatting (eg. emphasize), part of
    document.
  • Sample representations
  • Words form a directed graph, chained in order as
    they appear in each sentence.
  • Words form a weighted graph, where edge connects
    words within certain distance and weight
    determines closeness.
  • Connected terms based on NLP processing or
    co-occurrence.

8
Graph representations - examples
Schenker
Gamon
9
Graph-based categorization
  • Categorization based on similarity metrics
    Schenker
  • Isomorphism
  • Maximum common subgraph/ minimum common
    supergraph
  • Graph edit distance
  • Statistical methods
  • Diameter, degree distribution, betwenness
  • Comparison of node neighbors
  • Distance preservation measure
  • Methods
  • k-NN most straightforward
  • similarity to centroids graph mean and graph
    median
  • term distance to category

10
Ontology
  • An explicit specification of a
    conceptualization. Tom Gruber
  • Ontology is a data model that represents a set of
    concepts within a domain and the relationships
    between those concepts. It is used to reason
    about the objects within that domain. Wikipedia

11
Ontology - example
12
Use of ontologies in classification
  • Term unification
  • Hierarchy of concepts
  • Entity recognition and disambiguation
  • Strengthening co-occurrence of related entities
  • Nearest neighbors

13
Ontology-based classification
  • Ontology IS the knowledge base and THE
    CLASSIFIER no need for training set.
  • Rich instance base defines known universe.
  • Schema with taxonomy describe categorization
    structure.
  • Classification is based on recognized entities in
    text and semantic relationships between them.
  • Categories assigned are based on entities types
    and taxonomy embedded in schema.

14
OntoCategorization bases
  • Probability
  • Traditionally, document is classified based on
    probabilities that given feature (word, phrase)
    belongs to a certain category.
  • Here the more features belong to a category, the
    more probable that document belongs to the
    category.
  • Similarity
  • Category is defined as ontology fragment
    (entities, classes, structures, etc.)
  • Similarity of document graph to given ontology
    fragment describes closeness to selected category
  • Connectivity (components)
  • Knowledge is based on associations.
  • Entities in one category should form a connected
    component, as they belong to the same subject.

15
Classes and categories
  • Classes do not have to be categories
  • Classes
  • Form taxonomy / partonomy
  • Strict, formal requirements
  • Membership based on features
  • Categories
  • Can include other categories, intersect with
    them, etc. more set-like approach
  • Category can be a complex structure of classes,
    relationships and instances
  • Topic of interest that can span multiple,
    normally unrelated classes in schema

16
Who? What? Where? When? Why?
  • WWW What (who)? Where? When?
  • These text dimensions are orthogonal (in most
    text).
  • Fairly easy to find place and date/time.
  • What / who description of articles topic .
  • Ontology classification
  • Focus on text core find what and who by
    matching entities.
  • Recognize relationships between entities to
    construct an initial document graph.
  • Graph overlay from ontology on core entities
    reveals semantics from background knowledge of
    analyzed text.
  • Why? Hmm

17
OntoCategorization system
18
Algorithm sketch
  • Convert text to thematic graph
  • From words to entities (spotting).
  • Extract relationships and form triples (NLP).
  • Overlay background knowledge.
  • Remove unwanted entities (time/place).
  • Categorize graph using ontology
  • Select thematic component to categorization
    (disambiguation and topic set)
  • Find best category coverage for selected thematic
    graph.

19
Algorithm sketch more details
  • Match phrases in text with entities in ontology
    and assign initial weight.
  • Graph overlay add relationships from ontology
    between matched entities.
  • Mark / remove entities related to dates and
    places.
  • Add extracted relationships (NLP) between
    recognized entities.
  • Propagate entity weight in graph in similar way
    as in hubs-authorities algorithm Kleinberg.
  • Find thematic graph(s) for further analysis
    connected component.
  • Calculate most important entities based on weight
    and graph centrality.
  • Find categories in schema that cover largest part
    of thematic component, are lowest in hierarchy
    and include most important entities.

20
Experiments
  • Wikipedia ontology
  • Includes around 2,000,000 entries
  • Multiple entity names (variations for matching)
  • Has rich instance base (articles)
  • Internal href, templates and infobox relations
    carry semantic connections among entries
  • Has large schema with categories over 310,00
    categories
  • They DO NOT form a taxonomy, just a graph (even
    include cycles)

21
Experiments (2)
  • Wikipedia 2 RDF
  • Created initially by dbpedia.org
  • Auer, Lehmann
  • Creation of RDF some modifications
  • Focus on href, infoboxes and templates
  • Special relationships for entities in infoboxes
    and templates
  • Only English version of Wikipedia
  • Entity name variations for matching
  • Name, short name (no brackets), redirect,
    disambiguation, alternate names

22
Algorithm details (1)
  • Entity name matching
  • Entities and relationships are the content of
    document they define topic(s).
  • Ontology defines known entities, literals or
    phrases assigned to them and classifications.
  • Analyzed text must contain some of these entities
    to be categorizable otherwise it is outside of
    the ontology scope.
  • Matching assigns spotted phrases to known
    literals, and later to entities.
  • Possible use of stop words and/or stemming.

23
Example of entity matching
  • Ford Motor Co. is in the process of selling
  • Jaguar and Land Rover, according to Ford
  • CEO Alan Mulally.

24
Algorithm details (2)
  • Semantic graph construction
  • Add relationships between recognized entities
    from ontology, as ontology defines meaningful
    (semantic) connections between them.
  • Add relationships extracted from NLP analysis of
    annotated text.
  • Connected entites enable to perform graph
    analysis, connectivity, finding paths, etc.
  • Date and place elimination
  • Dates and places are orthogonal to topic.
  • Path connecting entities through place or date is
    very little meaningful for document topic.

25
Example parse tree and triples
  • Ford Motor Co. is in the process of selling
    Jaguar and Land Rover, according to Ford CEO Alan
    Mulally.

26
Example NLP ontology knowledge
  • Ford Motor Co. is in the process of selling
    Jaguar and Land Rover, according to Ford CEO Alan
    Mulally.

named_after
Jaguar (animal)
Jaguar Cars
Chief Executive Officer
parent_company
sells
Ford Motor Company
has_CEO
is_a
sells
CEO_of
parent_company
Land Rover
Alan Mulally
27
Algorithm details (3)
  • Weight propagation
  • Each entity has its initial weight assigned by
    strength of phrase matching.
  • Like in the web, entities are interconnected
    influence each other.
  • We are looking for authority entities
    assumption is they are most representative for
    topic.

28
Algorithm details (4)
  • Thematic subgraph in matched graph
  • Assumption is that entities associated with the
    same or related topics are interconnected in
    ontology same as in real life.
  • Graph component topic-related entites.
  • Each document (or document fragment) should treat
    about one or two main topics leave only most
    important (weight) and largest component(s).

29
Thematic graph examples
Chief Executive Officer
Jaguar Cars
Jaguar (animal)
Ford Motor Company
Alan Mulally
Land Rover
Announcement
Sales
News
Business
Newspaper
Buyer
30
Algorithm details (5)
  • Most important and central entities
  • Topic tends to center around few entites that are
    either most important (weight) or are most
    central in graph.
  • Also classification of whole subgraph should be a
    subset of possible classification of these
    entities.

31
Algorithm details (6)
  • Categorization
  • Category is defined as set and/or hierarchy of
    classes defined in ontology schema.
  • Each entity has a hierarchy of assigned
    categories.
  • Best ontology class for graph should
  • Cover maximum number of entities in the graph.
  • Be on relatively lowest level in hierarchy.
  • Be close in hierarchy to classified entity.
  • Include most important entities (the more, the
    better)

32
Entities and categories
Car Manufacturers
Felines
Living people
Off-road wehicles
Ford
Pantherinae
Ford people
Jaguar
Panthera
Ford executives
Jaguar Cars
Alan Mulally
Jaguar (animal)
Ford Motor Company
Land Rover
Chief Executive Officer
33
Longer example
  • Ford, utility ready to work on plug-in car
    Automaker, Southern California Edison to unveil
    alliance in response to demand for
    energy-efficient vehicles.
  • DETROIT (Reuters) -- Ford Motor Co. and power
    utility Southern California Edison will announce
    an unusual alliance Monday aimed at clearing the
    way for a new generation of rechargeable electric
    cars, the companies said.
  • Ford (Charts , Fortune 500) Chief Executive Alan
    Mulally and Edison International (Charts ,
    Fortune 500) Chief Executive John Bryson are
    scheduled to meet with reporters at Edison's
    headquarters in Rosemead, Calif., the companies
    said.
  • ...
  • Led by Toyota Motor Corp's (Charts) Prius, the
    current generation of hybrid vehicles uses
    batteries to power the vehicle at low speeds and
    in to provide assistance during stop-and-go
    traffic and hard acceleration, delivering higher
    fuel economy.
  • General Motors Corp. (Charts , Fortune 500) has
    already begun work this year to develop its own
    plug-in hybrid car, designed to use little or no
    gasoline over short distances. The company showed
    off a concept version of the Chevrolet Volt in
    January at the Detroit Auto show and has awarded
    contracts to two battery makers to research
    advanced batteries for a possible production
    version.

34
Longer example
  • Ford, utility ready to work on plug-in car
    Automaker, Southern California Edison to unveil
    alliance in response to demand for
    energy-efficient vehicles.
  • DETROIT (Reuters) -- Ford Motor Co. and power
    utility Southern California Edison will announce
    an unusual alliance Monday aimed at clearing the
    way for a new generation of rechargeable electric
    cars, the companies said.
  • Ford (Charts , Fortune 500) Chief Executive Alan
    Mulally and Edison International (Charts ,
    Fortune 500) Chief Executive John Bryson are
    scheduled to meet with reporters at Edison's
    headquarters in Rosemead, Calif., the companies
    said.
  • ...
  • Led by Toyota Motor Corp's (Charts) Prius, the
    current generation of hybrid vehicles uses
    batteries to power the vehicle at low speeds and
    in to provide assistance during stop-and-go
    traffic and hard acceleration, delivering higher
    fuel economy.
  • General Motors Corp. (Charts , Fortune 500) has
    already begun work this year to develop its own
    plug-in hybrid car, designed to use little or no
    gasoline over short distances. The company showed
    off a concept version of the Chevrolet Volt in
    January at the Detroit Auto show and has awarded
    contracts to two battery makers to research
    advanced batteries for a possible production
    version.

35
(No Transcript)
36
Longer example graph properties
  • Initial number of vertexes 205
  • Initial number of edges 361
  • Largest component 95
  • Component for analysis 35
  • Central and most important entities
  • Hybrid_vehicle Centrality 208, weight
    1.516873
  • Automobile Centrality 213, weight 1.249790,
  • Internal_combustion_engine Centrality 233,
    weight 1.069511
  • Ford_Motor_Company Centrality 237, weight
    1.451533,
  • Southern_California_Edison Centrality 351,
    weight 1.308824

37
Longer example categories
  • CategoryAutomobiles
  • CAT instances lt13gt, (avg. height 2.384615)weight
    0.874697
  • CategoryAlternative_propulsion
  • CAT instances lt4gt, (avg. height 1.250000) weight
    0.873287
  • CategoryCar_manufacturers
  • instances lt3gt (avg. height 1.000000) weight
    0.781271
  • CategoryVehicles
  • CAT instances lt13gt, (avg. height 2.923077)
    weight 0.647903
  • CategoryTransportation
  • CAT instances lt11gt, (avg. Height 3.090909)
    weight 0.629714

38
Wikipedia categories
  • Wikipedia categories DO NOT form a taxonomy
  • It is just a directed graph, that contains
    cycles.
  • Not possible to use subsumption for categories.
  • Thesaurus-like structure. Voss
  • Categories may be very deep and detailed, or very
    broad
  • Hard to pinpoint the cut-off point good for
    categorization.
  • There is no simple mapping between news
    categories and categories in Wikipedia.

39
Overall performance of initial tests
  • Tests against classic BOW statistic classifier
    McCallum.
  • Source articles and categories taken from CNN
    total of 7158 documents in 14 categories.
  • Divided into 50 training / 50 testing split
  • Mapping between Wikipedia and CNN categories done
    manually by crawling generated Wikipedia schema
    (still not really precise)

40
Text corpora CNN news
41
CNN and Wikipedia
  • CNN categories
  • Classified by people
  • Describe mostly article interest, not necessarily
    its content
  • Frequently described readers interest rather
    than true subject.
  • Hard to match to Wikipedia categories
  • Wikipedia categories
  • Content-based
  • Very detailed and deep

42
Categorization results - BOW
43
Categorization results BOW on Wikipedia
44
Categorization results - Wikipedia
45
Summary of work
  • Ontology storage and querying
  • Brahms RDF/S storage
  • Sparqler query language extension with path
    queries
  • For use in Glycomics project
  • Prototype of ontology-based categorization
  • Partial implementation not all modules included
    yet
  • Use of general-purpose ontology RDF graph
    created from English Wikipedia
  • Initial tests confirm proof of concept
  • Published as technical report, submitted to WWW
    2008

46
Remaining research
  • Goal
  • Create comprehensive model for ontology-based
    categorization.
  • Create semantic context definition
  • Modify and/or create graph similarity measures
    that exploit context information

47
Current work in progress
  • Goal
  • Create a system, where user can categorize text
    document with given ontology using specified
    semantic context.
  • NLP module for relationship extraction
  • Definition of query context
  • Extension of SPARQL with context queries

48
Proposed work
  • Include NLP analysis in creating relationships
    between entities
  • Will help to link entities that do not have
    connection in ontology or strengthen this
    connection.
  • Explore categorization to a user-defined context
    (collection of instances, classes, structures,
    path expressions).
  • Extend definition of category to include context.
  • Experiment with other well-developed ontologies
    to categorize more specialized documents
  • Eg. PubMed
  • (optional) Study the applicability of the method
    for ontology-based document summarization.

49
Published papers
  • Maciej Janik, Krys Kochut. "BRAHMS A WorkBench
    RDF Store And High Performance Memory System for
    Semantic Association Discovery", Fourth
    International Semantic Web Conference, ISWC 2005,
    Galway, Ireland, 6-10 November 2005
  • Krys Kochut, Maciej Janik. "SPARQLeR Extended
    Sparql for Semantic Association Discovery",
    Fourth European Semantic Web Conference, ESWC
    2007, Innsbruck, Austria, 3-7 June 2007
  • Matthew Perry, Maciej Janik, Cartic Ramakrishnan,
    Conrad Ibanez, Budak Arpinar, Amit Sheth.
    "Peer-to-Peer Discovery of Semantic
    Associations", Second International Workshop on
    Peer-to-Peer Knowledge Management, San Diego, CA,
    July 17, 2005
  • Maciej Janik, Krys Kochut. "Wikipedia in action
    Ontological Knowledge in Text Categorization",
    UGA Technical Report No. UGA-CS-TR-07-001,
    November 2007 submitted to WWW 2008
  • S. Nimmagadda, A. Basu, M. Evenson, J. Han, M.
    Janik, R. Narra, K. Nimmagadda, A. Sharma, K.J.
    Kochut, J.A. Miller and W. S. York, "GlycoVault
    A Bioinformatics Infrastructure for Glycan
    Pathway Visualization, Analysis and Modeling,"
    Proceedings of the 5th International Conference
    on Information Technology New Generations
    (ITNG'08), Las Vegas, Nevada (April 2008) to
    appear

50
References
  • Auer, S. and Lehmann, J., What have Innsbruck and
    Leipzig in common? Extracting Semantics from Wiki
    Content. in European Semantic Web Conference
    (ESWC'07), (Innsbruck, Austria, 2007), Springer,
    503-517.
  • Gamon, M., Graph-Based Text Representation for
    Novelty Detection. in Workshop on TextGraphs at
    HLT-NAACL 2006, (New York, NY, US, 2006).
  • Gruber, T. A Translation Approach to Portable
    Ontology Specifications. Knowledge Acquisition, 5
    (2). 199-220, 1993.
  • Kleinberg, J.M., Authoritative Sources in a
    Hyperlinked Environment. in ACM-SIAM Symposium on
    Discrete Algorithms, (1998).
  • McCallum, A.K. Bow A toolkit for statistical
    language modeling, text retrieval, classification
    and clustering. http//www.cs.cmu.edu/mccallum/bo
    w, 1996.
  • Nagarajan, M., Sheth, A.P., Aguilera, M., Keeton,
    K., Merchant, A. and Uysal, M. Altering Document
    Term Vectors for Classification - Ontologies as
    Expectations of Cooccurrence LSDIS Technical
    Report, November, 2006.
  • Schenker, A., Bunke, H., Last, M. and Kandel, A.
    Graph-Theoretic Techniques for Web Content
    Mining. World Scientific, London, 2005.
  • Sebastiani, F. Machine learning in automated text
    categorization. ACM Computing Surveys (CSUR), 34
    (1). 1 - 47.
  • Voss, J. Collaborative thesaurus tagging the
    Wikipedia way. ArXiv Computer Science e-prints,
    cs/0604036.
Write a Comment
User Comments (0)
About PowerShow.com