Pattern Analysis presentation

About This Presentation

Transcript and Presenter's Notes

Title: Pattern Analysis

1
Pattern Analysis Machine IntelligenceResearch
GroupUNIVERSITY OF WATERLOO
LORNET Theme 4
Data Mining and Knowledge Extraction for LO
T L Mohamed Kamel PIs O. Basir, F. Karray, H.
Tizhoosh Assoc PIs A. Wong, C. DiMarco
2
Knowledge Extraction and LO Mining

GOAL
Develop Data mining and knowledge extraction
techniques and tools for learning object
repositories.
These tools can provide context and facilitate
interactions, efficient organization, efficient
delivery, navigation and retrieval.

3
Theme Overview
From Text Syntactic Keyword, Keyphrase-based
Semantic Concept-based From Images Image
Features, Shape Features From Text Images
Describing Images with Text Enriching Text
with Images
Knowledge Extraction
Classification (MCS, Data Partitioning,
Imbalanced Classes) Clustering
(Parallel/Distributed Clustering, Cluster
Aggregation)
LO Similarity and Ranking Association Rules /
Social Networks Reinforcement Learning Specialized
/ Personalized Search
Tagging and Organizing
Matching and Ranking
4
Types of Data in LORNET
LCMS
TELOS
Course
Module
Lesson
LO
Course
Module
Lesson
LO
Course
Module
Lesson
LO
Subject Matter Text, Images, Flash, Applets,
Metadata, Interaction Logs
Resource
Resource
Resource
Discussion Board
Thread
Post
Board
Thread
Post
Board
Thread
Post
Board
SemanticLayer
Discussions Text, Interaction Logs
LOR
Metadata
Record
Metadata
Record
Metadata
Record
LO Descriptors Metadata
Resources Metadata,Semantic References
5
LO Mining Scenarios
Task Environment Knowledge Extraction Tagging / Organizing Matching / Ranking
TELOS Ontology Construction Grouping Components Finding Ranking Components
E-Learning Design Environment (LMS) Extracting LO Summary Extracting LO Concepts Extracting Image Description Grouping LOs Finding Similar LOs Ranking LOs
Learning Object Content MS (LCMS) Summarizing Documents Extracting Concepts from Documents Grouping Documents Tagging Documents Finding Similar Topics Finding Similar Profiles Building Social Networks Detect Plagiarism
LO Repository Extracting Metadata Extracting Ontologies Classifying LOs Building LO Clusters Detecting Duplicate LOs Ranking LOs Metadata Matching
6
LO Mining and Knowledge Extraction
7
Projects Overview
Information Extraction Analyzing content to
extract relevant information
Categorization Organizing LOs according to their
content
Classification
- Traditional - MCS - Imbalanced
Keyword Extraction Summarization Concept
Extraction Social Network Analysis
- Traditional - Ensembles - Distributed
Clustering
Personalization Providing user-specific results
Image Mining Describing and finding relevant
images
ReinforcementLearning
- Traditional - Opposition- based
CBIR
- Traditional - Fusion-based
Integration and Applications
Software Components
In Progress
Publications
Theme and Industry Collaboration
8
Information Extraction Summarization
LO Content Package Summarization

Learning objects stored in IMS content pacakges
are loaded and parsed. Textual content files are
extracted for analysis.
Statistical term weighting and sentence ranking
are performed on each document, and to the whole
collection.
Top relevant sentences are extracted for each
document.
Planned functionality Summarization of whole
modules or lessons (as opposed to single
documents).
Benefits
Provide summarized overview of learning objects
for quick browsing and access to learning
material.
Scenarios
Learning Management Systems can call the
summarization component to produce summaries for
content packages.

Data is courtesy University of Saskatchewan
9
Information Extraction Concept Extraction
Concept-Based Statistical Analyser
F-measure of Hierarchical Clustering F-measure of Hierarchical Clustering F-measure of Hierarchical Clustering F-measure of Hierarchical Clustering
Single-Term Concept-based Improvement
Reuters 0.723 0.925 27.94
ACM 0.697 0.918 31.70
Brown 0.581 0.906 55.93
Entropy of Hierarchical Clustering Entropy of Hierarchical Clustering Entropy of Hierarchical Clustering Entropy of Hierarchical Clustering
Single-Term Concept-based Improvement
Reuters 0.251 0.012 -95.21
ACM 0.317 0.043 -86.43
Brown 0.385 0.018 -95.32
Conceptual Ontological Graph (COG) Ranking
Precision of Search Precision of Search Precision of Search Precision of Search
Single-Term Concept-based Improvement
Cran 0.536 0.901 68.09
Reuters 0.591 0.897 51.77
Recall of Search Result Recall of Search Result Recall of Search Result Recall of Search Result
Single-Term Concept-based Improvement
Cran 0.486 0.827 70.16
Reuters 0.452 0.841 86.06
10
Information Extraction Keyword Extraction

Semantic Keyword Extraction
Tasks
Developing tools and techniques to extract
semantic keywords toward facilitating metadata
generation
Developing algorithms to enrich metadata (tags)
which can be applied in index-based multimedia
retrieval
Progress
Proposed a new information theoretic inclusion
index to measure the asymmetric dependency
between terms (and concepts), which can be used
in term selection (keyword extraction) and
taxonomy extraction (pseudo ontology)
Makrehchi, M. and Kamel, ICDM07, WI 07

11
Information Extraction Keyword Extraction

Rule base size shows quick initial growth,
followed by slow and irregular growth and rule
elimination
Learns 20 rules from the first 50 training rules
Learns 13 additional rules from the next 220
training rules

Rule-based Keyword Extraction

Learn rules to find keywords in English sentences
Rules represent sentence fragments
Specific enough for reliable keyword extraction
General enough to be applied to unseen sentences
Rule generalization
Begin with an exact sentence fragment
Merge with another by moving different words to
the lowest common level in the part-of-speech
hierarchy
Keep merged rule if it does not reduce precision
and recall of keyword extraction keep original
rules otherwise
Keyword extraction
Find sequence of rules that best cover an unseen
sentence
Extract keywords according to rules

Both precision and recall values increase during
training
Precision (blue) increases 10
Recall (red) shows slight upward trend

12
Categorization Ensemble-based Clustering

Consensus Clustering
Categorization of learning objects using proposed
consensus clustering algorithms.
The goal of consensus clustering is to find a
clustering of the data objects that optimally
summarizes an ensemble of multiple clusterings.
Consensus clustering can offer several advantages
over a single data clustering, such as the
improvement of clustering accuracy, enhancing the
scalability of clustering algorithms to large
volumes of data objects, and enhancing the
robustness by reducing the sensitivity to outlier
data objects or noisy attributes.
Tasks
Development of techniques for producing ensembles
of multiple data clusterings where diverse
information about the structure of the data is
likely to occur.
Development of consensus algorithms to aggregate
the individual clusterings.
Develop solutions for the cluster symbolic-label
matching problem
Empirical analysis on real-world data and
validation of proposed method.

13
Categorization using cluster ensemble
Dataset samples attributes classes K-means Mean Error Rate in Ensembles Mean Error Rate in
Synthetic1 1000 8 5 17.41 0
Yahoo! (text) 2340 1458 6 38.23 16.24
Texture (image) 5500 40 11 37.99 11.54
Optical Digit Recognition 500 64 10 27.31 16.40
14
Categorization Distributed Clustering
Hierarchical P2P Document Clustering

Peer nodes are arranged into groups called
neighborhoods.
Multiple neighborhoods are formed at each level
of the hierarchy.
This size of each neighborhood is determined
through a network partitioning factor.
Each neighborhood has a designated supernode.
Supernodes of level h form the neibhorhoods for
level h1.
Clustering is done within neighborhood
boundaries, then is merged up the hierarchy
through the supernodes.
Benefits
Significant speedup over centralized clustering
and flat peer-to-peer clustering.
Multiple levels of clusters.
Distributed summarization of clusters using
CorePhrase keyphrase extraction.
Scenarios

HP2PC Architecture
HP2PC Example3-level network, 16 nodes
15
Categorization Multiple Classifier Systems

Progress
Proposed a set of evaluation measures to select
sub-optimal training partitions for training
classifier ensembles.
Proposed an ensemble training algorithm called
Clustering, De-clustering, and Selection (CDS).
Proposed and optimized a cooperative training
algorithm called Cooperative Clustering,
De-clustering, and Selection (CO-CDS).
Investigated the applications of proposed
training methods (CDS and CO-CDS) on LO
classification.

Tasks
To investigate various aspects of cooperation in
Multiple Classifier Systems (Classifier
Ensembles)
To develop evaluation measures in order to
estimate various types of cooperation in the
system
To gain insight into the impact of changes in the
cooperative components with respect to system
performance using the proposed evaluation
measures
To apply these findings to optimize existing
ensemble methods
To apply these findings to develop novel ensemble
methods with the goal of improving classification
accuracy and reducing computation complexity

16
Categorization Imbalanced Class Distribution

Objective
Advance classification of multi-class imbalanced
data
Tasks
To develop cost-sensitive boosting algorithm
AdaC2.M1
To improve the identification performance on the
important classes
To balance classification performance among
several classes

17
Categorization Imbalanced Class Distribution
Performance of Base Classification and AdaBoost
Class Distribution
C4.5 C4.5 HPWR (Od3) HPWR (Od3)
class Meas. Base AdaBoost Base AdaBoost
C1 R 0 5.11 10.70 44.06
C1 P N/A 6.5 11.82 32.89
C1 F N/A 5.84 10.83 35.84
C2 R 73.21 92.28 88.31 87.43
C2 P 69.53 88.75 86.79 91.99
C2 F 72.29 90.38 87.43 89.64
C3 R 67.94 91.36 87.63 88.42
C3 P 73.89 87.88 87.07 89.91
C3 F 71.91 89.42 86.99 89.03
G-measure G-measure 0 11.46 33.32 68.50
Ind. size Dist.
C1 49 7.84
C2 288 46.08
C3 288 46.08
Balanced performance among classes - Evaluated by
G-mean
C4.5 C4.5 C4.5 HPWR (Od3) HPWR (Od3) HPWR (Od3)
Class Meas. Base AdaBoost AdaC2.M1 Base AdaBoost AdaC2.M1
C1 R 0 5.11 77.58 10.70 44.06 65.72
C1 P N/A 6.50 14.12 11.82 32.89 30.83
C2 R 73.21 92.28 64.73 88.31 87.43 83.12
C2 P 69.53 88.75 97.24 86.79 91.99 91.38
C3 R 67.94 91.36 65.23 87.63 88.42 83.95
C3 P 73.89 87.88 93.22 87.07 89.91 90.81
G-mean G-mean 0 11.46 68.42 33.32 68.50 76.08
18
Personalization

Opposition-based Reinforcement Learning for
Personalizing Image Search
Developing a reliable technique to assist users,
facilitate and enhance the learning process
Personalized ORL tool assists user to observe the
searched images desirable for her/him
Personalized tool gathers images of the searched
results, selects a sample of them
By interacting with user and presenting the
sample, it learns the users preferences

19
Personalization
20
Image Mining CBIR

Content based image retrieval
Build an IR system that can retrieve images based
on Textual Cues, Image content, NL Queries

Documents contain QI

Image Retrieval Tool Set
images

Images contain QT

Images match QI

NL Description of Image

Rich Documents

Automated image tagging

Query Image QI
Query Text QT
Query Document

21
Illustrative Example
IZM
FD
Accuracy 55
Accuracy 70
Accuracy 95
Accuracy 60
The proposed approach
MTAR
22
Experimental Results (Contd)
The Performance of the proposed approach
23
Integration and Applications

Progress
Finished core parts of the common data mining
framework.
Built components and services from theme
researchers work around the data mining
framework.
Provided documentation for the data mining
framework and software components.
Launched web site to host components and
documentation from Theme 4http//pami.uwaterloo.
ca/projects/lornet/software/

24
Integration and Applications

Progress
Core parts of the common data mining framework
are available, including
Vector and matrix manipulation.
Document parsing and tokenization.
Statistical term and sentence analysis.
Similarity calculation using multiple distance
functions.
IMS Content Package compliant parser.
Components and tools built around the common data
mining framework
Metadata extraction from single documents
supports Dublin Core encoding.
Document similarity calculation using cosine
similarity.
Single document and content package
summarization.
Building of standard text datasets from large
document collections.
Integration with TELOS
Developed C TELOS connector for integrating
Theme 4 components.
Worked on component manifest specification with
Theme 6.
Provided metadata extraction as part of a
complete scenario for TELOS components
integration.

25
Industry Collaboration

Pattern Discovery Software (PDS) provided data
mining software tools for use by researchers.
Vestech provided opportunities for researchers to
work on speech technologies.
Desire2Learn opened job opportunities for LORNET
researchers.

26
Software Components
Overview of Components
Scenarios for Use of Software Components

General Tools
C Connector for TELOS
Common Data Mining Framework
Standard Text Mining Tools
Metadata Extractor
Document Summarizer
Content Package Summarizer
Document Similarity
LO Recommender
Metadata Harvester
Keyword Extractor
Taxonomy Extractor
Metadata Enrichment Tools
Concept-based and Semantic Text Mining Tools
Metadata Extractor
LO Search Engine
Document Similarity

Environment
Data Types
Tasks
TELOS

Metadata
Ontology

Ontology construction and unification
Finding relations between components
Ranking components
Grouping components
Tagging components

Learning Object Repository

Metadata
Structured Text
Categorical

Automatic metadata extraction
LO automatic classification
LO organization through clustering
Multiple organization strategies through cluster
ensembles

e-Learning Environment

Structured Text
Images
Object Relationships
Context

Extracting concepts from LO
Summarizing Documents
Grouping LOs
Tagging LOs
Discovering Similar Topics
Discovering Similar Peers
Building Social Networks
Detecting Plagiarism
LO recommendation using similarity ranking
Personalization / Specialization through
reinforcement learning

User-centric Tools
Personalized Search Engine
Social Network Learner
Image Mining Tools
Content-based Image Search
Personalized Image Search
Consensus-based Fusion for Image Retrieval

Legend
Integrated
Ready
In Progress
Year 5

27
Publications
Papers (accepted / published) Papers (submitted / in prep) Theses (completed / in progress)
4.1 Information Extraction from Text 11 7 3/2
4.2 Semantic Knowledge Synthesis from Text 10 4 4/1
4.3 Knowledge Discovery through Categorization 12 10 4/1
4.4 Knowledge from Interaction 8 3 1/2
4.5 Knowledge from Image Mining 10 3 2/1
Total 51 27 14//7 21
28
Theme 4 TeamLeader M. Kamel

Dr. Karray
Asso PI (Wong, DiMarco
M. Shokri
S. Hassan
A. Farahat
Dr. R. Khoury
PDS,
Vestech,
Desire2Learn

Graduated
R. Khoury, PhD 07
L. Chen, PhD 07
M. Makhreshi,PhD 07
K.Hammouda,PhD 07
R. Dara, PhD 07
Y.Sun, PhD 07
K. Shaban, PhD 06
Y. Sun, PhD 06
M. Hussin, PhD 05
Jan Bakus, PhD 05
A. Adegorite, MA.Sc04
A. Khandani, MA.Sc05.
S. Podder, MA.Sc.04

PIs
Dr. Basir
Dr. Tizhoosh
Researchers
H. Ayad
R. Kashef
A. Ghazel
Dr. Makhreshi
Funding
CRC/CFI/OIT
NSERC
PAMI Lab

29
Pattern Analysis and Machine Intelligence
Lab Electrical and Computer Engineering Universit
y of Waterloo Canada
www.pami.uwaterloo.ca

Pattern Analysis PowerPoint PPT Presentation