A Survey on Text Categorization with Machine Learning - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

A Survey on Text Categorization with Machine Learning

Description:

Title: PowerPoint Presentation Last modified by: STD Created Date: 1/1/1601 12:00:00 AM Document presentation format: Other titles – PowerPoint PPT presentation

Number of Views:764

Avg rating:3.0/5.0

Slides: 29

Provided by: logosTUt

Category:

more less

Transcript and Presenter's Notes

Title: A Survey on Text Categorization with Machine Learning

1
A Survey on Text Categorization with Machine
Learning

Chikayama lab.
Dai Saito

2
IntroductionText Categorization

Many digital Texts are available
E-mail, Online news, Blog
Need of Automatic Text Categorization is
increasing
without human resource
Merits of time and cost

3
IntroductionText Categorization

Application
Spam filter
Topic Categorization

4
IntroductionMachine Learning

Making Categorization rule automatically by
Feature of Text
Types of Machine Learning (ML)
Supervised Learning
Labeling
Unsupervised Learning
Clustering

5
Introductionflow of ML

Prepare training Text data with label
Feature of Text
Learn
Categorize new Text

Label1
?
Label2
6
Outline

Introduction
Text Categorization
Feature of Text
Learning Algorithm
Conclusion

7
Number of labels

Binary-label
True or False (Ex. spam or not)
Applied for other types
Multi-label
Many labels, butOne Text has one label
Overlapping-label
One Text has some labels

Yes
No
L1
L2
L3
L4
L1
L2
L3
L4
8
Types of labels

Topic Categorization
Basic Task
Compare individual words
Author Categorization
Sentiment Categorization
Ex) Review of products
Need more linguistic information

9
Outline

Introduction
Text Categorization
Feature of Text
Learning Algorithm
Conclusion

10
Feature of Text

How to express a feature of Text?
Bag of Words
Ignore an order of words
Structure
Ex) I like this car. I dont like this car.
Bag of Words will not work well
(ddocument text)
(tterm word)

11
Preprocessing

Remove stop words
the a for
Stemming
relational -gt relate, truly -gt true

12
Term Weighting

Term Frequency
Number of a term in a document
Frequent terms in a document seems to be
important for categorization
tfidf
Terms appearing in many documents are not useful
for categorization

13
Sentiment Weighting

For sentiment classification,weight a word as
Positive or Negative
Constructing sentiment dictionary
WordNet 04 Kamps et al.
Synonym Database
Using a distancefrom good and bad

d (good, happy) 2
d (bad, happy) 4
14
Dimension Reduction

Size of feature vector is (terms)(documents)
terms ? size of dictionary
High calculation cost
Risk of overfitting
Best for training data ? Best for real data
Choosing effective feature
to improve accuracy and calculation cost

15
Dimension Reduction

df-threshold
Terms appearing in very few documents(ex.only
one) are not important
Score
If t and cj are independent, Score is equal
to Zero

16
Outline

Introduction
Text Categorization
Feature of Text
Learning Algorithm
Conclusion

17
Learning Algorithm

Many (Almost all?) algorithms are used in Text
Categorization
Simple approach
Naïve Bayes
K-Nearest Neighbor
High performance approach
Boosting
Support Vector Machine
Hierarchical Learning

18
Naïve Bayes

Bayes Rule
This value is hard to calculate
?
Assumption each terms occurs independently

19
k-Nearest Neighbor

Define a distance of two Texts
Ex)Sim(d1, d2) d1d2 / d1d2
cos?
check k of high similarityTexts and categorize
bymajority vote
If size of test data is larger, memory and
search cost is higher

d1
k3
d2
?
20
Boosting

BoosTexter 00 Schapire et al.
Ada boost
making many weak learners with different
parameters
Kth weak learner checks performance of
1..K-1th, and tries to classify right to the
worst score training data
BoosTexter uses Decision Stump as weak learner

21
Simple example of Boosting
22
Support Vector Machine

Text Categorization with SVM98 Joachims
Maximize margin

23
Text Categorization with SVM

SVM works well for Text Categorization
Robustness for high dimension
Robustness for overfitting
Most Text Categorization problems are linearly
separable
All of OHSUMED (MEDLINE collection)
Most of Reuters-21578 (NEWS collection)

24
Comparison of these methods

02 Sebastiani
Reuters-21578 (2 versions)
difference number of Categories

Method Ver.1(90) Ver.2(10)
k-NN .860 .823
Naïve Bayes .795 .815
Boosting .878 -
SVM .870 .920
25
Hierarchical Learning

TreeBoost06 Esuli et al.
Boosting algorithm for Hierarchical labels
Hierarchical labels and Texts with label as
Training data
Applying AdaBoost recursively
Better classifier than flat AdaBoost
Accuracy 2-3 up
Time training and categorization time down
Hierarchical SVM04 Cai et al.

26
TreeBoost
root
L1
L2
L3
L4
L11
L12
L41
L42
L43
L421
L422
27
Outline

Introduction
Text Categorization
Feature of Text
Learning Algorithm
Conclusion

28
Conclusion

Overview of Text Categorizationwith Machine
Learning
Feature of Text
Learning Algorithm
Future Work
Natural Language Processing with Machine
Learning, especially in Japanese
Calculation Cost

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Proposing a New Term Weighting Scheme for Text Categorization PowerPoint PPT Presentation

Proposing a New Term Weighting Scheme for Text Categorization - Machine Learning: SVM, kNN, decision Tree, Na ve Bayesian, Neural Network, ... Note: chi^2, ig (information gain), gr (gain ratio), mi (mutual information), or ... | PowerPoint PPT presentation | free to view

A Brief Survey of Machine Learning PowerPoint PPT Presentation

A Brief Survey of Machine Learning - 2. lessens the 'brittleness' of the system. Using learning, the software agents can adapt to: ... the difference between. information needed to identify ... | PowerPoint PPT presentation | free to view

CSE 591: Machine learning and Applications PowerPoint PPT Presentation

CSE 591: Machine learning and Applications - When does a customer buy, what does he buy, how often he pays on time, etc ... Intuition: how does your brain store these pictures? Model selection ... | PowerPoint PPT presentation | free to view

Toward an integration of qualitative and quantitative text analysis methods PowerPoint PPT Presentation

Toward an integration of qualitative and quantitative text analysis methods - ... using a user defined dictionary (or taxonomy) Categorization Dictionary. Dimension Reduction. Semantic Categorization ... (CATA of exchanges on internet forums) ... | PowerPoint PPT presentation | free to view

A Brief Survey of Machine Learning - Trace a path to each leaf ... Theory Information Theory Neuroscience Philosophy Psychology Statistics Machine Learning Symbolic Representation Planning ... What ... | PowerPoint PPT presentation | free to view

A Survey on Automatic Text/Speech Summarization PowerPoint PPT Presentation

A Survey on Automatic Text/Speech Summarization - Y. T. Chen et al., A probabilistic generative framework for ... Osborne (2002) used log-linear models to obviate the assumption of feature independence ... | PowerPoint PPT presentation | free to view

CS276B Text Information Retrieval, Mining, and Exploitation PowerPoint PPT Presentation

CS276B Text Information Retrieval, Mining, and Exploitation - CS276B Text Information Retrieval, Mining, and Exploitation Lecture 4 Text Categorization I Introduction and Naive Bayes Jan 21, 2003 Is this spam? From ... | PowerPoint PPT presentation | free to view

Learning to Classify Text PowerPoint PPT Presentation

Learning to Classify Text - Some examples of text classification problems. topical classification vs genre classification vs sentiment detection vs ... Classify jokes as Funny, NotFunny. ... | PowerPoint PPT presentation | free to view

Trainingless Ontologybased Text Categorization. PowerPoint PPT Presentation

Trainingless Ontologybased Text Categorization. - ... for Semantic Association Discovery', Fourth European Semantic Web Conference, ... on Peer-to-Peer Knowledge Management, San Diego, CA, July 17, 2005 ... | PowerPoint PPT presentation | free to view

Text Categorization PowerPoint PPT Presentation

Text Categorization - Machine translation, part-of-speech tagging, information extraction, question ... Figueiredo (2001); Similar to Tipping's Relevance Vector Machine (JMLR, 2001); LASSO ... | PowerPoint PPT presentation | free to view

Machine Learning for Data Mining PowerPoint PPT Presentation

Machine Learning for Data Mining - Magnification factors and curvatures of the Bernoulli Trait manifold on handwritten digits data ... Hierarchy of Local Magnification Factors. 60 ... | PowerPoint PPT presentation | free to view

How ICR is overcoming OCR limitations PowerPoint PPT Presentation

How ICR is overcoming OCR limitations - By harnessing the power of AI, machine learning, and NLP (Natural Language Processing), ICR excels in interpreting handwritten text and intricate documents, streamlining data entry processes, and significantly boosting efficiency. In the latest piece from the E42 Blog, we delve deep into how ICR represents a groundbreaking advancement in document processing, surpassing the limitations of traditional OCR (Optical Character Recognition). | PowerPoint PPT presentation | free to view

Hypertext data mining A tutorial survey PowerPoint PPT Presentation

Hypertext data mining A tutorial survey - Filtering news, email, etc. Narrowing searches and selective data acquisition ... Yahoo/SocietyCulture/Environment/ Recycling. Dimensionality ... | PowerPoint PPT presentation | free to view

How Intelligent character recognition overcomes ocr limitations PowerPoint PPT Presentation

How Intelligent character recognition overcomes ocr limitations - By harnessing the power of AI, machine learning, and NLP (Natural Language Processing), ICR excels in interpreting handwritten text and intricate documents, streamlining data entry processes, and significantly boosting efficiency. In the latest piece from the E42 Blog, we delve deep into how ICR represents a groundbreaking advancement in document processing, surpassing the limitations of traditional OCR (Optical Character Recognition). | PowerPoint PPT presentation | free to view

Subrogation Prediction Through Text Mining and Data Modeling PowerPoint PPT Presentation

Subrogation Prediction Through Text Mining and Data Modeling - Subrogation Prediction Through Text Mining and Data Modeling Sergei Ananyan, Ph.D. Megaputer Intelligence www.megaputer.com * * * * * * * * * * * * First Report of ... | PowerPoint PPT presentation | free to view

Survey of Word Sense Disambiguation Approaches PowerPoint PPT Presentation

Survey of Word Sense Disambiguation Approaches - College of Information Science & Technology Drexel University ... commonly used relationships include hypernym, hyponym, holonym, meronym, and synonym. ... | PowerPoint PPT presentation | free to view

Implementation in C CUDA of Multi-Label Text Categorizers PowerPoint PPT Presentation

Implementation in C CUDA of Multi-Label Text Categorizers - Implementation in C+CUDA of Multi-Label Text Categorizers Lucas Veronese, Alberto F. De Souza, Claudine Badue, Elias Oliveira, Patrick M. Ciarelli | PowerPoint PPT presentation | free to view

CS276A Text Retrieval and Mining PowerPoint PPT Presentation

CS276A Text Retrieval and Mining - CS276A Text Retrieval and Mining Lecture 11 Recap of the last lecture Probabilistic models in Information Retrieval Probability Ranking Principle Binary Independence ... | PowerPoint PPT presentation | free to view

Text-Mining%20Tutorial PowerPoint PPT Presentation

Text-Mining%20Tutorial - PowerPoint Presentation | PowerPoint PPT presentation | free to view

CS276A Text Retrieval and Mining - e.g., 'editorials' 'movie-reviews' 'news' Labels may be opinion ... rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, ... | PowerPoint PPT presentation | free to view

Writing ePlatform: Developing an innovative feedback tool for students PowerPoint PPT Presentation

Writing ePlatform: Developing an innovative feedback tool for students - Title: Writing ePlatform: Developing an innovative feedback tool for students writing to expand assessment for, of, as learning. Author: MCMINN, Sean W J | PowerPoint PPT presentation | free to view

Nave Bayes Text Classification PowerPoint PPT Presentation

Nave Bayes Text Classification - e.g., 'finance,' 'sports,' 'news world asia business' Labels may be genres ... D2 (sports): Japan baseball. D3 (politics): China trade. D4 (politics): Japan ... | PowerPoint PPT presentation | free to view

Web Search and Text Mining PowerPoint PPT Presentation

Web Search and Text Mining - Web Search and Text Mining Lecture 17: Na ve BayesText Classification | PowerPoint PPT presentation | free to view

Automatic Categorization Algorithm for Evolvable Software Archive PowerPoint PPT Presentation

Automatic Categorization Algorithm for Evolvable Software Archive - Software Engineering Laboratory, Department of Computer Science, Graduate School ... Jonathan I. Maletic and Andrian Marcus, Supporting Program Comprehension Using ... | PowerPoint PPT presentation | free to view

CS276B Text Information Retrieval, Mining, and Exploitation - I am 22 years old and I have already purchased 6 properties using the ... e.g., 'is a toner cartridge ad' :'isn't' Methods (1) Manual classification ... | PowerPoint PPT presentation | free to view

An Automatic Text Mining Framework for Knowledge Discovery on the Web PowerPoint PPT Presentation

An Automatic Text Mining Framework for Knowledge Discovery on the Web - Effectiveness (accuracy, precision, recall), efficiency (time) ... Considered occurrences in title, extended anchor text, and full text (Lee et al. 2002) ... | PowerPoint PPT presentation | free to view

Information Retrieval and Web Search PowerPoint PPT Presentation

Information Retrieval and Web Search - Many systems partly rely on machine learning (Autonomy, MSN, Verity, Enkata, Yahoo! ... Chi-square test. Information theory: ... statistic (CHI) ... | PowerPoint PPT presentation | free to view