Text Mining

About This Presentation

Title:

Text Mining

Description:

Data Mining is the discovery of interesting, unexpected or valuable structures ... 7.3 million documents are increasing on internet on each day. ... – PowerPoint PPT presentation

Number of Views:537

Avg rating:3.0/5.0

Slides: 16

Provided by: Inb2

Category:

more less

Transcript and Presenter's Notes

Title: Text Mining

1
Text Mining
??? ???? ?????? ??????

????? ????
???? ??????? ?????
????? ?????

2
Text Mining

Data Mining
Data Mining is the discovery of interesting,
unexpected or valuable structures in large data
sets.
Text Mining
Text Mining deals with unstructured textual
information and it attempt to discover structure
and implicit meanings buried within the text.

3
Why Text Mining?

Huge amount of Digital Text Data is available on
computers and internet. According to a study of
University of California, Berkley (2000)
Internet has about 2.5 billion documents.
7.3 million documents are increasing on internet
on each day.
No human can manually get information from this
huge amount, we need automated systems for this
purpose.

4
Fields of Text Mining

Categorization
Assigning a new document to one of the defined
categories of documents
Clustering
Finding clusters/categories in a given set of
documents
Summarization
Term Extraction
Extracting important terms and keywords used in
the document

5
Fields of Text Mining (Contd.)

Information Retrieval
Finding related document(s) corresponding to a
query
Information/Feature Extraction
Extraction of (processable) information from a
given document
Thematic Indexing
Knowledge about meaning of words to identify
broad topics covered in the document.
..

6
Flow of Text Mining Application

Remove Stop Words from the Document.
Strip the words to their Stems.
Give Weight to the words.
Select n Words as Feature Vector that will
represent the document.
Note It is only one method, NLP and other
methods can also be used in Text Mining

7
Stop Words

These words dont have information. There
purpose is to connect useful words.
English Stop Words
in, about, for, is, am, he, she, yours, ..
Urdu Stop Words
??? ???? ??? ??? ??? ??? ??? ???
????..............
Sindhi Stop Words
?? ??? ??? ??? ??????? ??? ??? ???? ???? ........

8
Stemming

Striping the word to its stem( root). The stem
can be different from words linguistic root.
Advantage ???????? ????????? ?????????? all
have same stem ???
Disadvantage To different words can have same
stem (Example International and Internal may
have same stem Intern.)

9
English Stemming Algorithms

Output of Potters and Lovins Algorithm

10
Urdu Stemming Algorithm

Some Examples of Stemming
UrduWord ye.noonghunna ? stemword
?? ??? - ?? ???
UrduWord wao.noonghunna ? stemword
????? - ?? ???
UrduWord tay.alif ? stemword
????? - ?? ???
aXYaZ ? XYZ
????? ???

11
Sindhi Stemming Algorithm

Some Examples of Stemming
SindhiWord wao ? stemword
Example???? ? ???
SindhiWord alif ? stemword
Example ???? ? ???

12
Weight Assignment to Words

Every word is assigned a weight depending upon
following factors.
A frequently occurring word in a document will
have high weight.
A word occurring in many documents will have low
weight.

13
Summarizer Algorithm

Assign Weight to Each Sentence using
word(occurring in the sentence) frequencies, and
length of sentence.
Select sentences of high frequencies as Summary.

14
Term Detection Algorithm

Make all possible one, two and three word
combination of words.
Process each of these combination as word.
Select n combinations with higher frequencies.
Remove (conditionally) a word, if it is part of a
whole.

15
Categorization Algorithm

Take Centroid of all documents in a Category.
Find distance between new document and centroid
of each category.
Categorize the new document in the category with
least distance calculated.

Write a Comment

User Comments (0)

About PowerShow.com

Text Mining - PowerPoint PPT Presentation

Text Mining

Data Mining is the discovery of interesting, unexpected or valuable structures ... 7.3 million documents are increasing on internet on each day. ... – PowerPoint PPT presentation