Text Mining - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Text Mining

Description:

Data Mining is the discovery of interesting, unexpected or valuable structures ... 7.3 million documents are increasing on internet on each day. ... – PowerPoint PPT presentation

Number of Views:537
Avg rating:3.0/5.0
Slides: 16
Provided by: Inb2
Category:
Tags: mining | text

less

Transcript and Presenter's Notes

Title: Text Mining


1
Text Mining
??? ???? ?????? ??????
  • ????? ????
  • ???? ??????? ?????
  • ????? ?????

2
Text Mining
  • Data Mining
  • Data Mining is the discovery of interesting,
    unexpected or valuable structures in large data
    sets.
  • Text Mining
  • Text Mining deals with unstructured textual
    information and it attempt to discover structure
    and implicit meanings buried within the text.

3
Why Text Mining?
  • Huge amount of Digital Text Data is available on
    computers and internet. According to a study of
    University of California, Berkley (2000)
  • Internet has about 2.5 billion documents.
  • 7.3 million documents are increasing on internet
    on each day.
  • No human can manually get information from this
    huge amount, we need automated systems for this
    purpose.

4
Fields of Text Mining
  • Categorization
  • Assigning a new document to one of the defined
    categories of documents
  • Clustering
  • Finding clusters/categories in a given set of
    documents
  • Summarization
  • Term Extraction
  • Extracting important terms and keywords used in
    the document

5
Fields of Text Mining (Contd.)
  • Information Retrieval
  • Finding related document(s) corresponding to a
    query
  • Information/Feature Extraction
  • Extraction of (processable) information from a
    given document
  • Thematic Indexing
  • Knowledge about meaning of words to identify
    broad topics covered in the document.
  • ..

6
Flow of Text Mining Application
  • Remove Stop Words from the Document.
  • Strip the words to their Stems.
  • Give Weight to the words.
  • Select n Words as Feature Vector that will
    represent the document.
  • Note It is only one method, NLP and other
    methods can also be used in Text Mining

7
Stop Words
  • These words dont have information. There
    purpose is to connect useful words.
  • English Stop Words
  • in, about, for, is, am, he, she, yours, ..
  • Urdu Stop Words
  • ??? ???? ??? ??? ??? ??? ??? ???
    ????..............
  • Sindhi Stop Words
  • ?? ??? ??? ??? ??????? ??? ??? ???? ???? ........

8
Stemming
  • Striping the word to its stem( root). The stem
    can be different from words linguistic root.
  • Advantage ???????? ????????? ?????????? all
    have same stem ???
  • Disadvantage To different words can have same
    stem (Example International and Internal may
    have same stem Intern.)

9
English Stemming Algorithms
  • Output of Potters and Lovins Algorithm

10
Urdu Stemming Algorithm
  • Some Examples of Stemming
  • UrduWord ye.noonghunna ? stemword
  • ?? ??? - ?? ???
  • UrduWord wao.noonghunna ? stemword
  • ????? - ?? ???
  • UrduWord tay.alif ? stemword
  • ????? - ?? ???
  • aXYaZ ? XYZ
  • ????? ???

11
Sindhi Stemming Algorithm
  • Some Examples of Stemming
  • SindhiWord wao ? stemword
  • Example???? ? ???
  • SindhiWord alif ? stemword
  • Example ???? ? ???

12
Weight Assignment to Words
  • Every word is assigned a weight depending upon
    following factors.
  • A frequently occurring word in a document will
    have high weight.
  • A word occurring in many documents will have low
    weight.

13
Summarizer Algorithm
  • Assign Weight to Each Sentence using
    word(occurring in the sentence) frequencies, and
    length of sentence.
  • Select sentences of high frequencies as Summary.

14
Term Detection Algorithm
  • Make all possible one, two and three word
    combination of words.
  • Process each of these combination as word.
  • Select n combinations with higher frequencies.
  • Remove (conditionally) a word, if it is part of a
    whole.

15
Categorization Algorithm
  • Take Centroid of all documents in a Category.
  • Find distance between new document and centroid
    of each category.
  • Categorize the new document in the category with
    least distance calculated.
Write a Comment
User Comments (0)
About PowerShow.com