From Textual Information to Numerical Vectors

About This Presentation

Title:

From Textual Information to Numerical Vectors

Description:

... they can be obtained- main issue cleanse the samples and ensure high quality ... Comma, Colon are tokens (between characters) eg:- USA,INDIA ... – PowerPoint PPT presentation

Number of Views:54

Avg rating:1.0/5.0

Slides: 32

Provided by: sxv

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: From Textual Information to Numerical Vectors

1
From Textual Information to Numerical Vectors

Shilpa Venkataramana
Kent State University

2
Introduction

To Mine Text we need to process it in a form that
Data Mining procedures use.
From earlier chapter, this involves generating
features in a spread sheet format
Classical data mining looks at highly structured
data
Spreadsheet Model is embodiment of representation
that is supportive of predictive modeling.
Predictive text mining is simpler and more
restrictive than open ended data mining.
Text mining is unstructured because very far from
the spreadsheet model that we need to process
data for prediction.

3
Introduction

Transformation of data to spreadsheet model is
methodical and carefully organized procedure to
fill in cells in a spread sheet.
We have to determine nature of column in spread
sheet.
Features are easy to obtain , some are difficult.
Features (word in a text-easy) grammatical
function of a word in a sentence
Discuss ????
How to Obtain the kinds of features generated
from Text

4
Collecting Documents

Text Mining is collect data
Web page retrieval application for an intranet
implicitly specifies the relevant documents to be
the web pages on the intranet
If documents are identified, then they can be
obtained- main issue cleanse the samples and
ensure high quality
Web application compromising a number of
autonomous Websites, one may deploy s/w tool
such as WebCrawler to collect the documents

5
Collecting Documents

Other application u have a logging process
attached to an input data steam for a length of
time (eg -gt email audit u will log in the
incoming and outgoing messages at mail server for
a period of time)
For RD work of Text Mining - we need generic
data. Corpus
Accompanying Software is Reuter which is called
Reuters corpus(RV1)
Early days (1960s and 1970s)1 million works was
considered
size of collection of size of collection Brown
corpus consist of 500 samples for 2000 words of
American English test

6
Collecting Documents

European corpus was modeled on Brown corpus
British English
1970s 0r 80s more resource were available- govt
sponsored.
Some widely used corpora Penn Tree Bank
(collection manually parsed sentences from
Journal)
Resource is World Wide Web. Web crawlers can
build collections of pages from a particular sit
such as yahoo. Give n size of web, collections
require cleaning before use

7
Document Standardization

When Documents are collected, you can have them
in different formats-
Some documents may be collected in word format or
simple text with ASCII format. To process these
documents we have to convert them to standard
formats
Standard Format XML
XML is Extensible Markup Language

8
Document Standardization-XML

Standard way to insert tags onto text to identify
its parts.
Each Document is markedoff from corpus through
XML
XML will have tags
ltDategt
ltSubjectgt
ltTopicgt
ltTextgt
ltBodygt
ltHeadergt

9
XML An Example

lt?xml version"1.0" encoding"ISO-8859-1"?gt
- ltnotegt
lttogtTovelt/togt
ltfromgtJanilt/fromgt
ltheadinggtReminderlt/headinggt
ltbodygtDon't forget me this
weekend!lt/bodygt
lt/notegt

10
XML

The main reason to identify the parts is to allow
selection of those parts that is used to generate
features
Selected document is concatenated into strings-
separated by tags
Document Standardization-
Why should we care ??
Advantage of data standardization is mining tools
can be applied without having to consider the
pedigree of document.

11
Tokenization

Document Collected in XMl Format- examine the
data
Break the characters into words TOKENS
Each token is an instance of a type the number
of tokens is higher than the number of types
2 tokens the occurs twice in a sentence. Refer
to occurrence of a type
Character space , tab are not tokens but white
spaces
Comma, Colon are tokens (between characters) eg-
USA,INDIA
between numbers are delimiter (121,135)
Apostrophe number of uses (Delimiter or part of
token) eg- DAngelo
When it is followed by a terminator internal
quote (Tess.)

12
Tokenization -Pesudocode

Dash is a terminator a token preceeded ir follwed
by another dash
(522-3333)
Without identifying token it is diffuicult to
imagine extracting higher level information from
document

13
Lemmatization

Once a character stream has been segmented after
sequence of tokens
Next Step ?? Convert each tokens to standard
forms Stemming or Lemmatization. (Application
dependent)
Reduce the number of distinct types in corpus and
increase frequency of occurrence of individual
types
English Speakers agree nouns Book and Books are
2 forms of same word- advantage to eliminate kind
of variation
Normalization regularize grammatical variants
Inflectional Stemming

14
Stemming to a Root

grammatical variants (singular/plural
present/past)
It is always advantageous to eliminate this kind
of variation before further processing
When normalization is confined to regular
grammatical variants such as singular/plural and
present/past, the process is called Inflectional
stemming
The intent of these stemmers is to reach a root
of no inflectional or derivational prefixes or
suffixes- end result aggressive stemming Example
reduce number of types is text

15
Stemming Pseudocode
16
Vector Generation for prediction

Consider the problem of categorizing documents
Characteristic feature are tokens or words they
contain.
Without deep analysis we can choose to describe
each document by features that represent the most
frequent tokens.
There is collective features called dictionary.
Tokens or words in the dictionary forms the basis
for creating a spreadsheet of numeric data
corresponding to document collection.
Each Row-gt document column -gtfeature

17
Vector Generation for prediction

Cells in a spreadsheet is a measurement of
feature for a document.
Basic model of data, we will simply check the
presence or absence of words
Checking for words is simple because we do not
check each word in dictionary. We will build a
hash table. Large samples of digital documents
are readily available. confidence on variation
and combinations of words that occurs
If prediction is our goal then we need one more
column for correct answer.
In preparing data for learning, information is
available from document labels. Our labels are
binaries and answers which is also called as
class)
Instead of generating global dictionary for class
we consider words in class that we r trying to
predict.
If this class is far smaller than the negative
class which is typical local dictionary is far
smaller than global dictionary
Another reduction in dictionary size is to
compile a lost of stopwords and remove them from
dictionary.

Stopwords are almost never have any predictive
capability, such articles a the pronouns as it
and they.
Frequency information on the word counts can be
quite useful in reducing the dictionary size and
improve predictive performance
Frequent words are stopwords and can be deleted.
Alternative approach to local dictionary
generation is to generate a global dictionary
from all documents in the collection . Special
feature selection routines will attempt to
select a subset of words that have greatest
potential of prediction- independent(selction
methods)
If we have 100 topics to categorize, then 100
problems to solve . Our choices are 100 small
dictionary or 1 global dictionary .

Vectors implied by spreadsheet model will be
regenerated to correspond to small dictionary
Instead of placing the word in the dictionary -gt
follow a path printed dictionary and avoid
storing every variation of word. (no
singular/plural/past/present)
Verbs stored in stemming manner.
Add a layer of complexity in text gain in
performance and size is reduced
Universal procedure is trim words to their root
form -gt difference in meaning (exit /exiting)-
context of programming (different meanings)
Small Dictionary- u can capture the best words
easily.
Use of tokens and stemming are examples of
helpful procedures in smaller dictionaries.
Improve ability of managing of learning and
accuracy
Document can be converted to spread sheet .

Each column is feature. Row is a document
Model of data for predictive text mining in terms
of spread sheet that populated by ones or zeros.
Cells represent the presence of dictionary words
in a document collection. Higher accuracy-gt
additional transformations
They are
Word Pairs and collocations
Frequency
Tf-idf
Word Pairs and Collocations- They serve to
increase size of dictionary improve performance
of prediction
Instead of 0s 1s in cells the frequency of
word can be used. word the occurs 10 times
count of the is used)
Count give better results than binary in cells.
This leads to compact solutions same solution of
binary data model. Yet additional frequency yield
simpler solution.

Frequencies are helpful in prediction but add
complexity to solutions.
Compromise that works 3 value system.1/0/2
Word did not occur -0
Word occurred one -1
Word occurred 2 or more times -2
Capture much added value of frequency without
adding much complexity to model.
Another variant is zeroing the values below the
threshold where tokens min frequency before being
considered any use.
Reduce the complexity of spread sheet used in
Data Mining algorithms
Other methods to reduce complexity are chi
square, mutual Information, odds Ratio ..etc
Next step beyond counting frequency is modify the
count by perceived importance of that word .

Tf-idf- Compute the weightings or scores of
words
Values of positive numbers that we capture the
absence or presence of the words.
Eq (a) we see that weight assigned to word j-term
of frequency modified by a scale factor for
importance of word. Scale factor is inverse
document frequency (eq (b))
Simply checks for number of documents containing
the word df(j) and reverse scaling.
Tf-idf(j) tf(j) idf(j) -------? Eq(a)
Idf(j) log(N/ df(j)) ------? Eq(b)
When a word appears in a document, the scale is
lowered and perhaps zero. if word is unique ,
appears in few documents - scale factor zooms
upward and appears important
Alternative of this tf-idf formulation exist, but
motivation is same. Result is positive score that
replaces the simple frequency or binary (T/F)
entry in our cell in spreadsheet.

Another variant is weight the tokens from
different parts of the document.
Which Data Transformation Method are BEST????
No Universal answer.
Best predictive accuracy is dependent on mating
all these methods.
Best variation is one method may not be the one
for other. Test ALL
Describe data as populating a spread sheet-cells
are 0
Small subset of dictionary words.
Text Classification a text corpus 1000s words.
Each individual document ,unique tokens.
Spread sheet for that document is 0. Rather than
store all 0s its better to represent the spread
sheet as a set of sparse vectors (row is list of
pairs , one element of pair is column and other
is corresponding nonzero value). By not storing
the non zero It will increase memory

24
Multi Word Features

Features are associated with single words (
tokens delimited with white space)
Simple scenario is extended to include pair of
words eg- bon and viant . Instead of separating
we could feature the word as bonviant.
Why stop at pairs? Why not consider a multiword
features??
Unlike word pairs , the words need not be
consecutive.
Eg- Don Smith as feature we can ignore is
middle name Leroy that may reappear in some
reference to the person.
In this case we have to accommodate many
reference to the noun that involve a number of
adjectives with desired adjective not the
adjacent to the noun. Eg- we want to accept a
phrase broken and dirty vase as an instance
broken vase

X number if words occurring within a maximum
window size y(ygtx naturally)
How features are extracted from text- specialized
methods???
If we use frequency methods, combinations of
words that are relatively frequent.
Straight forward implementation is simple
combination of x words in window y
Measuring the value of multiword feature is done
correlation between words in potential multiword
features measures on mutual information or
likelihood ratio is used!!!
An algorithm for generating multiword features. A
straight forward implementation consume lot of
memory
Multiword features are not too found in document
collection, but they are higly predictive

26
(No Transcript)
27

Labels for Right Answers-
For prediction an extra column is added to the
spreadsheet
Last column contains the labels, looks no
different from others.
Its a 0 or 1 indicating a right answer with
either True/false
In the sparse vector format are appended to each
vector separately as either a one (positive
class) or a Zero (negative class)
Feature Selection by Attribute Ranking-
In addition to frequency based approaches,
feature selection can be done in number of ways.
Select a set of feature for each category to form
a local dictionary for the category
Independent ranking feature attributes according
to their predictive abilities for category under
consideration.
Predictive ability of an attribute can be
measured by certain quantity how its is
correlated
Lets assume n number of documents
presence or absence of attribute j in x y to
denote label of document in last column

A commonly used ranking score is information gain
criterion which is
The quantity L(j) is number of bits required to
encode the label and the attribute j minus the
number of bits required to encode the attribute.
Quantities are needed to compute L(j). Can be
easily estimated using the estimators

29
Sentence Boundary Determination

If the XML markup for corpus doesn't mark
sentence boundaries, necessary to mark the
sentence
Necessary to determine when a period is part of a
token and when it is not
For more sophisticated way linguistic parsing,
the algorithms often require complete sentence as
input.
Extraction algorithms operate text a sentence at
a time
Algorithms are optimal, sentences are identified
clearly
Sentence boundary determination is problem of
deciding which instances of period followed by
white space are sentence delimiters and which are
not, since we assume characters ? !
classification problem
Algorithm accuracy and adjustments will give
better performance