From Textual Information to Numerical Vectors - PowerPoint PPT Presentation

About This Presentation
Title:

From Textual Information to Numerical Vectors

Description:

... they can be obtained- main issue cleanse the samples and ensure high quality ... Comma, Colon are tokens (between characters) eg:- USA,INDIA ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:1.0/5.0
Slides: 32
Provided by: sxv
Learn more at: https://www.cs.kent.edu
Category:

less

Transcript and Presenter's Notes

Title: From Textual Information to Numerical Vectors


1
From Textual Information to Numerical Vectors
  • Shilpa Venkataramana
  • Kent State University

2
Introduction
  • To Mine Text we need to process it in a form that
    Data Mining procedures use.
  • From earlier chapter, this involves generating
    features in a spread sheet format
  • Classical data mining looks at highly structured
    data
  • Spreadsheet Model is embodiment of representation
    that is supportive of predictive modeling.
  • Predictive text mining is simpler and more
    restrictive than open ended data mining.
  • Text mining is unstructured because very far from
    the spreadsheet model that we need to process
    data for prediction.

3
Introduction
  • Transformation of data to spreadsheet model is
    methodical and carefully organized procedure to
    fill in cells in a spread sheet.
  • We have to determine nature of column in spread
    sheet.
  • Features are easy to obtain , some are difficult.
    Features (word in a text-easy) grammatical
    function of a word in a sentence
  • Discuss ????
  • How to Obtain the kinds of features generated
    from Text

4
Collecting Documents
  • Text Mining is collect data
  • Web page retrieval application for an intranet
    implicitly specifies the relevant documents to be
    the web pages on the intranet
  • If documents are identified, then they can be
    obtained- main issue cleanse the samples and
    ensure high quality
  • Web application compromising a number of
    autonomous Websites, one may deploy s/w tool
    such as WebCrawler to collect the documents

5
Collecting Documents
  • Other application u have a logging process
    attached to an input data steam for a length of
    time (eg -gt email audit u will log in the
    incoming and outgoing messages at mail server for
    a period of time)
  • For RD work of Text Mining - we need generic
    data. Corpus
  • Accompanying Software is Reuter which is called
    Reuters corpus(RV1)
  • Early days (1960s and 1970s)1 million works was
    considered
  • size of collection of size of collection Brown
    corpus consist of 500 samples for 2000 words of
    American English test

6
Collecting Documents
  • European corpus was modeled on Brown corpus
    British English
  • 1970s 0r 80s more resource were available- govt
    sponsored.
  • Some widely used corpora Penn Tree Bank
    (collection manually parsed sentences from
    Journal)
  • Resource is World Wide Web. Web crawlers can
    build collections of pages from a particular sit
    such as yahoo. Give n size of web, collections
    require cleaning before use

7
Document Standardization
  • When Documents are collected, you can have them
    in different formats-
  • Some documents may be collected in word format or
    simple text with ASCII format. To process these
    documents we have to convert them to standard
    formats
  • Standard Format XML
  • XML is Extensible Markup Language

8
Document Standardization-XML
  • Standard way to insert tags onto text to identify
    its parts.
  • Each Document is markedoff from corpus through
    XML
  • XML will have tags
  • ltDategt
  • ltSubjectgt
  • ltTopicgt
  • ltTextgt
  • ltBodygt
  • ltHeadergt

9
XML An Example
  • lt?xml version"1.0" encoding"ISO-8859-1"?gt
  • - ltnotegt
  • lttogtTovelt/togt
  • ltfromgtJanilt/fromgt
  • ltheadinggtReminderlt/headinggt
  • ltbodygtDon't forget me this
    weekend!lt/bodygt
  • lt/notegt

10
XML
  • The main reason to identify the parts is to allow
    selection of those parts that is used to generate
    features
  • Selected document is concatenated into strings-
    separated by tags
  • Document Standardization-
  • Why should we care ??
  • Advantage of data standardization is mining tools
    can be applied without having to consider the
    pedigree of document.

11
Tokenization
  • Document Collected in XMl Format- examine the
    data
  • Break the characters into words TOKENS
  • Each token is an instance of a type the number
    of tokens is higher than the number of types
  • 2 tokens the occurs twice in a sentence. Refer
    to occurrence of a type
  • Character space , tab are not tokens but white
    spaces
  • Comma, Colon are tokens (between characters) eg-
    USA,INDIA
  • between numbers are delimiter (121,135)
  • Apostrophe number of uses (Delimiter or part of
    token) eg- DAngelo
  • When it is followed by a terminator internal
    quote (Tess.)

12
Tokenization -Pesudocode
  • Dash is a terminator a token preceeded ir follwed
    by another dash
  • (522-3333)
  • Without identifying token it is diffuicult to
    imagine extracting higher level information from
    document

13
Lemmatization
  • Once a character stream has been segmented after
    sequence of tokens
  • Next Step ?? Convert each tokens to standard
    forms Stemming or Lemmatization. (Application
    dependent)
  • Reduce the number of distinct types in corpus and
    increase frequency of occurrence of individual
    types
  • English Speakers agree nouns Book and Books are
    2 forms of same word- advantage to eliminate kind
    of variation
  • Normalization regularize grammatical variants
    Inflectional Stemming

14
Stemming to a Root
  • grammatical variants (singular/plural
    present/past)
  • It is always advantageous to eliminate this kind
    of variation before further processing
  • When normalization is confined to regular
    grammatical variants such as singular/plural and
    present/past, the process is called Inflectional
    stemming
  • The intent of these stemmers is to reach a root
    of no inflectional or derivational prefixes or
    suffixes- end result aggressive stemming Example
  • reduce number of types is text

15
Stemming Pseudocode
16
Vector Generation for prediction
  • Consider the problem of categorizing documents
  • Characteristic feature are tokens or words they
    contain.
  • Without deep analysis we can choose to describe
    each document by features that represent the most
    frequent tokens.
  • There is collective features called dictionary.
  • Tokens or words in the dictionary forms the basis
    for creating a spreadsheet of numeric data
    corresponding to document collection.
  • Each Row-gt document column -gtfeature

17
Vector Generation for prediction
  • Cells in a spreadsheet is a measurement of
    feature for a document.
  • Basic model of data, we will simply check the
    presence or absence of words
  • Checking for words is simple because we do not
    check each word in dictionary. We will build a
    hash table. Large samples of digital documents
    are readily available. confidence on variation
    and combinations of words that occurs
  • If prediction is our goal then we need one more
    column for correct answer.
  • In preparing data for learning, information is
    available from document labels. Our labels are
    binaries and answers which is also called as
    class)
  • Instead of generating global dictionary for class
    we consider words in class that we r trying to
    predict.
  • If this class is far smaller than the negative
    class which is typical local dictionary is far
    smaller than global dictionary
  • Another reduction in dictionary size is to
    compile a lost of stopwords and remove them from
    dictionary.

18
  • Stopwords are almost never have any predictive
    capability, such articles a the pronouns as it
    and they.
  • Frequency information on the word counts can be
    quite useful in reducing the dictionary size and
    improve predictive performance
  • Frequent words are stopwords and can be deleted.
  • Alternative approach to local dictionary
    generation is to generate a global dictionary
    from all documents in the collection . Special
    feature selection routines will attempt to
    select a subset of words that have greatest
    potential of prediction- independent(selction
    methods)
  • If we have 100 topics to categorize, then 100
    problems to solve . Our choices are 100 small
    dictionary or 1 global dictionary .

19
  • Vectors implied by spreadsheet model will be
    regenerated to correspond to small dictionary
  • Instead of placing the word in the dictionary -gt
    follow a path printed dictionary and avoid
    storing every variation of word. (no
    singular/plural/past/present)
  • Verbs stored in stemming manner.
  • Add a layer of complexity in text gain in
    performance and size is reduced
  • Universal procedure is trim words to their root
    form -gt difference in meaning (exit /exiting)-
    context of programming (different meanings)
  • Small Dictionary- u can capture the best words
    easily.
  • Use of tokens and stemming are examples of
    helpful procedures in smaller dictionaries.
    Improve ability of managing of learning and
    accuracy
  • Document can be converted to spread sheet .

20
  • Each column is feature. Row is a document
  • Model of data for predictive text mining in terms
    of spread sheet that populated by ones or zeros.
  • Cells represent the presence of dictionary words
    in a document collection. Higher accuracy-gt
    additional transformations
  • They are
  • Word Pairs and collocations
  • Frequency
  • Tf-idf
  • Word Pairs and Collocations- They serve to
    increase size of dictionary improve performance
    of prediction
  • Instead of 0s 1s in cells the frequency of
    word can be used. word the occurs 10 times
    count of the is used)
  • Count give better results than binary in cells.
  • This leads to compact solutions same solution of
    binary data model. Yet additional frequency yield
    simpler solution.

21
  • Frequencies are helpful in prediction but add
    complexity to solutions.
  • Compromise that works 3 value system.1/0/2
  • Word did not occur -0
  • Word occurred one -1
  • Word occurred 2 or more times -2
  • Capture much added value of frequency without
    adding much complexity to model.
  • Another variant is zeroing the values below the
    threshold where tokens min frequency before being
    considered any use.
  • Reduce the complexity of spread sheet used in
    Data Mining algorithms
  • Other methods to reduce complexity are chi
    square, mutual Information, odds Ratio ..etc
  • Next step beyond counting frequency is modify the
    count by perceived importance of that word .

22
  • Tf-idf- Compute the weightings or scores of
    words
  • Values of positive numbers that we capture the
    absence or presence of the words.
  • Eq (a) we see that weight assigned to word j-term
    of frequency modified by a scale factor for
    importance of word. Scale factor is inverse
    document frequency (eq (b))
  • Simply checks for number of documents containing
    the word df(j) and reverse scaling.
  • Tf-idf(j) tf(j) idf(j) -------? Eq(a)
  • Idf(j) log(N/ df(j)) ------? Eq(b)
  • When a word appears in a document, the scale is
    lowered and perhaps zero. if word is unique ,
    appears in few documents - scale factor zooms
    upward and appears important
  • Alternative of this tf-idf formulation exist, but
    motivation is same. Result is positive score that
    replaces the simple frequency or binary (T/F)
    entry in our cell in spreadsheet.

23
  • Another variant is weight the tokens from
    different parts of the document.
  • Which Data Transformation Method are BEST????
  • No Universal answer.
  • Best predictive accuracy is dependent on mating
    all these methods.
  • Best variation is one method may not be the one
    for other. Test ALL
  • Describe data as populating a spread sheet-cells
    are 0
  • Small subset of dictionary words.
  • Text Classification a text corpus 1000s words.
    Each individual document ,unique tokens.
  • Spread sheet for that document is 0. Rather than
    store all 0s its better to represent the spread
    sheet as a set of sparse vectors (row is list of
    pairs , one element of pair is column and other
    is corresponding nonzero value). By not storing
    the non zero It will increase memory

24
Multi Word Features
  • Features are associated with single words (
    tokens delimited with white space)
  • Simple scenario is extended to include pair of
    words eg- bon and viant . Instead of separating
    we could feature the word as bonviant.
  • Why stop at pairs? Why not consider a multiword
    features??
  • Unlike word pairs , the words need not be
    consecutive.
  • Eg- Don Smith as feature we can ignore is
    middle name Leroy that may reappear in some
    reference to the person.
  • In this case we have to accommodate many
    reference to the noun that involve a number of
    adjectives with desired adjective not the
    adjacent to the noun. Eg- we want to accept a
    phrase broken and dirty vase as an instance
    broken vase

25
  • X number if words occurring within a maximum
    window size y(ygtx naturally)
  • How features are extracted from text- specialized
    methods???
  • If we use frequency methods, combinations of
    words that are relatively frequent.
  • Straight forward implementation is simple
    combination of x words in window y
  • Measuring the value of multiword feature is done
    correlation between words in potential multiword
    features measures on mutual information or
    likelihood ratio is used!!!
  • An algorithm for generating multiword features. A
    straight forward implementation consume lot of
    memory
  • Multiword features are not too found in document
    collection, but they are higly predictive

26
(No Transcript)
27
  • Labels for Right Answers-
  • For prediction an extra column is added to the
    spreadsheet
  • Last column contains the labels, looks no
    different from others.
  • Its a 0 or 1 indicating a right answer with
    either True/false
  • In the sparse vector format are appended to each
    vector separately as either a one (positive
    class) or a Zero (negative class)
  • Feature Selection by Attribute Ranking-
  • In addition to frequency based approaches,
    feature selection can be done in number of ways.
  • Select a set of feature for each category to form
    a local dictionary for the category
  • Independent ranking feature attributes according
    to their predictive abilities for category under
    consideration.
  • Predictive ability of an attribute can be
    measured by certain quantity how its is
    correlated
  • Lets assume n number of documents
    presence or absence of attribute j in x y to
    denote label of document in last column

28
  • A commonly used ranking score is information gain
    criterion which is
  • The quantity L(j) is number of bits required to
    encode the label and the attribute j minus the
    number of bits required to encode the attribute.
  • Quantities are needed to compute L(j). Can be
    easily estimated using the estimators

29
Sentence Boundary Determination
  • If the XML markup for corpus doesn't mark
    sentence boundaries, necessary to mark the
    sentence
  • Necessary to determine when a period is part of a
    token and when it is not
  • For more sophisticated way linguistic parsing,
    the algorithms often require complete sentence as
    input.
  • Extraction algorithms operate text a sentence at
    a time
  • Algorithms are optimal, sentences are identified
    clearly
  • Sentence boundary determination is problem of
    deciding which instances of period followed by
    white space are sentence delimiters and which are
    not, since we assume characters ? !
    classification problem
  • Algorithm accuracy and adjustments will give
    better performance

30
(No Transcript)
31
Thank you ? !!!!
Write a Comment
User Comments (0)
About PowerShow.com