Processing of Large Document Collections 1
  • Helena Ahonen-Myka
  • University of Helsinki

Organization of the course
  • Classes 17.9., 22.10., 23.10., 26.11.
  • lectures (Helena Ahonen-Myka) 10-12,13-15
  • exercise sessions (Lili Aunimo) 15-17
  • required presence 75
  • Exercises are given (and returned) each week
  • required 75
  • Exam 4.12. at 16-20, Auditorio
  • Points Exam 30 pts, exercises 30 pts

  • 17.9. Character sets, preprocessing of text, text
  • 22.10. Text summarization
  • 23.10. Text compression
  • 26.11. to be announced
  • self-study basic transformations for text data,
    using linguistic tools, etc.

In this part...
  • Character sets
  • preprocessing of text
  • text categorization

1. Character sets
  • Abstract character vs. its graphical
  • abstract characters are grouped into alphabets
  • each alphabet forms the basis of the written form
    of a certain language or a set of languages

Character sets
  • For instance
  • for English
  • uppercase letters A-Z
  • lowercase letters a-z
  • punctuation marks
  • digits 0-9
  • common symbols ,
  • ideographic symbols of Chinese and Japanese
  • phonetic letters of Western languages

Character sets
  • To represent text digitally, we need a mapping
    between (abstract) characters and values stored
    digitally (integers)
  • this mapping is a character set
  • the domain of the character set is called a
    character repertoire ( the alphabet for which
    the mapping is defined)

Character sets
  • For each character in the character repertoire,
    the character set defines a code value in the set
    of code points
  • in English
  • 26 letters in both lower- and uppercase
  • ten digits some punctuation marks
  • in Russian cyrillic letters
  • both could use the same set of code points (if
    not a bilingual document)
  • in Japanese could be over 6000 characters

Character sets
  • The mere existence of a character set supports
    operations like editing and searching of text
  • usually character sets have some structure
  • e.g. integers within a small range
  • all lower-case (resp. upper-case) letters have
    code values that are consecutive integers
    (simplifies sorting etc.)

Character sets standars
  • Character sets can be arbitrary, but in practice
    standardization is needed for interoperability
    (between computers, programs,...)
  • early standards were designed for English only,
    or for a small group of languages at a time

Character sets standards
  • ISO-8859 (e.g. ISO Latin1)
  • Unicode
  • UTF-8, UTF-16

  • American Standard Code for Information
  • A seven bit code -gt 128 code points
  • actually 95 printable characters only
  • code points 0-31 and 128 are assigned to control
    characters (mostly outdated)
  • ISO 646 (1972) version of ASCII incorporated
    several national variants (accented letters and
    currency symbols)

  • With 7 bits, the set of code points is too small
    for anything else than American English
  • solution
  • 8 bits brings more code points (256)
  • ASCII character repertoire is mapped to the
    values 0-127
  • additional symbols are mapped to other values

Extended ASCII
  • Problem
  • different manufacturers each developed their own
    8-bit extensions to ASCII
  • different character repertoires -gt translation
    between them is not always possible
  • also 256 code values is not enough to represent
    all the alphabets -gt different variants for
    different languages

ISO 8859
  • Standardization of 8-bit character sets
  • In the 80s multipart standard ISO 8859 was
  • defines a collection of 8-bit character sets,
    each designed for a group of languages
  • the first part ISO 8859-1 (ISO Latin1)
  • covers most Western European languages
  • 0-127 identical to ASCII, 128-159 (mostly)
    unused, 96 code values for accented letters and

  • 256 is not enough code points
  • for ideographically represented languages
    (Chinese, Japanese)
  • for simultaneous use of several languages
  • solution more than one byte for each code value
  • a 16-bit character set has 65,536 code points

  • 16-bit character set, e.g. 65,536 code points
  • not sufficient for all the characters required
    for Chinese, Japanese, and Korean scripts in
    distinct positions
  • CJK-consolidation characters of these scripts
    are given the same value if they look the same

  • Code values for all the characters used to write
    contemporary major languages
  • also the classical forms of some languages
  • Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic,
    Devanagari, Bengali, Gurmukhi, Gujarati, Oriya,
    Tamil, Telugu, Kannada, Malayalam, Thai, Lao,
    Georgian, Tibetan
  • Chinese, Japanese, and Korean ideograms, and the
    Japanese and Korean phonetic and syllabic scripts

  • punctuation marks
  • technical and mathematical symbols
  • arrows
  • dingbats (pointing hands, stars, )
  • both accented letters and separate diacritical
    marks (accents, tildes) are included, with a
    mechanism for building composite characters
  • can also create problems two characters that
    look the same may have different code values
  • -gtnormalization may be necessary

  • Code values for nearly 39,000 symbols are
  • some part is reserved for an expansion method
    (see later)
  • 6,400 code points are reserved for private use
  • they will never be assigned to any character by
    the standard, so they will not conflict with the

Unicode encodings
  • Encoding is a mapping that transforms a code
    value into a sequence of bytes for storage and
  • identity mapping for a 8-bit code?
  • it may be necessary to encode 8-bit characters as
    sequences of 7-bit (ASCII) characters
  • e.g. Quoted-Printable (QP)
  • code values 128-255 as a sequence of 3 bytes
  • 1 ASCII code for , 2 3 hexadecimal digits
    of the value
  • 233 -gt E9 -gt E9

Unicode encodings
  • UTF-8
  • ASCII code values are likely to be more common in
    most text than any other values
  • in UTF-9 encoding ASCII characters are sent
    themselves (high-order bit 0)
  • other characters (two bytes) are encoded using up
    to six bytes (high-order bit is set to 1)

Unicode encodings
  • UTF-16 expansion method
  • two 16-bit values are combined to a 32-bit value
    -gt a million characters available

2. Preprocessing of text
  • Text cannot be directly interpreted by the many
    document processing applications
  • an indexing procedure is needed
  • mapping of a text into a compact representation
    of its content
  • which are the meaningful units of text?
  • how these units should be combined?
  • usually not important

Vector model
  • A document is usually represented as a vector of
    term weights
  • the vector has as many dimensions as there are
    terms (or features) in the whole collection of
  • the weight represents how much the term
    contributes to the semantics of the document

Vector model
  • Different approaches
  • different ways to understand what a term is
  • different ways to compute term weights

  • Words
  • typical choice
  • set of words, bag of words
  • phrases
  • syntactical phrases
  • statistical phrases
  • usefulness not yet known?

  • Part of the text is not considered as terms
  • very common words (function words)
  • articles, prepositions, conjunctions
  • numerals
  • these words are pruned
  • stopword list
  • other preprocessing possible
  • stemming, base words

Weights of terms
  • Weights usually range between 0 and 1
  • binary weights may be used
  • 1 denotes presence, 0 absence of the term in the
  • often the tfidf function is used
  • higher weight, if the term occurs often in the
  • lower weight, if the term occurs in many documents

  • Either the full text of the document or selected
    parts of it are indexed
  • e.g. in a patent categorization application
  • title, abstract, the first 20 lines of the
    summary, and the section containing the claims of
    novelty of the described invention
  • some parts may be considered more important
  • e.g. higher weight for the terms in the title

Dimensionality reduction
  • Many algorithms cannot handle high dimensionality
    of the term space ( large number of terms)
  • usually dimensionality reduction is applied
  • dimensionality reduction also reduces overfitting
  • classifier that overfits the training data is
    good at re-classifying the training data but
    worse at classifying previously unseen data

Dimensionality reduction
  • Local dimensionality reduction
  • for each category, a reduced set of terms is
    chosen for classification that category
  • hence, different subsets are used when working
    with different categories
  • global dimensionality reduction
  • a reduced set of terms is chosen for the
    classification under all categories

Dimensionality reduction
  • Dimensionality reduction by term selection
  • the terms of the reduced term set are a subset of
    the original term set
  • Dimensionality reduction by term extraction
  • the terms are not the same type of the terms in
    the original term set, but are obtained by
    combinations and transformations of the original

Dimensionality reduction by term selection
  • Goal select terms that, when used for document
    indexing, yields the highest effectiveness in the
    given application
  • wrapper approach
  • the reduced set of terms is found iteratively and
    tested with the application
  • filtering approach
  • keep the terms that receive the highest score
    according to a function that measures the
    importance of the term for the task

Dimensionality reduction by term selection
  • Many functions available
  • document frequency keep the high frequency terms
  • stopwords have been already removed
  • 50 of the words occur only once in the document
  • e.g. remove all terms occurring in at most 3

Dimensionality reduction by term selection
  • Information-theoretic term selection functions,
  • chi-square
  • information gain
  • mutual information
  • odds ratio
  • relevancy score

Dimensionality reduction by term extraction
  • Term extraction attempts to generate, from the
    original term set, a set of synthetic terms
    that maximize effectiveness
  • due to polysemy, homonymy, and synonymy, the
    original terms may not be optimal dimensions for
    document content representation

Dimensionality reduction by term extraction
  • Term clustering
  • tries to group words with a high degree of
    pairwise semantic relatedness
  • groups (or their centroids) may be used as
  • latent semantic indexing
  • compresses document vector into vectors of a
    lower-dimensional space whose dimensions are
    obtained as combinations of the original
    dimensions by looking at their patterns of

3. Text categorization
  • Text classification, topic classification/spotting
  • problem setting
  • assume a predefined set of categories, a set of
  • label each document with one (or more) categories

Text categorization
  • Two major approaches
  • knowledge engineering -gt end of 80s
  • manually defined set of rules encoding expert
    knowledge on how to classify documents under the
    given gategories
  • machine learning, 90s -gt
  • an automatic text classifier is built by
    learning, from a set of preclassified documents,
    the characteristics of the categories

Text categorization
  • Let
  • D a domain of documents
  • C c1, , cC a set of predefined
  • T true, F false
  • The task is to approximate the unknown target
    function ? D x C -gt T,F by means of a
    function ? D x C -gt T,F, such that the
    functions coincide as much as possible
  • function ? how documents should be classified
  • function ? classifier (hypothesis, model)

We assume...
  • Categories are just symbolic labels
  • no additional knowledge of their meaning is
  • No knowledge outside of the documents is
  • all decisions have to be made on the basis of the
    knowledge extracted from the documents
  • metadata, e.g., publication date, document type,
    source etc. is not used

-gt general methods
  • Methods do not depend on any application-dependent
  • in operational applications all kind of knowledge
    can be used
  • content-based decisions are necessarily
  • it is often difficult to measure the
    effectiveness of the classifiers
  • even human classifiers do not always agree

Single-label vs. multi-label
  • Single-label text categorization
  • exactly 1 category must be assigned to each dj ?
  • Multi-label text categorization
  • any number of categories may be assigned to the
    same dj ? D
  • Special case of single-label binary
  • each dj must be assigned either to category ci or
    to its complement ci

Single-label, multi-label
  • The binary case (and, hence, the single-label
    case) is more general than the multi-label
  • an algorithm for binary classification can also
    be used for multi-label classification
  • the converse is not true

Category-pivoted vs. document-pivoted
  • Two different ways for using a text classifier
  • given a document, we want to find all the
    categories, under which it should be filed -gt
    document-pivoted categorization (DPC)
  • given a category, we want to find all the
    documents that should be filed under it -gt
    category-pivoted categorization (CPC)

Category-pivoted vs. document-pivoted
  • The distinction is important, since the sets C
    and D might not be available in their entirety
    right from the start
  • DPC suitable when documents become available at
    different moments in time, e.g. filtering e-mail
  • CPC suitable when new categories are added after
    some documents have already been classified (and
    have to be reclassified)

Category-pivoted vs. document-pivoted
  • Some algorithms may apply to one style and not
    the other, but most techniques are capable of
    working in either mode

Hard-categorization vs. ranking categorization
  • Hard categorization
  • the classifier answers T or F
  • Ranking categorization
  • given a document, the classifier might rank the
    categories according to their estimated
    appropriateness to the document
  • respectively, given a category, the classifier
    might rank the documents

Applications of text categorization
  • Automatic indexing for Boolean information
    retrieval systems
  • document organization
  • text filtering
  • word sense disambiguation
  • hierarchical categorization of Web pages

Automatic indexing for Boolean IR systems
  • In an information retrieval system, each document
    is assigned one or more keywords or keyphrases
    describing its content
  • keywords belong to a finite set called controlled
  • TC problem the entries in a controlled
    dictionary are viewed as categories
  • k1 ? x ? k2 keywords are assigned to each
  • document-pivoted TC

Document organization
  • Indexing with a controlled vocabulary is an
    intance of the general problem of document base
  • e.g. a newspaper office has to classify the
    incoming classified ads under categories such
    as Personals, Cars for Sale, Real Estate etc.
  • organization of patents, filing of newspaper

Text filtering
  • Classifying a stream of incoming documents
    dispatched in an asynchronous way by an
    information producer to an information consumer
  • e.g. newsfeed
  • producer news agency consumer newspaper
  • the filtering system should block the delivery of
    documents the consumer is likely not interested in

Word sense disambiguation
  • Given the occurrence in a text of an ambiguous
    word, find the sense of this particular word
  • E.g.
  • Bank of England
  • the bank of river Thames
  • Last week I borrowed some money from the bank.

Word sense disambiguation
  • Indexing by word senses rather than by words
  • text categorization
  • documents word occurrence contexts
  • categories word senses
  • also resolving other natural language ambiguities
  • context-sensitive spelling correction, part of
    speech tagging, prepositional phrase attachment,
    word choice selection in machine translation

Hierarchical categorization of Web pages
  • E.g. Yahoo like web hierarchical catalogues
  • typically, each category should be populated by
    a few documents
  • new categories are added, obsolete ones removed
  • usage of link structure in classification
  • usage of the hierarchical structure

Knowledge engineering approach
  • In the 80s knowledge engineering techniques
  • building manually expert systems capable of
    taking text categorization decisions
  • expert system consists of a set of rules
  • wheat farm -gt wheat
  • wheat commodity -gt wheat
  • bushels export -gt wheat
  • wheat winter soft -gt wheat

Knowledge engineering approach
  • Drawback rules must be manually defined by a
    knowledge engineer with the aid of a domain
  • any update necessitates again human intervention
  • totally domain dependent
  • -gt expensive and slow process

Machine learning approach
  • A general inductive process (learner)
    automatically builds a classifier for a category
    ci by observing the characteristics of a set of
    documents manually classified under ci or ?ci by
    a domain expert
  • from these characteristics the learner gleans the
    characteristics that a new unseen document should
    have in order to be classified under ci
  • supervised learning ( supervised by the
    knowledge of the training documents)

Machine learning approach
  • The learner is domain independent
  • usually available off-the-shelf
  • the inductive process is easily repeated, if the
    set of categories changes
  • manually classified documents often already
  • manual process may exist
  • if not, it still easier to manually classify a
    set of documents than to build and tune a set of

Training set, test set, validation set
  • Initial corpus of manually classified documents
  • let dj belong to the initial corpus
  • for each pair ltdj, cigt it is known if dj should
    be filed under ci
  • positive examples, negative examples of a category

Training set, test set, validation set
  • The initial corpus is divided into two sets
  • a training (and validation) set
  • a test set
  • the training set is used to build the classifier
  • the test set is used for testing the
    effectiveness of the classifiers
  • each document is fed to the classifier and the
    decision is compared to the manual category

Training set, test set, validation set
  • The documents in the test are not used in the
    construction of the classifier
  • alternative k-fold cross-validation
  • k different classifiers are built by partitioning
    the initial corpus into k disjoint sets and then
    iteratively applying the train-and-test approach
    on pairs, where k-1 sets construct a training set
    and 1 set is used as a test set
  • individual results are then averaged

Training set, test set, validation set
  • Training set can be split to two parts
  • one part is used for optimising parameters
  • test which values of parameters yield the best
  • test set and validation set must be kept separate

Inductive construction of classifiers
  • A ranking classifier for a category ci
  • definition of a function that, given a document,
    returns a categorization status value for it,
    i.e. a number between 0 and 1
  • documents are ranked according to their
    categorization status value

Inductive construction of classifiers
  • A hard classifier for a category
  • definition of a function that returns true or
    false, or
  • definition of a function that returns a value
    between 0 and 1, followed by a definition of a
  • if the value is higher than the threshold -gt true
  • otherwise -gt false

  • Probabilistic classifiers (Naïve Bayes)
  • decision tree classifiers
  • decision rule classifiers
  • regression methods
  • on-line methods
  • neural networks
  • example-based classifiers (k-NN)
  • support vector machines

Rocchio method
  • Linear classifier method
  • for each category, an explicit profile (or
    prototypical document) is constructed
  • benefit profile is understandable even for

Rocchio method
  • A classifier is a vector of the same dimension as
    the documents
  • weights
  • classifying cosine similarity of the category
    vector and the document vector
