Title: Processing of Large Document Collections 1
1Processing of Large Document Collections 1
- Helena Ahonen-Myka
- University of Helsinki
2Organization of the course
- Classes 17.9., 22.10., 23.10., 26.11.
- lectures (Helena Ahonen-Myka) 10-12,13-15
- exercise sessions (Lili Aunimo) 15-17
- required presence 75
- Exercises are given (and returned) each week
- required 75
- Exam 4.12. at 16-20, Auditorio
- Points Exam 30 pts, exercises 30 pts
3Schedule
- 17.9. Character sets, preprocessing of text, text
categorization - 22.10. Text summarization
- 23.10. Text compression
- 26.11. to be announced
- self-study basic transformations for text data,
using linguistic tools, etc.
4In this part...
- Character sets
- preprocessing of text
- text categorization
51. Character sets
- Abstract character vs. its graphical
representation - abstract characters are grouped into alphabets
- each alphabet forms the basis of the written form
of a certain language or a set of languages
6Character sets
- For instance
- for English
- uppercase letters A-Z
- lowercase letters a-z
- punctuation marks
- digits 0-9
- common symbols ,
- ideographic symbols of Chinese and Japanese
- phonetic letters of Western languages
7Character sets
- To represent text digitally, we need a mapping
between (abstract) characters and values stored
digitally (integers) - this mapping is a character set
- the domain of the character set is called a
character repertoire ( the alphabet for which
the mapping is defined)
8Character sets
- For each character in the character repertoire,
the character set defines a code value in the set
of code points - in English
- 26 letters in both lower- and uppercase
- ten digits some punctuation marks
- in Russian cyrillic letters
- both could use the same set of code points (if
not a bilingual document) - in Japanese could be over 6000 characters
9Character sets
- The mere existence of a character set supports
operations like editing and searching of text - usually character sets have some structure
- e.g. integers within a small range
- all lower-case (resp. upper-case) letters have
code values that are consecutive integers
(simplifies sorting etc.)
10Character sets standars
- Character sets can be arbitrary, but in practice
standardization is needed for interoperability
(between computers, programs,...) - early standards were designed for English only,
or for a small group of languages at a time
11Character sets standards
- ASCII
- ISO-8859 (e.g. ISO Latin1)
- Unicode
- UTF-8, UTF-16
12ASCII
- American Standard Code for Information
Interchange - A seven bit code -gt 128 code points
- actually 95 printable characters only
- code points 0-31 and 128 are assigned to control
characters (mostly outdated) - ISO 646 (1972) version of ASCII incorporated
several national variants (accented letters and
currency symbols)
13ASCII
- With 7 bits, the set of code points is too small
for anything else than American English - solution
- 8 bits brings more code points (256)
- ASCII character repertoire is mapped to the
values 0-127 - additional symbols are mapped to other values
14Extended ASCII
- Problem
- different manufacturers each developed their own
8-bit extensions to ASCII - different character repertoires -gt translation
between them is not always possible - also 256 code values is not enough to represent
all the alphabets -gt different variants for
different languages
15ISO 8859
- Standardization of 8-bit character sets
- In the 80s multipart standard ISO 8859 was
produced - defines a collection of 8-bit character sets,
each designed for a group of languages - the first part ISO 8859-1 (ISO Latin1)
- covers most Western European languages
- 0-127 identical to ASCII, 128-159 (mostly)
unused, 96 code values for accented letters and
symbols
16Unicode
- 256 is not enough code points
- for ideographically represented languages
(Chinese, Japanese) - for simultaneous use of several languages
- solution more than one byte for each code value
- a 16-bit character set has 65,536 code points
17Unicode
- 16-bit character set, e.g. 65,536 code points
- not sufficient for all the characters required
for Chinese, Japanese, and Korean scripts in
distinct positions - CJK-consolidation characters of these scripts
are given the same value if they look the same
18Unicode
- Code values for all the characters used to write
contemporary major languages - also the classical forms of some languages
- Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic,
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya,
Tamil, Telugu, Kannada, Malayalam, Thai, Lao,
Georgian, Tibetan - Chinese, Japanese, and Korean ideograms, and the
Japanese and Korean phonetic and syllabic scripts
19Unicode
- punctuation marks
- technical and mathematical symbols
- arrows
- dingbats (pointing hands, stars, )
- both accented letters and separate diacritical
marks (accents, tildes) are included, with a
mechanism for building composite characters - can also create problems two characters that
look the same may have different code values - -gtnormalization may be necessary
20Unicode
- Code values for nearly 39,000 symbols are
provided - some part is reserved for an expansion method
(see later) - 6,400 code points are reserved for private use
- they will never be assigned to any character by
the standard, so they will not conflict with the
standard
21Unicode encodings
- Encoding is a mapping that transforms a code
value into a sequence of bytes for storage and
transmission - identity mapping for a 8-bit code?
- it may be necessary to encode 8-bit characters as
sequences of 7-bit (ASCII) characters - e.g. Quoted-Printable (QP)
- code values 128-255 as a sequence of 3 bytes
- 1 ASCII code for , 2 3 hexadecimal digits
of the value - 233 -gt E9 -gt E9
22Unicode encodings
- UTF-8
- ASCII code values are likely to be more common in
most text than any other values - in UTF-9 encoding ASCII characters are sent
themselves (high-order bit 0) - other characters (two bytes) are encoded using up
to six bytes (high-order bit is set to 1)
23Unicode encodings
- UTF-16 expansion method
- two 16-bit values are combined to a 32-bit value
-gt a million characters available
242. Preprocessing of text
- Text cannot be directly interpreted by the many
document processing applications - an indexing procedure is needed
- mapping of a text into a compact representation
of its content - which are the meaningful units of text?
- how these units should be combined?
- usually not important
25Vector model
- A document is usually represented as a vector of
term weights - the vector has as many dimensions as there are
terms (or features) in the whole collection of
documents - the weight represents how much the term
contributes to the semantics of the document
26Vector model
- Different approaches
- different ways to understand what a term is
- different ways to compute term weights
27Terms
- Words
- typical choice
- set of words, bag of words
- phrases
- syntactical phrases
- statistical phrases
- usefulness not yet known?
28Terms
- Part of the text is not considered as terms
- very common words (function words)
- articles, prepositions, conjunctions
- numerals
- these words are pruned
- stopword list
- other preprocessing possible
- stemming, base words
29Weights of terms
- Weights usually range between 0 and 1
- binary weights may be used
- 1 denotes presence, 0 absence of the term in the
document - often the tfidf function is used
- higher weight, if the term occurs often in the
document - lower weight, if the term occurs in many documents
30Structure
- Either the full text of the document or selected
parts of it are indexed - e.g. in a patent categorization application
- title, abstract, the first 20 lines of the
summary, and the section containing the claims of
novelty of the described invention - some parts may be considered more important
- e.g. higher weight for the terms in the title
31Dimensionality reduction
- Many algorithms cannot handle high dimensionality
of the term space ( large number of terms) - usually dimensionality reduction is applied
- dimensionality reduction also reduces overfitting
- classifier that overfits the training data is
good at re-classifying the training data but
worse at classifying previously unseen data
32Dimensionality reduction
- Local dimensionality reduction
- for each category, a reduced set of terms is
chosen for classification that category - hence, different subsets are used when working
with different categories - global dimensionality reduction
- a reduced set of terms is chosen for the
classification under all categories
33Dimensionality reduction
- Dimensionality reduction by term selection
- the terms of the reduced term set are a subset of
the original term set - Dimensionality reduction by term extraction
- the terms are not the same type of the terms in
the original term set, but are obtained by
combinations and transformations of the original
ones
34Dimensionality reduction by term selection
- Goal select terms that, when used for document
indexing, yields the highest effectiveness in the
given application - wrapper approach
- the reduced set of terms is found iteratively and
tested with the application - filtering approach
- keep the terms that receive the highest score
according to a function that measures the
importance of the term for the task
35Dimensionality reduction by term selection
- Many functions available
- document frequency keep the high frequency terms
- stopwords have been already removed
- 50 of the words occur only once in the document
collection - e.g. remove all terms occurring in at most 3
documents
36Dimensionality reduction by term selection
- Information-theoretic term selection functions,
e.g. - chi-square
- information gain
- mutual information
- odds ratio
- relevancy score
37Dimensionality reduction by term extraction
- Term extraction attempts to generate, from the
original term set, a set of synthetic terms
that maximize effectiveness - due to polysemy, homonymy, and synonymy, the
original terms may not be optimal dimensions for
document content representation
38Dimensionality reduction by term extraction
- Term clustering
- tries to group words with a high degree of
pairwise semantic relatedness - groups (or their centroids) may be used as
dimensions - latent semantic indexing
- compresses document vector into vectors of a
lower-dimensional space whose dimensions are
obtained as combinations of the original
dimensions by looking at their patterns of
co-occurrence
393. Text categorization
- Text classification, topic classification/spotting
/detection - problem setting
- assume a predefined set of categories, a set of
documents - label each document with one (or more) categories
40Text categorization
- Two major approaches
- knowledge engineering -gt end of 80s
- manually defined set of rules encoding expert
knowledge on how to classify documents under the
given gategories - machine learning, 90s -gt
- an automatic text classifier is built by
learning, from a set of preclassified documents,
the characteristics of the categories
41Text categorization
- Let
- D a domain of documents
- C c1, , cC a set of predefined
categories - T true, F false
- The task is to approximate the unknown target
function ? D x C -gt T,F by means of a
function ? D x C -gt T,F, such that the
functions coincide as much as possible - function ? how documents should be classified
- function ? classifier (hypothesis, model)
42We assume...
- Categories are just symbolic labels
- no additional knowledge of their meaning is
available - No knowledge outside of the documents is
available - all decisions have to be made on the basis of the
knowledge extracted from the documents - metadata, e.g., publication date, document type,
source etc. is not used
43-gt general methods
- Methods do not depend on any application-dependent
knowledge - in operational applications all kind of knowledge
can be used - content-based decisions are necessarily
subjective - it is often difficult to measure the
effectiveness of the classifiers - even human classifiers do not always agree
44Single-label vs. multi-label
- Single-label text categorization
- exactly 1 category must be assigned to each dj ?
D - Multi-label text categorization
- any number of categories may be assigned to the
same dj ? D - Special case of single-label binary
- each dj must be assigned either to category ci or
to its complement ci
45Single-label, multi-label
- The binary case (and, hence, the single-label
case) is more general than the multi-label - an algorithm for binary classification can also
be used for multi-label classification - the converse is not true
46Category-pivoted vs. document-pivoted
- Two different ways for using a text classifier
- given a document, we want to find all the
categories, under which it should be filed -gt
document-pivoted categorization (DPC) - given a category, we want to find all the
documents that should be filed under it -gt
category-pivoted categorization (CPC)
47Category-pivoted vs. document-pivoted
- The distinction is important, since the sets C
and D might not be available in their entirety
right from the start - DPC suitable when documents become available at
different moments in time, e.g. filtering e-mail - CPC suitable when new categories are added after
some documents have already been classified (and
have to be reclassified)
48Category-pivoted vs. document-pivoted
- Some algorithms may apply to one style and not
the other, but most techniques are capable of
working in either mode
49Hard-categorization vs. ranking categorization
- Hard categorization
- the classifier answers T or F
- Ranking categorization
- given a document, the classifier might rank the
categories according to their estimated
appropriateness to the document - respectively, given a category, the classifier
might rank the documents
50Applications of text categorization
- Automatic indexing for Boolean information
retrieval systems - document organization
- text filtering
- word sense disambiguation
- hierarchical categorization of Web pages
51Automatic indexing for Boolean IR systems
- In an information retrieval system, each document
is assigned one or more keywords or keyphrases
describing its content - keywords belong to a finite set called controlled
dictionary - TC problem the entries in a controlled
dictionary are viewed as categories - k1 ? x ? k2 keywords are assigned to each
document - document-pivoted TC
52Document organization
- Indexing with a controlled vocabulary is an
intance of the general problem of document base
organization - e.g. a newspaper office has to classify the
incoming classified ads under categories such
as Personals, Cars for Sale, Real Estate etc. - organization of patents, filing of newspaper
articles...
53Text filtering
- Classifying a stream of incoming documents
dispatched in an asynchronous way by an
information producer to an information consumer - e.g. newsfeed
- producer news agency consumer newspaper
- the filtering system should block the delivery of
documents the consumer is likely not interested in
54Word sense disambiguation
- Given the occurrence in a text of an ambiguous
word, find the sense of this particular word
occurrence - E.g.
- Bank of England
- the bank of river Thames
- Last week I borrowed some money from the bank.
55Word sense disambiguation
- Indexing by word senses rather than by words
- text categorization
- documents word occurrence contexts
- categories word senses
- also resolving other natural language ambiguities
- context-sensitive spelling correction, part of
speech tagging, prepositional phrase attachment,
word choice selection in machine translation
56Hierarchical categorization of Web pages
- E.g. Yahoo like web hierarchical catalogues
- typically, each category should be populated by
a few documents - new categories are added, obsolete ones removed
- usage of link structure in classification
- usage of the hierarchical structure
57Knowledge engineering approach
- In the 80s knowledge engineering techniques
- building manually expert systems capable of
taking text categorization decisions - expert system consists of a set of rules
- wheat farm -gt wheat
- wheat commodity -gt wheat
- bushels export -gt wheat
- wheat winter soft -gt wheat
58Knowledge engineering approach
- Drawback rules must be manually defined by a
knowledge engineer with the aid of a domain
expert - any update necessitates again human intervention
- totally domain dependent
- -gt expensive and slow process
59Machine learning approach
- A general inductive process (learner)
automatically builds a classifier for a category
ci by observing the characteristics of a set of
documents manually classified under ci or ?ci by
a domain expert - from these characteristics the learner gleans the
characteristics that a new unseen document should
have in order to be classified under ci - supervised learning ( supervised by the
knowledge of the training documents)
60Machine learning approach
- The learner is domain independent
- usually available off-the-shelf
- the inductive process is easily repeated, if the
set of categories changes - manually classified documents often already
available - manual process may exist
- if not, it still easier to manually classify a
set of documents than to build and tune a set of
rules
61Training set, test set, validation set
- Initial corpus of manually classified documents
- let dj belong to the initial corpus
- for each pair ltdj, cigt it is known if dj should
be filed under ci - positive examples, negative examples of a category
62Training set, test set, validation set
- The initial corpus is divided into two sets
- a training (and validation) set
- a test set
- the training set is used to build the classifier
- the test set is used for testing the
effectiveness of the classifiers - each document is fed to the classifier and the
decision is compared to the manual category
63Training set, test set, validation set
- The documents in the test are not used in the
construction of the classifier - alternative k-fold cross-validation
- k different classifiers are built by partitioning
the initial corpus into k disjoint sets and then
iteratively applying the train-and-test approach
on pairs, where k-1 sets construct a training set
and 1 set is used as a test set - individual results are then averaged
64Training set, test set, validation set
- Training set can be split to two parts
- one part is used for optimising parameters
- test which values of parameters yield the best
effectiveness - test set and validation set must be kept separate
65Inductive construction of classifiers
- A ranking classifier for a category ci
- definition of a function that, given a document,
returns a categorization status value for it,
i.e. a number between 0 and 1 - documents are ranked according to their
categorization status value
66Inductive construction of classifiers
- A hard classifier for a category
- definition of a function that returns true or
false, or - definition of a function that returns a value
between 0 and 1, followed by a definition of a
threshold - if the value is higher than the threshold -gt true
- otherwise -gt false
67Learners
- Probabilistic classifiers (Naïve Bayes)
- decision tree classifiers
- decision rule classifiers
- regression methods
- on-line methods
- neural networks
- example-based classifiers (k-NN)
- support vector machines
68Rocchio method
- Linear classifier method
- for each category, an explicit profile (or
prototypical document) is constructed - benefit profile is understandable even for
humans
69Rocchio method
- A classifier is a vector of the same dimension as
the documents - weights
- classifying cosine similarity of the category
vector and the document vector