Processing of Large Document Collections 1

1 / 69

About This Presentation

Title:

Processing of Large Document Collections 1

Description:

each alphabet forms the basis of the written form of a ... upper-case) letters have code values that are consecutive integers ... printable ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 70

Provided by: helenaah

more less

Transcript and Presenter's Notes

Title: Processing of Large Document Collections 1

1
Processing of Large Document Collections 1

Helena Ahonen-Myka
University of Helsinki

2
Organization of the course

Classes 17.9., 22.10., 23.10., 26.11.
lectures (Helena Ahonen-Myka) 10-12,13-15
exercise sessions (Lili Aunimo) 15-17
required presence 75
Exercises are given (and returned) each week
required 75
Exam 4.12. at 16-20, Auditorio
Points Exam 30 pts, exercises 30 pts

3
Schedule

17.9. Character sets, preprocessing of text, text
categorization
22.10. Text summarization
23.10. Text compression
26.11. to be announced
self-study basic transformations for text data,
using linguistic tools, etc.

4
In this part...

Character sets
preprocessing of text
text categorization

5
1. Character sets

Abstract character vs. its graphical
representation
abstract characters are grouped into alphabets
each alphabet forms the basis of the written form
of a certain language or a set of languages

6
Character sets

For instance
for English
uppercase letters A-Z
lowercase letters a-z
punctuation marks
digits 0-9
common symbols ,
ideographic symbols of Chinese and Japanese
phonetic letters of Western languages

7
Character sets

To represent text digitally, we need a mapping
between (abstract) characters and values stored
digitally (integers)
this mapping is a character set
the domain of the character set is called a
character repertoire ( the alphabet for which
the mapping is defined)

8
Character sets

For each character in the character repertoire,
the character set defines a code value in the set
of code points
in English
26 letters in both lower- and uppercase
ten digits some punctuation marks
in Russian cyrillic letters
both could use the same set of code points (if
not a bilingual document)
in Japanese could be over 6000 characters

9
Character sets

The mere existence of a character set supports
operations like editing and searching of text
usually character sets have some structure
e.g. integers within a small range
all lower-case (resp. upper-case) letters have
code values that are consecutive integers
(simplifies sorting etc.)

10
Character sets standars

Character sets can be arbitrary, but in practice
standardization is needed for interoperability
(between computers, programs,...)
early standards were designed for English only,
or for a small group of languages at a time

11
Character sets standards

ASCII
ISO-8859 (e.g. ISO Latin1)
Unicode
UTF-8, UTF-16

12
ASCII

American Standard Code for Information
Interchange
A seven bit code -gt 128 code points
actually 95 printable characters only
code points 0-31 and 128 are assigned to control
characters (mostly outdated)
ISO 646 (1972) version of ASCII incorporated
several national variants (accented letters and
currency symbols)

13
ASCII

With 7 bits, the set of code points is too small
for anything else than American English
solution
8 bits brings more code points (256)
ASCII character repertoire is mapped to the
values 0-127
additional symbols are mapped to other values

14
Extended ASCII

Problem
different manufacturers each developed their own
8-bit extensions to ASCII
different character repertoires -gt translation
between them is not always possible
also 256 code values is not enough to represent
all the alphabets -gt different variants for
different languages

15
ISO 8859

Standardization of 8-bit character sets
In the 80s multipart standard ISO 8859 was
produced
defines a collection of 8-bit character sets,
each designed for a group of languages
the first part ISO 8859-1 (ISO Latin1)
covers most Western European languages
0-127 identical to ASCII, 128-159 (mostly)
unused, 96 code values for accented letters and
symbols

16
Unicode

256 is not enough code points
for ideographically represented languages
(Chinese, Japanese)
for simultaneous use of several languages
solution more than one byte for each code value
a 16-bit character set has 65,536 code points

17
Unicode

16-bit character set, e.g. 65,536 code points
not sufficient for all the characters required
for Chinese, Japanese, and Korean scripts in
distinct positions
CJK-consolidation characters of these scripts
are given the same value if they look the same

18
Unicode

Code values for all the characters used to write
contemporary major languages
also the classical forms of some languages
Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic,
Devanagari, Bengali, Gurmukhi, Gujarati, Oriya,
Tamil, Telugu, Kannada, Malayalam, Thai, Lao,
Georgian, Tibetan
Chinese, Japanese, and Korean ideograms, and the
Japanese and Korean phonetic and syllabic scripts

19
Unicode

punctuation marks
technical and mathematical symbols
arrows
dingbats (pointing hands, stars, )
both accented letters and separate diacritical
marks (accents, tildes) are included, with a
mechanism for building composite characters
can also create problems two characters that
look the same may have different code values
-gtnormalization may be necessary

20
Unicode

Code values for nearly 39,000 symbols are
provided
some part is reserved for an expansion method
(see later)
6,400 code points are reserved for private use
they will never be assigned to any character by
the standard, so they will not conflict with the
standard

21
Unicode encodings

Encoding is a mapping that transforms a code
value into a sequence of bytes for storage and
transmission
identity mapping for a 8-bit code?
it may be necessary to encode 8-bit characters as
sequences of 7-bit (ASCII) characters
e.g. Quoted-Printable (QP)
code values 128-255 as a sequence of 3 bytes
1 ASCII code for , 2 3 hexadecimal digits
of the value
233 -gt E9 -gt E9

22
Unicode encodings

UTF-8
ASCII code values are likely to be more common in
most text than any other values
in UTF-9 encoding ASCII characters are sent
themselves (high-order bit 0)
other characters (two bytes) are encoded using up
to six bytes (high-order bit is set to 1)

23
Unicode encodings

UTF-16 expansion method
two 16-bit values are combined to a 32-bit value
-gt a million characters available

24
2. Preprocessing of text

Text cannot be directly interpreted by the many
document processing applications
an indexing procedure is needed
mapping of a text into a compact representation
of its content
which are the meaningful units of text?
how these units should be combined?
usually not important

25
Vector model

A document is usually represented as a vector of
term weights
the vector has as many dimensions as there are
terms (or features) in the whole collection of
documents
the weight represents how much the term
contributes to the semantics of the document

26
Vector model

Different approaches
different ways to understand what a term is
different ways to compute term weights

27
Terms

Words
typical choice
set of words, bag of words
phrases
syntactical phrases
statistical phrases
usefulness not yet known?

28
Terms

Part of the text is not considered as terms
very common words (function words)
articles, prepositions, conjunctions
numerals
these words are pruned
stopword list
other preprocessing possible
stemming, base words

29
Weights of terms

Weights usually range between 0 and 1
binary weights may be used
1 denotes presence, 0 absence of the term in the
document
often the tfidf function is used
higher weight, if the term occurs often in the
document
lower weight, if the term occurs in many documents

30
Structure

Either the full text of the document or selected
parts of it are indexed
e.g. in a patent categorization application
title, abstract, the first 20 lines of the
summary, and the section containing the claims of
novelty of the described invention
some parts may be considered more important
e.g. higher weight for the terms in the title

31
Dimensionality reduction

Many algorithms cannot handle high dimensionality
of the term space ( large number of terms)
usually dimensionality reduction is applied
dimensionality reduction also reduces overfitting
classifier that overfits the training data is
good at re-classifying the training data but
worse at classifying previously unseen data

32
Dimensionality reduction

Local dimensionality reduction
for each category, a reduced set of terms is
chosen for classification that category
hence, different subsets are used when working
with different categories
global dimensionality reduction
a reduced set of terms is chosen for the
classification under all categories

33
Dimensionality reduction

Dimensionality reduction by term selection
the terms of the reduced term set are a subset of
the original term set
Dimensionality reduction by term extraction
the terms are not the same type of the terms in
the original term set, but are obtained by
combinations and transformations of the original
ones

34
Dimensionality reduction by term selection

Goal select terms that, when used for document
indexing, yields the highest effectiveness in the
given application
wrapper approach
the reduced set of terms is found iteratively and
tested with the application
filtering approach
keep the terms that receive the highest score
according to a function that measures the
importance of the term for the task

35
Dimensionality reduction by term selection

Many functions available
document frequency keep the high frequency terms
stopwords have been already removed
50 of the words occur only once in the document
collection
e.g. remove all terms occurring in at most 3
documents

36
Dimensionality reduction by term selection

Information-theoretic term selection functions,
e.g.
chi-square
information gain
mutual information
odds ratio
relevancy score

37
Dimensionality reduction by term extraction

Term extraction attempts to generate, from the
original term set, a set of synthetic terms
that maximize effectiveness
due to polysemy, homonymy, and synonymy, the
original terms may not be optimal dimensions for
document content representation

38
Dimensionality reduction by term extraction

Term clustering
tries to group words with a high degree of
pairwise semantic relatedness
groups (or their centroids) may be used as
dimensions
latent semantic indexing
compresses document vector into vectors of a
lower-dimensional space whose dimensions are
obtained as combinations of the original
dimensions by looking at their patterns of
co-occurrence

39
3. Text categorization

Text classification, topic classification/spotting
/detection
problem setting
assume a predefined set of categories, a set of
documents
label each document with one (or more) categories

40
Text categorization

Two major approaches
knowledge engineering -gt end of 80s
manually defined set of rules encoding expert
knowledge on how to classify documents under the
given gategories
machine learning, 90s -gt
an automatic text classifier is built by
learning, from a set of preclassified documents,
the characteristics of the categories

41
Text categorization

Let
D a domain of documents
C c1, , cC a set of predefined
categories
T true, F false
The task is to approximate the unknown target
function ? D x C -gt T,F by means of a
function ? D x C -gt T,F, such that the
functions coincide as much as possible
function ? how documents should be classified
function ? classifier (hypothesis, model)

42
We assume...

Categories are just symbolic labels
no additional knowledge of their meaning is
available
No knowledge outside of the documents is
available
all decisions have to be made on the basis of the
knowledge extracted from the documents
metadata, e.g., publication date, document type,
source etc. is not used

43
-gt general methods

Methods do not depend on any application-dependent
knowledge
in operational applications all kind of knowledge
can be used
content-based decisions are necessarily
subjective
it is often difficult to measure the
effectiveness of the classifiers
even human classifiers do not always agree

44
Single-label vs. multi-label

Single-label text categorization
exactly 1 category must be assigned to each dj ?
D
Multi-label text categorization
any number of categories may be assigned to the
same dj ? D
Special case of single-label binary
each dj must be assigned either to category ci or
to its complement ci

45
Single-label, multi-label

The binary case (and, hence, the single-label
case) is more general than the multi-label
an algorithm for binary classification can also
be used for multi-label classification
the converse is not true

46
Category-pivoted vs. document-pivoted

Two different ways for using a text classifier
given a document, we want to find all the
categories, under which it should be filed -gt
document-pivoted categorization (DPC)
given a category, we want to find all the
documents that should be filed under it -gt
category-pivoted categorization (CPC)

47
Category-pivoted vs. document-pivoted

The distinction is important, since the sets C
and D might not be available in their entirety
right from the start
DPC suitable when documents become available at
different moments in time, e.g. filtering e-mail
CPC suitable when new categories are added after
some documents have already been classified (and
have to be reclassified)

48
Category-pivoted vs. document-pivoted

Some algorithms may apply to one style and not
the other, but most techniques are capable of
working in either mode

49
Hard-categorization vs. ranking categorization

Hard categorization
the classifier answers T or F
Ranking categorization
given a document, the classifier might rank the
categories according to their estimated
appropriateness to the document
respectively, given a category, the classifier
might rank the documents

50
Applications of text categorization

Automatic indexing for Boolean information
retrieval systems
document organization
text filtering
word sense disambiguation
hierarchical categorization of Web pages

51
Automatic indexing for Boolean IR systems

In an information retrieval system, each document
is assigned one or more keywords or keyphrases
describing its content
keywords belong to a finite set called controlled
dictionary
TC problem the entries in a controlled
dictionary are viewed as categories
k1 ? x ? k2 keywords are assigned to each
document
document-pivoted TC

52
Document organization

Indexing with a controlled vocabulary is an
intance of the general problem of document base
organization
e.g. a newspaper office has to classify the
incoming classified ads under categories such
as Personals, Cars for Sale, Real Estate etc.
organization of patents, filing of newspaper
articles...

53
Text filtering

Classifying a stream of incoming documents
dispatched in an asynchronous way by an
information producer to an information consumer
e.g. newsfeed
producer news agency consumer newspaper
the filtering system should block the delivery of
documents the consumer is likely not interested in

54
Word sense disambiguation

Given the occurrence in a text of an ambiguous
word, find the sense of this particular word
occurrence
E.g.
Bank of England
the bank of river Thames
Last week I borrowed some money from the bank.

55
Word sense disambiguation

Indexing by word senses rather than by words
text categorization
documents word occurrence contexts
categories word senses
also resolving other natural language ambiguities
context-sensitive spelling correction, part of
speech tagging, prepositional phrase attachment,
word choice selection in machine translation

56
Hierarchical categorization of Web pages

E.g. Yahoo like web hierarchical catalogues
typically, each category should be populated by
a few documents
new categories are added, obsolete ones removed
usage of link structure in classification
usage of the hierarchical structure

57
Knowledge engineering approach

In the 80s knowledge engineering techniques
building manually expert systems capable of
taking text categorization decisions
expert system consists of a set of rules
wheat farm -gt wheat
wheat commodity -gt wheat
bushels export -gt wheat
wheat winter soft -gt wheat

58
Knowledge engineering approach

Drawback rules must be manually defined by a
knowledge engineer with the aid of a domain
expert
any update necessitates again human intervention
totally domain dependent
-gt expensive and slow process

59
Machine learning approach

A general inductive process (learner)
automatically builds a classifier for a category
ci by observing the characteristics of a set of
documents manually classified under ci or ?ci by
a domain expert
from these characteristics the learner gleans the
characteristics that a new unseen document should
have in order to be classified under ci
supervised learning ( supervised by the
knowledge of the training documents)

60
Machine learning approach

The learner is domain independent
usually available off-the-shelf
the inductive process is easily repeated, if the
set of categories changes
manually classified documents often already
available
manual process may exist
if not, it still easier to manually classify a
set of documents than to build and tune a set of
rules

61
Training set, test set, validation set

Initial corpus of manually classified documents
let dj belong to the initial corpus
for each pair ltdj, cigt it is known if dj should
be filed under ci
positive examples, negative examples of a category

62
Training set, test set, validation set

The initial corpus is divided into two sets
a training (and validation) set
a test set
the training set is used to build the classifier
the test set is used for testing the
effectiveness of the classifiers
each document is fed to the classifier and the
decision is compared to the manual category

63
Training set, test set, validation set

The documents in the test are not used in the
construction of the classifier
alternative k-fold cross-validation
k different classifiers are built by partitioning
the initial corpus into k disjoint sets and then
iteratively applying the train-and-test approach
on pairs, where k-1 sets construct a training set
and 1 set is used as a test set
individual results are then averaged

64
Training set, test set, validation set

Training set can be split to two parts
one part is used for optimising parameters
test which values of parameters yield the best
effectiveness
test set and validation set must be kept separate

65
Inductive construction of classifiers

A ranking classifier for a category ci
definition of a function that, given a document,
returns a categorization status value for it,
i.e. a number between 0 and 1
documents are ranked according to their
categorization status value

66
Inductive construction of classifiers

A hard classifier for a category
definition of a function that returns true or
false, or
definition of a function that returns a value
between 0 and 1, followed by a definition of a
threshold
if the value is higher than the threshold -gt true
otherwise -gt false

67
Learners

Probabilistic classifiers (Naïve Bayes)
decision tree classifiers
decision rule classifiers
regression methods
on-line methods
neural networks
example-based classifiers (k-NN)
support vector machines

68
Rocchio method

Linear classifier method
for each category, an explicit profile (or
prototypical document) is constructed
benefit profile is understandable even for
humans

69
Rocchio method