Processing of Large Document Collections 1

1 / 69
About This Presentation
Title:

Processing of Large Document Collections 1

Description:

each alphabet forms the basis of the written form of a ... upper-case) letters have code values that are consecutive integers ... printable ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 70
Provided by: helenaah

less

Transcript and Presenter's Notes

Title: Processing of Large Document Collections 1


1
Processing of Large Document Collections 1
  • Helena Ahonen-Myka
  • University of Helsinki

2
Organization of the course
  • Classes 17.9., 22.10., 23.10., 26.11.
  • lectures (Helena Ahonen-Myka) 10-12,13-15
  • exercise sessions (Lili Aunimo) 15-17
  • required presence 75
  • Exercises are given (and returned) each week
  • required 75
  • Exam 4.12. at 16-20, Auditorio
  • Points Exam 30 pts, exercises 30 pts

3
Schedule
  • 17.9. Character sets, preprocessing of text, text
    categorization
  • 22.10. Text summarization
  • 23.10. Text compression
  • 26.11. to be announced
  • self-study basic transformations for text data,
    using linguistic tools, etc.

4
In this part...
  • Character sets
  • preprocessing of text
  • text categorization

5
1. Character sets
  • Abstract character vs. its graphical
    representation
  • abstract characters are grouped into alphabets
  • each alphabet forms the basis of the written form
    of a certain language or a set of languages

6
Character sets
  • For instance
  • for English
  • uppercase letters A-Z
  • lowercase letters a-z
  • punctuation marks
  • digits 0-9
  • common symbols ,
  • ideographic symbols of Chinese and Japanese
  • phonetic letters of Western languages

7
Character sets
  • To represent text digitally, we need a mapping
    between (abstract) characters and values stored
    digitally (integers)
  • this mapping is a character set
  • the domain of the character set is called a
    character repertoire ( the alphabet for which
    the mapping is defined)

8
Character sets
  • For each character in the character repertoire,
    the character set defines a code value in the set
    of code points
  • in English
  • 26 letters in both lower- and uppercase
  • ten digits some punctuation marks
  • in Russian cyrillic letters
  • both could use the same set of code points (if
    not a bilingual document)
  • in Japanese could be over 6000 characters

9
Character sets
  • The mere existence of a character set supports
    operations like editing and searching of text
  • usually character sets have some structure
  • e.g. integers within a small range
  • all lower-case (resp. upper-case) letters have
    code values that are consecutive integers
    (simplifies sorting etc.)

10
Character sets standars
  • Character sets can be arbitrary, but in practice
    standardization is needed for interoperability
    (between computers, programs,...)
  • early standards were designed for English only,
    or for a small group of languages at a time

11
Character sets standards
  • ASCII
  • ISO-8859 (e.g. ISO Latin1)
  • Unicode
  • UTF-8, UTF-16

12
ASCII
  • American Standard Code for Information
    Interchange
  • A seven bit code -gt 128 code points
  • actually 95 printable characters only
  • code points 0-31 and 128 are assigned to control
    characters (mostly outdated)
  • ISO 646 (1972) version of ASCII incorporated
    several national variants (accented letters and
    currency symbols)

13
ASCII
  • With 7 bits, the set of code points is too small
    for anything else than American English
  • solution
  • 8 bits brings more code points (256)
  • ASCII character repertoire is mapped to the
    values 0-127
  • additional symbols are mapped to other values

14
Extended ASCII
  • Problem
  • different manufacturers each developed their own
    8-bit extensions to ASCII
  • different character repertoires -gt translation
    between them is not always possible
  • also 256 code values is not enough to represent
    all the alphabets -gt different variants for
    different languages

15
ISO 8859
  • Standardization of 8-bit character sets
  • In the 80s multipart standard ISO 8859 was
    produced
  • defines a collection of 8-bit character sets,
    each designed for a group of languages
  • the first part ISO 8859-1 (ISO Latin1)
  • covers most Western European languages
  • 0-127 identical to ASCII, 128-159 (mostly)
    unused, 96 code values for accented letters and
    symbols

16
Unicode
  • 256 is not enough code points
  • for ideographically represented languages
    (Chinese, Japanese)
  • for simultaneous use of several languages
  • solution more than one byte for each code value
  • a 16-bit character set has 65,536 code points

17
Unicode
  • 16-bit character set, e.g. 65,536 code points
  • not sufficient for all the characters required
    for Chinese, Japanese, and Korean scripts in
    distinct positions
  • CJK-consolidation characters of these scripts
    are given the same value if they look the same

18
Unicode
  • Code values for all the characters used to write
    contemporary major languages
  • also the classical forms of some languages
  • Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic,
    Devanagari, Bengali, Gurmukhi, Gujarati, Oriya,
    Tamil, Telugu, Kannada, Malayalam, Thai, Lao,
    Georgian, Tibetan
  • Chinese, Japanese, and Korean ideograms, and the
    Japanese and Korean phonetic and syllabic scripts

19
Unicode
  • punctuation marks
  • technical and mathematical symbols
  • arrows
  • dingbats (pointing hands, stars, )
  • both accented letters and separate diacritical
    marks (accents, tildes) are included, with a
    mechanism for building composite characters
  • can also create problems two characters that
    look the same may have different code values
  • -gtnormalization may be necessary

20
Unicode
  • Code values for nearly 39,000 symbols are
    provided
  • some part is reserved for an expansion method
    (see later)
  • 6,400 code points are reserved for private use
  • they will never be assigned to any character by
    the standard, so they will not conflict with the
    standard

21
Unicode encodings
  • Encoding is a mapping that transforms a code
    value into a sequence of bytes for storage and
    transmission
  • identity mapping for a 8-bit code?
  • it may be necessary to encode 8-bit characters as
    sequences of 7-bit (ASCII) characters
  • e.g. Quoted-Printable (QP)
  • code values 128-255 as a sequence of 3 bytes
  • 1 ASCII code for , 2 3 hexadecimal digits
    of the value
  • 233 -gt E9 -gt E9

22
Unicode encodings
  • UTF-8
  • ASCII code values are likely to be more common in
    most text than any other values
  • in UTF-9 encoding ASCII characters are sent
    themselves (high-order bit 0)
  • other characters (two bytes) are encoded using up
    to six bytes (high-order bit is set to 1)

23
Unicode encodings
  • UTF-16 expansion method
  • two 16-bit values are combined to a 32-bit value
    -gt a million characters available

24
2. Preprocessing of text
  • Text cannot be directly interpreted by the many
    document processing applications
  • an indexing procedure is needed
  • mapping of a text into a compact representation
    of its content
  • which are the meaningful units of text?
  • how these units should be combined?
  • usually not important

25
Vector model
  • A document is usually represented as a vector of
    term weights
  • the vector has as many dimensions as there are
    terms (or features) in the whole collection of
    documents
  • the weight represents how much the term
    contributes to the semantics of the document

26
Vector model
  • Different approaches
  • different ways to understand what a term is
  • different ways to compute term weights

27
Terms
  • Words
  • typical choice
  • set of words, bag of words
  • phrases
  • syntactical phrases
  • statistical phrases
  • usefulness not yet known?

28
Terms
  • Part of the text is not considered as terms
  • very common words (function words)
  • articles, prepositions, conjunctions
  • numerals
  • these words are pruned
  • stopword list
  • other preprocessing possible
  • stemming, base words

29
Weights of terms
  • Weights usually range between 0 and 1
  • binary weights may be used
  • 1 denotes presence, 0 absence of the term in the
    document
  • often the tfidf function is used
  • higher weight, if the term occurs often in the
    document
  • lower weight, if the term occurs in many documents

30
Structure
  • Either the full text of the document or selected
    parts of it are indexed
  • e.g. in a patent categorization application
  • title, abstract, the first 20 lines of the
    summary, and the section containing the claims of
    novelty of the described invention
  • some parts may be considered more important
  • e.g. higher weight for the terms in the title

31
Dimensionality reduction
  • Many algorithms cannot handle high dimensionality
    of the term space ( large number of terms)
  • usually dimensionality reduction is applied
  • dimensionality reduction also reduces overfitting
  • classifier that overfits the training data is
    good at re-classifying the training data but
    worse at classifying previously unseen data

32
Dimensionality reduction
  • Local dimensionality reduction
  • for each category, a reduced set of terms is
    chosen for classification that category
  • hence, different subsets are used when working
    with different categories
  • global dimensionality reduction
  • a reduced set of terms is chosen for the
    classification under all categories

33
Dimensionality reduction
  • Dimensionality reduction by term selection
  • the terms of the reduced term set are a subset of
    the original term set
  • Dimensionality reduction by term extraction
  • the terms are not the same type of the terms in
    the original term set, but are obtained by
    combinations and transformations of the original
    ones

34
Dimensionality reduction by term selection
  • Goal select terms that, when used for document
    indexing, yields the highest effectiveness in the
    given application
  • wrapper approach
  • the reduced set of terms is found iteratively and
    tested with the application
  • filtering approach
  • keep the terms that receive the highest score
    according to a function that measures the
    importance of the term for the task

35
Dimensionality reduction by term selection
  • Many functions available
  • document frequency keep the high frequency terms
  • stopwords have been already removed
  • 50 of the words occur only once in the document
    collection
  • e.g. remove all terms occurring in at most 3
    documents

36
Dimensionality reduction by term selection
  • Information-theoretic term selection functions,
    e.g.
  • chi-square
  • information gain
  • mutual information
  • odds ratio
  • relevancy score

37
Dimensionality reduction by term extraction
  • Term extraction attempts to generate, from the
    original term set, a set of synthetic terms
    that maximize effectiveness
  • due to polysemy, homonymy, and synonymy, the
    original terms may not be optimal dimensions for
    document content representation

38
Dimensionality reduction by term extraction
  • Term clustering
  • tries to group words with a high degree of
    pairwise semantic relatedness
  • groups (or their centroids) may be used as
    dimensions
  • latent semantic indexing
  • compresses document vector into vectors of a
    lower-dimensional space whose dimensions are
    obtained as combinations of the original
    dimensions by looking at their patterns of
    co-occurrence

39
3. Text categorization
  • Text classification, topic classification/spotting
    /detection
  • problem setting
  • assume a predefined set of categories, a set of
    documents
  • label each document with one (or more) categories

40
Text categorization
  • Two major approaches
  • knowledge engineering -gt end of 80s
  • manually defined set of rules encoding expert
    knowledge on how to classify documents under the
    given gategories
  • machine learning, 90s -gt
  • an automatic text classifier is built by
    learning, from a set of preclassified documents,
    the characteristics of the categories

41
Text categorization
  • Let
  • D a domain of documents
  • C c1, , cC a set of predefined
    categories
  • T true, F false
  • The task is to approximate the unknown target
    function ? D x C -gt T,F by means of a
    function ? D x C -gt T,F, such that the
    functions coincide as much as possible
  • function ? how documents should be classified
  • function ? classifier (hypothesis, model)

42
We assume...
  • Categories are just symbolic labels
  • no additional knowledge of their meaning is
    available
  • No knowledge outside of the documents is
    available
  • all decisions have to be made on the basis of the
    knowledge extracted from the documents
  • metadata, e.g., publication date, document type,
    source etc. is not used

43
-gt general methods
  • Methods do not depend on any application-dependent
    knowledge
  • in operational applications all kind of knowledge
    can be used
  • content-based decisions are necessarily
    subjective
  • it is often difficult to measure the
    effectiveness of the classifiers
  • even human classifiers do not always agree

44
Single-label vs. multi-label
  • Single-label text categorization
  • exactly 1 category must be assigned to each dj ?
    D
  • Multi-label text categorization
  • any number of categories may be assigned to the
    same dj ? D
  • Special case of single-label binary
  • each dj must be assigned either to category ci or
    to its complement ci

45
Single-label, multi-label
  • The binary case (and, hence, the single-label
    case) is more general than the multi-label
  • an algorithm for binary classification can also
    be used for multi-label classification
  • the converse is not true

46
Category-pivoted vs. document-pivoted
  • Two different ways for using a text classifier
  • given a document, we want to find all the
    categories, under which it should be filed -gt
    document-pivoted categorization (DPC)
  • given a category, we want to find all the
    documents that should be filed under it -gt
    category-pivoted categorization (CPC)

47
Category-pivoted vs. document-pivoted
  • The distinction is important, since the sets C
    and D might not be available in their entirety
    right from the start
  • DPC suitable when documents become available at
    different moments in time, e.g. filtering e-mail
  • CPC suitable when new categories are added after
    some documents have already been classified (and
    have to be reclassified)

48
Category-pivoted vs. document-pivoted
  • Some algorithms may apply to one style and not
    the other, but most techniques are capable of
    working in either mode

49
Hard-categorization vs. ranking categorization
  • Hard categorization
  • the classifier answers T or F
  • Ranking categorization
  • given a document, the classifier might rank the
    categories according to their estimated
    appropriateness to the document
  • respectively, given a category, the classifier
    might rank the documents

50
Applications of text categorization
  • Automatic indexing for Boolean information
    retrieval systems
  • document organization
  • text filtering
  • word sense disambiguation
  • hierarchical categorization of Web pages

51
Automatic indexing for Boolean IR systems
  • In an information retrieval system, each document
    is assigned one or more keywords or keyphrases
    describing its content
  • keywords belong to a finite set called controlled
    dictionary
  • TC problem the entries in a controlled
    dictionary are viewed as categories
  • k1 ? x ? k2 keywords are assigned to each
    document
  • document-pivoted TC

52
Document organization
  • Indexing with a controlled vocabulary is an
    intance of the general problem of document base
    organization
  • e.g. a newspaper office has to classify the
    incoming classified ads under categories such
    as Personals, Cars for Sale, Real Estate etc.
  • organization of patents, filing of newspaper
    articles...

53
Text filtering
  • Classifying a stream of incoming documents
    dispatched in an asynchronous way by an
    information producer to an information consumer
  • e.g. newsfeed
  • producer news agency consumer newspaper
  • the filtering system should block the delivery of
    documents the consumer is likely not interested in

54
Word sense disambiguation
  • Given the occurrence in a text of an ambiguous
    word, find the sense of this particular word
    occurrence
  • E.g.
  • Bank of England
  • the bank of river Thames
  • Last week I borrowed some money from the bank.

55
Word sense disambiguation
  • Indexing by word senses rather than by words
  • text categorization
  • documents word occurrence contexts
  • categories word senses
  • also resolving other natural language ambiguities
  • context-sensitive spelling correction, part of
    speech tagging, prepositional phrase attachment,
    word choice selection in machine translation

56
Hierarchical categorization of Web pages
  • E.g. Yahoo like web hierarchical catalogues
  • typically, each category should be populated by
    a few documents
  • new categories are added, obsolete ones removed
  • usage of link structure in classification
  • usage of the hierarchical structure

57
Knowledge engineering approach
  • In the 80s knowledge engineering techniques
  • building manually expert systems capable of
    taking text categorization decisions
  • expert system consists of a set of rules
  • wheat farm -gt wheat
  • wheat commodity -gt wheat
  • bushels export -gt wheat
  • wheat winter soft -gt wheat

58
Knowledge engineering approach
  • Drawback rules must be manually defined by a
    knowledge engineer with the aid of a domain
    expert
  • any update necessitates again human intervention
  • totally domain dependent
  • -gt expensive and slow process

59
Machine learning approach
  • A general inductive process (learner)
    automatically builds a classifier for a category
    ci by observing the characteristics of a set of
    documents manually classified under ci or ?ci by
    a domain expert
  • from these characteristics the learner gleans the
    characteristics that a new unseen document should
    have in order to be classified under ci
  • supervised learning ( supervised by the
    knowledge of the training documents)

60
Machine learning approach
  • The learner is domain independent
  • usually available off-the-shelf
  • the inductive process is easily repeated, if the
    set of categories changes
  • manually classified documents often already
    available
  • manual process may exist
  • if not, it still easier to manually classify a
    set of documents than to build and tune a set of
    rules

61
Training set, test set, validation set
  • Initial corpus of manually classified documents
  • let dj belong to the initial corpus
  • for each pair ltdj, cigt it is known if dj should
    be filed under ci
  • positive examples, negative examples of a category

62
Training set, test set, validation set
  • The initial corpus is divided into two sets
  • a training (and validation) set
  • a test set
  • the training set is used to build the classifier
  • the test set is used for testing the
    effectiveness of the classifiers
  • each document is fed to the classifier and the
    decision is compared to the manual category

63
Training set, test set, validation set
  • The documents in the test are not used in the
    construction of the classifier
  • alternative k-fold cross-validation
  • k different classifiers are built by partitioning
    the initial corpus into k disjoint sets and then
    iteratively applying the train-and-test approach
    on pairs, where k-1 sets construct a training set
    and 1 set is used as a test set
  • individual results are then averaged

64
Training set, test set, validation set
  • Training set can be split to two parts
  • one part is used for optimising parameters
  • test which values of parameters yield the best
    effectiveness
  • test set and validation set must be kept separate

65
Inductive construction of classifiers
  • A ranking classifier for a category ci
  • definition of a function that, given a document,
    returns a categorization status value for it,
    i.e. a number between 0 and 1
  • documents are ranked according to their
    categorization status value

66
Inductive construction of classifiers
  • A hard classifier for a category
  • definition of a function that returns true or
    false, or
  • definition of a function that returns a value
    between 0 and 1, followed by a definition of a
    threshold
  • if the value is higher than the threshold -gt true
  • otherwise -gt false

67
Learners
  • Probabilistic classifiers (Naïve Bayes)
  • decision tree classifiers
  • decision rule classifiers
  • regression methods
  • on-line methods
  • neural networks
  • example-based classifiers (k-NN)
  • support vector machines

68
Rocchio method
  • Linear classifier method
  • for each category, an explicit profile (or
    prototypical document) is constructed
  • benefit profile is understandable even for
    humans

69
Rocchio method
  • A classifier is a vector of the same dimension as
    the documents
  • weights
  • classifying cosine similarity of the category
    vector and the document vector
Write a Comment
User Comments (0)