Automatic Term Weighting, Lexical Statistics and Quantitative Terminology - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Automatic Term Weighting, Lexical Statistics and Quantitative Terminology

Description:

To rescue/recover the sphere of lexicology ... clarified the basic framework of quantitative lexicology/terminology, with relevant measures. ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 25
Provided by: wwwtsuji
Category:

less

Transcript and Presenter's Notes

Title: Automatic Term Weighting, Lexical Statistics and Quantitative Terminology


1
Automatic Term Weighting, Lexical Statistics and
Quantitative Terminology
  • Kyo Kageura
  • National Institute of Informatics
  • July 05, 2003

2
Project
  • To rescue/recover the sphere of lexicology
  • To release the richness and productivity of
    lexico-conceptual sets from the dominance of
    discourse
  • while maintaining the traceable procedure in the
    process of doing this
  • and starting from textual corpora

3
Contents
  • Sphere of Texts and Sphere of Lexicon/ology
  • Three (representative) methods of automatic term
    weighting and their meanings
  • From corpus-based lexical statistics to (still)
    corpus-based quantitative lexicology
  • Measuring lexical productivity in lexicon (i.e.
    lexicological concept of productivity) from
    textual data, with some experiments
  • Conclusions

4
Textual Sphere and Lexicological Sphere

Lexicological Sphere
This exists
complex terms
lexicology
lexicon
quantitative lexicology
terms
So what about talking about lexicology when
talking about corpus-based
Textual Sphere
5
Lexicological Sphere and Texts
  • Lexicology deals with actual set of words
  • which does not mean its natural history
  • Lexicological model with expectations addresses
    realistic possibility of existence, not
    permissible forms or fantasy land
  • thus actual data is required
  • primary language data is texts
  • Thus becomes recovery of lexicological
    characteristics the task of lexicology

6
Automatic Term Weighting (ATW)
  • To review some representative ATW methods gives
    important insights into the current topic
  • while at the same time giving insights into ATWs
  • We look at
  • Tfidf (its info-theoretic interpretation by
    Aizawa)
  • Term representativeness (by Hisamitsu)
  • Lexical measure (by Nakagawa)
  • which goes from texts to lexicology, almost.

7
ATW1 tfidf
Tfidf and many other similar measures, in fact
most of what are used in IR, are based on the
document-term matrix which has formal
duality. Thus the weight of terms is always and
only meaningful vis-à-vis the given set of
documents or its population (Dfitf thus makes
sense, as in probabilistic model).
8
ATW2 Term representativeness
  • You shall know the meaning of a word by the
    company it keeps (or see friends to know a person
    if there is any, anyway)
  • To calculate the weight of a term ti, take the
    distribution of words that accompany ti in a
    certain window size and calculate the distance
    between this and the distribution of random chunk
    of the same window size (NB size normalisation
    is necessary due to LNRE nature of language data).

9
ATW2 Term representativeness
  • This method discards the factor of dominant
    discourse or minor discourse at the level of
    observed texts (or does not do favor to people
    who randomly buy friends by money).
  • This method calculates the characteristic that
    the term ti, if appears at all, can attract at
    the level of discourse (depending on the nature
    of window the method takes, of course).

10
ATW3 Nakagawas method
  • Observe the number of different elements (element
    types) that accompany ti within the complex
    lexical units in texts.
  • This reflects, therefore, a nature of lexical
    productivity of the focal element ti, but
    together with the degree of its use in discourse
    (texts)

11
ATW to Quantitative Lexicology
  • To characterise lexicological nature of elements
    from their occurrence in texts
  • As in the method of term representativeness in
    Hisamitsu, the discourse size factor should be
    reduced, more essentially
  • As in Nakagawas method, the point of observation
    should be limited to complex terms (or those
    which are supposed to be registered or can be
    registered to the lexicon/lexicological sphere).

12
A Quantitative Terminonlogical Study
  • Aim To recover the productivity of constituent
    elements of simplex and complex terms as head.
  • Observe, like Nakagawa, the window range of
    simplex and complex terms in texts, e.g.

13
Some preconditions/assumptions
  • Corpus and the target terminological space
    should
  • belong to and represent the same domain
  • cover the same period of time
  • in general matches qualitatively
  • We are concerned with defining a measure which
    can compare productivity of elements in the
    same lexicological/terminological sphere.

14
Definition of measures (a)
  • f(i,N) frequency of ti in the text of size N
  • This is the extent of use in discourse, nothing
    to do with lexicological productivity
  • d(i,N) number of different complex words whose
    head is ti in the text of size N
  • the first manifestation of lexicological
    productivity
  • basically identical to Nakagawa (2000)
  • thus this is the point of departure

15
Definition of measures (b)
  • d(i,N) means the manifestation of the
    productivity of ti as it occurs in the corpus
  • d(i,N) is sensitive to the extent of use of the
    focal element in the textual corpus,
  • e.g. the following can be the case

16
Definition of measures (c)
  • Better measure for manifested productivity
  • d(i,?N)the overall transition pattern of
    d(i,?N) where?takes a positive real value (a la
    Hisamitsu).
  • The measure for potential productivity
  • d(i) d(i,?N)??8discard all the quantitative
    factor
  • Can be computed by LNRE models

17
The measures and prob. distributions
  • Three distributions
  • 1) The occurrence probability of heads in
    theoretical lexicological space.
  • 2) The occurrence probability of modifiers for
    each head.
  • 3) The probability of use of the head in the
    text.
  • Relations
  • f(i,N) ? 3)
  • d(i) ? 1)
  • d(i,N) ? 2),3)

18
Experiments (1/5)
  • Artificial intelligence abstracts in Japanese
  • 4 elements, i.e. ?System??Model?(general) and
    ?knonwledge??information?(specific) are observed

19
Experiments (2/5)

20
Experiments (3/5)

21
Experiments (4/5)

22
Experiments (5/5)
General elements, such as system or model,
have high lexicological productivity, while
subject-specific elements, such as knowledge or
information, have rather low productivity.
23
Summary
  • Starting from the observation of ATW methods and
    going into examining corpus-based quantitative
    terminological study, we
  • clarified the position of lexicology/lexicon
  • clarified the basic framework of quantitative
    lexicology/terminology, with relevant measures.
  • gave some corresponding distributions
  • gave the framework of interpretation to measures
  • carried out experiments

24
Remaining problems
  • Concepts of lexicologisation and word
  • To be registered to the lexicon
  • To be consolidated as a lexical unit within the
    syntagmatic stream of language manifestations
  • Distribution of complex words in texts and word
    unit
  • referencehead vs. modifierhead
  • The former is related to an essential
    concept(ualisation) of lexicon/lexicology
Write a Comment
User Comments (0)
About PowerShow.com