Title: Automatic Term Weighting, Lexical Statistics and Quantitative Terminology
1Automatic Term Weighting, Lexical Statistics and
Quantitative Terminology
- Kyo Kageura
- National Institute of Informatics
- July 05, 2003
2Project
- To rescue/recover the sphere of lexicology
- To release the richness and productivity of
lexico-conceptual sets from the dominance of
discourse - while maintaining the traceable procedure in the
process of doing this - and starting from textual corpora
3Contents
- Sphere of Texts and Sphere of Lexicon/ology
- Three (representative) methods of automatic term
weighting and their meanings - From corpus-based lexical statistics to (still)
corpus-based quantitative lexicology - Measuring lexical productivity in lexicon (i.e.
lexicological concept of productivity) from
textual data, with some experiments - Conclusions
4Textual Sphere and Lexicological Sphere
Lexicological Sphere
This exists
complex terms
lexicology
lexicon
quantitative lexicology
terms
So what about talking about lexicology when
talking about corpus-based
Textual Sphere
5Lexicological Sphere and Texts
- Lexicology deals with actual set of words
- which does not mean its natural history
- Lexicological model with expectations addresses
realistic possibility of existence, not
permissible forms or fantasy land - thus actual data is required
- primary language data is texts
- Thus becomes recovery of lexicological
characteristics the task of lexicology
6Automatic Term Weighting (ATW)
- To review some representative ATW methods gives
important insights into the current topic - while at the same time giving insights into ATWs
- We look at
- Tfidf (its info-theoretic interpretation by
Aizawa) - Term representativeness (by Hisamitsu)
- Lexical measure (by Nakagawa)
- which goes from texts to lexicology, almost.
7ATW1 tfidf
Tfidf and many other similar measures, in fact
most of what are used in IR, are based on the
document-term matrix which has formal
duality. Thus the weight of terms is always and
only meaningful vis-à-vis the given set of
documents or its population (Dfitf thus makes
sense, as in probabilistic model).
8ATW2 Term representativeness
- You shall know the meaning of a word by the
company it keeps (or see friends to know a person
if there is any, anyway) - To calculate the weight of a term ti, take the
distribution of words that accompany ti in a
certain window size and calculate the distance
between this and the distribution of random chunk
of the same window size (NB size normalisation
is necessary due to LNRE nature of language data).
9ATW2 Term representativeness
- This method discards the factor of dominant
discourse or minor discourse at the level of
observed texts (or does not do favor to people
who randomly buy friends by money). - This method calculates the characteristic that
the term ti, if appears at all, can attract at
the level of discourse (depending on the nature
of window the method takes, of course).
10ATW3 Nakagawas method
- Observe the number of different elements (element
types) that accompany ti within the complex
lexical units in texts. - This reflects, therefore, a nature of lexical
productivity of the focal element ti, but
together with the degree of its use in discourse
(texts)
11ATW to Quantitative Lexicology
- To characterise lexicological nature of elements
from their occurrence in texts - As in the method of term representativeness in
Hisamitsu, the discourse size factor should be
reduced, more essentially - As in Nakagawas method, the point of observation
should be limited to complex terms (or those
which are supposed to be registered or can be
registered to the lexicon/lexicological sphere).
12A Quantitative Terminonlogical Study
- Aim To recover the productivity of constituent
elements of simplex and complex terms as head. - Observe, like Nakagawa, the window range of
simplex and complex terms in texts, e.g.
13Some preconditions/assumptions
- Corpus and the target terminological space
should - belong to and represent the same domain
- cover the same period of time
- in general matches qualitatively
- We are concerned with defining a measure which
can compare productivity of elements in the
same lexicological/terminological sphere.
14Definition of measures (a)
- f(i,N) frequency of ti in the text of size N
- This is the extent of use in discourse, nothing
to do with lexicological productivity - d(i,N) number of different complex words whose
head is ti in the text of size N - the first manifestation of lexicological
productivity - basically identical to Nakagawa (2000)
- thus this is the point of departure
15Definition of measures (b)
- d(i,N) means the manifestation of the
productivity of ti as it occurs in the corpus - d(i,N) is sensitive to the extent of use of the
focal element in the textual corpus, - e.g. the following can be the case
16Definition of measures (c)
- Better measure for manifested productivity
- d(i,?N)the overall transition pattern of
d(i,?N) where?takes a positive real value (a la
Hisamitsu). - The measure for potential productivity
- d(i) d(i,?N)??8discard all the quantitative
factor - Can be computed by LNRE models
17The measures and prob. distributions
- Three distributions
- 1) The occurrence probability of heads in
theoretical lexicological space. - 2) The occurrence probability of modifiers for
each head. - 3) The probability of use of the head in the
text. - Relations
- f(i,N) ? 3)
- d(i) ? 1)
- d(i,N) ? 2),3)
18Experiments (1/5)
- Artificial intelligence abstracts in Japanese
- 4 elements, i.e. ?System??Model?(general) and
?knonwledge??information?(specific) are observed
19Experiments (2/5)
20Experiments (3/5)
21Experiments (4/5)
22Experiments (5/5)
General elements, such as system or model,
have high lexicological productivity, while
subject-specific elements, such as knowledge or
information, have rather low productivity.
23Summary
- Starting from the observation of ATW methods and
going into examining corpus-based quantitative
terminological study, we - clarified the position of lexicology/lexicon
- clarified the basic framework of quantitative
lexicology/terminology, with relevant measures. - gave some corresponding distributions
- gave the framework of interpretation to measures
- carried out experiments
24Remaining problems
- Concepts of lexicologisation and word
- To be registered to the lexicon
- To be consolidated as a lexical unit within the
syntagmatic stream of language manifestations - Distribution of complex words in texts and word
unit - referencehead vs. modifierhead
- The former is related to an essential
concept(ualisation) of lexicon/lexicology