frankliang0086yahoo'com'cn - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

frankliang0086yahoo'com'cn

Description:

... a collection of text, now usually in machine-readable form and compiled to ... Chi-square test, loglikelihood test. ??????????. Concordance ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 25
Provided by: frank127
Category:

less

Transcript and Presenter's Notes

Title: frankliang0086yahoo'com'cn


1
???????????????
??????????????!
  • ??????????
  • ???
  • frankliang0086_at_yahoo.com.cn

2
????
  • ??????
  • ?????????
  • ????
  • ???????
  • ????
  • ????????????

3
????
  • ??? ??
  • ??? ??
  • ??? ??
  • ??? ??

4
?????????
  • ???
  • ??
  • ????
  • ??????????
  • ????
  • ??
  • ???????????
  • ????
  • ???????

5
?????????
  • ???
  • ??
  • ??????
  • ??????
  • ??
  • ?????????(CIA)
  • ????????????
  • ??????

6
?????????
  • ???
  • ??
  • ???????
  • ???????????
  • ??
  • ???????
  • ???

7
????
  • ????
  • ????????????
  • ???????????????
  • ??????????????????(??,??,??,?????)

8
???????
  • ???????
  • ??????( \ ????),????????????
  • ????,???????????,????,????
  • ??????????,???????
  • ??????????????
  • ????????(????)

9
????
  • ????
  • ????,????(??),???????
  • ????,????,????
  • ???,????
  • ????????(??)

10
??????????
  • Corpus
  • (pl. corpora or corpuses) a collection of text,
    now usually in machine-readable form and compiled
    to be representative of a particular kind of
    language and often provided with some kind of
    annotation.

11
??????????
  • Types of corpora
  • Annotated corpus a corpus enhanced with various
    types of linguistic information (or tagged
    corpus). An annotated corpus may be considered to
    be a repository of linguistic information,
    because the information which was implicit in the
    plain text has been made explicit through
    concrete annotation (added value).
  • Comparable (reference) corpus a corpus used for
    comparison of different (types of) languages.
    Comparable corpora often follow the same
    composition pattern. If comparable corpora are
    annotated, annotation schemes for the corpora are
    often similar.

12
??????????
  • Monitor corpus a corpus which is a growing,
    non-finite collection of texts, of primary use in
    lexicography. A monitor corpus reflects language
    changes in a constant growth rate of corpora,
    leaving untouched the relative weight of its
    components (i.e. balance) as defined by the
    parameters. The same composition schema should be
    followed year by year, the basis being a
    reference corpus with texts spoken or written in
    one single year.

13
??????????
  • Monolingual corpus a corpus which contains texts
    in a single language.
  • Multilingual corpus a corpus which represents
    small collections of individual monolingual
    corpora (or subcorpora) in the sense that they
    use the same or similar sampling procedures and
    categories for each language but contain
    completely different texts in those several
    languages.
  • Parallel (aligned) corpus a multilingual corpus
    where texts in one language and their
    translations into other languages are aligned,
    sentence by sentence, preferably phrase by
    phrase.

14
??????????
  • Special corpus A type of corpora that are
    assembled for a specific purpose, and they vary
    in size and composition according to their
    purpose. Special corpora are not balanced (except
    within the scope of their given purpose) and, if
    used for other purposes, give a distorted view of
    the language segment. Their main advantage is
    that the texts can be selected in such a way that
    the phenomena one is looking for occur much more
    frequently in special corpora than in a balanced
    corpus. A corpus that is enriched in such a way
    can be much smaller than a balanced corpus
    providing the same data.
  • General corpus

15
??????????
  • Token an individual word
  • Type word form. "I see a cat and a dog" contains
    seven tokens but only six types (the type 'a'
    occurs twice).
  • (standardized) type/token ratio
  • Frequency
  • absolute frequency
  • relative (normalized) frequency

16
??????????
  • Keywords
  • Keywords are words whose normalized frequency in
    one corpus (observed corpus) is significantly
    higher or lower than that in another comparable
    corpus (reference corpus).
  • Positive keywords and negative keywords
  • Chi-square test, loglikelihood test

17
??????????
  • Concordance
  • A term that signifies a list of a particular word
    or sequence of words in a context. The
    concordance is at the centre of corpus
    linguistics, because it gives access to many
    important language patterns in texts.
    Concordances of major works such as the Bible and
    Shakespeare have been available for many years.
    The computer has made concordances easy to
    compile. (concordancer, concordance lines)
  • The computer-generated concordances can be very
    flexible the context of a word can be selected
    on various criteria (for example counting the
    words on either side).
  • Interpreting concordance lines can be a demanding
    task.

18
??????????
  • Collocation
  • A term used to refer to the combination of words
    that have a certain mutual expectancy i.e. words
    regulary keep company with certain other words.
    When a collocation appears with a greater
    frequency than chance, then it is called a
    significant collocation.
  • Words are like people. A man may be in mad love
    with a woman who does not love him at all. She is
    everything to him, but he is nothing to her.
    (Consider the relation between I and am)
  • We shall know a word by the company it keeps.
  • Measures of collocation strength MI, T-score,
    Z-score, etc.

19
??????????
  • Colligation
  • collocation patterns based on syntactic groups
    rather than individual words. (Barnbrook 1996)
  • There is a book on the desk.
  • There_EX is_VBZ a_AT1 book_NN1 on_II the_AT
    desk_NN1 ._.
  • EX VBZ AT1 NN1 II AT NN1.

20
??????????
  • POS (part-of-speech) tagging
  • A most basic type of linguistic corpus annotation
    its aim is to assign a code (or tag) indicating
    its part-of-speech (e.g. singular common noun -
    NN, past participle - VVN) to each lexical unit
    in the text. Part-of-speech information is a
    fundamental basis for increasing the specificity
    of data retrieval from corpora and also forms an
    essential foundation for further forms of
    analysis such as syntactic parsing and semantic
    field annotation.

21
??????????
  • Pattern
  • In pattern grammar, Hunston did not give a clear
    definition of patterns, but she said that a verb
    pattern is a verb and the words that come after
    it (such as V n that), and a noun pattern is a
    noun and the words come after it (such as N
    that).

22
??????????
  • Pattern
  • In pattern matching, a pattern is a string of
    regular expression, in which combinations of
    certain symbols do not stand for themselves
    literally, but rather a certain category of
    strings.
  • For example
  • \wments? stands for all words ending with -ment
    or ments, such as agreement, achievements,
    abcment, etc.

23


24
Resources
Write a Comment
User Comments (0)
About PowerShow.com