Corpus Linguistics - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Corpus Linguistics

Description:

Synchronic vs Diachronic; Monitor. Annotated vs Unannotated. 6. Written corpora. 7 ... Online tutorial and info on corpus linguistics. Web Concordancer (VLC, PolyU) ... – PowerPoint PPT presentation

Number of Views:374
Avg rating:5.0/5.0
Slides: 49
Provided by: engl206
Category:

less

Transcript and Presenter's Notes

Title: Corpus Linguistics


1
Corpus Linguistics
  • Developing a
  • PolyU Language Bank
  • Sherman Lee
  • egslee_at_inet.polyu.edu.hk
  • PI Grahame Bilbow
  • Thanks to Chris Greaves, Raymond Cheung, Li Lan

2
Outline
  • Background
  • Goals of corpus linguistics
  • Types of corpora
  • Applications of corpus analysis
  • As an illustration
  • Exploring units of meaning
  • Case study
  • Developing a PolyU Language Bank
  • Aims and objectives of project
  • Similar existing projects
  • Procedures
  • The PolyU Language Bank
  • Current status
  • Sample corpora
  • Sample search

3
Goals of corpus linguistics
  • Chomskyan linguistics
  • Langue (competence)
  • Ideal speaker/hearer
  • Language innate mental faculty
  • Intuitive evidence
  • Universals
  • Grammar
  • Corpus linguistics
  • Parole (performance)
  • Complexity/variation
  • Language social phenomenon
  • Empirical evidence
  • Differences
  • Meaning

4
Basic tools
  • Corpus a systematic collection of speech or
    writing that is built according to explicit
    design criteria for a specific purpose
  • c.f. EAGLES broad definition A corpus can
    potentially contain any text type, incl. word
    lists, dictionaries, etc.
  • Concordancer search engine (e.g. WordSmith
    SARA)
  • Concordance occurrences of search item,
    displayed in list with immediate context shown

5
Types of corpora
  • Written vs Spoken
  • General vs Specialised
  • e.g. ESP, Learner corpora
  • Monolingual vs Multilingual
  • e.g. Parallel, Comparable
  • Synchronic vs Diachronic Monitor
  • Annotated vs Unannotated

6
Written corpora
7
Specialised corpora
8
Other examples of available corpora
9
Some applications of corpus analysis
  • Language teaching learning
  • Empirical teaching data authentic examples of
    language use
  • Reference source answering learners questions
    or explaining learner errors
  • Whats the difference between at last and in
    the end?
  • How is hardly used?
  • Preparation of teaching materials e.g.
    vocabulary lists, CLOZE tests
  • CALL concordancing and data-driven learning
  • Translation
  • Using parallel texts to find suitable translation
    equivalents
  • Creation of translation databases or glossaries
    for domain-specific terminology, e.g. business,
    law, science
  • Exploring units of meaning in texts
  • Linguistics and language research
  • Lexicography lexical studies e.g. relative
    word frequency
  • Language variation e.g. linguistic features
    across registers
  • Grammar corpora used as data to test
    hypotheses, syntactic theory
  • Pragmatics discourse e.g. CA of discourse
    features in spoken (conversational) data

10
Exploring meaning, units of meaning
  • Focus on meaning because
  • People interested in the meanings of texts, in
    how language is actually used in discourse
  • Meaning is a key problem for translation,
    language learning, information management
  • What are basic units of meaning?
  • Language teaching (TEFL) vocabulary often
    introduced in the form of new single words
  • Words considered to be basic units of meaning
  • Is the word an ideal unit of meaning?
  • If you dog a dog during the dog days of
    summer, youll be a dog tired dog catcher
  • Can I sit down? My dogs are barking
  • Most lexical errors made by language learners
    result from failure to deal with ambiguities of
    single words

11
Unambiguous Units of Meaning
  • Notion of an Unambiguous Unit of Meaning
    necessary for understanding meaning
  • UUoM keyword and all words in the context that
    contribute to making the word unambiguous
  • Compounds, idioms, multi-word units,
    collocations, set phrases
  • Often determined by a syntactic pattern
  • Adj N
  • friendly fire, closing remarks
  • V N
  • invite proposals, draw conclusions
  • Adv A
  • politically correct, environmentally friendly
  • N of N
  • cause of death, proof of identity, code of
    practice, duty of care

12
Case study
  • Search for units of meaning in online
    dictionaries and corpora
  • friendly fire
  • environmentally friendly
  • Corpora from 1990s
  • British National Corpus (BNC)
  • 100,000,000 words
  • Written (90)
  • Extracts from regional/national newspapers,
    specialist periodicals, academic books, popular
    fiction, un/published letters, memos,
    school/university essays
  • Spoken (10)
  • Informal conversation, formal meetings (business,
    government), radio shows, phone-ins
  • The Times (1995, Jan March)
  • 10,220,367 words
  • Written business, home news, readers letters,
    reviews
  • Corpora from 1960 - 1970s
  • Brown corpus / LOB corpus
  • Each 1 million words
  • Written, balanced corpora of 15 genres of text

13
(No Transcript)
14
(No Transcript)
15
Search results
16
What the results show
  • friendly fire, environmentally friendly
  • Represent fairly new concepts
  • Occur in the newer corpora (1990s) as units of
    meaning
  • Occur as entries in some of the online
    dictionaries only (not bilingual dictionaries)
  • New terminology and terms of common usage not
    always recorded in dictionaries and termbanks
  • One way of using corpora for learning and
    translation
  • Use corpus evidence to help students recognise
    units of meaning introduce notion of units of
    meaning into language learning

17
Aims of PULB project
  • To design and build an archive of language
    corpora language bank
  • To be used by staff and students in the
    department
  • For teaching, language learning and research
    purposes
  • To provide a user-friendly platform
  • A WWW interface via which users can freely access
    the language bank
  • With browse, search and concordance facilities

18
Ingredients of PULB
  • Sources standard corpora, departmental
    collections
  • Medium written texts, transcribed spoken data
  • Language types native speaker, learner corpora
  • Languages English, Chinese, Japanese, French,
    German
  • Genres business, law, academia, media, social,
    literature
  • Target Size 30 millionwords (European) /
    characters (Asian)

19
Why a language bank? - Whats in it for us
  • Free and simple shared access to a collection of
    language corpora
  • That you can utilise for your teaching
  • Authentic examples of language use at your
    fingertips
  • Empirical teaching data covering different
    specialisms (ESP, EAP)
  • That you can utilise for your research
  • A ready-made collection of data waiting for you
    to work on
  • Saving on time and resources
  • Way of incorporating new methods and information
    technology into the departments teaching and
    research activities
  • Increase students awareness of this rapidly
    developing methodology / branch of language
    studies (corpus linguistics, corpora studies)
  • Way of integrating theory with technology in the
    classroom
  • Train students to be more computer-literate
  • All of the above can
  • Motivate students to become active learners
  • Help students to more effectively learn the
    target language (cf goals of DDL)

20
Similar existing projects
  • W3 Corpora Project (Essex)
  • http//clwww.essex.ac.uk/w3c/
  • Access to corpora (Gutenberg texts, LOB,
    LOB-tagged)
  • Web interface for performing searches
  • Online tutorial and info on corpus linguistics
  • Web Concordancer (VLC, PolyU)
  • http//vlc.polyu.edu.hk/concordance/
  • Access to variety of corpora and texts
    (bilingual/parallel corpora, news, Bible, works
    of fiction)
  • Web interface for performing searches

21
(No Transcript)
22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
(No Transcript)
30
Directions for PULB
  • Build a language bank with features that parallel
    those of similar sites
  • VLC
  • Bring together corpora and texts of various types
    and genres, of different languages
  • Essex
  • Make available different facilities for different
    categories of users (cf. legal considerations)
  • Provide on-site tutorial, corpora-based info
  • Include extra features
  • Allow searches in multiple texts / corpora
    simultaneously
  • Some form of parallel concordancing

31
Target composition of PULB
French
German
Business Chinese
Business Japanese
PolyU Language Bank
Chinese
Japanese
Legal Chinese
Japanese Literature
English
General corpora
Learner corpora
Spoken Corpora
Specialised corpora
ICE
BNC
BROWN
Student work
Teaching reflections
Business writing
Social interactions
Business English (PUBC)
Legal English
Academic English
Workplace English
HK spoken corpus
Conference speeches
Academic presentations
English Literature
32
Procedures (i)
  • Collate, sort, categorise data from various
    sources
  • Commercially available data
  • Departmental collections, incl.
  • PolyU Business Corpus (Li and Bilbow)
  • Bilingual corpora (Xu)
  • ESP / EAP corpora (Forey)
  • Learner corpora (Sengupta)

33
Procedures (ii)
  • For the departmental collections
  • Decide how to present each collection
  • E.g. Sub-categories, macro categories
  • Clean up texts
  • E.g. Duplications of text samples
  • E.g. Structural features (headings, typographic
    features)
  • E.g. Personal information found in data
  • To protect anonymity or privacy of authors and
    speakers
  • Annotate texts
  • Provide descriptive information about each corpus
  • Compiler, time of compilation, type of
    collection
  • Provide descriptive information about the texts
  • Number, size, genre of subtexts
  • Bibliographic info (written text)
  • Ethnographic info (spoken data)
  • Provide structural information for texts if
    necessary
  • Mark texts for paragraph boundaries etc

34
Procedures (iii)
  • Put corpora together on platform set up search
    and support facilities
  • PULB map
  • Browse facility
  • Search and concordance facilities
  • Tutorial / general information
  • Transplant PULB onto dept website for use by
    staff and students
  • Promote PULB among corpora community
  • Data provider to data archives / distribution
    sites, e.g. OLAC ICAME

35
The PolyU Language Bank
  • Current status
  • Range of corpora totalling 12M words
  • Individual corpus descriptions
  • Index of corpora
  • Simple to use built-in concordancer
  • Available at http//langbank.engl.polyu.edu.hk/

36
(No Transcript)
37
The PolyU Language Bank
  • Some of the currently available corpora
  • PolyU Business Corpus (Eng, Chi, Jap)
  • BNC Sampler Corpus (Spoken, Written)
  • Corpus of Multilingual Texts
  • Corpus of Nursing and Health Science Texts
  • Learner Corpus of Essays and Reports
  • HK Bilingual Corpus of Legal and Documentary
    Texts
  • ...

38
(No Transcript)
39
(No Transcript)
40
(No Transcript)
41
How you can contribute
  • Talk to us about your ideas
  • What would you like to see being incorporated
    into PULB?
  • In terms of corpora
  • In terms of search facilities and supplementary
    information
  • Can you think of other ways in which PULB can be
    organised and structured?
  • How likely are you to make use of PULB in your
    teaching and research?
  • Do you have any suggestions for corpus studies
    based on available or potentially available
    corpora from PULB?
  • Do you know of similar projects being undertaken
    elsewhere that we can learn from?
  • Talk to us about your collections / corpora
  • Do you have collections of language data from
    past research projects that are (could be)
    presented as a corpus (corpora)?
  • Can we help you put your collections to good use?
  • Can we work together to incorporate your
    collections into PULB?

42
Concluding remarks
  • Corpora represent a valuable but under exploited
    resource for teaching and research
  • PULB aims to bring together various corpora under
    a single departmental archive, accessible via WWW
  • You can help us by contributing your ideas and/or
    your language collections
  • Please visit and test the PULB website at
    http//langbank.engl.polyu.edu.hk/ and provide us
    with feedback using the online evaluation form
  • Thank you very much

43
Social grooming
44
CLOZE
45
PolyU Business Corpus
  • Compiled in 1999-2000 (Li Bilbow)
  • Multilingual - comparable corpora
  • English (c. 1.3 M words)
  • Chinese (c. 1.2 M words)
  • Japanese (c. 1.1 M words)
  • Business texts from newspapers, government
    reports, company reports and brochures
  • Has been used for creating a bilingual
    English-Chinese business lexicon

46
PolyU Business Lexicon
47
Duplication
48
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com