From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools - PowerPoint PPT Presentation

About This Presentation
Title:

From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools

Description:

Title: 1 Author: Javan Last modified by: churen Created Date: 12/28/2003 4:30:03 AM Document presentation format: Company – PowerPoint PPT presentation

Number of Views:306
Avg rating:3.0/5.0
Slides: 77
Provided by: JAV49
Category:

less

Transcript and Presenter's Notes

Title: From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools


1
From Synergy to Knowledge Integrating multiple
language resourcesPart I Language Resources
and Tools
  • Chu-Ren Huang
  • Academia Sinica
  • http//cwn.ling.sinica.edu.tw/huang/huang.htm

2
Outline Language Resources and Tools
  • Introduction 10 Years in Chinese Language
    Processing-A mirror for other Asian Languages
  • The Starting Point Resources and Resources
    Sharing
  • OLAC The Open Language Archives Community
  • Asian Language Resources Committee of AFNLP
  • Standards ISO TC37 Language Resources
    Mangagement
  • Language Archives Project of Taiwan
  • Tools Getting Started in NLP with NLTK

3
Why Resources and Tools
  • Language Resources
  • Foundation and empirical basis of scientific
    studies of natural languages
  • The only reliable source for language specific
    features
  • Infrastructure for knowledge representation and
    knowledge engineering
  • Essential to preserve linguistic and cultural
    diversity
  • Tools
  • Needed to process
  • General enough for multilingual processing and
    cross-lingual comparison
  • Robust enough to deal with language specific
    issues

4
Chinese Language Processing as a Mirror
  • For the development of Asian Language Processing
  • Unlike Japanese, which has enjoying being one of
    the leaders in technological innovation
  • The development of Chinese language processing
    coincides with the developing economies of Taiwan
    and China
  • Especially the availability of Chinese language
    PCs
  • Similar to the situation of many Asian languages
    now

5
CLP in the past 10 years
  • A review of what happened in the past ten years
    in Chinese Language Processing (1992-2002)
  • from a somewhat personal perspective
  • 1992 Corpora
  • Completion of the first Chinese corpus for
    linguistic research (Huang and Chen, COLING
    92.1214-1217)
  • -untagged, non-segmented
  • -but searchable

6
CLP 1992 1993
  • 1992 Segmentation Standard
  • Announcement of the first national standard for
    word segmentation by PRC government.
  • GB 13715-?????????????.
  • 1993 Lexicon
  • Completion and Release of the first version of
    CKIP lexicon (with the category set and ICG
    thematic roles)
  • First version of K. Chens parser for Chinese

7
CLP Corpus 1994 1995
  • 1994
  • 10th year anniversary for the Automation of
    Chinese historical textual databases.
  • Completion of the pre-Qin Classic Chinese corpus
    at Academia Sinica.
  • 1995
  • Completion of Sinica Corpus (v. 1.0 1 million
    words), the first balanced and tagged Chinese
    corpus.

8
CLP 1996
  • Research Institutes
  • 10th Anniversary of the Institute of
    Computational Linguistics at Peking University
  • 10th Anniversary of the Chinese Knowledge
    Information Processing Group at Academia Sinica
  • Anthology of Papers
  • Readings in Chinese Natural Language Processing
    (Journal of Chinese Linguistics Monograph)
  • Editors Huang, Chen, and Tsou

9
CLP 1996 November-1997
  • Sinica Corpus on Web
  • One of the first fully searchable language corpus
    on the WWW
  • http//www.sinica.edu.tw/ftms-bin/kiwi.sh (old
    webpage in web archives)
  • http//www.sinica.edu.tw/SinicaCorpus/ (current
    page)
  • 1997
  • Publication of the first Chinese dictionary
    compiled directly from a corpus (Huang et al.s
    Mandarin Daily Classifier Dictionary and
    Noun-Classifier Collocation Dictionary)
  • The Tenth Annual ROCLING conference

10
CLP 1998
  • KnowledgeNet
  • Release of HowNet, the first full-fledged Chinese
    and English-Chinese LKB
  • http//www.keenage.com/
  • -Segmentation Standard
  • Official announcement of CNS14366 for Taiwan

11
CLP 2000 Treebanks
  • Simultaneous completion and announcement of two
    Chinese Treebanks
  • Penn Chinese Treebank
  • Sinica Treebank
  • ACL Workshop on Chinese Language Processing

12
CLP 2001-2002
  • 2001 Society
  • Formal approval of the formation of
  • ACL SigHAN, the first international organization
    on Chinese Language Processing
  • 2002
  • First SigHAN workshop on Chinese Language
    Processing
  • Formal launch of Hsiehs Intelligent Character
    Encoding System (a sustainable solution to the
    missing character problem)
  • COLING2002 in Taipei

13
CLP 2003 -
  • 2003
  • THE FIRST INTERNATIONAL CHINESE WORD SEGMENTATION
    BAKEOFF
  • http//www.sighan.org/bakeoff2003/
  • 2002-2005
  • Chinese Proposition Bank
  • http//www.cis.upenn.edu/chinese/cpb/
  • 2003,2005,2007
  • Chinese Gigaword Corpus v.1., v.2, and tagged
    version

14
What CLP Development Showed?
  • Resources Lead
  • When tools and standards completes a
    comprehensive infrastructure
  • Research will bloom

15
Resources Development
  • Towards a Sharable and Sustainable Model of
    Resources Development
  • OLAC
  • Open Language Archives Community
  • http//www.language-archives.org

16
OLAC Aims
  • OLAC, the Open Language Archives Community, is an
    international partnership of institutions and
    individuals who are creating a worldwide virtual
    library of language resources by
  • developing consensus on best current practice for
    the digital archiving of language resources
  • developing a network of interoperating
    repositories and services for housing and
    accessing such resources.

17
OLAC Organization
  • Coordinators Steven Bird Gary Simons
  • Council Anthony Aristar (Linguist List),
    Christopher Cieri (LDC), Gary Holton (Alaska
    Native Lanuage Center), Chu-Ren Huang (Academia
    Sinica), Heidi Johnson (Archive of the Indigenous
    Languages of Latin America), Laurent Romary
    (Atilf, University of Nancy), Joan Spanne (SIL),
    Martin Wynne (Oxford Text Archive)
  • Participating Archives Services 39 archives
    including LDC, ELRA, DFKI, CBOLD, ANLC, LACITO,
    Perseus, SIL, APS, Utrecht, Academia Sinica,
    TalkBank, Rosetta, MPI
  • Individual Members 120

18
Types of Language Resource
  • DATA any information which documents or
    describes a language, such as a
  • monograph, data file, shoebox of index cards,
    unanalyzed recordings, heavily annotated texts,
    complete descriptive grammar
  • TOOLS computational resources that facilitate
    creating, viewing, querying, or otherwise using
    language data
  • includes fonts, stylesheets, DTDs, Schemas
  • ADVICE any information about
  • reliable data sources, appropriate tools and
    practices

19
The Gap
20
Coordinated Approach


OAI
OLAC
"A shared architectural vision, having many
components, and implemented in stages by the
community, will bridge the gap" Analogies
federated databases semantic web
21
OLAC
USER SERVICES
OLAC SERVICES
OLAC REPOSITORIES
CONTENT
METADATA
OAI
22
The Foundation 3 initiatives
  • Dublin Core Metadata Initiative (DC)
  • founded in 1995 (Dublin, Ohio)
  • conventions for resource discovery on the web
  • Open Archives Initiative (OAI)
  • founded in 1999 (Santa Fe)
  • interoperability of e-print services
  • Open Language Archives Community (OLAC)
  • founded in 2000 (Philadelphia)
  • a partnership of institutions and individuals
  • creating a worldwide virtual library of language
    resources

23
Foundation 1 DC Elements
  • 15 metadata elements
  • broad interdisciplinary consensus
  • each element is optional and repeatable
  • applies to digital and traditional formats
  • Title, Creator, Subject, Description, Publisher,
    Contributor, Date, Type, Format, Identifier,
    Source, Language, Relation, Coverage, Rights.
  • dublincore.org

24
Foundation 1 DC Qualifiers
  • Encoding Schemes
  • a controlled vocabulary or notation used to
    express the value of an element
  • helps a client system to interpret the element
    content
  • e.g. Language "en" (not "English", "Anglais",
    ...)
  • Refinements
  • makes the meaning of an element more specific
  • e.g. Subject.language, Type.linguistic

25
Foundation 2 OAI Repository
26
Foundation 2 OAI Standards
  • To implement the OAI infrastructure, an archive
    must comply with two standards
  • 1. The OAI Shared Metadata Set
  • Dublin Core
  • interoperability across all repositories
  • 2. The OAI Metadata Harvesting Protocol
  • HTTP requests - 6 verbs
  • Identify, ListIdentifiers, ListMetadataFormats,
    ListSets, ListRecords, GetRecord
  • XML responses

27
Foundation 2 OAI Service Providers and Data
Providers
28
Foundation 3 OLAC OAI
  • Recall OAI data providers must support
  • Dublin Core Metadata
  • OAI Metadata harvesting protocol
  • BUT OAI data providers can support
  • a more specialized metadata format
  • a more specialized harvesting protocol
  • What OLAC does
  • specialized metadata for language resources
  • specialized harvesting (extra validation)

29
OLAC Standards
  • Aside
  • standards the protocols and interfaces that
    allow the community to function
  • recommendations "standards" for representing
    linguistic content
  • OLAC has three primary standards
  • OLACMS the OLAC Metadata Set (Qualified DC)
  • OLAC MHP refinements to the OAI protocol
  • OLAC Process a procedure for identifying Best
    Common Practice Recommendations

30
The OLAC Metadata Set
  • The three categories of metadata
  • Work language describes information entities and
    their intellectual attributes
  • e.g. names of works and their creators
  • Document language describes and provides access
    to the physical manifestation of information
  • e.g. format, publisher, date, rights
  • Subject language describes what a document is
    about
  • e.g. subject, description

31
OLACMS and Controlled Vocabularies
  • Language
  • A language of the intellectual content of the
    resource (OLAC-Language)
  • Subject.language
  • A language which the content of the resource
    describes or discusses (OLAC-Language)
  • OLAC-Language
  • A vocabulary for identifying the language(s) that
    the data is in, or that a piece of linguistic
    description is about, or that a particular tool
    can process

32
Summary With the software in place, we have a
complete platform
CONTENT
METADATA
OAI
33
Summary Repositories completely bridge the gap,
letting us consistently organize and archive our
resources
OLAC REPOSITORIES
CONTENT
METADATA
OAI
34
OLAC
USER SERVICES
OLAC SERVICES
OLAC REPOSITORIES
CONTENT
METADATA
OAI
Acknowledgements ISLE and TalkBank projects
(NSF), participants of the Philadelphia workshop,
Eva Banik (programmer), Hernando de Soto (the
analogy)
35
OLACMS helps archive versatility
  • Given Shared Metadata Standard
  • New language archives can be created on the fly
    by harvesting existing archives
  • Rich information can be inferred by establishing
    temporal and geographic anchors for each
    document.

36
OLAC Infrastructure
  • Helps to Solve Language Archive Problems such as
  • Language Identification
  • and
  • Metadata Set for Multi-lingual Language Archives

37
The Language Identification Problem
  • The DC code (e.g. en for English) is not enough
    to describe all the languages in the world
  • Enthnologue (http//www.ethnologue.org) is
    comprehensive but not complete
  • Potential Problems of using Enthnologue (or any
    existing language list)
  • over-splitting
  • over-chunking
  • omission

38
A Fundamental Solution to Language Identification
Problems
  • Registering language groups with an OLAC
    registration service
  • OLAC language classification server would house a
    comprehensive list of language family names
    (defined by users) and their extensional
    definitions (i.e. sets of Enthnologue codes)
  • ASAmis ALV, AIS
  • ALV Amis, AIS Nataoran

39
Describing Multi-Lingual Resources in OLACMS
  • Directionality is crucial in multilingual
    resources
  • However, OLAC metadata is flat and unordered
  • Bi-directional MT
  • ltLanguage code X/gt
  • ltLanguage code Y/gt
  • ltSubject.language code X/gt
  • ltSubject.language code Y/gt

40
Multi-lingual Resources II
  • Text language
  • Bitext (bilingual aligned corpus)
  • There is always an directionality
  • Original language
  • Translation Subject.language
  • Language Description (Field Notes)
  • Elicitation, transcription, translation, notes
  • ?Multiple related resources

41
Language Archives Project of Taiwan
  • Part of the National Digital Archives Project
    (NDAP)
  • Pilot Stage 2000-2001
  • First Phase 2002-2006
  • Both Language Archives
  • And Linguistic Anchor

42
Language and Digital Archives
43
Digital Archives are Linguistically Anchored
  • Archives are anchored with Lexical KnowledgeBase
    (LKB)
  • -because LKB as collection of lexical types
    instantiated in archives uniquely defines each
    archive
  • -And each lexical item is the conceptual atom
    projecting knowledge from archive to archive

44
Multi-anchor Knowledge Linking
  • Geographical anchor based on GIS (geography
    information system)
  • -Ecology (Fauna, Weather, Geology etc.)
  • -Socio-Anthropological classification
  • Linguistic anchor based on LKB
  • -etymology, language grouping, loan words,

45
Institute of Linguistics
Language Archives
46
  • Two branch projects
  • 1 Chinese Archives -- 5 sub-projects
  • Early- Mandarin Chinese Lexicon
  • Lexical Database of Pre-Qin Bronze and Bamboo
    Manuscripts
  • Modern Chinese Corpus and Treebank
  • New Age Corpus Linguistic Representations and
    Archives of Multimedia Data
  • Southern-Min Archive A Database of Historical
    Change in Language Distribution
  • 2 Formosan Language Archives.

47
Early- Mandarin Chinese Lexicon
  • GOAL
  • Collect the corpus and the lexicon in the period
    of Early Mandarin Chinese.
  • Provide a systematical knowledge thesaurus as
    well as powerful instrument for the study of the
    grammatical development.
  • Archives Description
  • Digitalization of texts (10,000,000 characters).
  • Tagging of grammatical markers (3,500,000
    characters).
  • Construction of the lexical database.
  • httpwww.sinica.edu.tw/Early_Mandarin

48
(No Transcript)
49
Lexical Database of Pre-Qin Bronze and Bamboo
Manuscripts
  • Archives Description
  • to digitize the bronze inscriptions from the
    Shang to the Eastern Chou dynasties.
  • the construction of a typological lexicon of
    bronze inscriptions and bamboo scripts accurate
    encoding and analysis for the bronze inscriptions
    and Chu scripts.
  • Achievement
  • Proof-read bronze inscriptions (12113 piece of
    bronze inscriptions).
  • http//Inscription.sinica.edu.tw

50
(No Transcript)
51
Modern Chinese Corpus and Treebank
  • Achievement
  • Segmented words tagged with their part-of-speech
    (10 millions words version in 2006).
  • Syntactic tree structure30,000.
  • http//www.sinica.edu.tw/SinicaCorpus
  • http//treebank.sinica.edu.tw

52
(No Transcript)
53
(No Transcript)
54
Treebank
55
New Age Corpus Linguistic Representations and
Archives of Multimedia Data
  • Archives Description
  • A multimodal corpus of spoken Mandarin in Taiwan.
  • By means of different designs of tasks and
    scenarios.
  • Combining data format of written transcripts with
    digital technology of video and audio processing.

56
New Age Corpus Linguistic Representations and
Archives of Multimedia Data
  • Achievement
  • Transcribed and transformed the 11 hour-digital
    data.
  • Tagged the 5-hour speech data.
  • http//mmc.sinica.edu.tw

57
(No Transcript)
58
Southern-Min Archive A Database of Historical
Change in Language Distribution
  • Archives Description
  • From the perspectives of historical change and
    geographical distribution.
  • A tagged corpus of Southern Min written documents
    from 16th century to 20th century.
  • A linguistic Geographical Informational System
    displaying distributions of languages in Hsinfeng.

59
(No Transcript)
60
Formosan Language archives
  • Archives Description
  • Preserve the endangered Formosan Austronesian
    languages
  • 1.1 corpora, lexicons and grammars
  • 1.2 integration of linguistic information with
    GIS.
  • fifteen extant Formosan languages
  • 2.1 Rukai, Yami, Saisiyat, Tsou, Atayal, Bunun,
    Paiwan, Amis and Puyuma
  • http//http//formosan.sinica.edu.tw/

61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
Sinica BOW Bilingual Ontological Wordnet
  • To construct a Chinese WordNet as the linguistic
    ontology for knowledge representation
  • To provide linguistic anchoring grounded with
    temporal information by building a synchronic
    lexicon for all historical periods and
  • To provide linguistic anchoring reference and
    implementation services.

65
(No Transcript)
66
Asian Language Resources Committee
  • Mail List alr_at_cl.cs.titech.ac.jp
  • Affiliated with AFNLP
  • Cataloguing Asian Language Resources
  • Will adopt OLACMS and search engine
  • Hosting ALR Workshops (5 so far)
  • Asian Language Processing Special Issues in
    Language Resources and Evaluation
  • Co-ChairsTogunaga take_at_cl.cs.titech.ac.jp
  • Huang churen_at_sinica.edu.tw
  • http//www.cl.cs.titech.ac.jp/alr/

67
An overview of theNatural Language Toolkit
  • http//nltk.sourceforge.net
  • Project Leaders Steven Bird, Edward Loper, Ewan
    Klein
  • Acknowledgement I would like to thank Steven
    Bird for agreeing to let me use these slides on
    NLTK

68
Summary
  • NLTK is a suite of open source Python modules,
    data sets and tutorials
  • supporting research and development in natural
    language processing
  • Download NLTK from nltk.sourceforge.net
  • A Truly Multilingual Toolkit accessible to
    beginning researchers in NLP
  • A good way to attract international scholars to
    research on your language
  • Also a good stepping stone for a developing HLT
    language to test a full range of NLP applications

69
Components of NLTK
  1. Code corpus readers, tokenizers, stemmers,
    taggers, chunkers, parsers, wordnet, ... (50k
    lines of code)
  2. Corpora 20 annotated data sets widely used in
    natural language processing (300Mb data)
  3. Documentation a 360-page book, articles,
    reviews, API documentation

70
1. Code
  • corpus readers
  • tokenizers
  • stemmers
  • taggers
  • parsers
  • wordnet
  • semantic interpretation
  • clusterers
  • evaluation metrics

71
2. Corpora
  • Brown Corpus
  • Carnegie Mellon Pronouncing Dictionary
  • CoNLL 2000 Chunking Corpus
  • Project Gutenberg Selections
  • NIST 1999 Information Extraction Entity
    Recognition Corpus
  • US Presidential Inaugural Address Corpus
  • Indian Language POS-Tagged Corpus
  • Prepositional Phrase Attachment Corpus
  • SENSEVAL 2 Corpus
  • Sinica Treebank Corpus Sample
  • Universal Declaration of Human Rights Corpus
  • Stopwords Corpus
  • TIMIT Corpus Sample
  • Treebank Corpus Sample

72
3. Documentation
  • a 360-page book about natural language processing
    in Python and NLTK
  • teaches Python and NLP
  • provides numerous examples and exercises
  • installation instructions
  • presentation slides for some of the book chapters
  • API Documentation describes every module,
    interface, class, and method

73
Parser demonstrations
74
Interactive session (WordNet)
75
Adoption in NLP courses
  • Amsterdam, Ben-Gurion, Brown, Bryn Mawr,
    CDAC-Mumbai, Coruña, Edinburgh, Erlangen,
    Georgetown, Helsinki, IIT-Bombay, Iowa State,
    Konstanz, MIT, Macquarie, Magdeburg, Malta,
    Marquette, Melbourne, Nancy, Naval Postgraduate
    School, Northeastern, Ohio State, Pitt, San Diego
    State, Simon Fraser, Stanford, Syracuse
    University, Tsuda College, U Colorado, UC
    Berkeley, UMass Amherst, UNAM, U Penn, UT Austin,
    Warsaw

76
Contribute
  • NLTK is an open source project
  • all code, data, documentation is free
  • dozens of people have contributed over the past 6
    years
  • please visit the website for project ideas
  • sign up on the NLTK-Announce mailing list to hear
    about new releases
Write a Comment
User Comments (0)
About PowerShow.com