Title: From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools
1From Synergy to Knowledge Integrating multiple
language resourcesPart I Language Resources
and Tools
- Chu-Ren Huang
- Academia Sinica
- http//cwn.ling.sinica.edu.tw/huang/huang.htm
2Outline Language Resources and Tools
- Introduction 10 Years in Chinese Language
Processing-A mirror for other Asian Languages - The Starting Point Resources and Resources
Sharing - OLAC The Open Language Archives Community
- Asian Language Resources Committee of AFNLP
- Standards ISO TC37 Language Resources
Mangagement - Language Archives Project of Taiwan
- Tools Getting Started in NLP with NLTK
3Why Resources and Tools
- Language Resources
- Foundation and empirical basis of scientific
studies of natural languages - The only reliable source for language specific
features - Infrastructure for knowledge representation and
knowledge engineering - Essential to preserve linguistic and cultural
diversity - Tools
- Needed to process
- General enough for multilingual processing and
cross-lingual comparison - Robust enough to deal with language specific
issues
4Chinese Language Processing as a Mirror
- For the development of Asian Language Processing
- Unlike Japanese, which has enjoying being one of
the leaders in technological innovation - The development of Chinese language processing
coincides with the developing economies of Taiwan
and China - Especially the availability of Chinese language
PCs - Similar to the situation of many Asian languages
now
5CLP in the past 10 years
- A review of what happened in the past ten years
in Chinese Language Processing (1992-2002) - from a somewhat personal perspective
- 1992 Corpora
- Completion of the first Chinese corpus for
linguistic research (Huang and Chen, COLING
92.1214-1217) - -untagged, non-segmented
- -but searchable
6CLP 1992 1993
- 1992 Segmentation Standard
- Announcement of the first national standard for
word segmentation by PRC government. - GB 13715-?????????????.
- 1993 Lexicon
- Completion and Release of the first version of
CKIP lexicon (with the category set and ICG
thematic roles) - First version of K. Chens parser for Chinese
7CLP Corpus 1994 1995
- 1994
- 10th year anniversary for the Automation of
Chinese historical textual databases. - Completion of the pre-Qin Classic Chinese corpus
at Academia Sinica. - 1995
- Completion of Sinica Corpus (v. 1.0 1 million
words), the first balanced and tagged Chinese
corpus.
8CLP 1996
- Research Institutes
- 10th Anniversary of the Institute of
Computational Linguistics at Peking University - 10th Anniversary of the Chinese Knowledge
Information Processing Group at Academia Sinica - Anthology of Papers
- Readings in Chinese Natural Language Processing
(Journal of Chinese Linguistics Monograph) - Editors Huang, Chen, and Tsou
9CLP 1996 November-1997
- Sinica Corpus on Web
- One of the first fully searchable language corpus
on the WWW - http//www.sinica.edu.tw/ftms-bin/kiwi.sh (old
webpage in web archives) - http//www.sinica.edu.tw/SinicaCorpus/ (current
page) - 1997
- Publication of the first Chinese dictionary
compiled directly from a corpus (Huang et al.s
Mandarin Daily Classifier Dictionary and
Noun-Classifier Collocation Dictionary) - The Tenth Annual ROCLING conference
10CLP 1998
- KnowledgeNet
- Release of HowNet, the first full-fledged Chinese
and English-Chinese LKB - http//www.keenage.com/
- -Segmentation Standard
- Official announcement of CNS14366 for Taiwan
11CLP 2000 Treebanks
- Simultaneous completion and announcement of two
Chinese Treebanks - Penn Chinese Treebank
- Sinica Treebank
- ACL Workshop on Chinese Language Processing
12CLP 2001-2002
- 2001 Society
- Formal approval of the formation of
- ACL SigHAN, the first international organization
on Chinese Language Processing - 2002
- First SigHAN workshop on Chinese Language
Processing - Formal launch of Hsiehs Intelligent Character
Encoding System (a sustainable solution to the
missing character problem) - COLING2002 in Taipei
13CLP 2003 -
- 2003
- THE FIRST INTERNATIONAL CHINESE WORD SEGMENTATION
BAKEOFF - http//www.sighan.org/bakeoff2003/
- 2002-2005
- Chinese Proposition Bank
- http//www.cis.upenn.edu/chinese/cpb/
- 2003,2005,2007
- Chinese Gigaword Corpus v.1., v.2, and tagged
version
14What CLP Development Showed?
- Resources Lead
- When tools and standards completes a
comprehensive infrastructure - Research will bloom
15Resources Development
- Towards a Sharable and Sustainable Model of
Resources Development - OLAC
- Open Language Archives Community
- http//www.language-archives.org
16OLAC Aims
- OLAC, the Open Language Archives Community, is an
international partnership of institutions and
individuals who are creating a worldwide virtual
library of language resources by - developing consensus on best current practice for
the digital archiving of language resources - developing a network of interoperating
repositories and services for housing and
accessing such resources.
17OLAC Organization
- Coordinators Steven Bird Gary Simons
- Council Anthony Aristar (Linguist List),
Christopher Cieri (LDC), Gary Holton (Alaska
Native Lanuage Center), Chu-Ren Huang (Academia
Sinica), Heidi Johnson (Archive of the Indigenous
Languages of Latin America), Laurent Romary
(Atilf, University of Nancy), Joan Spanne (SIL),
Martin Wynne (Oxford Text Archive) - Participating Archives Services 39 archives
including LDC, ELRA, DFKI, CBOLD, ANLC, LACITO,
Perseus, SIL, APS, Utrecht, Academia Sinica,
TalkBank, Rosetta, MPI - Individual Members 120
18Types of Language Resource
- DATA any information which documents or
describes a language, such as a - monograph, data file, shoebox of index cards,
unanalyzed recordings, heavily annotated texts,
complete descriptive grammar - TOOLS computational resources that facilitate
creating, viewing, querying, or otherwise using
language data - includes fonts, stylesheets, DTDs, Schemas
- ADVICE any information about
- reliable data sources, appropriate tools and
practices
19The Gap
20Coordinated Approach
OAI
OLAC
"A shared architectural vision, having many
components, and implemented in stages by the
community, will bridge the gap" Analogies
federated databases semantic web
21OLAC
USER SERVICES
OLAC SERVICES
OLAC REPOSITORIES
CONTENT
METADATA
OAI
22The Foundation 3 initiatives
- Dublin Core Metadata Initiative (DC)
- founded in 1995 (Dublin, Ohio)
- conventions for resource discovery on the web
- Open Archives Initiative (OAI)
- founded in 1999 (Santa Fe)
- interoperability of e-print services
- Open Language Archives Community (OLAC)
- founded in 2000 (Philadelphia)
- a partnership of institutions and individuals
- creating a worldwide virtual library of language
resources
23Foundation 1 DC Elements
- 15 metadata elements
- broad interdisciplinary consensus
- each element is optional and repeatable
- applies to digital and traditional formats
- Title, Creator, Subject, Description, Publisher,
Contributor, Date, Type, Format, Identifier,
Source, Language, Relation, Coverage, Rights. - dublincore.org
24Foundation 1 DC Qualifiers
- Encoding Schemes
- a controlled vocabulary or notation used to
express the value of an element - helps a client system to interpret the element
content - e.g. Language "en" (not "English", "Anglais",
...) - Refinements
- makes the meaning of an element more specific
- e.g. Subject.language, Type.linguistic
25Foundation 2 OAI Repository
26Foundation 2 OAI Standards
- To implement the OAI infrastructure, an archive
must comply with two standards - 1. The OAI Shared Metadata Set
- Dublin Core
- interoperability across all repositories
- 2. The OAI Metadata Harvesting Protocol
- HTTP requests - 6 verbs
- Identify, ListIdentifiers, ListMetadataFormats,
ListSets, ListRecords, GetRecord - XML responses
27Foundation 2 OAI Service Providers and Data
Providers
28Foundation 3 OLAC OAI
- Recall OAI data providers must support
- Dublin Core Metadata
- OAI Metadata harvesting protocol
- BUT OAI data providers can support
- a more specialized metadata format
- a more specialized harvesting protocol
- What OLAC does
- specialized metadata for language resources
- specialized harvesting (extra validation)
29OLAC Standards
- Aside
- standards the protocols and interfaces that
allow the community to function - recommendations "standards" for representing
linguistic content - OLAC has three primary standards
- OLACMS the OLAC Metadata Set (Qualified DC)
- OLAC MHP refinements to the OAI protocol
- OLAC Process a procedure for identifying Best
Common Practice Recommendations
30The OLAC Metadata Set
- The three categories of metadata
- Work language describes information entities and
their intellectual attributes - e.g. names of works and their creators
- Document language describes and provides access
to the physical manifestation of information - e.g. format, publisher, date, rights
- Subject language describes what a document is
about - e.g. subject, description
31OLACMS and Controlled Vocabularies
- Language
- A language of the intellectual content of the
resource (OLAC-Language) - Subject.language
- A language which the content of the resource
describes or discusses (OLAC-Language) - OLAC-Language
- A vocabulary for identifying the language(s) that
the data is in, or that a piece of linguistic
description is about, or that a particular tool
can process
32Summary With the software in place, we have a
complete platform
CONTENT
METADATA
OAI
33Summary Repositories completely bridge the gap,
letting us consistently organize and archive our
resources
OLAC REPOSITORIES
CONTENT
METADATA
OAI
34OLAC
USER SERVICES
OLAC SERVICES
OLAC REPOSITORIES
CONTENT
METADATA
OAI
Acknowledgements ISLE and TalkBank projects
(NSF), participants of the Philadelphia workshop,
Eva Banik (programmer), Hernando de Soto (the
analogy)
35OLACMS helps archive versatility
- Given Shared Metadata Standard
- New language archives can be created on the fly
by harvesting existing archives - Rich information can be inferred by establishing
temporal and geographic anchors for each
document.
36OLAC Infrastructure
- Helps to Solve Language Archive Problems such as
- Language Identification
- and
- Metadata Set for Multi-lingual Language Archives
37The Language Identification Problem
- The DC code (e.g. en for English) is not enough
to describe all the languages in the world - Enthnologue (http//www.ethnologue.org) is
comprehensive but not complete - Potential Problems of using Enthnologue (or any
existing language list) - over-splitting
- over-chunking
- omission
38A Fundamental Solution to Language Identification
Problems
- Registering language groups with an OLAC
registration service - OLAC language classification server would house a
comprehensive list of language family names
(defined by users) and their extensional
definitions (i.e. sets of Enthnologue codes) - ASAmis ALV, AIS
- ALV Amis, AIS Nataoran
39Describing Multi-Lingual Resources in OLACMS
- Directionality is crucial in multilingual
resources - However, OLAC metadata is flat and unordered
- Bi-directional MT
- ltLanguage code X/gt
- ltLanguage code Y/gt
- ltSubject.language code X/gt
- ltSubject.language code Y/gt
40Multi-lingual Resources II
- Text language
- Bitext (bilingual aligned corpus)
- There is always an directionality
- Original language
- Translation Subject.language
- Language Description (Field Notes)
- Elicitation, transcription, translation, notes
- ?Multiple related resources
41Language Archives Project of Taiwan
- Part of the National Digital Archives Project
(NDAP) - Pilot Stage 2000-2001
- First Phase 2002-2006
- Both Language Archives
- And Linguistic Anchor
42Language and Digital Archives
43Digital Archives are Linguistically Anchored
- Archives are anchored with Lexical KnowledgeBase
(LKB) - -because LKB as collection of lexical types
instantiated in archives uniquely defines each
archive - -And each lexical item is the conceptual atom
projecting knowledge from archive to archive
44Multi-anchor Knowledge Linking
- Geographical anchor based on GIS (geography
information system) - -Ecology (Fauna, Weather, Geology etc.)
- -Socio-Anthropological classification
- Linguistic anchor based on LKB
- -etymology, language grouping, loan words,
45Institute of Linguistics
Language Archives
46- Two branch projects
- 1 Chinese Archives -- 5 sub-projects
- Early- Mandarin Chinese Lexicon
- Lexical Database of Pre-Qin Bronze and Bamboo
Manuscripts - Modern Chinese Corpus and Treebank
- New Age Corpus Linguistic Representations and
Archives of Multimedia Data - Southern-Min Archive A Database of Historical
Change in Language Distribution - 2 Formosan Language Archives.
47Early- Mandarin Chinese Lexicon
- GOAL
- Collect the corpus and the lexicon in the period
of Early Mandarin Chinese. - Provide a systematical knowledge thesaurus as
well as powerful instrument for the study of the
grammatical development. - Archives Description
- Digitalization of texts (10,000,000 characters).
- Tagging of grammatical markers (3,500,000
characters). - Construction of the lexical database.
- httpwww.sinica.edu.tw/Early_Mandarin
48(No Transcript)
49Lexical Database of Pre-Qin Bronze and Bamboo
Manuscripts
- Archives Description
- to digitize the bronze inscriptions from the
Shang to the Eastern Chou dynasties. - the construction of a typological lexicon of
bronze inscriptions and bamboo scripts accurate
encoding and analysis for the bronze inscriptions
and Chu scripts. - Achievement
- Proof-read bronze inscriptions (12113 piece of
bronze inscriptions). - http//Inscription.sinica.edu.tw
50(No Transcript)
51Modern Chinese Corpus and Treebank
- Achievement
- Segmented words tagged with their part-of-speech
(10 millions words version in 2006). - Syntactic tree structure30,000.
- http//www.sinica.edu.tw/SinicaCorpus
- http//treebank.sinica.edu.tw
52(No Transcript)
53(No Transcript)
54 Treebank
55New Age Corpus Linguistic Representations and
Archives of Multimedia Data
- Archives Description
- A multimodal corpus of spoken Mandarin in Taiwan.
- By means of different designs of tasks and
scenarios. - Combining data format of written transcripts with
digital technology of video and audio processing.
56New Age Corpus Linguistic Representations and
Archives of Multimedia Data
- Achievement
- Transcribed and transformed the 11 hour-digital
data. - Tagged the 5-hour speech data.
- http//mmc.sinica.edu.tw
57(No Transcript)
58Southern-Min Archive A Database of Historical
Change in Language Distribution
- Archives Description
- From the perspectives of historical change and
geographical distribution. - A tagged corpus of Southern Min written documents
from 16th century to 20th century. - A linguistic Geographical Informational System
displaying distributions of languages in Hsinfeng.
59(No Transcript)
60Formosan Language archives
- Archives Description
- Preserve the endangered Formosan Austronesian
languages - 1.1 corpora, lexicons and grammars
- 1.2 integration of linguistic information with
GIS. - fifteen extant Formosan languages
- 2.1 Rukai, Yami, Saisiyat, Tsou, Atayal, Bunun,
Paiwan, Amis and Puyuma - http//http//formosan.sinica.edu.tw/
61(No Transcript)
62(No Transcript)
63(No Transcript)
64Sinica BOW Bilingual Ontological Wordnet
- To construct a Chinese WordNet as the linguistic
ontology for knowledge representation - To provide linguistic anchoring grounded with
temporal information by building a synchronic
lexicon for all historical periods and - To provide linguistic anchoring reference and
implementation services.
65(No Transcript)
66Asian Language Resources Committee
- Mail List alr_at_cl.cs.titech.ac.jp
- Affiliated with AFNLP
- Cataloguing Asian Language Resources
- Will adopt OLACMS and search engine
- Hosting ALR Workshops (5 so far)
- Asian Language Processing Special Issues in
Language Resources and Evaluation - Co-ChairsTogunaga take_at_cl.cs.titech.ac.jp
- Huang churen_at_sinica.edu.tw
- http//www.cl.cs.titech.ac.jp/alr/
67An overview of theNatural Language Toolkit
- http//nltk.sourceforge.net
- Project Leaders Steven Bird, Edward Loper, Ewan
Klein - Acknowledgement I would like to thank Steven
Bird for agreeing to let me use these slides on
NLTK
68Summary
- NLTK is a suite of open source Python modules,
data sets and tutorials - supporting research and development in natural
language processing - Download NLTK from nltk.sourceforge.net
- A Truly Multilingual Toolkit accessible to
beginning researchers in NLP - A good way to attract international scholars to
research on your language - Also a good stepping stone for a developing HLT
language to test a full range of NLP applications
69Components of NLTK
- Code corpus readers, tokenizers, stemmers,
taggers, chunkers, parsers, wordnet, ... (50k
lines of code) - Corpora 20 annotated data sets widely used in
natural language processing (300Mb data) - Documentation a 360-page book, articles,
reviews, API documentation
701. Code
- corpus readers
- tokenizers
- stemmers
- taggers
- parsers
- wordnet
- semantic interpretation
- clusterers
- evaluation metrics
-
712. Corpora
- Brown Corpus
- Carnegie Mellon Pronouncing Dictionary
- CoNLL 2000 Chunking Corpus
- Project Gutenberg Selections
- NIST 1999 Information Extraction Entity
Recognition Corpus - US Presidential Inaugural Address Corpus
- Indian Language POS-Tagged Corpus
- Prepositional Phrase Attachment Corpus
- SENSEVAL 2 Corpus
- Sinica Treebank Corpus Sample
- Universal Declaration of Human Rights Corpus
- Stopwords Corpus
- TIMIT Corpus Sample
- Treebank Corpus Sample
723. Documentation
- a 360-page book about natural language processing
in Python and NLTK - teaches Python and NLP
- provides numerous examples and exercises
- installation instructions
- presentation slides for some of the book chapters
- API Documentation describes every module,
interface, class, and method
73Parser demonstrations
74Interactive session (WordNet)
75Adoption in NLP courses
- Amsterdam, Ben-Gurion, Brown, Bryn Mawr,
CDAC-Mumbai, Coruña, Edinburgh, Erlangen,
Georgetown, Helsinki, IIT-Bombay, Iowa State,
Konstanz, MIT, Macquarie, Magdeburg, Malta,
Marquette, Melbourne, Nancy, Naval Postgraduate
School, Northeastern, Ohio State, Pitt, San Diego
State, Simon Fraser, Stanford, Syracuse
University, Tsuda College, U Colorado, UC
Berkeley, UMass Amherst, UNAM, U Penn, UT Austin,
Warsaw
76Contribute
- NLTK is an open source project
- all code, data, documentation is free
- dozens of people have contributed over the past 6
years - please visit the website for project ideas
- sign up on the NLTK-Announce mailing list to hear
about new releases