From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools

About This Presentation

Title:

From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools

Description:

Title: 1 Author: Javan Last modified by: churen Created Date: 12/28/2003 4:30:03 AM Document presentation format: Company – PowerPoint PPT presentation

Number of Views:306

Avg rating:3.0/5.0

Slides: 77

Provided by: JAV49

Category:

more less

Transcript and Presenter's Notes

Title: From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools

1
From Synergy to Knowledge Integrating multiple
language resourcesPart I Language Resources
and Tools

Chu-Ren Huang
Academia Sinica
http//cwn.ling.sinica.edu.tw/huang/huang.htm

2
Outline Language Resources and Tools

Introduction 10 Years in Chinese Language
Processing-A mirror for other Asian Languages
The Starting Point Resources and Resources
Sharing
OLAC The Open Language Archives Community
Asian Language Resources Committee of AFNLP
Standards ISO TC37 Language Resources
Mangagement
Language Archives Project of Taiwan
Tools Getting Started in NLP with NLTK

3
Why Resources and Tools

Language Resources
Foundation and empirical basis of scientific
studies of natural languages
The only reliable source for language specific
features
Infrastructure for knowledge representation and
knowledge engineering
Essential to preserve linguistic and cultural
diversity
Tools
Needed to process
General enough for multilingual processing and
cross-lingual comparison
Robust enough to deal with language specific
issues

4
Chinese Language Processing as a Mirror

For the development of Asian Language Processing
Unlike Japanese, which has enjoying being one of
the leaders in technological innovation
The development of Chinese language processing
coincides with the developing economies of Taiwan
and China
Especially the availability of Chinese language
PCs
Similar to the situation of many Asian languages
now

5
CLP in the past 10 years

A review of what happened in the past ten years
in Chinese Language Processing (1992-2002)
from a somewhat personal perspective
1992 Corpora
Completion of the first Chinese corpus for
linguistic research (Huang and Chen, COLING
92.1214-1217)
-untagged, non-segmented
-but searchable

6
CLP 1992 1993

1992 Segmentation Standard
Announcement of the first national standard for
word segmentation by PRC government.
GB 13715-?????????????.
1993 Lexicon
Completion and Release of the first version of
CKIP lexicon (with the category set and ICG
thematic roles)
First version of K. Chens parser for Chinese

7
CLP Corpus 1994 1995

1994
10th year anniversary for the Automation of
Chinese historical textual databases.
Completion of the pre-Qin Classic Chinese corpus
at Academia Sinica.
1995
Completion of Sinica Corpus (v. 1.0 1 million
words), the first balanced and tagged Chinese
corpus.

8
CLP 1996

Research Institutes
10th Anniversary of the Institute of
Computational Linguistics at Peking University
10th Anniversary of the Chinese Knowledge
Information Processing Group at Academia Sinica
Anthology of Papers
Readings in Chinese Natural Language Processing
(Journal of Chinese Linguistics Monograph)
Editors Huang, Chen, and Tsou

9
CLP 1996 November-1997

Sinica Corpus on Web
One of the first fully searchable language corpus
on the WWW
http//www.sinica.edu.tw/ftms-bin/kiwi.sh (old
webpage in web archives)
http//www.sinica.edu.tw/SinicaCorpus/ (current
page)
1997
Publication of the first Chinese dictionary
compiled directly from a corpus (Huang et al.s
Mandarin Daily Classifier Dictionary and
Noun-Classifier Collocation Dictionary)
The Tenth Annual ROCLING conference

10
CLP 1998

KnowledgeNet
Release of HowNet, the first full-fledged Chinese
and English-Chinese LKB
http//www.keenage.com/
-Segmentation Standard
Official announcement of CNS14366 for Taiwan

11
CLP 2000 Treebanks

Simultaneous completion and announcement of two
Chinese Treebanks
Penn Chinese Treebank
Sinica Treebank
ACL Workshop on Chinese Language Processing

12
CLP 2001-2002

2001 Society
Formal approval of the formation of
ACL SigHAN, the first international organization
on Chinese Language Processing
2002
First SigHAN workshop on Chinese Language
Processing
Formal launch of Hsiehs Intelligent Character
Encoding System (a sustainable solution to the
missing character problem)
COLING2002 in Taipei

13
CLP 2003 -

2003
THE FIRST INTERNATIONAL CHINESE WORD SEGMENTATION
BAKEOFF
http//www.sighan.org/bakeoff2003/
2002-2005
Chinese Proposition Bank
http//www.cis.upenn.edu/chinese/cpb/
2003,2005,2007
Chinese Gigaword Corpus v.1., v.2, and tagged
version

14
What CLP Development Showed?

Resources Lead
When tools and standards completes a
comprehensive infrastructure
Research will bloom

15
Resources Development

Towards a Sharable and Sustainable Model of
Resources Development
OLAC
Open Language Archives Community
http//www.language-archives.org

16
OLAC Aims

OLAC, the Open Language Archives Community, is an
international partnership of institutions and
individuals who are creating a worldwide virtual
library of language resources by
developing consensus on best current practice for
the digital archiving of language resources
developing a network of interoperating
repositories and services for housing and
accessing such resources.

17
OLAC Organization

Coordinators Steven Bird Gary Simons
Council Anthony Aristar (Linguist List),
Christopher Cieri (LDC), Gary Holton (Alaska
Native Lanuage Center), Chu-Ren Huang (Academia
Sinica), Heidi Johnson (Archive of the Indigenous
Languages of Latin America), Laurent Romary
(Atilf, University of Nancy), Joan Spanne (SIL),
Martin Wynne (Oxford Text Archive)
Participating Archives Services 39 archives
including LDC, ELRA, DFKI, CBOLD, ANLC, LACITO,
Perseus, SIL, APS, Utrecht, Academia Sinica,
TalkBank, Rosetta, MPI
Individual Members 120

18
Types of Language Resource

DATA any information which documents or
describes a language, such as a
monograph, data file, shoebox of index cards,
unanalyzed recordings, heavily annotated texts,
complete descriptive grammar
TOOLS computational resources that facilitate
creating, viewing, querying, or otherwise using
language data
includes fonts, stylesheets, DTDs, Schemas
ADVICE any information about
reliable data sources, appropriate tools and
practices

19
The Gap
20
Coordinated Approach

OAI
OLAC
"A shared architectural vision, having many
components, and implemented in stages by the
community, will bridge the gap" Analogies
federated databases semantic web
21
OLAC
USER SERVICES
OLAC SERVICES
OLAC REPOSITORIES
CONTENT
METADATA
OAI
22
The Foundation 3 initiatives

Dublin Core Metadata Initiative (DC)
founded in 1995 (Dublin, Ohio)
conventions for resource discovery on the web
Open Archives Initiative (OAI)
founded in 1999 (Santa Fe)
interoperability of e-print services
Open Language Archives Community (OLAC)
founded in 2000 (Philadelphia)
a partnership of institutions and individuals
creating a worldwide virtual library of language
resources

23
Foundation 1 DC Elements

15 metadata elements
broad interdisciplinary consensus
each element is optional and repeatable
applies to digital and traditional formats
Title, Creator, Subject, Description, Publisher,
Contributor, Date, Type, Format, Identifier,
Source, Language, Relation, Coverage, Rights.
dublincore.org

24
Foundation 1 DC Qualifiers

Encoding Schemes
a controlled vocabulary or notation used to
express the value of an element
helps a client system to interpret the element
content
e.g. Language "en" (not "English", "Anglais",
...)
Refinements
makes the meaning of an element more specific
e.g. Subject.language, Type.linguistic

25
Foundation 2 OAI Repository
26
Foundation 2 OAI Standards

To implement the OAI infrastructure, an archive
must comply with two standards
1. The OAI Shared Metadata Set
Dublin Core
interoperability across all repositories
2. The OAI Metadata Harvesting Protocol
HTTP requests - 6 verbs
Identify, ListIdentifiers, ListMetadataFormats,
ListSets, ListRecords, GetRecord
XML responses

27
Foundation 2 OAI Service Providers and Data
Providers
28
Foundation 3 OLAC OAI

Recall OAI data providers must support
Dublin Core Metadata
OAI Metadata harvesting protocol
BUT OAI data providers can support
a more specialized metadata format
a more specialized harvesting protocol
What OLAC does
specialized metadata for language resources
specialized harvesting (extra validation)

29
OLAC Standards

Aside
standards the protocols and interfaces that
allow the community to function
recommendations "standards" for representing
linguistic content
OLAC has three primary standards
OLACMS the OLAC Metadata Set (Qualified DC)
OLAC MHP refinements to the OAI protocol
OLAC Process a procedure for identifying Best
Common Practice Recommendations

30
The OLAC Metadata Set

The three categories of metadata
Work language describes information entities and
their intellectual attributes
e.g. names of works and their creators
Document language describes and provides access
to the physical manifestation of information
e.g. format, publisher, date, rights
Subject language describes what a document is
about
e.g. subject, description

31
OLACMS and Controlled Vocabularies

Language
A language of the intellectual content of the
resource (OLAC-Language)
Subject.language
A language which the content of the resource
describes or discusses (OLAC-Language)
OLAC-Language
A vocabulary for identifying the language(s) that
the data is in, or that a piece of linguistic
description is about, or that a particular tool
can process

32
Summary With the software in place, we have a
complete platform
CONTENT
METADATA
OAI
33
Summary Repositories completely bridge the gap,
letting us consistently organize and archive our
resources
OLAC REPOSITORIES
CONTENT
METADATA
OAI
34
OLAC
USER SERVICES
OLAC SERVICES
OLAC REPOSITORIES
CONTENT
METADATA
OAI
Acknowledgements ISLE and TalkBank projects
(NSF), participants of the Philadelphia workshop,
Eva Banik (programmer), Hernando de Soto (the
analogy)
35
OLACMS helps archive versatility

Given Shared Metadata Standard
New language archives can be created on the fly
by harvesting existing archives
Rich information can be inferred by establishing
temporal and geographic anchors for each
document.

36
OLAC Infrastructure

Helps to Solve Language Archive Problems such as
Language Identification
and
Metadata Set for Multi-lingual Language Archives

37
The Language Identification Problem

The DC code (e.g. en for English) is not enough
to describe all the languages in the world
Enthnologue (http//www.ethnologue.org) is
comprehensive but not complete
Potential Problems of using Enthnologue (or any
existing language list)
over-splitting
over-chunking
omission

38
A Fundamental Solution to Language Identification
Problems

Registering language groups with an OLAC
registration service
OLAC language classification server would house a
comprehensive list of language family names
(defined by users) and their extensional
definitions (i.e. sets of Enthnologue codes)
ASAmis ALV, AIS
ALV Amis, AIS Nataoran

39
Describing Multi-Lingual Resources in OLACMS

Directionality is crucial in multilingual
resources
However, OLAC metadata is flat and unordered
Bi-directional MT
ltLanguage code X/gt
ltLanguage code Y/gt
ltSubject.language code X/gt
ltSubject.language code Y/gt

40
Multi-lingual Resources II

Text language
Bitext (bilingual aligned corpus)
There is always an directionality
Original language
Translation Subject.language
Language Description (Field Notes)
Elicitation, transcription, translation, notes
?Multiple related resources

41
Language Archives Project of Taiwan

Part of the National Digital Archives Project
(NDAP)
Pilot Stage 2000-2001
First Phase 2002-2006
Both Language Archives
And Linguistic Anchor

42
Language and Digital Archives
43
Digital Archives are Linguistically Anchored

Archives are anchored with Lexical KnowledgeBase
(LKB)
-because LKB as collection of lexical types
instantiated in archives uniquely defines each
archive
-And each lexical item is the conceptual atom
projecting knowledge from archive to archive

44
Multi-anchor Knowledge Linking

Geographical anchor based on GIS (geography
information system)
-Ecology (Fauna, Weather, Geology etc.)
-Socio-Anthropological classification
Linguistic anchor based on LKB
-etymology, language grouping, loan words,

45
Institute of Linguistics
Language Archives
46

Two branch projects
1 Chinese Archives -- 5 sub-projects
Early- Mandarin Chinese Lexicon
Lexical Database of Pre-Qin Bronze and Bamboo
Manuscripts
Modern Chinese Corpus and Treebank
New Age Corpus Linguistic Representations and
Archives of Multimedia Data
Southern-Min Archive A Database of Historical
Change in Language Distribution
2 Formosan Language Archives.

47
Early- Mandarin Chinese Lexicon

GOAL
Collect the corpus and the lexicon in the period
of Early Mandarin Chinese.
Provide a systematical knowledge thesaurus as
well as powerful instrument for the study of the
grammatical development.
Archives Description
Digitalization of texts (10,000,000 characters).
Tagging of grammatical markers (3,500,000
characters).
Construction of the lexical database.
httpwww.sinica.edu.tw/Early_Mandarin

48
(No Transcript)
49
Lexical Database of Pre-Qin Bronze and Bamboo
Manuscripts

Archives Description
to digitize the bronze inscriptions from the
Shang to the Eastern Chou dynasties.
the construction of a typological lexicon of
bronze inscriptions and bamboo scripts accurate
encoding and analysis for the bronze inscriptions
and Chu scripts.
Achievement
Proof-read bronze inscriptions (12113 piece of
bronze inscriptions).
http//Inscription.sinica.edu.tw

50
(No Transcript)
51
Modern Chinese Corpus and Treebank

Achievement
Segmented words tagged with their part-of-speech
(10 millions words version in 2006).
Syntactic tree structure30,000.
http//www.sinica.edu.tw/SinicaCorpus
http//treebank.sinica.edu.tw

52
(No Transcript)
53
(No Transcript)
54
Treebank
55
New Age Corpus Linguistic Representations and
Archives of Multimedia Data

Archives Description
A multimodal corpus of spoken Mandarin in Taiwan.
By means of different designs of tasks and
scenarios.
Combining data format of written transcripts with
digital technology of video and audio processing.

56
New Age Corpus Linguistic Representations and
Archives of Multimedia Data

Achievement
Transcribed and transformed the 11 hour-digital
data.
Tagged the 5-hour speech data.
http//mmc.sinica.edu.tw

57
(No Transcript)
58
Southern-Min Archive A Database of Historical
Change in Language Distribution

Archives Description
From the perspectives of historical change and
geographical distribution.
A tagged corpus of Southern Min written documents
from 16th century to 20th century.
A linguistic Geographical Informational System
displaying distributions of languages in Hsinfeng.

59
(No Transcript)
60
Formosan Language archives

Archives Description
Preserve the endangered Formosan Austronesian
languages
1.1 corpora, lexicons and grammars
1.2 integration of linguistic information with
GIS.
fifteen extant Formosan languages
2.1 Rukai, Yami, Saisiyat, Tsou, Atayal, Bunun,
Paiwan, Amis and Puyuma
http//http//formosan.sinica.edu.tw/

61
(No Transcript)
62
(No Transcript)
63
(No Transcript)
64
Sinica BOW Bilingual Ontological Wordnet

To construct a Chinese WordNet as the linguistic
ontology for knowledge representation
To provide linguistic anchoring grounded with
temporal information by building a synchronic
lexicon for all historical periods and
To provide linguistic anchoring reference and
implementation services.

65
(No Transcript)
66
Asian Language Resources Committee

Mail List alr_at_cl.cs.titech.ac.jp
Affiliated with AFNLP
Cataloguing Asian Language Resources
Will adopt OLACMS and search engine
Hosting ALR Workshops (5 so far)
Asian Language Processing Special Issues in
Language Resources and Evaluation
Co-ChairsTogunaga take_at_cl.cs.titech.ac.jp
Huang churen_at_sinica.edu.tw
http//www.cl.cs.titech.ac.jp/alr/

67
An overview of theNatural Language Toolkit

http//nltk.sourceforge.net
Project Leaders Steven Bird, Edward Loper, Ewan
Klein
Acknowledgement I would like to thank Steven
Bird for agreeing to let me use these slides on
NLTK

68
Summary

NLTK is a suite of open source Python modules,
data sets and tutorials
supporting research and development in natural
language processing
Download NLTK from nltk.sourceforge.net
A Truly Multilingual Toolkit accessible to
beginning researchers in NLP
A good way to attract international scholars to
research on your language
Also a good stepping stone for a developing HLT
language to test a full range of NLP applications

69
Components of NLTK

Code corpus readers, tokenizers, stemmers,
taggers, chunkers, parsers, wordnet, ... (50k
lines of code)
Corpora 20 annotated data sets widely used in
natural language processing (300Mb data)
Documentation a 360-page book, articles,
reviews, API documentation

70
1. Code

corpus readers
tokenizers
stemmers
taggers
parsers
wordnet
semantic interpretation
clusterers
evaluation metrics

71
2. Corpora

Brown Corpus
Carnegie Mellon Pronouncing Dictionary
CoNLL 2000 Chunking Corpus
Project Gutenberg Selections
NIST 1999 Information Extraction Entity
Recognition Corpus
US Presidential Inaugural Address Corpus
Indian Language POS-Tagged Corpus
Prepositional Phrase Attachment Corpus
SENSEVAL 2 Corpus
Sinica Treebank Corpus Sample
Universal Declaration of Human Rights Corpus
Stopwords Corpus
TIMIT Corpus Sample
Treebank Corpus Sample

72
3. Documentation

a 360-page book about natural language processing
in Python and NLTK
teaches Python and NLP
provides numerous examples and exercises
installation instructions
presentation slides for some of the book chapters
API Documentation describes every module,
interface, class, and method

73
Parser demonstrations
74
Interactive session (WordNet)
75
Adoption in NLP courses

Amsterdam, Ben-Gurion, Brown, Bryn Mawr,
CDAC-Mumbai, Coruña, Edinburgh, Erlangen,
Georgetown, Helsinki, IIT-Bombay, Iowa State,
Konstanz, MIT, Macquarie, Magdeburg, Malta,
Marquette, Melbourne, Nancy, Naval Postgraduate
School, Northeastern, Ohio State, Pitt, San Diego
State, Simon Fraser, Stanford, Syracuse
University, Tsuda College, U Colorado, UC
Berkeley, UMass Amherst, UNAM, U Penn, UT Austin,
Warsaw

76
Contribute

NLTK is an open source project
all code, data, documentation is free
dozens of people have contributed over the past 6
years
please visit the website for project ideas
sign up on the NLTK-Announce mailing list to hear
about new releases

Write a Comment

User Comments (0)

About PowerShow.com

From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools - PowerPoint PPT Presentation

From Synergy to Knowledge: Integrating multiple language resources Part I: Language Resources and Tools

Title: 1 Author: Javan Last modified by: churen Created Date: 12/28/2003 4:30:03 AM Document presentation format: Company – PowerPoint PPT presentation