Title: The Rosetta Project: ALL Language Archive and the Impact of the NSDL through Daughter Repository Net
1The Rosetta Project ALL Language Archive and
the Impact of the NSDL through Daughter
Repository Networks
- Laura Buszard-Welcher, UC Berkeley / Long Now
Foundation - Susan Hooyenga, Eastern Michigan University
- Will Lewis, Univ. Washington / CSU Fresno
2Panel Goals
- To present the Rosetta Project and its
relationship to other organizations that promote
best practice in the creation and archiving of
digital language resources. - To discuss the impact of this network to the
discipline of linguistics. - To examine the potential impact of the NSDL
through a complex of organizations that
functionally extends the NSDL through a daughter
repository node.
3Panel Participants
- Laura Buszard-Welcher, representing the Rosetta
Project and also presenting OLAC (Open Language
Archives Community) - Susan Hooyenga, representing the LINGUIST List
and E-MELD Project (Electronic Metastructure for
Endangered Languages Data) - Will Lewis, representing ODIN (Online Database of
Interlinear Text) and GOLD (General Ontology for
Linguistic Description)
4The Rosetta Project Archive
- NSDL node for documentation of the worlds
languages - Largest repository of language documentation on
the Net - Global network of linguists, speakers, educators
- Over 95,000 pages of resources on over 2,300
languages - Doubled in size since joining NSDL in 2004
5Goals Linguistically Sophisticated Site with
Comprehensive, Global Scope
- The Rosetta Project supports documentation of the
worlds nearly 7000 languages - Leverages relatively new linguistic
infrastructure standardized metadata formats,
global language catalog - Promotes linguistic diversity by broadly
disseminating resources on languages with small
numbers of speakers - Contributes to the effort to document endangered
languages
6Supporting Metadata Standardization and
Interoperability
- Open Language Archives Community (OLAC) --
participating archive (and individuals) - Electronic Metastructure for Endangered Languages
Data (E-MELD) - General Ontology for Linguistic Description
(GOLD) - Linguistic Society of America (LSA) Conversation
on Endangered Languages Archiving - NEW Digital Endangered Languages and Musics
Archive Network (DELAMAN)
7Metadata
- oai_dc, nsdl_dc, olac_dc
- Open Language Archives Community (OLAC) Set of
23 metadata elements and controlled vocabularies - Subject.language (language described, rather than
audience language) - Uses Ethnologue language codes (now ISO/DIS 639-3
- Type.linguistic (grammar, lexicon, text)
- Recommended extensions (Discourse Types,
Linguistic Field, Participant roles) - Enables searches across a network of archives
that use OLAC metadata http//www.language-archive
s.org/tools/search/
8Developing tools for collaborative linguistic
research
- Endangered Language Query Room
- DOCS (Digital Online Curation Services)
- LangGator
- Wordlist tool (collaboration with MPI-EVA)
- New Rosetta V2.0 Website
9Site Infrastructure
- Plone 2.1 content management system, running in
the Zope Application Server - Open source, leverages worldwide developer
communities - Lots of plug in modules for functionality
expansion - CMF Bibliography AT, Plone Board, etc.
- Heavily modified infrastructure (language node
design) and user interface - For a site demo, please come to our poster
10Nodal Architecture
- New feature on the Rosetta V2.0 site
- Languages, language families, family subgroups,
dialects all represented by nodes. - A node is a content aggregation page
- Nodes and parent-child relationships each have
unique IDs - The system currently represents Ethnologue
language relationships, but has the flexibility
to be agnostic about them, represent
relationships from various theoretical
perspectives
11Node Pages
- Accessible from a variety of browse and search
pages - Browse by language name, family, country data
type - Quick search, advanced search
- Node page organization
- Node metadata
- Descriptive Resources
- Navigation classification tree
- Links to people functions, LINGUIST List people
search - External links searches
12Content
- In-house collection, vetting
- Primary focus of collection
- Rosetta descriptive categories
- Special collections
- Endangered Language Fund Digital Archives
- Alan Lomax Audio Collection
- Future collections that come in through DOCS
- Future development
- Uploaded, peer-reviewed resources
- Collaborative content areas (bulletin boards,
wiki)
13Scanning
- Historically, the primary focus of in-house
collection - Rosetta serves over 95,000 images from a variety
of published resources - Excerpts in data categories (see following
slides) - Public domain resources can be scanned in their
entirety
14Categories of Collection (1)
15Categories of Collection (2)
16New! Audio Digitization
- Alan Lomax language audio collection
- Digitization of reel-to-reel and cassette
language audio - Accepting audio deposits (on a limited basis)
- Revised depositor consent forms
17Resource Pages
- Accessed from node pages
- Bibliographic metadata
- Links to other resources
- Resource bundles
- Associated resource files
- Scanned images
- OCRed live text files
- Annotated text files
- Audio/video files
- User comments
18Rosetta Corpus as Resource
- Three related projects tapping Rosetta corpus
- ODIN Online Database of Interlinear Text
- GOLD General Ontology of Linguistic Description
- E-MELD Electronic Metastructure for Endangered
Languages Data - ODIN GOLD outgrowths of E-MELD project
19ODIN
- Focus of ODIN
- Discover on the Web linguistic resources with
language data - Specifically interlinear data (structurally
translated) - Data extracted and indexed
- Searchable by language name, and grammatical
content - Pointers given to source documents
20ODIN
- Example snippet of interlinear text
- (from scanned text on Emberá-saija, spoken in
Colombia) - ba-pa-ci conaa ome
- be-HAB-PST old-man with
- He lived there with the old man.
Linguistic Analysis
Language data
Translation
21Heterogeneous Vocabularies
- Problem Linguists not consistent with
terminology used in analyses! - For example
- be-HAB-PST old-man with
- PST, or Past tense, can be expressed as
- PAST, PST, 3sPT, PASTC, COPPST
- 1sgPAST, and many more!!
22GOLD
- General Ontology of Linguistic Description
- One purpose normalize vocabularies for search
- TensePast PersonSingular
PAST PST PASTC COPPST 3sPT 1sgPAST
23Smart Search
- Normalized vocabularies allow search over
differently encoded resources - Descriptive summaries, called language profiles,
can be automatically created - Profiles, and data, can be queried
24Smart Search
25Language Profiles
- Automatically generated
- Searchable themselves as resources
- Facilitate multi-language comparisons and analyses
26Community Functions
- Goal build a network of linguists, speakers,
educators - People
- Member pages
- Regional and language curators
- Collaborative content
- Discussions (nodes, resources)
- Resource upload
- Vetting by volunteer language/family experts
- In the future? Wiki documents (unvetted, but
resources produced may go through higher vetting
levels)
27Member Gallery
- Central access to member search and browse
- Central access to language forums
- Highlighted members
28MemberProfile Page
- User-defined content area
- List of recent uploads
- Lists of recent forum postings
29LINGUIST List
- The worlds largest online linguistic resource
since 1989 - A moderated listserv, with 21,000 subscribers
worldwide - Eastern Michigan University, Wayne State
University, and University of Arizona - www.linguistlist.org
30The ALL Project
- 2-year NSF grant, 2003-2005
- LINGUIST List collaborated with Rosetta
- Increased our database of linguists around the
world - From 7,000 to 12,000 entries
- Institution, email address, linguistic subfield,
and language specialty
31Directory of Linguists
32Browse Facility
33Linguists specializing in Navajo
34Linguist listing
35The Navajo Language
36Search results for Navajo
37Search by language
38Search results for Kayardild
39Search results for Lardil
40E-MELDElectronic Metastructure for Endangered
Languages Data
- 5-year NSF project, 2001-2006
- Goal to aid in
- the preservation of endangered languages data
- the development of infrastructure for electronic
archives and recommendations of best practices in
digital language documentation
41E-MELD Recommendations
- Developed through 5 annual workshops
- By working groups of language engineers and
documentary linguists - E-MELD School of Best Practices in Digital
Language Documentation - http//emeld.org/school/
- Information on Audio, Video, Unicode, XML,
Metadata, Case Studies (Documentation of 12
endangered languages), etc - Ask-An-Expert
42E-MELD School of Best Practices
43E-MELD Educating field linguists
- Interoperability between digital repositories
- Long-term preservation of irreplaceable materials
- Distinguishing between working format (e.g.,
Excel), presentation format (HTML), and archival
format (XML)
44New LINGUIST List projects
- Multi-Tree
- Language family trees
- Approved by experts in the field
- LL-MAP
- Using GIS to show where languages are spoken
45Daughter Network Added Value
- Enriches Rosetta resources
- E-MELD best practice recommendations
- Better linguistic metadata (OLAC and Ethnologue)
- Functionally extends the NSDL through
- Making resources discoverable through OLAC
- Collaborations with organizations to build
networked content (LINGUIST List people search) - Queries on Rosetta content by external tools that
are exposed on other sites (ODIN, also LangGator)
46URLs
- Electronic Metastructure for Endangered Language
Data (E-MELD) http//www.emeld.org (School of
Best Practice, FIELD Tool). - Endangered Language Query Rooms
http//rosettaproject.org8080/emeldbase/. - The Ethnologue http//www.ethnologue.com.
- General Ontology for Linguistic Description
(GOLD) http//www.linguistics-ontology.org or
http//emeld.org/school/workroom/terminology/ - LINGUIST List http//www.linguistlist.org
- National Science Digital Library (NSDL)
http//nsdl.org - ODIN www.csufresno.edu/odin
- Open Language Archives Community (OLAC)
http//www.language-archives.org. - The Rosetta Project, http//www.rosettaproject.org
/live. A preview of the new Web site is available
at http//preview.rosettaproject.org.