The Rosetta Project: ALL Language Archive and the Impact of the NSDL through Daughter Repository Net - PowerPoint PPT Presentation

1 / 46
About This Presentation
Title:

The Rosetta Project: ALL Language Archive and the Impact of the NSDL through Daughter Repository Net

Description:

Transcribed indigenous texts with word glosses, free translations and grammatical markup. ... A common text with translation for each language. ... – PowerPoint PPT presentation

Number of Views:341
Avg rating:3.0/5.0
Slides: 47
Provided by: laurabusza
Category:

less

Transcript and Presenter's Notes

Title: The Rosetta Project: ALL Language Archive and the Impact of the NSDL through Daughter Repository Net


1
The Rosetta Project ALL Language Archive and
the Impact of the NSDL through Daughter
Repository Networks
  • Laura Buszard-Welcher, UC Berkeley / Long Now
    Foundation
  • Susan Hooyenga, Eastern Michigan University
  • Will Lewis, Univ. Washington / CSU Fresno

2
Panel Goals
  • To present the Rosetta Project and its
    relationship to other organizations that promote
    best practice in the creation and archiving of
    digital language resources.
  • To discuss the impact of this network to the
    discipline of linguistics.
  • To examine the potential impact of the NSDL
    through a complex of organizations that
    functionally extends the NSDL through a daughter
    repository node.

3
Panel Participants
  • Laura Buszard-Welcher, representing the Rosetta
    Project and also presenting OLAC (Open Language
    Archives Community)
  • Susan Hooyenga, representing the LINGUIST List
    and E-MELD Project (Electronic Metastructure for
    Endangered Languages Data)
  • Will Lewis, representing ODIN (Online Database of
    Interlinear Text) and GOLD (General Ontology for
    Linguistic Description)

4
The Rosetta Project Archive
  • NSDL node for documentation of the worlds
    languages
  • Largest repository of language documentation on
    the Net
  • Global network of linguists, speakers, educators
  • Over 95,000 pages of resources on over 2,300
    languages
  • Doubled in size since joining NSDL in 2004

5
Goals Linguistically Sophisticated Site with
Comprehensive, Global Scope
  • The Rosetta Project supports documentation of the
    worlds nearly 7000 languages
  • Leverages relatively new linguistic
    infrastructure standardized metadata formats,
    global language catalog
  • Promotes linguistic diversity by broadly
    disseminating resources on languages with small
    numbers of speakers
  • Contributes to the effort to document endangered
    languages

6
Supporting Metadata Standardization and
Interoperability
  • Open Language Archives Community (OLAC) --
    participating archive (and individuals)
  • Electronic Metastructure for Endangered Languages
    Data (E-MELD)
  • General Ontology for Linguistic Description
    (GOLD)
  • Linguistic Society of America (LSA) Conversation
    on Endangered Languages Archiving
  • NEW Digital Endangered Languages and Musics
    Archive Network (DELAMAN)

7
Metadata
  • oai_dc, nsdl_dc, olac_dc
  • Open Language Archives Community (OLAC) Set of
    23 metadata elements and controlled vocabularies
  • Subject.language (language described, rather than
    audience language)
  • Uses Ethnologue language codes (now ISO/DIS 639-3
  • Type.linguistic (grammar, lexicon, text)
  • Recommended extensions (Discourse Types,
    Linguistic Field, Participant roles)
  • Enables searches across a network of archives
    that use OLAC metadata http//www.language-archive
    s.org/tools/search/

8
Developing tools for collaborative linguistic
research
  • Endangered Language Query Room
  • DOCS (Digital Online Curation Services)
  • LangGator
  • Wordlist tool (collaboration with MPI-EVA)
  • New Rosetta V2.0 Website

9
Site Infrastructure
  • Plone 2.1 content management system, running in
    the Zope Application Server
  • Open source, leverages worldwide developer
    communities
  • Lots of plug in modules for functionality
    expansion
  • CMF Bibliography AT, Plone Board, etc.
  • Heavily modified infrastructure (language node
    design) and user interface
  • For a site demo, please come to our poster

10
Nodal Architecture
  • New feature on the Rosetta V2.0 site
  • Languages, language families, family subgroups,
    dialects all represented by nodes.
  • A node is a content aggregation page
  • Nodes and parent-child relationships each have
    unique IDs
  • The system currently represents Ethnologue
    language relationships, but has the flexibility
    to be agnostic about them, represent
    relationships from various theoretical
    perspectives

11
Node Pages
  • Accessible from a variety of browse and search
    pages
  • Browse by language name, family, country data
    type
  • Quick search, advanced search
  • Node page organization
  • Node metadata
  • Descriptive Resources
  • Navigation classification tree
  • Links to people functions, LINGUIST List people
    search
  • External links searches

12
Content
  • In-house collection, vetting
  • Primary focus of collection
  • Rosetta descriptive categories
  • Special collections
  • Endangered Language Fund Digital Archives
  • Alan Lomax Audio Collection
  • Future collections that come in through DOCS
  • Future development
  • Uploaded, peer-reviewed resources
  • Collaborative content areas (bulletin boards,
    wiki)

13
Scanning
  • Historically, the primary focus of in-house
    collection
  • Rosetta serves over 95,000 images from a variety
    of published resources
  • Excerpts in data categories (see following
    slides)
  • Public domain resources can be scanned in their
    entirety

14
Categories of Collection (1)
15
Categories of Collection (2)
16
New! Audio Digitization
  • Alan Lomax language audio collection
  • Digitization of reel-to-reel and cassette
    language audio
  • Accepting audio deposits (on a limited basis)
  • Revised depositor consent forms

17
Resource Pages
  • Accessed from node pages
  • Bibliographic metadata
  • Links to other resources
  • Resource bundles
  • Associated resource files
  • Scanned images
  • OCRed live text files
  • Annotated text files
  • Audio/video files
  • User comments

18
Rosetta Corpus as Resource
  • Three related projects tapping Rosetta corpus
  • ODIN Online Database of Interlinear Text
  • GOLD General Ontology of Linguistic Description
  • E-MELD Electronic Metastructure for Endangered
    Languages Data
  • ODIN GOLD outgrowths of E-MELD project

19
ODIN
  • Focus of ODIN
  • Discover on the Web linguistic resources with
    language data
  • Specifically interlinear data (structurally
    translated)
  • Data extracted and indexed
  • Searchable by language name, and grammatical
    content
  • Pointers given to source documents

20
ODIN
  • Example snippet of interlinear text
  • (from scanned text on Emberá-saija, spoken in
    Colombia)
  • ba-pa-ci conaa ome
  • be-HAB-PST old-man with
  • He lived there with the old man.

Linguistic Analysis
Language data
Translation
21
Heterogeneous Vocabularies
  • Problem Linguists not consistent with
    terminology used in analyses!
  • For example
  • be-HAB-PST old-man with
  • PST, or Past tense, can be expressed as
  • PAST, PST, 3sPT, PASTC, COPPST
  • 1sgPAST, and many more!!

22
GOLD
  • General Ontology of Linguistic Description
  • One purpose normalize vocabularies for search
  • TensePast PersonSingular

PAST PST PASTC COPPST 3sPT 1sgPAST
23
Smart Search
  • Normalized vocabularies allow search over
    differently encoded resources
  • Descriptive summaries, called language profiles,
    can be automatically created
  • Profiles, and data, can be queried

24
Smart Search
25
Language Profiles
  • Automatically generated
  • Searchable themselves as resources
  • Facilitate multi-language comparisons and analyses

26
Community Functions
  • Goal build a network of linguists, speakers,
    educators
  • People
  • Member pages
  • Regional and language curators
  • Collaborative content
  • Discussions (nodes, resources)
  • Resource upload
  • Vetting by volunteer language/family experts
  • In the future? Wiki documents (unvetted, but
    resources produced may go through higher vetting
    levels)

27
Member Gallery
  • Central access to member search and browse
  • Central access to language forums
  • Highlighted members

28
MemberProfile Page
  • User-defined content area
  • List of recent uploads
  • Lists of recent forum postings

29
LINGUIST List
  • The worlds largest online linguistic resource
    since 1989
  • A moderated listserv, with 21,000 subscribers
    worldwide
  • Eastern Michigan University, Wayne State
    University, and University of Arizona
  • www.linguistlist.org

30
The ALL Project
  • 2-year NSF grant, 2003-2005
  • LINGUIST List collaborated with Rosetta
  • Increased our database of linguists around the
    world
  • From 7,000 to 12,000 entries
  • Institution, email address, linguistic subfield,
    and language specialty

31
Directory of Linguists
32
Browse Facility
33
Linguists specializing in Navajo
34
Linguist listing
35
The Navajo Language
36
Search results for Navajo
37
Search by language
38
Search results for Kayardild
39
Search results for Lardil
40
E-MELDElectronic Metastructure for Endangered
Languages Data
  • 5-year NSF project, 2001-2006
  • Goal to aid in
  • the preservation of endangered languages data
  • the development of infrastructure for electronic
    archives and recommendations of best practices in
    digital language documentation

41
E-MELD Recommendations
  • Developed through 5 annual workshops
  • By working groups of language engineers and
    documentary linguists
  • E-MELD School of Best Practices in Digital
    Language Documentation
  • http//emeld.org/school/
  • Information on Audio, Video, Unicode, XML,
    Metadata, Case Studies (Documentation of 12
    endangered languages), etc
  • Ask-An-Expert

42
E-MELD School of Best Practices
43
E-MELD Educating field linguists
  • Interoperability between digital repositories
  • Long-term preservation of irreplaceable materials
  • Distinguishing between working format (e.g.,
    Excel), presentation format (HTML), and archival
    format (XML)

44
New LINGUIST List projects
  • Multi-Tree
  • Language family trees
  • Approved by experts in the field
  • LL-MAP
  • Using GIS to show where languages are spoken

45
Daughter Network Added Value
  • Enriches Rosetta resources
  • E-MELD best practice recommendations
  • Better linguistic metadata (OLAC and Ethnologue)
  • Functionally extends the NSDL through
  • Making resources discoverable through OLAC
  • Collaborations with organizations to build
    networked content (LINGUIST List people search)
  • Queries on Rosetta content by external tools that
    are exposed on other sites (ODIN, also LangGator)

46
URLs
  • Electronic Metastructure for Endangered Language
    Data (E-MELD) http//www.emeld.org (School of
    Best Practice, FIELD Tool).
  • Endangered Language Query Rooms
    http//rosettaproject.org8080/emeldbase/.
  • The Ethnologue http//www.ethnologue.com.
  • General Ontology for Linguistic Description
    (GOLD) http//www.linguistics-ontology.org or
    http//emeld.org/school/workroom/terminology/
  • LINGUIST List http//www.linguistlist.org
  • National Science Digital Library (NSDL)
    http//nsdl.org
  • ODIN www.csufresno.edu/odin
  • Open Language Archives Community (OLAC)
    http//www.language-archives.org.
  • The Rosetta Project, http//www.rosettaproject.org
    /live. A preview of the new Web site is available
    at http//preview.rosettaproject.org.
Write a Comment
User Comments (0)
About PowerShow.com