A Field Linguist - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

A Field Linguist

Description:

Presentations from this session will be posted at ... The Rosetta Project / Long Now Foundation. A great way to freak out a linguist ' ... – PowerPoint PPT presentation

Number of Views:38
Avg rating:3.0/5.0
Slides: 40
Provided by: laurab173
Category:

less

Transcript and Presenter's Notes

Title: A Field Linguist


1
A Field Linguists Guide to Making Long Lasting
Texts and Databases
  • LSA Organized Session
  • January 4, 2007
  • Anaheim, California

2
  • Organized by
  • Jeff Good and Heidi Johnson
  • Open Language Archives Community (OLAC)
  • Outreach Committee
  • Moderator
  • Laura Welcher
  • Speakers
  • Debbie Anderson,
  • Michael Appleby, Jessica Boynton,
  • Naomi Fox, Connie Dickinson

3
Presentations from this session will be posted
athttp//www.language-archives.org/news.htmlolac
07
4
Best Practice in Your Back Pocket Getting the
Most Out of the Tools You Have
  • Laura Welcher
  • The Rosetta Project / Long Now Foundation

5
A great way to freak out a linguist
  • To be in compliance with best practice
    recommendations (ahem), your interlinear glossed
    text needs to be in XML format with
    morphosyntactic tags that reference the GOLD
    ontology.

6
Reality Check
  • Theres a difference between ideal best practice
    resources (which is still somewhat of a moving
    target) and a good, sufficient approximation.
  • Some common practices are far from ideal or
    sufficient (like saving the dictionary you worked
    5 years on as a Microsoft Word document file).
  • We can easily modify these practices to produce
    archivable resources that will last.
  • And this can be done using tools that you already
    have, and knowledge that is easy to acquire.
  • Hence the title Best practice in your back
    pocket getting the most out of the tools that
    you have.

7
Best Practice
  • E-MELD project (Electronic Metastructure for
    Endangered Languages Data)
  • Goals
  • Help preserve endangered languages data
  • Develop infrastructure for electronic archives
  • Defining best practice
  • E-MELD summer workshops http//www.emeld.org
  • Promoting best practice
  • School of Best Practice at http//www.emeld.org/
    school/index.html

8
Good, Better, Best Practice
  • The information presented here comes from
    presentations of the E-MELD team, particularly
    the following
  • Simons and Dry (2006) Good, Better, and Best
    Practice The Experience of the E-MELD
    Project http//www.linguistlist.org/emeld/document
    s/Bielefeld-Dry-Simons.pdf

9
The first considerationworking, presentation
and archival formats
  • The process of creating digital language
    resources usually involves creating files in
    different formats
  • Working format
  • Presentation format
  • Archival format

10
Working Format
  • The saved format of whatever program you are
    working in
  • .doc (MS Word)
  • .xls (Excel)
  • .fp7 (FileMaker Pro)
  • This format is what you use for your own
    convenience and productivity
  • Typically this format is proprietary
  • Less typically, people may work in programs whose
    native format is not proprietary, automatically
    saving in .txt (plain text), .xml or .html (types
    of formatted plain text)
  • A proprietary working file format is not the only
    format you should have!

11
Archival Format
  • A very important format -- this format helps
    ensure that your resource will last and be usable
    well into the future
  • An archival format has LOTS of good qualities
    (Simons, 2004)
  • Lossless
  • Open Standard
  • Transparent
  • Supported by multiple vendors

12
Archival Format Lossless
  • Avoid compressed formats that lose content
  • A good rule-of-thumb is to use uncompressed
    formats
  • Text .txt, .html, .xml
  • Images .tiff, .bmp
  • Audio .wav (Windows), .aiff (Apple), .au (Sun,
    Java, Unix) but make sure it is PCM
    (uncompressed)
  • Video .avi (some codecs), .rtv
  • Most compressed formats lose content, but some
    are lossless (.zip for text, black and white .gif
    for images, .ale Apple Lossless Encoding for
    audio, jpeg2000 video codec) -- use with caution!

13
Archival Format Open
  • Avoid proprietary formats like .doc, .xls, .fp7
  • The company that produces the software may stop
    supporting the format, rendering your file
    unreadable
  • For your archival format, choose a file format
    that is open standard like .xml, .html, .pdf or
    .rtf
  • Open standard means that the specification of
    the format is publically available, and anyone
    can implement it.

14
Archival Format Transparent
  • Use a file format that is easy to interpret
  • Example text files (.txt)
  • Have common characters like letters, numbers,
    punctuation
  • Virtually no formatting (tabs, returns)
  • Because of the simplicity of this file type, many
    programs can read it and make use of the data
  • Other transparent formats .wav, .aiff can be
    read by any audio program
  • Not transparent .zip, .mp3 (require a special
    algorithm for interpretation)

15
Archival Format Supported
  • Prefer formats that are widely supported
  • If more vendors support it, it is less likely to
    become obsolete
  • This is another reason to prefer an open standard
    format to a proprietary one

16
Presentation Format
  • Presentation formats are those you choose for the
    convenience and ease of accessibility and display
  • It is fine that presentation formats be
    compressed, so long as you make a lossless
    archival copy as well
  • Examples of presentation formats include .pdf
    files, .mp3 files, .jpg images, MPEG-2 video

17
So far, so good?
  • As a responsible linguist creating digital
    language documentation that will last well into
    the future you
  • Know the difference between a working,
    presentation, and archival file format
  • Know what makes a good archival format (LOTS)
  • Maintain an archival format of your data
  • Anything beyond this? Yes, a bit more

18
Best Practice Digital Resources are
  • Preservable in formats that are not vulnerable to
    decay or obsolescence (see LOTS)
  • Intelligible so that content that is easily
    understood by future scholars
  • Accessible so that resources are easily
    discovered and accessed
  • They are also interoperable, but this is mostly a
    concern of archives and services
  • (Simons and Dry, 2006)

19
Create Preservable Resources
  • Linguists are responsible for making preservable
    resources
  • That is, creating archival formats that follow
    the principles of LOTS

20
Create Intelligible Resources
  • In order to create resources that are
    intelligible to others, you must document your
    practices!
  • Documentation includes
  • Your markup practices
  • The encoding you use
  • Metadata about your resources
  • This information should be kept a file or files
    in an archival format, and archived along with
    your resources.

21
Presentational Markup
  • Many people use presentational markup,
    particularly in the working formats like
    Microsoft Word.
  • Presentational markup means that aspects of the
    presentation (like bold, italics, indenting) are
    themselves meaningful
  • For example

22
Example of Presentational Markup
  • AS_5.2.1978_audio Alice Spear, Potawatomi,
    Crane Boy, May 2, 1978, Mayetta, Kansas.
  • ltboldgtAS_5.2.1978_audiolt/boldgt
  • ltplain.textgtAliceSpearlt/plain.textgt
  • ltitalicsgtCrane Boylt/italicsgt
  • ltplain.textgtMay 2, 1978lt/plain.textgt
  • ltplain.textgtMayetta, Kansaslt/plain.textgt

23
Presentational Markup
  • Presentational markup is not recommended. BUT if
    you do use it, describe all meaningful aspects
    (e.g. bold means head word, italics is used
    for the part of speech)

24
Descriptive Markup
  • It is better practice to use descriptive markup,
    like XML
  • XML is basically text with tags that provide
    information about what is between the tags
  • ltheadwordgtmnomenlt/headwordgt
  • ltglossgtricelt/glossgt
  • Tags can be also used to group information, much
    like you would group information in a database
    record, and have a whole set of information in a
    database

25
Example of Descriptive Markup
  • AS_5.2.1978_audio Alice Spear, Potawatomi,
    Crane Boy, May 2, 1978, Mayetta, Kansas.
  • ltIDgtAS_5.2.1978_audiolt/IDgt
  • ltspeakergtAlice Spearlt/speakergt
  • ltdescriptiongtCrane Boylt/descriptiongt
  • ltrecording.dategtMay 2, 1978lt/recording.dategt
  • ltlocationgtMayetta, Kansaslt/locationgt

26
Descriptive Markup XML
  • lt?xml version1.0" encodingUTF-8"?gt
  • lt?xml-stylesheet typetext/xsl"
    hrefarchive.xsl"?gt
  • ltmy.archivegt
  • ltrecordgt
  • ltidentifiergtAS_5.2.1978_audiolt/identifiergt
  • ltsubject.language codex-sil-POT"/gtltlanguage
    code"en"/gt
  • ltformatgtAnalog audio recording on Cassette
    tapelt/formatgt
  • ltcontributor refine"speaker"gtAlice
    Spearlt/contributorgt
  • ltcontributor refine"researcher"gtLaura
    Buszard-Welcherlt/contributorgt
  • ltdescriptiongtCrane Boy narrative told in
    Potawatomi and in Englishlt/descriptiongt
  • ltdate code1978-05-02"/gt
  • ltcoveragegtMayetta, Kansaslt/coveragegt
  • ltrelationgtdigital audio AS_5.2.1978_audio.wav,
    interlinear text AS_5.2.1978_audio.txtlt/relationgt
  • lttype.linguistic codeprimary_text"/gt
  • ltrightsgtSome restrictions contact field
    linguistlt/rightsgt
  • lt/recordgt
  • lt/my.archivegt

27
Descriptive Markup XML
  • It is a good practice to use standard tags where
    they are available.
  • OLAC has a set of tags that you would use for
    metadata to describe your resources
  • GOLD has a set of tags used for morphosyntactic
    description
  • Otherwise, be sure to document the meaning of the
    tags that you use
  • Although some people feel comfortable working in
    XML, many dont like to use it as a working
    format.
  • Fortunately many common programs now allow you to
    save your work as an XML file.

28
The Advantage of XML
  • Besides creating an archival data file, XML has
    other advantages
  • By creating stylesheets, you can give the same
    XML file different presentation forms
  • For example

29
Delimited Text
  • Another kind of markup that you might find
    yourself using is delimited text.
  • Spreadsheet and database programs allow you to
    export your data as text, delimited by a
    particular character
  • Comma separated text (.csv)
  • Tab separated text (.tab)
  • To help with intelligibility, create an initial
    record where the name of each field / cell is
    given inside the record itself. That way, the
    names of your fields / cells will be exported and
    saved along with the rest of your data.
  • Text data exported this way is good practice,
    particularly if you are careful about documenting
    your practices inside your fields / cells (for
    more on this see following slides).

30
Other aspects of markup
  • Document any special conventions that you use
  • What do your morpheme boundary markers mean ( /
    - / any others?)
  • What glossing conventions do you use? Give the
    full names of abbreviations (e.g. POS means
    possessive, PV means preverb).
  • Describe grammatical terms that you use (like
    aorist, or preverb) and what it means for the
    language you are describing. You dont have to
    write a grammar -- a sentence or two describing
    the term is sufficient)
  • Also note if you are using standard terminology
    sets, like Leipzig Glossing Rules, or GOLD
    terminology

31
Document the Encoding
  • Identify the character set you are using
  • Document any non-standard characters
  • Best practice is to use Unicode

32
Create Metadata
  • You will need to create some additional
    information about your resources
  • Metadata usually includes information about
  • The setting (time, date, participants, location)
  • The language (ISO 636-3)
  • Linguistic type (text, grammar, lexicon) and
    subject
  • Access restrictions
  • There are metadata standards for language
    resources OLAC and IMDI

33
OLAC Metadata Elements
Contributor (content) Language (audience)
Coverage (e.g. location) Publisher
Creator (content) Relation (to another resource
Date Rights (controlled vocab.)
Description Source (say, for re-elicited data)
Format Subject (controlled vocab.)
Encoding Format (character set) Subject Language (ISO 636-3 code)
Markup Format (XML schema) Title
Identifier (file name, URL Linguistic Type
http//www.language-archives.org/OLAC/olacms.html
34
Create Metadata
  • Keep a metadata record for each of your
    resources.
  • The records should themselves be in an archival
    format. This could be
  • A text file (good)
  • Delimited text, exported from a simple database
    file (good)
  • An XML file (better)
  • An OLAC or IMDI formatted XML file (best)
  • Your archivist may have a preference about
    metadata formats, and prefer something relatively
    simple (like a paper form) if the archive will be
    manually entering the metadata.
  • Archive this file along with the rest of your
    resources.

35
Make your resources accessible
  • Archive, archive, archive! (Not just on your
    own, or your departmental server. Archives are
    committed to the long-term preservation and
    availability of your resources.)
  • Before you leave to do fieldwork, or when you are
    writing your grant, establish contact with the
    archive where you intend to deposit your
    resources
  • Archivists will
  • give you guidelines for creating archival files
  • help you select the best metadata set
  • give you information about setting access levels
  • When you return, the first thing to do is send
    your files, along with the metadata and markup
    descriptions to the archive
  • Most archives will then give you an ID number for
    your resources that you can then cite in your
    publications

36
A Community Responsibility
  • Best practice involves what individual field
    linguists do, but also how we collectively use
    and care for these resources
  • This broader community involves
  • Other researchers like yourself who create
    resources
  • A growing set of interconnected digital language
    archives that care for, protect, and disseminate
    your resources
  • People who develop tools and services to make
    your resources locateable, searchable, and
    reusable
  • Others linguistics organizations, organizations
    like OLAC and DELAMAN, funding agencies who
    promote the work of this community

37
Unicode
  • Debbie Anderson A field linguists guide to
    Unicode
  • Michael Appleby How to use Unicode on your
    computer

38
Field Case Studies Texts and Databases
  • Jessica Boynton
  • Transcription, Time-Alignment and Annotation
  • Naomi Fox
  • Using Filemaker Pro to produce archivable
    language documentation
  • Connie Dickonson
  • The Tsafiki Text Factory

39
Panel Session
  • Talks are 25 minutes, consecutive.
  • Please remember or write down your questions!
  • We will field them in a panel session after the
    talks.
Write a Comment
User Comments (0)
About PowerShow.com