Title: A Field Linguist
1A Field Linguists Guide to Making Long Lasting
Texts and Databases
- LSA Organized Session
- January 4, 2007
- Anaheim, California
2- Organized by
- Jeff Good and Heidi Johnson
- Open Language Archives Community (OLAC)
- Outreach Committee
- Moderator
- Laura Welcher
- Speakers
- Debbie Anderson,
- Michael Appleby, Jessica Boynton,
- Naomi Fox, Connie Dickinson
3Presentations from this session will be posted
athttp//www.language-archives.org/news.htmlolac
07
4Best Practice in Your Back Pocket Getting the
Most Out of the Tools You Have
- Laura Welcher
- The Rosetta Project / Long Now Foundation
5A great way to freak out a linguist
- To be in compliance with best practice
recommendations (ahem), your interlinear glossed
text needs to be in XML format with
morphosyntactic tags that reference the GOLD
ontology.
6Reality Check
- Theres a difference between ideal best practice
resources (which is still somewhat of a moving
target) and a good, sufficient approximation. - Some common practices are far from ideal or
sufficient (like saving the dictionary you worked
5 years on as a Microsoft Word document file). - We can easily modify these practices to produce
archivable resources that will last. - And this can be done using tools that you already
have, and knowledge that is easy to acquire. - Hence the title Best practice in your back
pocket getting the most out of the tools that
you have.
7Best Practice
- E-MELD project (Electronic Metastructure for
Endangered Languages Data) - Goals
- Help preserve endangered languages data
- Develop infrastructure for electronic archives
- Defining best practice
- E-MELD summer workshops http//www.emeld.org
- Promoting best practice
- School of Best Practice at http//www.emeld.org/
school/index.html
8Good, Better, Best Practice
- The information presented here comes from
presentations of the E-MELD team, particularly
the following - Simons and Dry (2006) Good, Better, and Best
Practice The Experience of the E-MELD
Project http//www.linguistlist.org/emeld/document
s/Bielefeld-Dry-Simons.pdf
9The first considerationworking, presentation
and archival formats
- The process of creating digital language
resources usually involves creating files in
different formats - Working format
- Presentation format
- Archival format
10Working Format
- The saved format of whatever program you are
working in - .doc (MS Word)
- .xls (Excel)
- .fp7 (FileMaker Pro)
- This format is what you use for your own
convenience and productivity - Typically this format is proprietary
- Less typically, people may work in programs whose
native format is not proprietary, automatically
saving in .txt (plain text), .xml or .html (types
of formatted plain text) - A proprietary working file format is not the only
format you should have!
11Archival Format
- A very important format -- this format helps
ensure that your resource will last and be usable
well into the future - An archival format has LOTS of good qualities
(Simons, 2004) - Lossless
- Open Standard
- Transparent
- Supported by multiple vendors
12Archival Format Lossless
- Avoid compressed formats that lose content
- A good rule-of-thumb is to use uncompressed
formats - Text .txt, .html, .xml
- Images .tiff, .bmp
- Audio .wav (Windows), .aiff (Apple), .au (Sun,
Java, Unix) but make sure it is PCM
(uncompressed) - Video .avi (some codecs), .rtv
- Most compressed formats lose content, but some
are lossless (.zip for text, black and white .gif
for images, .ale Apple Lossless Encoding for
audio, jpeg2000 video codec) -- use with caution!
13Archival Format Open
- Avoid proprietary formats like .doc, .xls, .fp7
- The company that produces the software may stop
supporting the format, rendering your file
unreadable - For your archival format, choose a file format
that is open standard like .xml, .html, .pdf or
.rtf - Open standard means that the specification of
the format is publically available, and anyone
can implement it.
14Archival Format Transparent
- Use a file format that is easy to interpret
- Example text files (.txt)
- Have common characters like letters, numbers,
punctuation - Virtually no formatting (tabs, returns)
- Because of the simplicity of this file type, many
programs can read it and make use of the data - Other transparent formats .wav, .aiff can be
read by any audio program - Not transparent .zip, .mp3 (require a special
algorithm for interpretation)
15Archival Format Supported
- Prefer formats that are widely supported
- If more vendors support it, it is less likely to
become obsolete - This is another reason to prefer an open standard
format to a proprietary one
16Presentation Format
- Presentation formats are those you choose for the
convenience and ease of accessibility and display - It is fine that presentation formats be
compressed, so long as you make a lossless
archival copy as well - Examples of presentation formats include .pdf
files, .mp3 files, .jpg images, MPEG-2 video
17So far, so good?
- As a responsible linguist creating digital
language documentation that will last well into
the future you - Know the difference between a working,
presentation, and archival file format - Know what makes a good archival format (LOTS)
- Maintain an archival format of your data
- Anything beyond this? Yes, a bit more
18Best Practice Digital Resources are
- Preservable in formats that are not vulnerable to
decay or obsolescence (see LOTS) - Intelligible so that content that is easily
understood by future scholars - Accessible so that resources are easily
discovered and accessed - They are also interoperable, but this is mostly a
concern of archives and services - (Simons and Dry, 2006)
19Create Preservable Resources
- Linguists are responsible for making preservable
resources - That is, creating archival formats that follow
the principles of LOTS
20Create Intelligible Resources
- In order to create resources that are
intelligible to others, you must document your
practices! - Documentation includes
- Your markup practices
- The encoding you use
- Metadata about your resources
- This information should be kept a file or files
in an archival format, and archived along with
your resources.
21Presentational Markup
- Many people use presentational markup,
particularly in the working formats like
Microsoft Word. - Presentational markup means that aspects of the
presentation (like bold, italics, indenting) are
themselves meaningful - For example
22Example of Presentational Markup
- AS_5.2.1978_audio Alice Spear, Potawatomi,
Crane Boy, May 2, 1978, Mayetta, Kansas. - ltboldgtAS_5.2.1978_audiolt/boldgt
- ltplain.textgtAliceSpearlt/plain.textgt
- ltitalicsgtCrane Boylt/italicsgt
- ltplain.textgtMay 2, 1978lt/plain.textgt
- ltplain.textgtMayetta, Kansaslt/plain.textgt
23Presentational Markup
- Presentational markup is not recommended. BUT if
you do use it, describe all meaningful aspects
(e.g. bold means head word, italics is used
for the part of speech)
24Descriptive Markup
- It is better practice to use descriptive markup,
like XML - XML is basically text with tags that provide
information about what is between the tags - ltheadwordgtmnomenlt/headwordgt
- ltglossgtricelt/glossgt
- Tags can be also used to group information, much
like you would group information in a database
record, and have a whole set of information in a
database
25Example of Descriptive Markup
- AS_5.2.1978_audio Alice Spear, Potawatomi,
Crane Boy, May 2, 1978, Mayetta, Kansas. - ltIDgtAS_5.2.1978_audiolt/IDgt
- ltspeakergtAlice Spearlt/speakergt
- ltdescriptiongtCrane Boylt/descriptiongt
- ltrecording.dategtMay 2, 1978lt/recording.dategt
- ltlocationgtMayetta, Kansaslt/locationgt
26Descriptive Markup XML
- lt?xml version1.0" encodingUTF-8"?gt
- lt?xml-stylesheet typetext/xsl"
hrefarchive.xsl"?gt - ltmy.archivegt
- ltrecordgt
- ltidentifiergtAS_5.2.1978_audiolt/identifiergt
- ltsubject.language codex-sil-POT"/gtltlanguage
code"en"/gt - ltformatgtAnalog audio recording on Cassette
tapelt/formatgt - ltcontributor refine"speaker"gtAlice
Spearlt/contributorgt - ltcontributor refine"researcher"gtLaura
Buszard-Welcherlt/contributorgt - ltdescriptiongtCrane Boy narrative told in
Potawatomi and in Englishlt/descriptiongt - ltdate code1978-05-02"/gt
- ltcoveragegtMayetta, Kansaslt/coveragegt
- ltrelationgtdigital audio AS_5.2.1978_audio.wav,
interlinear text AS_5.2.1978_audio.txtlt/relationgt
- lttype.linguistic codeprimary_text"/gt
- ltrightsgtSome restrictions contact field
linguistlt/rightsgt - lt/recordgt
- lt/my.archivegt
27Descriptive Markup XML
- It is a good practice to use standard tags where
they are available. - OLAC has a set of tags that you would use for
metadata to describe your resources - GOLD has a set of tags used for morphosyntactic
description - Otherwise, be sure to document the meaning of the
tags that you use - Although some people feel comfortable working in
XML, many dont like to use it as a working
format. - Fortunately many common programs now allow you to
save your work as an XML file.
28The Advantage of XML
- Besides creating an archival data file, XML has
other advantages - By creating stylesheets, you can give the same
XML file different presentation forms - For example
29Delimited Text
- Another kind of markup that you might find
yourself using is delimited text. - Spreadsheet and database programs allow you to
export your data as text, delimited by a
particular character - Comma separated text (.csv)
- Tab separated text (.tab)
- To help with intelligibility, create an initial
record where the name of each field / cell is
given inside the record itself. That way, the
names of your fields / cells will be exported and
saved along with the rest of your data. - Text data exported this way is good practice,
particularly if you are careful about documenting
your practices inside your fields / cells (for
more on this see following slides).
30Other aspects of markup
- Document any special conventions that you use
- What do your morpheme boundary markers mean ( /
- / any others?) - What glossing conventions do you use? Give the
full names of abbreviations (e.g. POS means
possessive, PV means preverb). - Describe grammatical terms that you use (like
aorist, or preverb) and what it means for the
language you are describing. You dont have to
write a grammar -- a sentence or two describing
the term is sufficient) - Also note if you are using standard terminology
sets, like Leipzig Glossing Rules, or GOLD
terminology
31Document the Encoding
- Identify the character set you are using
- Document any non-standard characters
- Best practice is to use Unicode
32Create Metadata
- You will need to create some additional
information about your resources - Metadata usually includes information about
- The setting (time, date, participants, location)
- The language (ISO 636-3)
- Linguistic type (text, grammar, lexicon) and
subject - Access restrictions
- There are metadata standards for language
resources OLAC and IMDI
33OLAC Metadata Elements
Contributor (content) Language (audience)
Coverage (e.g. location) Publisher
Creator (content) Relation (to another resource
Date Rights (controlled vocab.)
Description Source (say, for re-elicited data)
Format Subject (controlled vocab.)
Encoding Format (character set) Subject Language (ISO 636-3 code)
Markup Format (XML schema) Title
Identifier (file name, URL Linguistic Type
http//www.language-archives.org/OLAC/olacms.html
34Create Metadata
- Keep a metadata record for each of your
resources. - The records should themselves be in an archival
format. This could be - A text file (good)
- Delimited text, exported from a simple database
file (good) - An XML file (better)
- An OLAC or IMDI formatted XML file (best)
- Your archivist may have a preference about
metadata formats, and prefer something relatively
simple (like a paper form) if the archive will be
manually entering the metadata. - Archive this file along with the rest of your
resources.
35Make your resources accessible
- Archive, archive, archive! (Not just on your
own, or your departmental server. Archives are
committed to the long-term preservation and
availability of your resources.) - Before you leave to do fieldwork, or when you are
writing your grant, establish contact with the
archive where you intend to deposit your
resources - Archivists will
- give you guidelines for creating archival files
- help you select the best metadata set
- give you information about setting access levels
- When you return, the first thing to do is send
your files, along with the metadata and markup
descriptions to the archive - Most archives will then give you an ID number for
your resources that you can then cite in your
publications
36A Community Responsibility
- Best practice involves what individual field
linguists do, but also how we collectively use
and care for these resources - This broader community involves
- Other researchers like yourself who create
resources - A growing set of interconnected digital language
archives that care for, protect, and disseminate
your resources - People who develop tools and services to make
your resources locateable, searchable, and
reusable - Others linguistics organizations, organizations
like OLAC and DELAMAN, funding agencies who
promote the work of this community
37Unicode
- Debbie Anderson A field linguists guide to
Unicode - Michael Appleby How to use Unicode on your
computer
38Field Case Studies Texts and Databases
- Jessica Boynton
- Transcription, Time-Alignment and Annotation
- Naomi Fox
- Using Filemaker Pro to produce archivable
language documentation - Connie Dickonson
- The Tsafiki Text Factory
39Panel Session
- Talks are 25 minutes, consecutive.
- Please remember or write down your questions!
- We will field them in a panel session after the
talks.