Title: Towards a solution for the sharing of phonological data
1Towards a solution for the sharing of
phonological data
- Yvan Rose
- Memorial University of Newfoundland
- Brian MacWhinney
- Carnegie Mellon University
2Map of presentation
- Context no specialized tool to facilitate
research in phonological development - A preliminary attempt ChildPhon
- A more promising solution Phon
- Current state of the Phon project
- Developments in foreseeable future
- Potential
- Publicly-available cross-linguistic database
- Proposal
3Context (until recently)
- CHILDES tools (focus on CLAN)
- Number of tools for multimedia data storage and
analysis - Mostly deals with morphological and syntactic
aspects of development - Not easily extensible
- What about phonology?
- No CHILDES tool adapted for phonology
- Data sharing and broad-based investigations are
challenging
4A first attempt
- ChildPhon (Rose 2003)
- Analytical (relational) database for child
language data - Designed within FileMaker Pro
- Main features
- Interface for double-blind transcriptions
- Automatic functions based on phonetic
transcriptions - Syllabification of transcribed forms
- Detection of common processes observed in child
language (e.g. onset cluster reduction)
5Problems with ChildPhon
- No support for Unicode fonts ??no X-platform
compatibility (Macintosh-only) - Not compatible with CHILDES / TalkBank??no data
exchange functions - Automatic parses limited, not customizable
- Multimedia capabilities are minimal (at best)
- Requires use of proprietary software and font
- Algorithms are destructive
- Statistical functions are minimal
- No web implementation
- In sum Good idea -- Bad implementation
6Phon a more promising solution
- Interdisciplinary project (First of its kind
between Linguistics and Computer Science at
Memorial University of Newfoundland) - Software designers and programmersRodrigue
Byrne, Gregory Hedlund, Philip O'Brien, Yvan
Rose, Harold Wareham - Financial Support
- Faculty of Arts, Memorial University
- Social Sciences and Humanities Research Council
of Canada (SSHRC) - Canada Fund for Innovation (CFI)
- National Science Foundation (NSF)
7Phon Overview
- Software underpinnings
- Programmed in Java, Unicode font encoding
- Cross-platform compatible (Mac, Windows, )
- XML data storage structure
- Compatible with TalkBank schema
- User management system
- Extended multimedia capabilities
- More flexible automatic algorithms
- Specialized query language
- Offers a complete solution for data sharing
8Phon usability
- Intuitive graphical user interface
- Helpful wizards (e.g. project creation, queries)
- Record navigator
- Custom selection of data fields
- General / record-by-record
- Intuitive query language
- Standard terminology
- Built-in queries (modifiable by user)
- Query memorization and saving
9Phon main functions
- User management
- Media segmentation
- Phonetic transcription
- Transcription merging (Selection of final
transcriptions for analysis) - Phrase segmentation and alignment (Further
segmentation according to research needs) - Syllable alignment (Alignment of syllables of
target and actual forms) - Database query
10User management
- Secure login
- User tasks / privilegesmanagement
11Media segmentation
- Generally similar to CLAN
- Hit the space bar to define a speech segment
- Default segment length user-defined
- Useful for working on small speech segments
- Segment editing
- Change numerical value
- Stretch the time segment by sliding pointer
Yvan Rose Replace yellow line in segment
timebar by waveform.
12Transcription general interface
13Transcription
- Built-in IPA character map
- Symbol categories
- Access tosound segment
- Interface for double-blind transcriptions
- Tied with user management functions
Yvan Rose Link adulttranscription to an
electronic IPA dictionary. Need to develop a
transcription system for sounds that cant be
transcribed easily. Ability to assign a
feature set to a dummy character Ability to use
the forward slash bar to assign two competing
symbols to a given sound (e.g. p/b would imply
that voicing cannot be transcribed accurately
the alternants will be considered as one
consonant by the syllabifier and query
interpreter.
14Transcription merging
- Comparison of competing transcriptions
- Direct access to media segment
- Selection of most accurate transcription
- Further refinement of selected transcription
Yvan Rose People an algorithm that would enable
a comparison of transcriptionsbased on specific
parameters (e.g. voicing). This algorithm could
build on the feature sets associated with each
segment transcribed.
15Phrase alignment
- Further segmentation of the utterances
- Useful for researchon phonologicaldomains
- A simple mouse click sets and resetsthe domain
boundaries
Yvan Rose Several people requested different
levels of segmentation. This includes
morpho-syntacticlevels of segmentation, as well
as various levels of the prosodic
hierarchy.Also add PLAY button in the interface
of this module
16Syllabification algorithm
- Syllabification algorithm
- Refined labeling of each syllabic position
- Each label is a valid object for query
?
?
R
R
O
O
N
N
17Syllabification algorithm
- Parameters of syllabification areuser-definable
Timing tier
Syllable constituents
Yvan Rose The parameters will be revised
thoroughly. To add (among others) word-final
codas, list of exceptional clusters.Also add, to
complement stress attraction, an option of
ambisyllabic syllabification of
intervocalicconsonants in Strong-Weak syllable
juncture. In addition to this, we also need a
way to manually assign a syllabification to each
consonant whichcannot be accounted for by the
automatic algorithm.
18Syllable alignment
- Automatic alignment of syllables
- Manual modifications
19Query language
- Quick and accurate queries on large amounts of
data - Language features
- Uses terms familiar to phonologists to compose
queries - Syllable constituents onset, nucleus,
- Stressed vs. unstressed syllables
- Custom predicates
- History of recent queries
- Ability to save queries
20Query language components
- Selectors (e.g. Onset(Syllable x))
- Predicates (e.g. Branching(Onset(Syllable x))
- Boolean connectives
- Example
let corpusName "TestCorpus", let corpus
Corpus(corpusName), let records
Records(corpus) foreach r in records foreach
p in Phrases(r) foreach s in
Syllables(p) Branching(Onset(TargetS
yllable(s))) AND NOT
Branching(Onset(ActualSyllable(s)))
21Query tree structure
- Branching onset reduction in 2nd syllable
Record
TargetPhrase
ActualPhrase
Syllable
Syllable
Syllable
Syllable
Rhyme
Rhyme
Rhyme
Rhyme
Nucleus
Nucleus
Nucleus
Nucleus
Onset
Onset
Onset
Onset
Coda
Coda
T
U
N
D
R
A
S
D
U
N
D
A
S
TRUE
FALSE
AND NOT
branching(
)
pos( , 2)
onset( )
TargetPhrase
MATCH
AND NOT
ActualPhrase
pos( , 2)
onset( )
branching(
)
22Query results
- View in application
- Use to generate textual reports
- Recording session (e.g. to exemplify a given
process) - Time slice (e.g. to exemplify a stage of
acquisition) - Entire database (to exemplify a learning curve)
- Export
- As Unicode file
- As ASCII file (modulo font conversion
limitations)
23Enhancements (short term)
- Improvement of syllable alignment algorithm
(building on Kondraks 2003 algorithm) - Import function
- ChildPhon files (including font translator
--almost done!) - CHAT files
- Incorporation user-defined fields
- Incorporation of statistical functions
- Chart report generator
- Ability to select various chart formats
- Bar graphs (for proportions within and across
sessions) - Line graphs (for learning curves)
24Enhancements (longer term)
- Interoperability with Praat
- Export to Praat (similar to CLAN function)
- Interface to accommodate acoustic measurement
data - Web-based interface
- Data sharing at a distance
- Easy query of corpora on CHILDES database
- Further automation
- Automatic detection of pre-identified processes
Yvan Rose Include function to extract phonetic
inventories per session/stage/Get examples of
canned analyses in literature on clinical
phonology.
25Development timeline
- End of fall of 2004
- Completion of current development phase
- Release of testing (Beta) version
- Winter of 2005
- Bug fixes
- Improvement of functionality and user interface
(including short-term enhancements) - Website creation (http//www.phon.ca/)
- Completion of technical documentation
- Notes to programmers
- User guide
- Summer of 2005
- Release of ? Phon 1.0 as open-source freeware
26Potential
- Standard for data sharing
- Large-scale investigations
- Cross-linguistic investigations
- Enhancement to CHILDES
- Elaboration of a database fulfilling the needs of
acquisitionists focussing on phonology and
related issues - Investigation of interface issues (e.g. between
morpho-syntax and phonology)
27How to realize this potential
- Team of researchers specializing in
- Early acquisition (including babbling)
- Segmental development
- Prosodic development
- Phonological disorders
- Second language acquisition
-
- Feedback on software development project
- Data contribution
- Existing corpora in digital format
- Conversion of printed corpora
- Identification of corpora (printed, with or
without audio files) - Setting of conventions for data conversion
28Our proposal
- Constitution of a research team to develop a
phonological component of CHILDES - Database
- Supporting software
- Elaboration, with the research team, of a grant
application to support - Database elaboration
- Software development
- Periodical meetings
- Workshops
29Concretely
- Feedback on software project
- Software needs for various types of research Let
us know what you need - Implementation Let us know how you want it to
work - Contribution to grant application
- Kinds of research would the new database
enable Let us know what you would like to do - Impacts of this research (e.g. theoretical,
clinical, ) - Supporting letters
- Contribution to the public database
- Sharing of existing / future corpora
- Establishment of conventions to format older
corpora
30Special thanks
- The Phon team at Memorial
- Rodrigue Byrne
- Harold Wareham
- Gregory Hedlund
- Philip OBrien
- For his great help with the TalkBank XML schema
- Franklin Chen (Carnegie Mellon University)
- For their useful feedback on an early version of
this software - Heather Goad (McGill), Paula Fikkert (Nijmegen),
Clara Levelt (Leiden), Katherine Demuth (Brown),
Mark Johnson (Brown), Carrie Dyck (Memorial),
Phil Branigan (Memorial), Brian MacWhinney
(Carnegie Mellon), Bryan Gick (UBC), Sophie
Wauquier-Gravelines (Nantes), Sharon Inkelas (UC
Berkeley), Conxita Lleó, Sonia Frota (Lisbon),
Maria João Freitas (Lisbon), Ronald Sprouse (UC
Berkeley), Joe Pater (UMass, Amherst), John
Archibald (Calgary), Éliane Lebel (Memorial)
hoping that no one was forgotten