Title: Kirrkirr: A Javabased visualisation tool for XML dictionaries of Australian Languages
1Kirrkirr A Java-based visualisation tool for XML
dictionaries of Australian Languages
- Kevin Jansz
- Department of Computer Science, University of
Sydney, Australia - Christopher Manning
- Computer Science and Linguistics, Stanford
University, USA - Nitin Indurkhya
- School of applied Science, Nanyang Technological
University, Singapore
2Project Objectives
- Aims of the project
- providing innovative ways for representing a
dictionary, through creative use of the medium of
computers - augmenting dictionaries from text corpora
- to be able to provide practical educationally
useful programs as a result (at low labor cost) - examining the richness of lexical structure, in
particular the connotational and figurative use
of words - Main initial target an interactive front end for
exploring or using the Warlpiri dictionary.
3Talk Outline
- The research agendas
- Kirrkirr A Warlpiri dictionary browser
- The Lexical Database
- exploiting the strengths of XML
- indexing XML data
- User interface and visualization
- User studies
4Research Program Lexicon
- A lexicon is not just words but a vast network of
associations between words and within and across
the concepts represented by words - The aim of this work is to provide people with a
better understanding of this conceptual map. - Traditional paper dictionaries offer very limited
ways for making such networks visible - On a computer, one can imagine all sorts of ways
of bringing out such relationships
5Research Computational Lexicography
- Dictionaries on computers are now commonplace
- But there has been little attempt to utilize the
potential of the new medium - Goal fun dictionary tools that are effective for
language learning, browsing, and research - Special interest dictionaries for minority
languages. Here economic, motivational, and user
support reasons all point to an important role
for computers.
6MRD Structure
- The internal structures of current Machine
Readable Dictionaries (MRDs) usually merely mimic
the structure of the printed form (Boguraev 1990) - Some work, notably WordNet (Miller 1995) has
involved a fundamental rethinking of dictionary
content and organization (in WordNet,
organization via synsets which are related via
links of part, subkind, opposite) - But there has been little in the way of software
to make such research truly usable by different
communities of users.
7Initial focusKirrkirr a Warlpiri browser
- Warlpiri is an Australian Aboriginal language
spoken in the Tanami desert (NW of Alice) - Rich lexical materials have been collected by
linguists over decades (Ken Hale, MIT, from
1950s) resulting in one of the most
comprehensive lexical databases for any
Australian Language - There is a relatively large community of people
interested in learning their traditional language - Until now, results havent been produced in a
format usable by the community (only raw
printouts) - Kirrkirr aims to build a computer interface for
browsing the Warlpiri dictionary.
8Educational goals
- Dictionary structure and usability are often
dictated by professional linguists, while the
needs of others (speakers, semi-speakers, young
users, second language learners) are not met - Aim is to avoid this
- A low level of literacy makes an e-dictionary
potentially more useful than a paper edition as
it is less dependent on good knowledge of
spelling and alphabetical order. - Making it fun and easy to use, and providing
multimedia content and the pronunciations of
words is a considerable help as well.
9Target user community
10Kirrkirr A Warlpiri dictionary browser
- (Jansz 1998 Jansz, Manning and Indurkhya 1999)
- An environment for the interactive exploration of
dictionaries. - Although our current work has just been with
Warlpiri, the design is general (Arrernte coming
soon!) - Attempts to more fully utilize graphical
interfaces, hypertext, multimedia, and different
ways of indexing and accessing information - Written in Java, it can either be run over the
web high bandwidth or run locally (here Javas
main advantage is cross-platform support).
11Specific goals
- An interactive environment that encouraged
exploration easy and fun to use - Reduction of the dependence on alphabetical order
- Catering to the needs of different user groups
(kids, teachers, professionals) - Flexible enough to display appropriate
information in appropriate ways depending on user
level
12Overview
- Kirrkirr provides various modules
- Graph layout of word relationships
- Formatted dictionary entries
- Semantic domain browsing
- A notes facility for jotting in the margin
- Multimedia audio, pictures
- Advanced searching interfaces
- others in planning formatting (XSL) editing,
figuration patterns - These attempt to cater to users with different
interests and competence levels
13(Kirrkirr screen shot)
14The lexical database
- Original materials are stored in an ad hoc format
of markup using backslash codes with some (rather
odd) nesting of structural tags - These were converted to XML using an
error-correcting stack-based parser (written in
PERL). - The inconsistency and flexibility of dictionary
entries actually made this a surprisingly
difficult task. - But parser tries to impose data integrity
- Use of XML gives a clear structure to the data,
and makes available many (free) tools
15XML
- XML separates the structure of the data from its
presentation - Much of the recent enthusiasm for XML has
centered around representing simple and rigid
structures such as database records - The rich hierarchical and variable structure of
dictionary entries is really more what something
like XML excels at! - Result remains a portable, tangible text file
16Alternative a database
- The obvious thing for storing a lot of data
- Has clear advantages structure, indexing, query
language, relationships, integrity. - Many people have suggested using a database for
lexical data and some have actually done it
(IITLEX, Austin and Nathan) - But in general lexicographers oppose the
rigidity, and, in practice, standard relational
databases are quite ill-suited to dictionaries
17Problems with using a Relational Database
- Dictionary entries vary enormously in structure
- Data is fragmented
- Dictionaries are only loosely structured
- Same element can appear at many levels (dialect,
cross-reference, ) - Database model is inflexible to extending the
dictionary structure - Lessens portability
18XML indexing - challenges
- Despite the various XML parsers available, it is
surprising that there has been little
consideration in making single entries
retrievable from the file - Present XML Parsers tend to put the entire XML
document in memory (or its parsed tree form),
before the data extraction process begins - This is not practical when parsing significant
XML databases (e.g., the Warlpiri dictionary is
approx. 10Mb).
19XML Indexing - solutions
- The hierarchical structure of XML lends itself to
indexing, as each separate entry in the XML file
can be considered as a separate entity - To make the Warlpiri dictionary usable for
Kirrkirr an ad hoc indexing system was developed - Uses a slightly modified Ælfred XML parser
- Entries are indexed by headword in a separate
index file - The system returns an XML document object
containing the single dictionary entry,
facilitating - processing for related words (Graph layout)
- XSL processing to HTML
20XML Indexing - solutions (2)
- The use of the XML indexing process considerably
improves efficiency as only requested entries are
parsed, hence conserving time and bandwidth - Once whole entries are parsed, they are kept
temporarily in a cache - Thus the System uses XML as a median between the
structure and indexing of a relational database,
with the freedom and functionality of XML.
21Kirrkirrs XML Index Process
Index in Memory
Kirrkirr
5
XML document object
22XQL - Potential
- An alternative to investigate for the future is
using a standard query language such as XQL
to get material out of the XML dictionary, rather
than using our ad hoc index. - At the moment not a huge issue since most
retrieval is focussed on components of a
particular word
23Visualization of dictionary information
- For applications with simple textual content
behind them, there is little that can be done but
an on-line reflection of a printed page - But we want more than just definitions of words
we want to know their relationships to other
words, and the patterning in these relationships - In a computational approach, the program can
mediate between the lexical data and the user - The interface can select from and choose how to
present information (according to the users
preferences) in many different ways
24Previous work
- Current systems present the search-dominated
interface of classic Information Retrieval
systems you type a word in a search box - Results try to mimic, but are generally inferior
to, the printed version of the dictionary - Good feature rapid searching
- But these systems do little to utilize the
captivating qualities of computers
interactivity, user control and adaptability
(Brown 1985).
25Previous work (2)
- Current systems are only effective when user has
a clearly specified information need even here,
we are ignoring the distinction between
information gained and knowledge sought (Sharpe
1995) - Lack browsing, and chances for incidental or
curiosity driven learning - Lack tangibility and situatedness of paper
ineffective for getting an idea of a collection - We wish to exploit the essence of hypertext,
which is click to explore browsing
26Previous work (3)
- Little research work (in corpus linguistics,
visualization etc.) on dictionary visualization - WordNet built a rich network of relationships,
which fundamentally departed from the paper
dictionary tradition, and has been used in many
computational projects - However very little has been done in the way of
interfaces that make these relationships visible
and intelligible to users. - Graphical representations seem particularly
important given our target users.
27Graph-based visualization
- There is a little previous work on graphical
representations of dictionaries - For instance, the visual-thesaurus by plumbdesign
derived from WordNet - But it is also a good demonstration of how
chaotic and confusing graphical interfaces can
become.
28Perils of visualization
29Graph-based visualization
- (Jansz 1998 Jansz, Manning and Indurkhya 1999)
- Classic graph layout problem
- Adapts work by Eades et al. (1998) and Huang et
al. (1998) on visualization and navigation of WWW
document linkages - Uses the spring algorithm. Big advantage is that
it is an iterative updating algorithm, and so
gives an easy interactivity - it wiggles and people can play with it.
- Clarity and simplicity of graph Software
maintains a set of focus nodes to prevent
overcrowding
30Educational advantages
- Alphabetical order is important, but
- A web of words offers other effective
opportunities for learning - A student can opportunistically explore words
that are related in various ways - Important semantic relationships can be
understood
31Kirrkirr network display
32Kirrkirr network display
33Formatted dictionary entries
- Are produced automatically from the XML by using
XSL (via James Clarks XT) - XSL allows easy modeling of some user
preferences. - Most trivially, one can leave out information
such as part of speech, or detailed definitions,
which we do by providing several stylesheets to
choose from - This is useful as many users find information
overload quite confusing and demotivating - Can produce bilingual or monolingual dictionary
- Opportunities for various output styles, and
formats such as RTF or TeX for printing.
34Formatted dictionary entries
35Rich typology of link types
- The semantically rich types of linkages present
in a dictionary (synonym, antonym, hyponym,
subheadword, variant, coverbs, ) solves one of
the major problems of the web we have many link
types with a clear semantic interpretation - Use consistent color-coded text and edges to show
these link types - Gives a richer browsing experience
- Unlike HTML, you can tell where you are going
before clicking
36Browsing
- Work (at PARC and elsewhere Pirolli et al. 1996)
has stressed role for browsing as well as
searching in information access - It provides a context for learning
- We provide browsing in several ways
- conventional hypertext
- but with rich semantically-interpreted links
- their color-coding matches network edges
- network-based display of words
- Other methods being investigated
- browsing through semantic domains
- deriving terminology sets (words that are used
together in culturally important activities)
automatically from text corpora
37Other components
- Multimedia (currently pictures and audio)
- Can hear pronunciations / see objects
- Im keen to put in videos of Warlpiri sign
language - Advanced search page
- search various fields, regular expressions, etc.
- Notes one can annotate dictionary entries (to
correct or personalize)
38User study
- Mim Corris (Yuendumu, Willowra)
- Jane Simpson (Lajamanu)
- User testing with primary and (lower) secondary
students - Observation of trainee Warlpiri literacy workers
- Comments from teachers, other adults etc.
- Purely qualitative observational study of
dictionary use. (Doing anything much else would
be difficult.) - Initial reactions are very enthusiastic
- Could use as a basis for classroom activities
(better with some further development games and
puzzles)
39A positive anecdote
- One of the introductory Warlpiri literacy
students, who had not been very interested in the
literacy class, spent nearly 3/4 hour looking at
Kirrkirr apparently in absorbed concentration.
She wasnt especially interested in the sound and
picture possibilities. She moved between words,
scrolling along the list, typing in the search,
clicking on the words in the network pane. She
wasnt even put off when the dictionary
definitions stopped appearing looking at the
networks of words instead. This is quite unlike
her attitude to the backslash coded electronic
dictionary (where she lost interest quickly
because of the difficulty for her of narrowing
down searches). After the Kirrkirr demo she
asked if she could have a printed dictionary to
take away with her to use in camp to learn the
words. I interpret this as a desire to learn
words in her own time and place.
40Conclusions
- Kirrkirr is just a prototype of what one can do
to develop new ways to visualize lexicons - We have addressed the challenge of making
dictionary information usable in the creation of
an application which mediates between
well-structured data and users needs for
searching/browsing and presentation - While we have focused our research on Warlpiri,
the system can be easily applied to other
languages
41Conclusions (cont.)
- ... The best future applications of MRDs in
education will be those most able to respond to
the insights and needs of their users (Kegl
1995) - Kirrkirr can be seen as a step towards the future
of edictionaries
42(No Transcript)