Title: quads.esds.ac.uk/squad
1SMART QUALITATIVE DATA METHODS AND COMMUNITY
TOOLS FOR DATA MARK-UP
THE PROJECT
WHAT FEATURES OF TEXT CAN BE MARKED UP?
SQUAD aims to explore methodological and
technical solutions for exposing digital
qualitative data to make them fully shareable and
exploitable. The main objectives are to
Spoken interview texts provide the clearest and
most common example of the types of encoding
features that can be marked up. There are three
basic groups of structural features
- specify, test and propose an eXtended Markup
Language (XML) schema for storing and marking up
qualitative data - investigate requirements for contextualising
qualitative data and developing standards for
data documentation - develop semi-automated using natural language
processing (NLP) tools for preparing marked up
qualitative data for sharing - research tools for publishing and interrogating
data via the web Qualitative Data Mark-Up
Tools (QDMT)
- utterance, specific turn taker, defining
idiosyncrasies in transcription - links to analytic annotation and other data types
(e.g. thematic codes,concepts,audio or video
links, researcher annotations) - identifying information such as real names,
company names, place names, occupations, temporal
information
Example Italy's business world was rocked by the
announcement last Thursday that Mr. Verdi would
leave his job as vice-president of Music Masters
of Milan, Inc to become operations director of
Arthur Anderson.
DEFINING CONTEXT
Rich context enables informed re-use of data. But
defining how to provide context for raw data to
make it more usable is complex. ESDS Qualidata
has done much to establish informal ways of
documenting raw data. Micro and macro level
features should be considered including
USING NLP TOOLS
Information Extraction (IE) is a sub-field of NLP
which aims to identify key pieces of information
in texts using 'shallow' analysis techniques. A
typical IE system will perform Named Entity
Recognition where particular kinds of proper
names and terms are identified, classified and
marked up.
- how the research question was framed
- the research application process
- project progress
- fieldwork situations
- analyses processes
Fieldwork observations are useful as are
timelines and political chronologies. Equally
when undertaking a replication or restudy,
detailed information on sampling procedures,
field work approaches and question guides will be
essential. SQUAD has identified a minimal
generic set of elements that represent a baseline
for contextualising data.
This is a means of annotating documents with
semantic metadata enabling resource discovery
and data exploration. The Edinburgh LT-XML and
CME tools have been used to process the data.
quads.esds.ac.uk/squad
2SMART QUALITATIVE DATA METHODS AND COMMUNITY
TOOLS FOR DATA MARK-UP
METADATA STANDARDS
ANONYMISING DATA TOOL
The XML schema will specify a reduced set of
Text Encoding Initiative (TEI) elements
This tool imports marked up data from from the
Edinburgh pipeline system. Named entities are
highlighted and co-reference chains e.g
numerous references to a single person - are
identified.
- core tag set for transcription
- names, numbers, dates ltpersnamegt
- links and cross references ltrefgt
- notes and annotations ltnotegt
- text structure ltbodygt
- unique to spoken texts ltkinesicgt
- linking, segmentation and alignment ltlinkgt
- advanced pointing - XPointer framework
- text and AV synchronisation
- contextual information (participants, setting,
text)
Names can be anonymised with chosen pseudonyms.
The references of names to pseudonyms is saved.
Annotations are explored in an XML format in the
NITE NXT model. NXT uses stand off annotation
where annotation is linked to or referenced by
words.
- ltu who"interviewer" xmlid"u1"gtThere's just
one or two factual things first of all do you
mind my asking how old you are?lt/ugt - ltu who"subject" xmlid"u2"gt49.lt/ugt
- ltu who"interviewer" xmlid"u3"gtAnd what
schools did you go to?lt/ugt - ltu who"subject" xmlid"u4"gt
- ltorgNamegtKing Streetlt/orgNamegt
interview text with XML tags embedded
TOOLS PROGRESS
- defined header metadata for a standardised
transcript - defined and tested generic XML models for
qualitative data - tested and refined NLP tools for qualitative data
- built front end to NLP named entity tools
- chosen software to enable annotation of data
- explored export formats for longer-term archiving
- investigated powerful XML based indexing tools
for searching and retrieving data - investigated web display of multimedia data and
pointers to other resources using XML extending
the functionality of ESDS Qualidata
DATA EXCHANGE STANDARDS
- A uniform format for richly encoding qualitative
research is necessary as it enables preservation
and re-use of metadata, data and annotation
ensures consistency of presentation and
description of data supports the development of
common web-based publishing and search tools and
facilitates data interchange and comparison
among datasets. - SQUAD has produced a limited formal definition of
a common XML vocabulary and DTD based on the TEI
and tested a new Qualitative Data Interchange
Format (QDIF).
THE PROJECT TEAM
CONTACT
Claire Grover Maria Milosavljevic
Louise Corti and Claire Grover UK Data
ArchiveUniversity of EssexColchester, Essex CO4
3SQ Email quads_at_esds.ac.ukTel 44 (0)1206
872145 URL quads.esds.ac.uk/squad
Louise Corti
Libby Bishop
Mijail Alexandrov Kabadjov
quads.esds.ac.uk/squad