Automated%20indexing%20of%20survey%20questionnaires%20and%20interviews - PowerPoint PPT Presentation

About This Presentation

Title:

Automated%20indexing%20of%20survey%20questionnaires%20and%20interviews

Description:

studies primary data derived social research methods ... was rocked by the announcement last Thursday that Mr. Verdi would leave his job ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 39

Provided by: COR119

Category:

more less

Transcript and Presenter's Notes

Title: Automated%20indexing%20of%20survey%20questionnaires%20and%20interviews

1
Automated indexing of survey questionnaires and
interviews

Louise Corti
UK Data Archive, University of Essex
NaCTeM, Manchester
25 January 2008

2
Data collections at UKDA

social, economic and historical data
collections
_at_5000 data collections studies
studies primary data derived social research
methods
surveys, collated statistics, qualitative
interviewing, fieldwork and observation

3
Resource discovery at UKDA

how do our users currently locate data?
at the highest level, a study level metadata
record is compiled for each study following the
DDI XML-based metadata schema
free text fields plus controlled vocabulary
DDI - all the national social science data
archives in the world use this standard, so
harvesting across collections is possible

4
(No Transcript)
5
(No Transcript)
6
Resource discovery at UKDA

simplest resource discovery is a free text search
or on some key fields from the catalogue records,
eg health

7
Free text search
8
Use of key words

UKDA manually assign key words to (index) data
for resource discovery purposes
surveys study description and question level
qualitative data study level, methods
key words used thus describe the methodology but
not the research data per se

9
Key word search
Key words
10
Key words
11
Key words are manually assigned at the survey
question level but captured in a database at the
study level
This is very laborious as there can be hundreds
per study!
BUT key words are NOT linked to questions!! ie
at UKDA there is NO correspondence in the current
metadata schema.
12
Keywords searches can be refined by the user
making use of the UKDA thesaurus of terms
Select this term and a new search is run
13
UKDA Thesaurus

HASSET (Humanities and Social Science Thesaurus )
is a subject thesaurus which has been developed
by the UKDA over the past 20 years
Initially based on the UNESCO thesaurus, it has
been continuously expanded and updated for use in
the UKDAs online catalogue
display of the hierarchical relationships of
terms can help users to broaden a search or make
it more specific. Cross referencing to synonyms
suggests alternative search terms, as does the
provision of links to other conceptually related
terms
it employs the conventional range of term
relationships of equivalence (preferred and
non-preferred terms), the hierarchical
relationships (broader and narrower terms) and
the associative relationships (related terms)
stored in SQL tables and multi lingual version
ELSST developed for EC

14
Metadata

DDI allows for
study level description
methodology and data description, authors, rights
management and access etc.
file level
description of individual files e.g spss files,
work file, audio file (but not currently used
in-house)
question (variable) level
question description, text, var names and values,
groups PLUS key words (again not used)

15
Variable search
16
But key words NOT linked to the variable!
17
Indexing survey questions

survey questions
Do you suffer from any long-standing limiting
illness?
Keywords assigned long-tem illness
Government survey questions are often
standardised to provide comparability across
surveys
Have large databases of individual questions

18
Semi-automated solutions

UKDA indexingthere must be an easier way!
first a database of questions linked to key terms
(controlled vocab) must be built to test any
automated assignment
methodology and coder reliability should also be
investigated
in-house guidelines are in place but still
subjective assignment
no stringent quality control on key word
assignment

19
Key words for qualitative data

a different challenge
indexing done at study level and is largely
conceptual
work needs to be done on how researchers assign
key words to data and how they search for
qualitative data
analysis of data processing methods in house
analysis of UKDA search logs ..what terms you
users enter?
Can utilise named entity recognition, term
extraction and document summarisation tools on
these kinds of data (eg an unstructured
transcribed interview)

20
What about using NaCTeM tools?

given that we can provide databases of terms
linked to data (study, file and parts)
could test NaCTeM tools data to assign terms or
concepts/summarise text
nice front end processing tools are essential
processors must have option to agree or edit any
terms
terms should be output to DDI XML metadata at the
study, file and variable level

21
Structural and content mark up of textual
interview data

for spoken interview texts, useful encoding
features are
utterance, specific turn taker, defining
idiosyncrasies in transcription
links to analytic annotation and other data types
(e.g.. thematic codes, concepts, audio or video
links, researcher memos, maps, images, URLs etc.)
identifying information such as real names,
company names, place names, occupations, temporal
information

22
An sample interview

ID 001
Sex M
YOB 1921
Place Oldham
Finalocc Postman
U id'1' who'interviewer' Right, it starts with
your grandparents. So give me the names and dates
of birth of both. Do you remember those sets of
grandparents?
U id'2' who'subject' Yes.
U id'3' who'interviewer' Well, we'll start with
your mum's parents? Where did they live?
U id'4' who'subject' They lived in Widness,
Lancashire.
U id'5' who'interviewer' How do you remember
them?
U id'6' who'subject' When we Mum used to take
me to see them and me Grandma came to live with
us in the end, didn't she?
U id'7' who'Welham' Welham Yes, when Granddad
died - '48.
U id'8' who'interviewer' So he died when he was
48?
U id'9' who'Welham' Welham No, he was 52. He
died in 1948.
U id'10' who'interviewer' But I remember it.
How old would I be then?
U id'11' who'Welham' Welham Oh, you would have
been little then.
U id'12' who'subject' I remember him, he used
to have whiskers. He used to put me on his knee
and give me a kiss. .

23
ESRC SQUAD project

developed and tested universal standards and
technologies
long-term digital archiving
publishing
data exchange
investigated user-friendly tools for
semi-automating processes already used to prepare
qualitative data and materials
formatted text documents ready for output
mark-up of structural features of textual data
annotation and anonymisation tool
automated coding/indexing linked to a domain
ontology

24
Identifying elements

Identify atomic elements of information in text
Person names
Company/Organisation names
Occupations
Locations
Dates and times
Example
Italy's business world was rocked by the
announcement last Thursday that Mr. Verdi would
leave his job as vice-president of Music Masters
of Milan, Inc to become operations director of
Arthur Anderson

25
Testing NLP tools

UKDA have investigated some basic NLP tools to
identify named entities with a nice GUI
part of ESRC SQUAD award
rules can be written but obviously geared to
domain specificity. Individual interviews can
cover almost any subject!
system tuned to a sample of routine interview
datanot jargon-laden

26
26
27
XML schema - TEI

main aim to tag data with key XML elements
work on an XML schema has specified a reduced
set of Text Encoding Initiative (TEI) elements
core tag set for transcription
names, numbers, dates ltpersnamegt
links and cross references ltrefgt
text structure ltbodygt
unique to spoken texts ltkinesicgt
contextual information (participants, setting,
text)
New XML schema developed under JISC funding
(DEXT) called QuDEx to describe annotation,
linking, segmentation and alignment of
qualitative data (www.data-archive.ac.uk/dext)

28
Transcript with manual XML mark-up
28
29
Automated XML mark-up input data file for NLP
tools
30
Data processed through Edinburgh LT-XML and CME
tools
The main Graphical User Interface (GUI)
Invokes the SQUADCoder in NXT
31
NXT tool
Locate the NXT metadata file which must be set
up with named entity types
The NXT generic window running the SQUAD Coder
32
The SQUADCoder Window
All the references to a particular entity
The Named Entity Hierarchy
Transcription view
33
Annotation tool - anonymise
The Coreference Action Panel
34
Annotation tool
Enter pseudonym
35
Anonymised data
The Anonymised Transcription View
36
Annotated data in the NXT what formats and how
stored?

NXT uses stand off annotation annotation
linked to or references individual words
uses the NITE NXT XML model
creates new anonymised version of the text
save original file
save matrix of references - names to pseudonyms
outputs annotations who worked on the file etc.

37
Next steps

these are all demo tools.. none taken any
further..project funding ended
I would like collaboration on annotation of data
through semantic tagging and document mark-up
automatic term recognition and XML element
tagging
automatic document classification indexing
auto summarisation of text document reduction
possibly detecting structural relationships
can NacTeM (ASSERT) tools be used to undertake
key word assignment for survey questions and
structured catalogue records
term extraction, summarisation and mark-up of
spoken interview data
coreferencing in interviews?