Title: From Edwardians Online
1- From Edwardians Online
- to Qualidata Online
- Preparing data for online
- access
- Libby Bishop,ESDS QualidataEconomic and Social
Data Service, UK Data ArchiveOnline Access to
Qualitative Data Opportunities and Challenges - Friday 5 December, 2003
- Royal Statistical Society
2Towards a standard format for qualitative data
resources
- data needs to be preserved in a uniform resource
format - easier for provider (maintenance, tools,
interchange) - easier for user (consistency across data sets)
- DDI provides an XML framework for survey content
(variables) but currently no suitable standard
format for the content of qualitative data - need a comprehensive application that will
enable - data interchange
- sophisticated on-line searching
- retrieval from encoded texts
The Edwardians Online Pilot
3Edwardians Online Development work
- A six month investigative project towards
developing such a framework in a specific
resource creation project - Data
- Models using XML standards technologies
- New functionality
- Data coding methods
4Basic search and retrieve functionality
- developed online querying function based on
annotation of texts and themes in XML - keyword search interview summaries database
- keyword search interview full transcripts
- search or browse by themes from list - retrieve
extracts of text in particular documents coded by
that theme - jump from extract to view in full document
- filter searchers on subsets of interviewees e.g.,
age, gender
5Family Life and Work Experience Before 1918
project
- Life history interviews - classic sociological
study of Edwardian Society by Professor Paul
Thompson - One of our larger datasets conducted in early
seventies - nearly 500 interviews completed - Of value because of scope and diversity
cross-national sample of people born in Britain
before 1918 - Broadly representative of qualitative interview
data - Data exists in various formats in various
locations originally recorded on audio tapes
transcribed as typed paper documents includes
supporting source materials- essays letters - Texts coded in thematic analysis of content
- Paper source has proved popular to be very
popular for reuse
6Example of Interview Text
- interviews up to 100,000 words
- 8hrs of audio tape
- secondary source transcription of the dialogue-
errors in interpretation - no time indexes between sound content
- loosely structured
- alternate speakers
7Thematic Coding
1. Household 2. Domestic routine 3. Meals
4. Influence and discipline 5. Recreation
in the home 6. Recreation outside the home
7. Weekend activities and religion 8. Politics
9. Parents' interests 10. Children's leisure
11. Community and social class
- 12. School 13. Work, except domestic service
14. Life after leaving school 15. Marriage 16.
Childbirth - including sexual knowledge 18.
Domestic service - 19. Institutions and boarding schools
- 20. Occupational history
- Texts coded into broad themes in family life and
work - Coded then extracts cut-and-pasted to separate
filing system - Coding systems vary in complexity
- Text coded by theme to assist research
- management of dataset
- more rigorous interpretation of text
8Example of Thematic Coding
- Thematic sections of variable length
- May be overlapping
9Why Preserve Thematic Coding?
- Preserving codes preserves record of primary
interpretation of dataset, promotes openness in
research - replication confirmation re-interpretation.
- Useful as retrieval aids for voluminous bodies of
text? - Original cut-and-paste thematic segments proved
important and popular finding aid for paper
collections. - User familiarity CAQDAS information retrieval
and management - Some limitations
- Codes vary with content and individual coders
interpretation, so quality -quality variable - Coding is not a complete representation of
thematic content for example, not coded for
migration or health
10(No Transcript)
11(No Transcript)
12From Edwardians Online to Qualidata Online
- Expand number of accessible datasets
- Expanded online functionality
- Ability to search across multiple datasets
- Ability to filter on basic demographics (age,
gender, residence, occupation) - Ability to combine keyword search and filter
- Standardise and automate transcript processing
tools and procedures
13Material from additional collections
- Mothers and Daughters by Mildred Blaxter (in
Scots) - 100 Families by Paul Thompson (without speaker
tags) - Key processing steps
- Scan
- OCR
- Proof
- Format
- XML
14Preparing data
- Prepare digital files in appropriate format
- OCR and manual tidy up
- Macros to prepare text for mark-up
- Assign line IDs remove unicode
- Excel sheet to add speaker IDs (turn takers)
- Database to tag (code) lines by theme
- Scripts to transform docs to XML
- Scripts to process web retrievals
- VB Script to process retrieval request using
x-link and x-pointer
15Getting from .tif
16To basic XML
- ltu id"96" who"subject"gtI would rather nae ken
if I had cancer. I told my man that, I says "If I
have cancer, don't tell me". I mean you might
hae an idea yourself, but I wouldnae like to be
telt. I told him that.lt/ugt - ltu id"97" who"interviewer"gtAnd how has your own
health been over the years?lt/ugt - ltu id"98" who"subject"gtOch, up an' doon, y'ken
.lt/ugt - ltu id"99" who"interviewer"gtAny serious
illness?lt/ugt - ltu id"100" who"subject"gtNo ... nae illnesses
.. nae illness, ken, in that wey. Just once I
took an afa' turn at missing words I couldnae
get ... I wis aye sleepin' this tablets I got fae
the doctor and I had to sign for this tablets ..
I just couldnae keep awake.lt/ugt - ltu id"101" who"interviewer"gtAnd did he say what
it was?lt/ugt - ltu id"102" who"subject"gtI canna mind now, it's
that long ago.. But I was really bad at that
time, otherwise now .. apart fae this broken arm
and operations kidneys .. I had an operation for
a cyst.lt/ugt - ltu id"103" who"interviewer"gtUh-huh.. was it ...
epileptic, was she?lt/ugt - ltu id"104" who"subject"gtAn this shoulder.. I
couldnae move it..lt/ugt - ltu id"105" who"interviewer"gtUh-huh .. a joint?
Seized up ... and how long ago was that?lt/ugt - ltu id"106" who"subject"gtWell, it'll be.. ten
year this 8th June. It was the same day as
Robert Kennedy was killed. that's how I ken. I
was goin' into hospital in the mornin an I mind
tellin the patients missing words "What a
shame, Robert Kennedy's been shot, an' killed"
missing words
17Word document created from OCR
18Issues in scanning and OCR
- Scanning done at 300 dpi, grey scale
- OCR varies hugely with quality of original,
special challenges include (but are not limited
to) - Character recognition
- Stray marks on page
- Missing words
- Interviewers notes
- Creative character interpretation section
breaks, font changes, footnotes, super- and
sub-scripts, and so on. - Partially automated with macros, but much
judgement (clerical and research) still required
19OCR and manual tidy up
- Work required to digitise older type face not to
be underestimated - Average of 12 hours clerical labour to prepare a
70 page document - Apply macros in Ultraedit or Excel to remove
page nos, speaker line breaks etc.
- 040
- Mrs Florrie D., Wootton. Father, farm worker. B.
1892. - Your name is Mrs Florence is it?
- Florrie - yes.
- And you live at 13?
- Castle Road. Wooton.
- And you're a widow?
- Yes.
- And the year of your marriage was 1911?
- Yes.
- And the year of birth 28th July 1892?
- Yes.
- And that was at Whichford?
20Final Word file(human and Excel readable)
21Notes for transcription proofing
- Main aim is to check that the speech flows and
reads properly and that there are no missing
sections of text. In addition - Each individual stream of speech should be a
continuous line followed by a carriage return. - Obvious spellings mistakes to be corrected, with
the following exceptions - Peculiar or unrecognized spellings that should be
left as they are include proper names, place
names and obvious cases where the original
transcriber was trying to indicate the phonetics
of the speech. - E.g. never mixed with a lot of em.
- The spelling of proper names such as place
names, person names should be consistent. - Poor grammar is typically the interview content
so ignore this. - Page numbers.
22Transcription editing guidelines
- Basic editing (spelling, punctuation)
- Research editing
- Interviewers annotations
- Text supplemental to transcript itself
- Editing to conform with Excel and XML
- MUST have tabs between speaker tags and
utterance, BUT no other tabs - Handling special characters (10 ½ or ten and a
half or 10.5?) - Replace double spacing with paragraph formatting
to create extra space at end of paragraph - See handout
- Qualidata Transcription Editing Guidelines
23Transformation of transcripts to XML
- Export key fields to Excel sheet
- Doc ID, Line ID
- Text is marked up at the utterance level
- Excel macros create marked up transcripts from
tab delimited file - SN30.xml
- SN31.xml
24(No Transcript)
25Using Excel macros to create XML transcript
26New tags for searching on demographic variables
27Handling unique features
- None of the Edwardians or 100 Families
transcripts have speaker tags - Need some way of indicating who is speaking when
search results are returned - Turn takers (usually transcribed)
- Logic test to assign interviewer/subject based on
end of line character
28Check turn-taking with no speaker tags
29Screenshots for quali online
30Screenshots for quali online
31Screenshots for quali online
32Screenshots for quali online
33Screenshots for quali online
34Thematic coding sand-off Architecture in XML
- Challenges for developing an XML application
included the multiple hierarchies in the
transcript texts and overlapping fields or
elements - dialogue structure v thematic content
- Conventional mark-up of these structures in a
single document violates nesting rules of XML - Solution - stand-off annotation approach
whereby data and coding stored in different
documents (annotation linked by Xlink and
Xpointers) - Proven utility as method for annotating
multi-coded dialogue corpora. Allows for - allows for multiple coding schemes
- accommodates overlapping elements
- easily extendable
35Base-line text unit utterances (ltUgt)
Theme work
Theme household
- ltUgt attributes
- id
- speaker
- start time (audio file)
- end time (audio file)
Theme politics
Example of Stand-off XML Architecture
36(No Transcript)
37(No Transcript)
38Applying thematic coding
- Annotator tool in MS access using append query
(appends lines to table) - For each transcripts save as section table
- Add to total section table (cum. Line ID)
- Export table as tab delimited file
- Perl scripts create files with pointers to
relevant parts text for EACH transcripts - householdSN30.xml
- childbirthSN30.xml
- householdSN31.xml
- childbirthSN31.xml
39(No Transcript)
40Theme x-query files
- FileChildbirth30.xml
- - ltChildbirth_set id"FLWE30" xmlnsxlink"http//
www.w3.org/1999/xlink/"gt - ltChildbirth id"FLWE30_1" xlinktype"simple"
xlinkhref"30,xmlxpointer(53 to 54)" /gt - ltChildbirth id"FLWE30_2" xlinktype"simple"
xlinkhref"30.xmlxpointer(257 to 264)" /gt - lt/Childbirth_setgt
- Theme ID Childbirth
- Document ID 30
41(No Transcript)
42(No Transcript)
43Files required for web query
- Web Directories
- \Documents (for indexing only)
- - SN30.asp (but could be xml files)
- \Transcripts (for web xml retrieval)
- - SN30.xml
- \Themes
- childbirthSN30.xml (X-query pointers
- to the relevant parts of xml docs)
44Phase II functionality and beyond
- Will be adding
- Boolean searching to view overlapping themes
- Key word in theme search
- Add in to text pointers to other materials
notes, researchers annotations, audio, pictures,
geo-references. - Will investigate/would like to develop
- neat tools sets for publishing and querying data
- Enable simultaneous manipulation and display of
quantitative data, e.g. via the NESSTAR system - Document and thematic coding on-line
- New code retrieval on the fly
- Linked thesaurus tools