From Edwardians Online - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

From Edwardians Online

Description:

8. Politics. 9. Parents' interests. 10. Children's leisure. 11. Community and social class ... apart fae this broken arm and operations kidneys .. I had an ... – PowerPoint PPT presentation

Number of Views:180
Avg rating:3.0/5.0
Slides: 45
Provided by: louise114
Category:

less

Transcript and Presenter's Notes

Title: From Edwardians Online


1
  • From Edwardians Online
  • to Qualidata Online
  • Preparing data for online
  • access
  • Libby Bishop,ESDS QualidataEconomic and Social
    Data Service, UK Data ArchiveOnline Access to
    Qualitative Data Opportunities and Challenges
  • Friday 5 December, 2003
  • Royal Statistical Society

2
Towards a standard format for qualitative data
resources
  • data needs to be preserved in a uniform resource
    format
  • easier for provider (maintenance, tools,
    interchange)
  • easier for user (consistency across data sets)
  • DDI provides an XML framework for survey content
    (variables) but currently no suitable standard
    format for the content of qualitative data
  • need a comprehensive application that will
    enable
  • data interchange
  • sophisticated on-line searching
  • retrieval from encoded texts

The Edwardians Online Pilot
3
Edwardians Online Development work
  • A six month investigative project towards
    developing such a framework in a specific
    resource creation project
  • Data
  • Models using XML standards technologies
  • New functionality
  • Data coding methods

4
Basic search and retrieve functionality
  • developed online querying function based on
    annotation of texts and themes in XML
  • keyword search interview summaries database
  • keyword search interview full transcripts
  • search or browse by themes from list - retrieve
    extracts of text in particular documents coded by
    that theme
  • jump from extract to view in full document
  • filter searchers on subsets of interviewees e.g.,
    age, gender

5
Family Life and Work Experience Before 1918
project
  • Life history interviews - classic sociological
    study of Edwardian Society by Professor Paul
    Thompson
  • One of our larger datasets conducted in early
    seventies - nearly 500 interviews completed
  • Of value because of scope and diversity
    cross-national sample of people born in Britain
    before 1918
  • Broadly representative of qualitative interview
    data
  • Data exists in various formats in various
    locations originally recorded on audio tapes
    transcribed as typed paper documents includes
    supporting source materials- essays letters
  • Texts coded in thematic analysis of content
  • Paper source has proved popular to be very
    popular for reuse

6
Example of Interview Text
  • interviews up to 100,000 words
  • 8hrs of audio tape
  • secondary source transcription of the dialogue-
    errors in interpretation
  • no time indexes between sound content
  • loosely structured
  • alternate speakers

7
Thematic Coding
  1. Household   2. Domestic routine   3. Meals
  4. Influence and discipline   5. Recreation
in the home   6. Recreation outside the home  
7. Weekend activities and religion   8. Politics
  9. Parents' interests 10. Children's leisure
11. Community and social class
  • 12. School 13. Work, except domestic service
    14. Life after leaving school 15. Marriage 16.
    Childbirth - including sexual knowledge 18.
    Domestic service
  • 19. Institutions and boarding schools
  • 20. Occupational history
  • Texts coded into broad themes in family life and
    work
  • Coded then extracts cut-and-pasted to separate
    filing system
  • Coding systems vary in complexity
  • Text coded by theme to assist research
  • management of dataset
  • more rigorous interpretation of text

8
Example of Thematic Coding
  • Thematic sections of variable length
  • May be overlapping

9
Why Preserve Thematic Coding?
  • Preserving codes preserves record of primary
    interpretation of dataset, promotes openness in
    research
  • replication confirmation re-interpretation.
  • Useful as retrieval aids for voluminous bodies of
    text?
  • Original cut-and-paste thematic segments proved
    important and popular finding aid for paper
    collections.
  • User familiarity CAQDAS information retrieval
    and management
  • Some limitations
  • Codes vary with content and individual coders
    interpretation, so quality -quality variable
  • Coding is not a complete representation of
    thematic content for example, not coded for
    migration or health

10
(No Transcript)
11
(No Transcript)
12
From Edwardians Online to Qualidata Online
  • Expand number of accessible datasets
  • Expanded online functionality
  • Ability to search across multiple datasets
  • Ability to filter on basic demographics (age,
    gender, residence, occupation)
  • Ability to combine keyword search and filter
  • Standardise and automate transcript processing
    tools and procedures

13
Material from additional collections
  • Mothers and Daughters by Mildred Blaxter (in
    Scots)
  • 100 Families by Paul Thompson (without speaker
    tags)
  • Key processing steps
  • Scan
  • OCR
  • Proof
  • Format
  • XML

14
Preparing data
  • Prepare digital files in appropriate format
  • OCR and manual tidy up
  • Macros to prepare text for mark-up
  • Assign line IDs remove unicode
  • Excel sheet to add speaker IDs (turn takers)
  • Database to tag (code) lines by theme
  • Scripts to transform docs to XML
  • Scripts to process web retrievals
  • VB Script to process retrieval request using
    x-link and x-pointer

15
Getting from .tif
16
To basic XML
  • ltu id"96" who"subject"gtI would rather nae ken
    if I had cancer. I told my man that, I says "If I
    have cancer, don't tell me". I mean you might
    hae an idea yourself, but I wouldnae like to be
    telt. I told him that.lt/ugt
  • ltu id"97" who"interviewer"gtAnd how has your own
    health been over the years?lt/ugt
  • ltu id"98" who"subject"gtOch, up an' doon, y'ken
    .lt/ugt
  • ltu id"99" who"interviewer"gtAny serious
    illness?lt/ugt
  • ltu id"100" who"subject"gtNo ... nae illnesses
    .. nae illness, ken, in that wey. Just once I
    took an afa' turn at missing words I couldnae
    get ... I wis aye sleepin' this tablets I got fae
    the doctor and I had to sign for this tablets ..
    I just couldnae keep awake.lt/ugt
  • ltu id"101" who"interviewer"gtAnd did he say what
    it was?lt/ugt
  • ltu id"102" who"subject"gtI canna mind now, it's
    that long ago.. But I was really bad at that
    time, otherwise now .. apart fae this broken arm
    and operations kidneys .. I had an operation for
    a cyst.lt/ugt
  • ltu id"103" who"interviewer"gtUh-huh.. was it ...
    epileptic, was she?lt/ugt
  • ltu id"104" who"subject"gtAn this shoulder.. I
    couldnae move it..lt/ugt
  • ltu id"105" who"interviewer"gtUh-huh .. a joint?
    Seized up ... and how long ago was that?lt/ugt
  • ltu id"106" who"subject"gtWell, it'll be.. ten
    year this 8th June. It was the same day as
    Robert Kennedy was killed. that's how I ken. I
    was goin' into hospital in the mornin an I mind
    tellin the patients missing words "What a
    shame, Robert Kennedy's been shot, an' killed"
    missing words

17
Word document created from OCR
18
Issues in scanning and OCR
  • Scanning done at 300 dpi, grey scale
  • OCR varies hugely with quality of original,
    special challenges include (but are not limited
    to)
  • Character recognition
  • Stray marks on page
  • Missing words
  • Interviewers notes
  • Creative character interpretation section
    breaks, font changes, footnotes, super- and
    sub-scripts, and so on.
  • Partially automated with macros, but much
    judgement (clerical and research) still required

19
OCR and manual tidy up
  • Work required to digitise older type face not to
    be underestimated
  • Average of 12 hours clerical labour to prepare a
    70 page document
  • Apply macros in Ultraedit or Excel to remove
    page nos, speaker line breaks etc.
  • 040
  • Mrs Florrie D., Wootton. Father, farm worker. B.
    1892.
  • Your name is Mrs Florence is it?
  • Florrie - yes.
  • And you live at 13?
  • Castle Road. Wooton.
  • And you're a widow?
  • Yes.
  • And the year of your marriage was 1911?
  • Yes.
  • And the year of birth 28th July 1892?
  • Yes.
  • And that was at Whichford?

20
Final Word file(human and Excel readable)
21
Notes for transcription proofing
  • Main aim is to check that the speech flows and
    reads properly and that there are no missing
    sections of text. In addition
  • Each individual stream of speech should be a
    continuous line followed by a carriage return.
  • Obvious spellings mistakes to be corrected, with
    the following exceptions
  • Peculiar or unrecognized spellings that should be
    left as they are include proper names, place
    names and obvious cases where the original
    transcriber was trying to indicate the phonetics
    of the speech.
  • E.g. never mixed with a lot of em.
  • The spelling of proper names such as place
    names, person names should be consistent.
  • Poor grammar is typically the interview content
    so ignore this.
  • Page numbers.

22
Transcription editing guidelines
  • Basic editing (spelling, punctuation)
  • Research editing
  • Interviewers annotations
  • Text supplemental to transcript itself
  • Editing to conform with Excel and XML
  • MUST have tabs between speaker tags and
    utterance, BUT no other tabs
  • Handling special characters (10 ½ or ten and a
    half or 10.5?)
  • Replace double spacing with paragraph formatting
    to create extra space at end of paragraph
  • See handout
  • Qualidata Transcription Editing Guidelines

23
Transformation of transcripts to XML
  • Export key fields to Excel sheet
  • Doc ID, Line ID
  • Text is marked up at the utterance level
  • Excel macros create marked up transcripts from
    tab delimited file
  • SN30.xml
  • SN31.xml

24
(No Transcript)
25
Using Excel macros to create XML transcript
26
New tags for searching on demographic variables
27
Handling unique features
  • None of the Edwardians or 100 Families
    transcripts have speaker tags
  • Need some way of indicating who is speaking when
    search results are returned
  • Turn takers (usually transcribed)
  • Logic test to assign interviewer/subject based on
    end of line character

28
Check turn-taking with no speaker tags
29
Screenshots for quali online
30
Screenshots for quali online
31
Screenshots for quali online
32
Screenshots for quali online
33
Screenshots for quali online
34
Thematic coding sand-off Architecture in XML
  • Challenges for developing an XML application
    included the multiple hierarchies in the
    transcript texts and overlapping fields or
    elements
  • dialogue structure v thematic content
  • Conventional mark-up of these structures in a
    single document violates nesting rules of XML
  • Solution - stand-off annotation approach
    whereby data and coding stored in different
    documents (annotation linked by Xlink and
    Xpointers)
  • Proven utility as method for annotating
    multi-coded dialogue corpora. Allows for
  • allows for multiple coding schemes
  • accommodates overlapping elements
  • easily extendable

35
Base-line text unit utterances (ltUgt)
Theme work
Theme household
  • ltUgt attributes
  • id
  • speaker
  • start time (audio file)
  • end time (audio file)

Theme politics
Example of Stand-off XML Architecture
36
(No Transcript)
37
(No Transcript)
38
Applying thematic coding
  • Annotator tool in MS access using append query
    (appends lines to table)
  • For each transcripts save as section table
  • Add to total section table (cum. Line ID)
  • Export table as tab delimited file
  • Perl scripts create files with pointers to
    relevant parts text for EACH transcripts
  • householdSN30.xml
  • childbirthSN30.xml
  • householdSN31.xml
  • childbirthSN31.xml

39
(No Transcript)
40
Theme x-query files
  • FileChildbirth30.xml
  • - ltChildbirth_set id"FLWE30" xmlnsxlink"http//
    www.w3.org/1999/xlink/"gt
  •   ltChildbirth id"FLWE30_1" xlinktype"simple"
    xlinkhref"30,xmlxpointer(53 to 54)" /gt
  •   ltChildbirth id"FLWE30_2" xlinktype"simple"
    xlinkhref"30.xmlxpointer(257 to 264)" /gt
  •   lt/Childbirth_setgt
  • Theme ID Childbirth
  • Document ID 30

41
(No Transcript)
42
(No Transcript)
43
Files required for web query
  • Web Directories
  • \Documents (for indexing only)
  • - SN30.asp (but could be xml files)
  • \Transcripts (for web xml retrieval)
  • - SN30.xml
  • \Themes
  • childbirthSN30.xml (X-query pointers
  • to the relevant parts of xml docs)

44
Phase II functionality and beyond
  • Will be adding
  • Boolean searching to view overlapping themes
  • Key word in theme search
  • Add in to text pointers to other materials
    notes, researchers annotations, audio, pictures,
    geo-references.
  • Will investigate/would like to develop
  • neat tools sets for publishing and querying data
  • Enable simultaneous manipulation and display of
    quantitative data, e.g. via the NESSTAR system
  • Document and thematic coding on-line
  • New code retrieval on the fly
  • Linked thesaurus tools
Write a Comment
User Comments (0)
About PowerShow.com