Text Mining for the Social Sciences - PowerPoint PPT Presentation

About This Presentation
Title:

Text Mining for the Social Sciences

Description:

Text mining is not based upon understanding of document content. ... Tools: document converters, sentence detectors, tokenisers, taggers, chunkers, ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 28
Provided by: bria69
Category:
Tags: mining | sciences | social | text

less

Transcript and Presenter's Notes

Title: Text Mining for the Social Sciences


1
Text Mining for the Social Sciences
  • Brian Rea and Sophia Ananiadou
  • National Text Mining Centre
  • www.nactem.ac.uk
  • University of Manchester

2
Outline
  • What is Text Mining
  • Techniques and Solutions
  • How can the Social Sciences Benefit
  • Case Study ASSERT Project
  • Future Opportunities
  • Conclusions

3
What is Text Mining
  • Text mining discovers and extracts information
    hidden in unstructured texts.
  • It aids to construct hypotheses based upon
    associations between the extracted information
  • Due to this it can often discover things
    overlooked by human readers

4
What Text Mining is not
  • It is not Google!
  • Text mining is not based upon understanding of
    document content.
  • Instead it predicts the most likely meaning of a
    fragment of text based upon the language models.
  • Text mining will generally not pick up on
    sarcasm, irony or other subtleties of language
    usage.
  • Text mining tools must be tuned before use on
    different text types, styles or languages.

5
NaCTeM Our Vision
  • to harness the synergy from service provision and
    user needs within biomedicine, allied to
    development and research within text mining
  • to provide quality service underpinned from
    proven text mining tools / techniques enabled via
    MIMAS for biosciences in first instance
  • to consolidate existing knowledge, activities,
    and models and transfer them widely to other
    areas, e.g. humanities / social sciences /
    clinical, etc.

6
Gathering Information
  • Information retrieval
  • Gather, select, filter, documents that may prove
    useful
  • Information extraction
  • Partial, shallow, deep language analysis
  • Find relevant entities, facts about entities
  • Resources ontologies, lexicons, terminologies,
    thesauri, grammars, annotated corpora
  • Tools document converters, sentence detectors,
    tokenisers, taggers, chunkers, parsers, NE
    recognisers, semantic analysers

7
Information Retrieval
  • The base technology is similar to that used in
    common search engines.
  • This is greatly enhanced by text mining
  • Away from common bag of words approaches
  • By indexing concepts, entities and facts
  • Using pre-processing to improve efficiency
  • We use this approach iteratively with other tools
    to minimise intensive runtime processing

8
Terminology and Named-Entities
  • Extraction of named entities (names of
    organisations, people, locations), technical
    terms for any domain
  • Discovery of concepts allows semantic annotation
    of documents
  • Improves information access by going beyond index
    terms
  • Links same entities via co-reference (anaphora)
  • Enables semantic querying
  • Construction of concept networks from text
  • Allows clustering, classification of documents
  • Visualisation of knowledge maps

9
Terminology Management
  • Term clustering (linking semantically similar
    terms) and term classification (assigning terms
    to classes from a pre-defined classification
    scheme)
  • Possible applications
  • Metadata creation
  • Topic detection
  • Conceptual indexing (with facts, events)
  • Document clustering classification

10
Information Extraction
  • Extraction of relationships (events and facts)
    for knowledge discovery
  • Information extraction, more sophisticated
    annotation of texts
  • Beyond named entities facts, events
  • Fact THE QUEEN and the Royal Family remain at
    Osborne
  • Event THE DUKE, attended by the Hon. E.C. Yorke,
    honoured the Lyceum with his presence on Saturday
    evening, to witness the performance of The
    Bells.

11
Information Extraction Annotation
  • Annotated texts for semantic IR and IE
  • Example The GENIA annotation editor
  • Mesh term Human, Blood Cells, and Transcription
    Factors.
  • Annotation POS, named entity, parse tree
  • Any text may be annotated using named entities,
    facts, events
  • Consider searching for document about things that
    kill cancer

12
GRID-based Development
70 million seconds, that is, about 2 years
13
Using the GRID
  • Challenge
  • Analysis of very large data sets
  • Combining distributed resources
  • Solution
  • Large PC clusters
  • National infrastructure (NDS)
  • Grid enabled software (UIMA)
  • Experiments
  • The entire MEDLINE was parsed in 8 days

14
How the Social Sciences Can Benefit
  • Improved browsing and searching of domain
    specific and general literature.
  • Assisted analysis of text-based qualitative data
  • Automated extraction of text-based quantitative
    data
  • Discovery of key concepts and relationships in
    literature, ideal for virtual learning
    environments
  • Discovery of trends and change over time

15
Case Study ASSERT Project
  • Automatic Summarisation for Systematic
    Reviews using Text Mining
  • Engaging the user community EPPI
  • (Evidence for Policy and Practice Information and
    Co-ordinating Centre)
  • Document clustering/classification (see demo)
  • Information extraction
  • Summarisation
  • Visualisation

16
Systematic Reviews
  • First, extensive searches are carried out in
    order to locate as much relevant research as
    possible according to a query.
  • Then the mass of data retrieved by this process
    is screened until only the most relevant and
    reliable literature remains to form the focus of
    the review.
  • Finally, the literature is synthesised and
    summary reports are written to inform policy and
    practice by helping users of the research to make
    evidence-informed decisions.

17
Search Solutions
  • A combination of Web Crawl and Information
    Retrieval systems to allow for iteratively deeper
    searches.
  • Terminology Management to discover key concepts
    for later stages and to improve search criteria
  • Clustering Techniques to categorise and collate
    documents on similar subtopics
  • Visualisation to allow for improved usability and
    access to documents

18
Screening Solutions
  • As above for retrieval, visualisation and access
    to subtopics
  • Named Entity Recognition
  • Fact and Event Extraction to provide key details
    to the reviewer
  • Summarisation techniques to identify significant
    sections of each document

19
Synthesis Solutions
  • Multi-Document Summarisation techniques to assist
    with comprehension of the subtopics.
  • Fact and Event databases to assist in examination
    and linking of evidence
  • Evidence retrieval and reference through existing
    systems to assist in report generation

20
Demonstration
  • ASSERT

21
Community Call
  • Additional funding by JISC to organise and
    support a community call for the social sciences.
  • 360K to be shared between at least 2 projects
    to support expansion of ASSERT and related tools
    for other social science research.
  • Specific focus on RoI from assistive technologies
    and significant benefit to existing research.

22
Looking into the future
  • NCeSS usability case studies
  • Close links with industry AZ, IBM, Xerox
  • Involvement with UKPMC
  • Pilot project with the BBC
  • Not only science
  • history, archaeology, humanities, business
  • Improving scalability through parallelisation and
    GRID computing
  • Further inclusion of Data Mining techniques
  • New and extended services for full text analysis

23
BBC pilot project
  • Analyse, structure and visualise BBC news online,
    according to a users query using advanced text
    mining techniques
  • Concept discovery and retrieval
  • interface allows a user to enter a query across
    the document collection and automatically
    calculate a list of concepts specific to the
    query and ranked by perceived importance.
  • Creation of user oriented knowledge maps
  • Based on clusters of articles and their automatic
    concept categorisation.

24
Sentiment Analysis
  • Information from user survey is an essential
    resource for obtaining opinions
  • Fixed style questionnaires make it hard to obtain
    free opinion whilst open questions are often hard
    to analyse consistently.

Model ABC BAD Lose battery GOOD style is
nifty MISC. Back panel is heated
Cellular phone ABC loses battery easily, but the
style is very nifty. Back panel is sometimes
heated up.
25
Epidemiology (NIBHI)
  • Combines techniques from
  • Systematic reviews
  • Information extraction
  • Knowledge management and inference
  • Evidence-informed hypothesis generation
  • Filtering to specific location or time period for
    detailed analysis

26
Conclusions
  • Text Mining has a proven track record in the
    sciences
  • The numerous techniques are fully extendable to
    social science challenges
  • The ASSERT project highlights some of the key
    benefits of this assistive technology
  • Usability of tools is a key challenge, one that
    is solvable with input from potential end-users
  • GRID computing is allowing for novel solution to
    previously impossible challenges

27
How to contact us
  • Visit the Text Mining Centre Website at
  • http//www.nactem.ac.uk
  • brian.rea_at_manchester.ac.uk
  • sophia.ananiadou_at_manchester.ac.uk
Write a Comment
User Comments (0)
About PowerShow.com