Michael Young www'socialscienceautomation'com Turning Text Into Data - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Michael Young www'socialscienceautomation'com Turning Text Into Data

Description:

... by Social Science Automation would undoubtedly be of interest to many students ... Slightly longer overview of Profiler Plus and what it does. Does it work? ... – PowerPoint PPT presentation

Number of Views:522
Avg rating:3.0/5.0
Slides: 27
Provided by: michae1006
Category:

less

Transcript and Presenter's Notes

Title: Michael Young www'socialscienceautomation'com Turning Text Into Data


1
Michael Youngwww.socialscienceautomation.com
Turning Text Into Data
2
The lure
  • The formal approach to discourse analysis
    undertaken by Social Science Automation would
    undoubtedly be of interest to many students and
    faculty in IR, both from a practical standpoint
    in that it provides a useful tool for conducting
    research, and from a methodological perspective,
    representing an innovative mix of content and
    discourse analysis that seeks to capture formally
    the meanings of different texts.

3
Agenda
  • Why Social Science Automation exists
  • Brief overview of WorldView
  • Slightly longer overview of Profiler Plus and
    what it does.
  • Does it work?
  • Some current projects
  • Demonstrations
  • Please ask questions at any time.

4
Why Social Science Automation Exists
  • Most of the available data in the social sciences
    is text.
  • For most of the history of the social sciences we
    have dealt with text extraordinarily poorly.
  • In 1997, I was a failing academic ...
  • Many current and former faculty here supported my
    efforts in many ways

5
  • These circumstances led to the development of
    WorldView and later of Profiler Plus (because
    WorldView turned out to be the easy part).

6
WorldView
  • WorldView is a program for representing,
    exploring, and comparing semantic network
    representations of text.

7
For the United States to be strong abroad, it
must be strong at home.
8
Profiler Plus
  • Profiler Plus is a general purpose text coding
    program.
  • That is, a rule-based system for manipulating
    tokens (words and punctuation), whose rules are
    collected in sets and applied in sequence.

9
Profiler Plus
Data
Coding Scheme
10
JARGON!
  • Token word or punctuation (hi ! ,).
  • Slot a place in a token to store information.
  • Literal the sequence of actual characters found
    in the text.
  • Part of speech grammatical function of a token,
    noun.
  • Lemma base form of a word.
  • lemmings?lemming

11
Definition of A Coding Scheme
  • A coding scheme consists of a set of rules for
  • Adding, deleting and moving tokens
  • Changing the values in a token's slots
  • Outputting results
  • All based on the tokens other slot values and
    relationships to other tokens
  • The simplest case is matching a literal.
  • ? find all occurrences of anger and output
    ANGER for each occurrence.

12
Sequence of Operations
  • Tokenizing
  • Building sentence data structures
  • Assigning a part of speech and lemma to each
    token
  • Manipulating the tokens
  • Creating output

13
Tokenizing
  • Breaks the text into sentences which limits the
    scope of token matching to a single sentence.
  • Separates punctuation from words so that we do
    not have to concern ourselves with trailing
    periods, commas, semicolons and etc .

14
  • Sentence data structure
  • Rome was destroyed by the Huns

15
  • Part of Speech
  • Noun, verb, preposition, adjective, adverb
  • Lemma
  • Base form of token

16
  • Rome was destroyed by the Huns

17
  • Manipulating tokens
  • Using rules with operators that create, move, and
    delete tokens as well as comparing and changing
    values of the slots in the tokens.

18
Reduction Table
19
  • Rome has been destroyed by the Huns.
  • ?
  • Huns destroyed Rome.

20
Does it work?
21
To quote a Harvard Professor...
Dont take my word for it...
  • Our results surprised us. The computer program we
    evaluated VRA Reader was able to extract
    information on a level equal to trained
    Harvard undergraduatesand this was for a short
    term application. Computer programs do not get
    tired, bored, and distracted , perhaps for the
    first time, automated programs are ready for
    primetime.
  • Gary King Will Lowe, International Organization
    57, 2003

22
(No Transcript)
23
(No Transcript)
24

25
Select Projects
  • ICEWS-CWEST (DARPA)
  • Near real-time event data extraction from the
    full text of news feeds
  • Coding capabilities in Arabic and Russian
  • Perception Metrics
  • Company branding
  • Nation branding
  • Political Perception
  • Candidate image in media
  • Achievement Metrics
  • Who are the trouble makers?
  • Refinement of text mapping and WorldView

26
Demonstrations
  • XEditor
  • Profiler Plus
  • WorldView
Write a Comment
User Comments (0)
About PowerShow.com