Tobias Blanke - PowerPoint PPT Presentation

1 / 90
About This Presentation
Title:

Tobias Blanke

Description:

How do computers understand Texts Tobias Blanke Next, state the action step. Make your action step specific, clear and brief. Be sure you can visualize your audience ... – PowerPoint PPT presentation

Number of Views:183
Avg rating:3.0/5.0
Slides: 91
Provided by: tobi7
Category:

less

Transcript and Presenter's Notes

Title: Tobias Blanke


1
How do computers understand Texts
Tobias Blanke
2
My contact details
  • Name Tobias Blanke
  • Telephone                020 7848 1975
  • Email                        tobias.blanke_at_kcl.ac.
    uk   
  • Address                 51 Oakfield Road (!)
    N4 4LD

3
Outline
  • How do computers understand texts so that you
    dont have to read them?
  • The same steps
  • We stay with searching for a long time.
  • How to use text analysis for Linked Data
  • You will build your own Twitter miner

4
Why? A Simple question
  • Suppose you have a million documents and a
    question what do you do?
  • Solution User can read all the documents in the
    store, retain the relevant documents and discard
    all the others Perfect Retrieval NOT POSSIBLE
    !!!
  • Alternative Use a High Speed Computer to read
    entire document collection and extract the
    relevant documents.

5
Data Geeks are in demand
New research by the McKinsey Global Institute
(MGI) forecasts a 50 to 60 percent gap between
the supply and demand of people with deep
analytical talent.
http//jonathanstray.com/investigating-thousands-o
r-millions-of-documents-by-visualizing-clusters
6
The Search Problem
7
The problem of traditional text analysis is
retrieval
  • Goal find documents relevant to an information
    need from a large document set

Information need?
Query
Magicsystem
Document collection
Retrieval
Answer list
8
Example
Google
Web
9
Search problem
  • First applications in libraries (1950s)
  • ISBN 0-201-12227-8
  • Author Salton, Gerard
  • Title Automatic text processing the
    transformation, analysis, and retrieval of
    information by computer
  • Editor Addison-Wesley
  • Date 1989
  • Content ltTextgt
  • external attributes and internal attribute
    (content)
  • Search by external attributes Search in
    databases
  • IR search by content

10
Text Mining
  • Text mining is used to describe the application
    of data mining techniques to automated discovery
    of useful or interesting knowledge from
    unstructured text.
  • Task Discuss with your neighbour what a system
    needs to
  • Determine who is a terrorist
  • Determine the sentiments

11
The big Picture
  • IR is easy .
  • Lets stay with search for a while

12
Search still is the biggest application
  • Security applications Search for the villain
  • Biomedical applications Semantic Search
  • Online media applications Disambiguate
    Information
  • Sentiment analysis Find nice movies
  • The human consumption is still key

13
Why is the human so important
  • Because we talk about information and
    understanding remains a human domain
  • There will be information on the Web that has a
    clearly defined meaning and can be analysed and
    traced by computer programs there will be
    information, such as poetry and art, that
    requires the whole human intellect for an
    understanding that will always be subjective.
    (Tim Berners-Lee, Spinning the Semantic Web)
  • There is virtually no semantics in the
    semantic web. () Semantic content, in the
    Semantic Web, is generated by humans, ontologised
    by humans, and ultimately consumed by humans.
    Indeed, it is not unusual to hear complaints
    about how difficult it is to find and retain
    good ontologists. (https//uhra.herts.ac.uk/dsp
    ace/bitstream/2299/3629/1/903250.pdf)

14
The Central Problem The Human
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
15
The Black Box
Documents
Query
Results
Slide is from Jimmy Lins tutorial
16
Inside The IR Black Box
Documents
Query
Representation
Representation
Query Representation
Document Representation
Index
Comparison Function
Results
Slide is from Jimmy Lins tutorial
17
Possible approaches
  • 1. String matching (linear search in documents)
  • - Syntactical
  • - Difficult to improve
  • 2. Indexing
  • - Semantics
  • - Flexible to further improvement

18
Indexing-based IRSimilarity text analysis
  • Document Query/Document
  • indexing indexing
  • (Query analysis)
  • Representation Representation
  • (keywords) Query (keywords)
  • evaluation
  • How is this document similar to
  • the query/another document?

Slide is from Jimmy Lins tutorial
19
Main problems
  • Document indexing
  • How to best represent their contents?
  • Matching
  • To what extent does an identified information
    source correspond to a query/document?
  • System evaluation
  • How good is a system?
  • Are the retrieved documents relevant?
    (precision)
  • Are all the relevant documents retrieved?
    (recall)

20
Indexing
21
Document indexing
  • Goal Find the important meanings and create an
    internal representation
  • Factors to consider
  • Accuracy to represent meanings (semantics)
  • Exhaustiveness (cover all the contents)

Coverage
Accuracy
String Word Phrase Concept
Slide is from Jimmy Lins tutorial
22
Text Representations Issues
  • In general, it is hard to capture these features
    from a text document
  • One, it is difficult to extract this
    automatically
  • Two, even if we did it, it won't scale!
  • One simplification is to represent documents as a
    bag of words
  • Each document is represented as a bag of the word
    it contains, and each component of the bag
    represents some measurement of the relative
    importance of a single word.

23
Some immediate problems
How do we compare these bags of word to find out
whether they are similar? Lets say we have
three bags House, Garden, House
door Household, Garden, Flat House, House,
House, Gardening How do we normalise these
bags? Why is normalisation needed? What would we
want to normalise?
24
Keyword selection and weighting
  • How to select important keywords?

 
25
Luhns Ideas
  • Frequency of word occurrence in a document is a
    useful measurement of word significance

26
Zipf and Luhn
27
Top 50 Terms
WSJ87 collection, a 131.6 MB collection of 46,449
newspaper articles (19 million term occurrences)
TIME collection, a 1.6 MB collection of 423 short
TIME magazine articles (245,412 term occurrences)
28
Scholarship and the Long Tail
  • Scholarship follows a long-tailed distribution
    the interest in relatively unknown items decline
    much more slowly than they would be if popularity
    were described by a normal distribution
  • We have few statistical tools for dealing with
    long-tailed distributions
  • Other problems include contested terms
  • Graham White, "On Scholarship" (in Bartscherer
    ed., Switching Codes)

29
Stopwords / Stoplist
  • Some words do not bear useful information. Common
    examples
  • of, in, about, with, I, although,
  • Stoplist contain stopwords, not to be used as
    index
  • Prepositions
  • Articles
  • Pronouns
  • http//www.textfixer.com/resources/common-english-
    words.txt

30
Stemming
  • Reason
  • Different word forms may bear similar meaning
    (e.g. search, searching) create a standard
    representation for them
  • Stemming
  • Removing some endings of word
  • computer
  • compute
  • computes
  • computing
  • computed
  • computation

Is it always good to stem? Give examples!
comput
Slide is from Jimmy Lins tutorial
31
Porter algorithm(Porter, M.F., 1980, An
algorithm for suffix stripping, Program, 14(3)
130-137)
http//qaa.ath.cx/porter_js_demo.html
  • Step 1 plurals and past participles
  • SSES -gt SS caresses -gt caress
  • (v) ING -gt motoring -gt motor
  • Step 2 adj-gtn, n-gtv, n-gtadj,
  • (mgt0) OUSNESS -gt OUS callousness -gt callous
  • (mgt0) ATIONAL -gt ATE relational -gt relate
  • Step 3
  • (mgt0) ICATE -gt IC triplicate -gt triplic
  • Step 4
  • (mgt1) AL -gt revival -gt reviv
  • (mgt1) ANCE -gt allowance -gt allow
  • Step 5
  • (mgt1) E -gt probate -gt probat
  • (m gt 1 and d and L) -gt single letter controll
    -gt control

Slide is from Jimmy Lins tutorial
32
Lemmatization
  • transform to standard form according to syntactic
    category. Produce vs Produc-
  • E.g. verb ing ? verb
  • noun s ? noun
  • Need POS tagging
  • More accurate than stemming, but needs more
    resources

Slide partly taken from Jimmy Lins tutorial
33
Index Documents ( Bag of Words Approach)
INDEX
DOCUMENT
Document Analysis Text Is This
This is a document in text analysis
34
Result of indexing
  • Each document is represented by a set of weighted
    keywords (terms)
  • D1 ? (t1, w1), (t2,w2),
  • e.g. D1 ? (comput, 0.2), (architect, 0.3),
  • D2 ? (comput, 0.1), (network, 0.5),
  • Inverted file
  • comput ? (D1,0.2), (D2,0.1),
  • Inverted file is used during retrieval for
    higher efficiency.

Slide partly taken from Jimmy Lins tutorial
35
Inverted Index Example
Doc 1
Dictionary
Postings
This is a sample document with one
sample sentence
Term docs Total freq
This 2 2
is 2 2
sample 2 3
another 1 1

Doc id Freq
1 1
2 1
1 1
2 1
1 2
2 1
2 1


Doc 2
This is another sample document
Slide is from ChengXiang Zhai
36
Similarity
37
Similarity Models
  • Boolean model
  • Vector-space model
  • Many more

38
Boolean model
  • Document Logical conjunction of keywords
  • Query Boolean expression of keywords
  • e.g. D t1 ? t2 ? ? tn
  • Q (t1 ? t2) ? (t3 ? ?t4)
  • Problems
  • many documents or few documents
  • End-users cannot manipulate Boolean operators
    correctly
  • E.g. documents about poverty and crime

39
Vector space model
  • Vector space all the keywords encountered
  • ltt1, t2, t3, , tngt
  • Document
  • D lt a1, a2, a3, , angt
  • ai weight of ti in D
  • Query
  • Q lt b1, b2, b3, , bngt
  • bi weight of ti in Q
  • R(D,Q) Sim(D,Q)

40
Cosine Similarity
Similarity calculated using COSINE similarity
between two vectors
41
Tf/Idf
  • tf term frequency
  • frequency of a term/keyword in a document
  • The higher the tf, the higher the importance
    (weight) for the doc.
  • df document frequency
  • no. of documents containing the term
  • distribution of the term
  • idf inverse document frequency
  • the unevenness of term distribution in the corpus
  • the specificity of term to a document
  • The more the term is distributed evenly, the
    less it is specific to a document
  • weight(t,D) tf(t,D) idf(t)

42
Exercise
  • (1) Define term/document matrix
  • D1 The silver truck arrives
  • D2 The silver cannon fires silver bullets
  • D3 The truck is on fire
  • (2) Compute TF/IDF from Reuters

43
Lets code our first text analysis engine
  • search.pl

44
Our corpus
  • A study on Kants critique of judgement
  • Aristotle's Metaphysics
  • Hegels Aesthetics
  • Platos Charmides
  • McGreedys War Diaries
  • Excerpts from the Royal Irish Society

45
Text Analysis is an Experimental Science!
46
Text Analysis is an Experimental Science!
  • Formulate a hypothesis
  • Design an experiment to answer the question
  • Perform the experiment
  • Does the experiment answer the question?
  • Rinse, repeat

47
Test Collections
  • Three components of a test collection
  • Test Collection of documents
  • Set of topics
  • Sets of relevant document based on expert
    judgments
  • Metrics for assessing performance
  • Precision
  • Recall

48
Precision vs. Recall
All docs
Retrieved
Relevant
Slide taken from Jimmy Lins tutorial
49
The TREC experiments
  • Once per year
  • A set of documents and queries are distributed
    to the participants (the standard answers are
    unknown) (April)
  • Participants work (very hard) to construct,
    fine-tune their systems, and submit the answers
    (1000/query) at the deadline (July)
  • NIST people manually evaluate the answers and
    provide correct answers (and classification of IR
    systems) (July August)
  • TREC conference (November)

50
Towards Linked Data
Beyond the Simple Stuff
51
  • Build Relationships between Documents
  • Structure in the classic web Hyperlinks
  • Mashing
  • Cluster and create links
  • Build Relationships within Documents
  • Information Extraction

52
The traditional web
  • Hyperlinks

53
Web Mining
  • No stable document collection (spider, crawler)
  • Huge number of documents (partial collection)
  • Multimedia documents
  • Great variation of document quality
  • Multilingual problem

54
Exploiting Inter-Document Links
Description (anchor text)
Links indicate the utility of a doc
What does a link tell us?
Slide is from ChengXiang Zhai
55
Mashing
56
Information filtering
  • Instead of changing queries on stable document
    collection, we now want to filter an incoming
    document flow with stable interests (queries)
  • yes/no decision (in stead of ordering documents)
  • Advantage the description of users interest may
    be improved using relevance feedback (the user is
    more willing to cooperate)
  • The basic techniques used for IF are the same as
    those for IR Two sides of the same coin

keep
IF
doc3, doc2, doc1
ignore
Slide taken from Jimmy Lins tutorial
User profile
57
Lets mine Twitter analysis
  • Imagine you are a social scientist and interested
    in the Arab spring and the influence of social
    media or something else
  • You know that social media plays an important
    role. Even the Pope tweets with an IPad!

58
Twitter API
  • Its easy to get information out of Twitter
  • Search API http//twitter.com/!/search/house
  • http//twitter.com/statuses/public_timeline.rss

59
Twitter exercise
  • What do we want to look for?
  • Form Groups
  • Create an account with YahooPipes
  • http//pipes.yahoo.com/pipes/
  • (You can use your Google one)
  • Create a Pipe. What do you see?

60
  • I. Access Keywords source
  • Fetch CSV Module.
  • Enter the URL of the CSV file http//dl.dropbox.c
    om/u/868826/Filter-Demo.csv
  • Use keywords as column names
  • II. Loop through each element in the CSV file and
    builds a search URL formatted for RSS output.
  • Under Operators Fetch Loop module
  • Fetch URLs URL Builder into the Loops big field
  • As base use http//search.twitter.com/search.atom
  • As query parameters use q in the first box and
    then item.keywords in the second
  • Assign results to item.loopurlbuilder
  • III. Connect the CSV and Loop modules

61
  • IV. Search Twitter
  • Under Operators Fetch Loop module
  • Fetch Sourcess Fetch Feed into the Loops big
    field
  • As URL use item.loopurlbuilder
  • Emit all results
  • V. Connect the two Loop modules
  • VI. Sort
  • 1. Under Operators Fetch Sort module
  • Sort by item.ypublished.utime in descending
    order
  • VII. Connect Sort module to pipe output. The
    final module in every Yahoo Pipe.
  • VIII. Save and Run Pipe

http//www.squidoo.com/yahoo-pipes-guide
62
Cluster to Create Links
63
Group together similar documents
Idea Frequent terms carry more information about
the cluster they might belong to Highly
co-related frequent terms probably belong to the
same cluster
http//www.iboogie.com/
64
Clustering Example
.
How many terms do these documents have?
65
English Novels
  1. Normalise
  2. Calculate similarity according to dot product

66
Lets code again
  • Compare.pl

67
FReSH (Forging ReSTful Services for e-Humanities)
  • Creating Semantic Relationships

68

69
  • Digital edition of 6 newspapers / periodicals
  • Monthly Repository (1806 1837)
  • Northern Star (1837 1852)
  • The Leader (1850 1860)
  • English Womens Journal (1858 1864)
  • Tomahawk (1867 1870)
  • Publishers Circular (1837-1959 NCSE 1880-1890)

70
Semantic view
  • Chain of readings

71
OCR Problems
  • Thin Compimy in fmmod to iiKu'-t tho dooiro
    ol'.those who seek, without Hpcoiilal/ioii, Hiifo
    and .profltublo invtwtmont for larjo or Hinall
    HiiniH, at a hi(jlilt"r rulo of intoront tlian
    can be obtainod from tho in 'ihlio 1'uihIh, and
    on oh Hocuro a basin. Tho invoHlinont Hystom,
    whilo it olfors tho preutoHt advantages to tho
    public, nifordH to i(.H -moniberH n perfect
    Boourity, luul a hi hor rato ofintonmt than can
    bo obtained oluowhoro, 'I'ho capital of 250,000
    in divided, for tho oonvonionco of invoiitmont
    and tninafor, into 1 bIiui-ob, of which 10a.
    only'wiUbe oallod.

72
  • N-grams
  • Latent Semantic Indexing

http//www.seo-blog.com/latent-semantic-indexing-l
si-explained.php
73
Demo
74
Producing Structured Information
  • Information Extraction

75
Information Extraction (IE)
  • IE systems
  • Identify documents of a specific type
  • Extract information according to pre-defined
    templates
  • Current approaches to IE focus on restricted
    domains, for instance news wires

http//www.opencalais.com/about
http//viewer.opencalais.com/
76
History of IE Terror, fleets, catastrophes and
management
  • The Message Understanding Conferences (MUC) were
    initiated and financed by DARPA (Defense Advanced
    Research Projects Agency) to encourage the
    development of new and better methods of
    information extraction. The character of this
    competitionmany concurrent research teams
    competing against one anotherrequired the
    development of standards for evaluation, e.g. the
    adoption of metrics like precision and recall.
  • http//en.wikipedia.org/wiki/Message_Understanding
    _Conference

The MUC-4 Terrorism Task The task given to
participants in the MUC-4 evaluation (1991) was
to extract specific information on terrorist
incidents from newspaper and newswire texts
relating to South America.
77
Hunting for Things
  • Named entity recognition
  • Labelling names of things
  • An entity is a discrete thing like Kings
    College London
  • But also dates, places, etc.

78
The aims Things and their relations
  • Find and understand the limited relevant parts of
    texts
  • Clear, factual information (who did what to whom
    when?)
  • Produce a structured representation of the
    relevant information relations
  • Terrorists have heads
  • Storms cause damage

79
Independent linguistic tools
  • A Text Zoner, which turns a text into a set of
    segments.
  • A Preprocessor which turns a text or text segment
    into a sequence of sentences, each of which being
    a sequence of lexical items.
  • A Filter, which turns a sequence of sentences
    into a smaller set of sentences by filtering out
    irrelevant ones.
  • A Preparser, which takes a sequence of lexical
    items and tries to identify reliably determinable
    small-scale structures, e.g. names
  • A Parser, which takes a set of lexical items
    (words and phrases) and outputs a set of
    parse-tree fragments, which may or may not be
    complete.

80
Independent linguistic tools II
  • A Fragment Combiner, which attempts to combine
    parse-tree or logical-form fragments into a
    structure of the same type for the whole
    sentence.
  • A Semantic Interpreter, which generates semantic
    structures or logical forms from parse-tree
    fragments.
  • A Lexical Disambiguator, which indexes lexical
    items to one and only one lexical sense, or can
    be viewed as reducing the ambiguity of the
    predicates in the logical form fragments.
  • A Coreference Resolver which identifies different
    descriptions of the same entity in different
    parts of a text.
  • A Template Generator which fills the IE templates
    from the semantic structures. Off to Linked Data

http//citeseerx.ist.psu.edu/viewdoc/download?doi
10.1.1.61.6480reprep1typepdf
81
Stanford NLP
82
More examples from Stanford
  • Use conventional classification algorithms to
    classify substrings of document as to be
    extracted or not.

83
Lets code again
  • Great ideas

84
  • Parliament at Stormont 1921-1972
  • Transcripts of all debates - Hansards

85
  • Georeferencing basic principles
  • Informal based on place names
  • Formal based on coordinates, etc.
  • Benefits
  • Resolving ambiguity
  • Ease of access to data objects
  • Integration of data from heterogeneous sources
  • Resolving space and time

86
DBpedia
  • Linked Data All we need to do now is to return
    the results in the right format
  • For instance, extracting
  • http//dbpedia.org/spotlight

87
Sponging
88
Thanks
89
Example from Stanford
The task given to participants in the MUC-4
evaluation (1991) was to extract
specific information on terrorist incidents from
newspaper and newswire texts relating to South
America. part-of-speech taggers, systems that
assign one and only one part- of-speech symbol
(like Proper noun, or Auxiliary verb) to a word
in a running text and do so on the
basis (usually) of statistical generalizations
across very large bodies of text.
90
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com