Tobias Blanke

About This Presentation

Title:

Tobias Blanke

Description:

How do computers understand Texts Tobias Blanke Next, state the action step. Make your action step specific, clear and brief. Be sure you can visualize your audience ... – PowerPoint PPT presentation

Number of Views:190

Avg rating:3.0/5.0

Slides: 91

Provided by: tobi7

Category:

more less

Transcript and Presenter's Notes

Title: Tobias Blanke

1
How do computers understand Texts
Tobias Blanke
2
My contact details

Name Tobias Blanke
Telephone 020 7848 1975
Email tobias.blanke_at_kcl.ac.
uk
Address 51 Oakfield Road (!)
N4 4LD

3
Outline

How do computers understand texts so that you
dont have to read them?
The same steps
We stay with searching for a long time.
How to use text analysis for Linked Data
You will build your own Twitter miner

4
Why? A Simple question

Suppose you have a million documents and a
question what do you do?
Solution User can read all the documents in the
store, retain the relevant documents and discard
all the others Perfect Retrieval NOT POSSIBLE
!!!
Alternative Use a High Speed Computer to read
entire document collection and extract the
relevant documents.

5
Data Geeks are in demand
New research by the McKinsey Global Institute
(MGI) forecasts a 50 to 60 percent gap between
the supply and demand of people with deep
analytical talent.
http//jonathanstray.com/investigating-thousands-o
r-millions-of-documents-by-visualizing-clusters
6
The Search Problem
7
The problem of traditional text analysis is
retrieval

Goal find documents relevant to an information
need from a large document set

Information need?
Query
Magicsystem
Document collection
Retrieval
Answer list
8
Example
Google
Web
9
Search problem

First applications in libraries (1950s)
ISBN 0-201-12227-8
Author Salton, Gerard
Title Automatic text processing the
transformation, analysis, and retrieval of
information by computer
Editor Addison-Wesley
Date 1989
Content ltTextgt
external attributes and internal attribute
(content)
Search by external attributes Search in
databases
IR search by content

10
Text Mining

Text mining is used to describe the application
of data mining techniques to automated discovery
of useful or interesting knowledge from
unstructured text.
Task Discuss with your neighbour what a system
needs to
Determine who is a terrorist
Determine the sentiments

11
The big Picture

IR is easy .
Lets stay with search for a while

12
Search still is the biggest application

Security applications Search for the villain
Biomedical applications Semantic Search
Online media applications Disambiguate
Information
Sentiment analysis Find nice movies
The human consumption is still key

13
Why is the human so important

Because we talk about information and
understanding remains a human domain
There will be information on the Web that has a
clearly defined meaning and can be analysed and
traced by computer programs there will be
information, such as poetry and art, that
requires the whole human intellect for an
understanding that will always be subjective.
(Tim Berners-Lee, Spinning the Semantic Web)
There is virtually no semantics in the
semantic web. () Semantic content, in the
Semantic Web, is generated by humans, ontologised
by humans, and ultimately consumed by humans.
Indeed, it is not unusual to hear complaints
about how difficult it is to find and retain
good ontologists. (https//uhra.herts.ac.uk/dsp
ace/bitstream/2299/3629/1/903250.pdf)

14
The Central Problem The Human
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
15
The Black Box
Documents
Query
Results
Slide is from Jimmy Lins tutorial
16
Inside The IR Black Box
Documents
Query
Representation
Representation
Query Representation
Document Representation
Index
Comparison Function
Results
Slide is from Jimmy Lins tutorial
17
Possible approaches

1. String matching (linear search in documents)
- Syntactical
- Difficult to improve
2. Indexing
- Semantics
- Flexible to further improvement

18
Indexing-based IRSimilarity text analysis

Document Query/Document
indexing indexing
(Query analysis)
Representation Representation
(keywords) Query (keywords)
evaluation
How is this document similar to
the query/another document?

Slide is from Jimmy Lins tutorial
19
Main problems

Document indexing
How to best represent their contents?
Matching
To what extent does an identified information
source correspond to a query/document?
System evaluation
How good is a system?
Are the retrieved documents relevant?
(precision)
Are all the relevant documents retrieved?
(recall)

20
Indexing
21
Document indexing

Goal Find the important meanings and create an
internal representation
Factors to consider
Accuracy to represent meanings (semantics)
Exhaustiveness (cover all the contents)

Coverage
Accuracy
String Word Phrase Concept
Slide is from Jimmy Lins tutorial
22
Text Representations Issues

In general, it is hard to capture these features
from a text document
One, it is difficult to extract this
automatically
Two, even if we did it, it won't scale!
One simplification is to represent documents as a
bag of words
Each document is represented as a bag of the word
it contains, and each component of the bag
represents some measurement of the relative
importance of a single word.

23
Some immediate problems
How do we compare these bags of word to find out
whether they are similar? Lets say we have
three bags House, Garden, House
door Household, Garden, Flat House, House,
House, Gardening How do we normalise these
bags? Why is normalisation needed? What would we
want to normalise?
24
Keyword selection and weighting

How to select important keywords?

25
Luhns Ideas

Frequency of word occurrence in a document is a
useful measurement of word significance

26
Zipf and Luhn
27
Top 50 Terms
WSJ87 collection, a 131.6 MB collection of 46,449
newspaper articles (19 million term occurrences)
TIME collection, a 1.6 MB collection of 423 short
TIME magazine articles (245,412 term occurrences)
28
Scholarship and the Long Tail

Scholarship follows a long-tailed distribution
the interest in relatively unknown items decline
much more slowly than they would be if popularity
were described by a normal distribution
We have few statistical tools for dealing with
long-tailed distributions
Other problems include contested terms
Graham White, "On Scholarship" (in Bartscherer
ed., Switching Codes)

29
Stopwords / Stoplist

Some words do not bear useful information. Common
examples
of, in, about, with, I, although,
Stoplist contain stopwords, not to be used as
index
Prepositions
Articles
Pronouns
http//www.textfixer.com/resources/common-english-
words.txt

30
Stemming

Reason
Different word forms may bear similar meaning
(e.g. search, searching) create a standard
representation for them
Stemming
Removing some endings of word
computer
compute
computes
computing
computed
computation

Is it always good to stem? Give examples!
comput
Slide is from Jimmy Lins tutorial
31
Porter algorithm(Porter, M.F., 1980, An
algorithm for suffix stripping, Program, 14(3)
130-137)
http//qaa.ath.cx/porter_js_demo.html

Step 1 plurals and past participles
SSES -gt SS caresses -gt caress
(v) ING -gt motoring -gt motor
Step 2 adj-gtn, n-gtv, n-gtadj,
(mgt0) OUSNESS -gt OUS callousness -gt callous
(mgt0) ATIONAL -gt ATE relational -gt relate
Step 3
(mgt0) ICATE -gt IC triplicate -gt triplic
Step 4
(mgt1) AL -gt revival -gt reviv
(mgt1) ANCE -gt allowance -gt allow
Step 5
(mgt1) E -gt probate -gt probat
(m gt 1 and d and L) -gt single letter controll
-gt control

Slide is from Jimmy Lins tutorial
32
Lemmatization

transform to standard form according to syntactic
category. Produce vs Produc-
E.g. verb ing ? verb
noun s ? noun
Need POS tagging
More accurate than stemming, but needs more
resources

Slide partly taken from Jimmy Lins tutorial
33
Index Documents ( Bag of Words Approach)
INDEX
DOCUMENT
Document Analysis Text Is This
This is a document in text analysis
34
Result of indexing

Each document is represented by a set of weighted
keywords (terms)
D1 ? (t1, w1), (t2,w2),
e.g. D1 ? (comput, 0.2), (architect, 0.3),
D2 ? (comput, 0.1), (network, 0.5),
Inverted file
comput ? (D1,0.2), (D2,0.1),
Inverted file is used during retrieval for
higher efficiency.

Slide partly taken from Jimmy Lins tutorial
35
Inverted Index Example
Doc 1
Dictionary
Postings
This is a sample document with one
sample sentence
Term docs Total freq
This 2 2
is 2 2
sample 2 3
another 1 1

Doc id Freq
1 1
2 1
1 1
2 1
1 2
2 1
2 1

Doc 2
This is another sample document
Slide is from ChengXiang Zhai
36
Similarity
37
Similarity Models

Boolean model
Vector-space model
Many more

38
Boolean model

Document Logical conjunction of keywords
Query Boolean expression of keywords
e.g. D t1 ? t2 ? ? tn
Q (t1 ? t2) ? (t3 ? ?t4)
Problems
many documents or few documents
End-users cannot manipulate Boolean operators
correctly
E.g. documents about poverty and crime

39
Vector space model

Vector space all the keywords encountered
ltt1, t2, t3, , tngt
Document
D lt a1, a2, a3, , angt
ai weight of ti in D
Query
Q lt b1, b2, b3, , bngt
bi weight of ti in Q
R(D,Q) Sim(D,Q)

40
Cosine Similarity
Similarity calculated using COSINE similarity
between two vectors
41
Tf/Idf

tf term frequency
frequency of a term/keyword in a document
The higher the tf, the higher the importance
(weight) for the doc.
df document frequency
no. of documents containing the term
distribution of the term
idf inverse document frequency
the unevenness of term distribution in the corpus
the specificity of term to a document
The more the term is distributed evenly, the
less it is specific to a document
weight(t,D) tf(t,D) idf(t)

42
Exercise

(1) Define term/document matrix
D1 The silver truck arrives
D2 The silver cannon fires silver bullets
D3 The truck is on fire
(2) Compute TF/IDF from Reuters

43
Lets code our first text analysis engine

search.pl

44
Our corpus

A study on Kants critique of judgement
Aristotle's Metaphysics
Hegels Aesthetics
Platos Charmides
McGreedys War Diaries
Excerpts from the Royal Irish Society

45
Text Analysis is an Experimental Science!
46
Text Analysis is an Experimental Science!

Formulate a hypothesis
Design an experiment to answer the question
Perform the experiment
Does the experiment answer the question?
Rinse, repeat

47
Test Collections

Three components of a test collection
Test Collection of documents
Set of topics
Sets of relevant document based on expert
judgments
Metrics for assessing performance
Precision
Recall

48
Precision vs. Recall
All docs
Retrieved
Relevant
Slide taken from Jimmy Lins tutorial
49
The TREC experiments

Once per year
A set of documents and queries are distributed
to the participants (the standard answers are
unknown) (April)
Participants work (very hard) to construct,
fine-tune their systems, and submit the answers
(1000/query) at the deadline (July)
NIST people manually evaluate the answers and
provide correct answers (and classification of IR
systems) (July August)
TREC conference (November)

50
Towards Linked Data
Beyond the Simple Stuff
51

Build Relationships between Documents
Structure in the classic web Hyperlinks
Mashing
Cluster and create links
Build Relationships within Documents
Information Extraction

52
The traditional web

Hyperlinks

53
Web Mining

No stable document collection (spider, crawler)
Huge number of documents (partial collection)
Multimedia documents
Great variation of document quality
Multilingual problem

54
Exploiting Inter-Document Links
Description (anchor text)
Links indicate the utility of a doc
What does a link tell us?
Slide is from ChengXiang Zhai
55
Mashing
56
Information filtering

Instead of changing queries on stable document
collection, we now want to filter an incoming
document flow with stable interests (queries)
yes/no decision (in stead of ordering documents)
Advantage the description of users interest may
be improved using relevance feedback (the user is
more willing to cooperate)
The basic techniques used for IF are the same as
those for IR Two sides of the same coin

keep
IF
doc3, doc2, doc1
ignore
Slide taken from Jimmy Lins tutorial
User profile
57
Lets mine Twitter analysis

Imagine you are a social scientist and interested
in the Arab spring and the influence of social
media or something else
You know that social media plays an important
role. Even the Pope tweets with an IPad!

58
Twitter API

Its easy to get information out of Twitter
Search API http//twitter.com/!/search/house
http//twitter.com/statuses/public_timeline.rss

59
Twitter exercise

What do we want to look for?
Form Groups
Create an account with YahooPipes
http//pipes.yahoo.com/pipes/
(You can use your Google one)
Create a Pipe. What do you see?

I. Access Keywords source
Fetch CSV Module.
Enter the URL of the CSV file http//dl.dropbox.c
om/u/868826/Filter-Demo.csv
Use keywords as column names
II. Loop through each element in the CSV file and
builds a search URL formatted for RSS output.
Under Operators Fetch Loop module
Fetch URLs URL Builder into the Loops big field
As base use http//search.twitter.com/search.atom
As query parameters use q in the first box and
then item.keywords in the second
Assign results to item.loopurlbuilder
III. Connect the CSV and Loop modules

IV. Search Twitter
Under Operators Fetch Loop module
Fetch Sourcess Fetch Feed into the Loops big
field
As URL use item.loopurlbuilder
Emit all results
V. Connect the two Loop modules
VI. Sort
1. Under Operators Fetch Sort module
Sort by item.ypublished.utime in descending
order
VII. Connect Sort module to pipe output. The
final module in every Yahoo Pipe.
VIII. Save and Run Pipe

http//www.squidoo.com/yahoo-pipes-guide
62
Cluster to Create Links
63
Group together similar documents
Idea Frequent terms carry more information about
the cluster they might belong to Highly
co-related frequent terms probably belong to the
same cluster
http//www.iboogie.com/
64
Clustering Example
.
How many terms do these documents have?
65
English Novels

Normalise
Calculate similarity according to dot product

66
Lets code again

Compare.pl

67
FReSH (Forging ReSTful Services for e-Humanities)

Creating Semantic Relationships

68

69

Digital edition of 6 newspapers / periodicals
Monthly Repository (1806 1837)
Northern Star (1837 1852)
The Leader (1850 1860)
English Womens Journal (1858 1864)
Tomahawk (1867 1870)
Publishers Circular (1837-1959 NCSE 1880-1890)

70
Semantic view

Chain of readings

71
OCR Problems

Thin Compimy in fmmod to iiKu'-t tho dooiro
ol'.those who seek, without Hpcoiilal/ioii, Hiifo
and .profltublo invtwtmont for larjo or Hinall
HiiniH, at a hi(jlilt"r rulo of intoront tlian
can be obtainod from tho in 'ihlio 1'uihIh, and
on oh Hocuro a basin. Tho invoHlinont Hystom,
whilo it olfors tho preutoHt advantages to tho
public, nifordH to i(.H -moniberH n perfect
Boourity, luul a hi hor rato ofintonmt than can
bo obtained oluowhoro, 'I'ho capital of 250,000
in divided, for tho oonvonionco of invoiitmont
and tninafor, into 1 bIiui-ob, of which 10a.
only'wiUbe oallod.

N-grams
Latent Semantic Indexing

http//www.seo-blog.com/latent-semantic-indexing-l
si-explained.php
73
Demo
74
Producing Structured Information

Information Extraction

75
Information Extraction (IE)

IE systems
Identify documents of a specific type
Extract information according to pre-defined
templates
Current approaches to IE focus on restricted
domains, for instance news wires

http//www.opencalais.com/about
http//viewer.opencalais.com/
76
History of IE Terror, fleets, catastrophes and
management

The Message Understanding Conferences (MUC) were
initiated and financed by DARPA (Defense Advanced
Research Projects Agency) to encourage the
development of new and better methods of
information extraction. The character of this
competitionmany concurrent research teams
competing against one anotherrequired the
development of standards for evaluation, e.g. the
adoption of metrics like precision and recall.
http//en.wikipedia.org/wiki/Message_Understanding
_Conference

The MUC-4 Terrorism Task The task given to
participants in the MUC-4 evaluation (1991) was
to extract specific information on terrorist
incidents from newspaper and newswire texts
relating to South America.
77
Hunting for Things

Named entity recognition
Labelling names of things
An entity is a discrete thing like Kings
College London
But also dates, places, etc.

78
The aims Things and their relations

Find and understand the limited relevant parts of
texts
Clear, factual information (who did what to whom
when?)
Produce a structured representation of the
relevant information relations
Terrorists have heads
Storms cause damage

79
Independent linguistic tools

A Text Zoner, which turns a text into a set of
segments.
A Preprocessor which turns a text or text segment
into a sequence of sentences, each of which being
a sequence of lexical items.
A Filter, which turns a sequence of sentences
into a smaller set of sentences by filtering out
irrelevant ones.
A Preparser, which takes a sequence of lexical
items and tries to identify reliably determinable
small-scale structures, e.g. names
A Parser, which takes a set of lexical items
(words and phrases) and outputs a set of
parse-tree fragments, which may or may not be
complete.

80
Independent linguistic tools II

A Fragment Combiner, which attempts to combine
parse-tree or logical-form fragments into a
structure of the same type for the whole
sentence.
A Semantic Interpreter, which generates semantic
structures or logical forms from parse-tree
fragments.
A Lexical Disambiguator, which indexes lexical
items to one and only one lexical sense, or can
be viewed as reducing the ambiguity of the
predicates in the logical form fragments.
A Coreference Resolver which identifies different
descriptions of the same entity in different
parts of a text.
A Template Generator which fills the IE templates
from the semantic structures. Off to Linked Data

http//citeseerx.ist.psu.edu/viewdoc/download?doi
10.1.1.61.6480reprep1typepdf
81
Stanford NLP
82
More examples from Stanford

Use conventional classification algorithms to
classify substrings of document as to be
extracted or not.

83
Lets code again

Great ideas

Parliament at Stormont 1921-1972
Transcripts of all debates - Hansards

Georeferencing basic principles
Informal based on place names
Formal based on coordinates, etc.
Benefits
Resolving ambiguity
Ease of access to data objects
Integration of data from heterogeneous sources
Resolving space and time

86
DBpedia

Linked Data All we need to do now is to return
the results in the right format
For instance, extracting
http//dbpedia.org/spotlight

87
Sponging
88
Thanks
89
Example from Stanford
The task given to participants in the MUC-4
evaluation (1991) was to extract
specific information on terrorist incidents from
newspaper and newswire texts relating to South
America. part-of-speech taggers, systems that
assign one and only one part- of-speech symbol
(like Proper noun, or Auxiliary verb) to a word
in a running text and do so on the
basis (usually) of statistical generalizations
across very large bodies of text.
90
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Tobias Blanke - PowerPoint PPT Presentation

Tobias Blanke

How do computers understand Texts Tobias Blanke Next, state the action step. Make your action step specific, clear and brief. Be sure you can visualize your audience ... – PowerPoint PPT presentation