Title: Tobias Blanke
1How do computers understand Texts
Tobias Blanke
2My contact details
- Name Tobias Blanke
- Telephone 020 7848 1975
- Email tobias.blanke_at_kcl.ac.
uk - Address 51 Oakfield Road (!)
N4 4LD -
3Outline
- How do computers understand texts so that you
dont have to read them? - The same steps
- We stay with searching for a long time.
- How to use text analysis for Linked Data
- You will build your own Twitter miner
4Why? A Simple question
- Suppose you have a million documents and a
question what do you do? - Solution User can read all the documents in the
store, retain the relevant documents and discard
all the others Perfect Retrieval NOT POSSIBLE
!!! - Alternative Use a High Speed Computer to read
entire document collection and extract the
relevant documents.
5Data Geeks are in demand
New research by the McKinsey Global Institute
(MGI) forecasts a 50 to 60 percent gap between
the supply and demand of people with deep
analytical talent.
http//jonathanstray.com/investigating-thousands-o
r-millions-of-documents-by-visualizing-clusters
6The Search Problem
7The problem of traditional text analysis is
retrieval
- Goal find documents relevant to an information
need from a large document set
Information need?
Query
Magicsystem
Document collection
Retrieval
Answer list
8Example
Google
Web
9Search problem
- First applications in libraries (1950s)
- ISBN 0-201-12227-8
- Author Salton, Gerard
- Title Automatic text processing the
transformation, analysis, and retrieval of
information by computer - Editor Addison-Wesley
- Date 1989
- Content ltTextgt
- external attributes and internal attribute
(content) - Search by external attributes Search in
databases - IR search by content
10Text Mining
- Text mining is used to describe the application
of data mining techniques to automated discovery
of useful or interesting knowledge from
unstructured text. - Task Discuss with your neighbour what a system
needs to - Determine who is a terrorist
- Determine the sentiments
11The big Picture
- IR is easy .
- Lets stay with search for a while
12Search still is the biggest application
- Security applications Search for the villain
- Biomedical applications Semantic Search
- Online media applications Disambiguate
Information - Sentiment analysis Find nice movies
- The human consumption is still key
13Why is the human so important
- Because we talk about information and
understanding remains a human domain - There will be information on the Web that has a
clearly defined meaning and can be analysed and
traced by computer programs there will be
information, such as poetry and art, that
requires the whole human intellect for an
understanding that will always be subjective.
(Tim Berners-Lee, Spinning the Semantic Web) - There is virtually no semantics in the
semantic web. () Semantic content, in the
Semantic Web, is generated by humans, ontologised
by humans, and ultimately consumed by humans.
Indeed, it is not unusual to hear complaints
about how difficult it is to find and retain
good ontologists. (https//uhra.herts.ac.uk/dsp
ace/bitstream/2299/3629/1/903250.pdf)
14The Central Problem The Human
Information Seeker
Authors
Concepts
Concepts
Query Terms
Document Terms
Do these represent the same concepts?
15The Black Box
Documents
Query
Results
Slide is from Jimmy Lins tutorial
16Inside The IR Black Box
Documents
Query
Representation
Representation
Query Representation
Document Representation
Index
Comparison Function
Results
Slide is from Jimmy Lins tutorial
17Possible approaches
- 1. String matching (linear search in documents)
- - Syntactical
- - Difficult to improve
- 2. Indexing
- - Semantics
- - Flexible to further improvement
18Indexing-based IRSimilarity text analysis
- Document Query/Document
- indexing indexing
- (Query analysis)
- Representation Representation
- (keywords) Query (keywords)
- evaluation
- How is this document similar to
- the query/another document?
-
Slide is from Jimmy Lins tutorial
19Main problems
- Document indexing
- How to best represent their contents?
- Matching
- To what extent does an identified information
source correspond to a query/document? - System evaluation
- How good is a system?
- Are the retrieved documents relevant?
(precision) - Are all the relevant documents retrieved?
(recall)
20Indexing
21Document indexing
- Goal Find the important meanings and create an
internal representation - Factors to consider
- Accuracy to represent meanings (semantics)
- Exhaustiveness (cover all the contents)
Coverage
Accuracy
String Word Phrase Concept
Slide is from Jimmy Lins tutorial
22Text Representations Issues
- In general, it is hard to capture these features
from a text document - One, it is difficult to extract this
automatically - Two, even if we did it, it won't scale!
- One simplification is to represent documents as a
bag of words - Each document is represented as a bag of the word
it contains, and each component of the bag
represents some measurement of the relative
importance of a single word.
23Some immediate problems
How do we compare these bags of word to find out
whether they are similar? Lets say we have
three bags House, Garden, House
door Household, Garden, Flat House, House,
House, Gardening How do we normalise these
bags? Why is normalisation needed? What would we
want to normalise?
24Keyword selection and weighting
- How to select important keywords?
25Luhns Ideas
- Frequency of word occurrence in a document is a
useful measurement of word significance
26Zipf and Luhn
27Top 50 Terms
WSJ87 collection, a 131.6 MB collection of 46,449
newspaper articles (19 million term occurrences)
TIME collection, a 1.6 MB collection of 423 short
TIME magazine articles (245,412 term occurrences)
28Scholarship and the Long Tail
- Scholarship follows a long-tailed distribution
the interest in relatively unknown items decline
much more slowly than they would be if popularity
were described by a normal distribution - We have few statistical tools for dealing with
long-tailed distributions - Other problems include contested terms
- Graham White, "On Scholarship" (in Bartscherer
ed., Switching Codes)
29Stopwords / Stoplist
- Some words do not bear useful information. Common
examples - of, in, about, with, I, although,
- Stoplist contain stopwords, not to be used as
index - Prepositions
- Articles
- Pronouns
- http//www.textfixer.com/resources/common-english-
words.txt
30Stemming
- Reason
- Different word forms may bear similar meaning
(e.g. search, searching) create a standard
representation for them - Stemming
- Removing some endings of word
- computer
- compute
- computes
- computing
- computed
- computation
Is it always good to stem? Give examples!
comput
Slide is from Jimmy Lins tutorial
31Porter algorithm(Porter, M.F., 1980, An
algorithm for suffix stripping, Program, 14(3)
130-137)
http//qaa.ath.cx/porter_js_demo.html
- Step 1 plurals and past participles
- SSES -gt SS caresses -gt caress
- (v) ING -gt motoring -gt motor
- Step 2 adj-gtn, n-gtv, n-gtadj,
- (mgt0) OUSNESS -gt OUS callousness -gt callous
- (mgt0) ATIONAL -gt ATE relational -gt relate
- Step 3
- (mgt0) ICATE -gt IC triplicate -gt triplic
- Step 4
- (mgt1) AL -gt revival -gt reviv
- (mgt1) ANCE -gt allowance -gt allow
- Step 5
- (mgt1) E -gt probate -gt probat
- (m gt 1 and d and L) -gt single letter controll
-gt control
Slide is from Jimmy Lins tutorial
32Lemmatization
- transform to standard form according to syntactic
category. Produce vs Produc- - E.g. verb ing ? verb
- noun s ? noun
- Need POS tagging
- More accurate than stemming, but needs more
resources -
Slide partly taken from Jimmy Lins tutorial
33Index Documents ( Bag of Words Approach)
INDEX
DOCUMENT
Document Analysis Text Is This
This is a document in text analysis
34Result of indexing
- Each document is represented by a set of weighted
keywords (terms) - D1 ? (t1, w1), (t2,w2),
- e.g. D1 ? (comput, 0.2), (architect, 0.3),
- D2 ? (comput, 0.1), (network, 0.5),
- Inverted file
- comput ? (D1,0.2), (D2,0.1),
- Inverted file is used during retrieval for
higher efficiency.
Slide partly taken from Jimmy Lins tutorial
35Inverted Index Example
Doc 1
Dictionary
Postings
This is a sample document with one
sample sentence
Term docs Total freq
This 2 2
is 2 2
sample 2 3
another 1 1
Doc id Freq
1 1
2 1
1 1
2 1
1 2
2 1
2 1
Doc 2
This is another sample document
Slide is from ChengXiang Zhai
36Similarity
37Similarity Models
- Boolean model
- Vector-space model
- Many more
38Boolean model
- Document Logical conjunction of keywords
- Query Boolean expression of keywords
- e.g. D t1 ? t2 ? ? tn
- Q (t1 ? t2) ? (t3 ? ?t4)
-
- Problems
- many documents or few documents
- End-users cannot manipulate Boolean operators
correctly - E.g. documents about poverty and crime
39Vector space model
- Vector space all the keywords encountered
- ltt1, t2, t3, , tngt
- Document
- D lt a1, a2, a3, , angt
- ai weight of ti in D
- Query
- Q lt b1, b2, b3, , bngt
- bi weight of ti in Q
- R(D,Q) Sim(D,Q)
40Cosine Similarity
Similarity calculated using COSINE similarity
between two vectors
41Tf/Idf
- tf term frequency
- frequency of a term/keyword in a document
- The higher the tf, the higher the importance
(weight) for the doc. - df document frequency
- no. of documents containing the term
- distribution of the term
- idf inverse document frequency
- the unevenness of term distribution in the corpus
- the specificity of term to a document
- The more the term is distributed evenly, the
less it is specific to a document - weight(t,D) tf(t,D) idf(t)
42Exercise
- (1) Define term/document matrix
- D1 The silver truck arrives
- D2 The silver cannon fires silver bullets
- D3 The truck is on fire
- (2) Compute TF/IDF from Reuters
43Lets code our first text analysis engine
44Our corpus
- A study on Kants critique of judgement
- Aristotle's Metaphysics
- Hegels Aesthetics
- Platos Charmides
- McGreedys War Diaries
- Excerpts from the Royal Irish Society
45Text Analysis is an Experimental Science!
46Text Analysis is an Experimental Science!
- Formulate a hypothesis
- Design an experiment to answer the question
- Perform the experiment
- Does the experiment answer the question?
- Rinse, repeat
47Test Collections
- Three components of a test collection
- Test Collection of documents
- Set of topics
- Sets of relevant document based on expert
judgments - Metrics for assessing performance
- Precision
- Recall
48Precision vs. Recall
All docs
Retrieved
Relevant
Slide taken from Jimmy Lins tutorial
49The TREC experiments
- Once per year
- A set of documents and queries are distributed
to the participants (the standard answers are
unknown) (April) - Participants work (very hard) to construct,
fine-tune their systems, and submit the answers
(1000/query) at the deadline (July) - NIST people manually evaluate the answers and
provide correct answers (and classification of IR
systems) (July August) - TREC conference (November)
50Towards Linked Data
Beyond the Simple Stuff
51- Build Relationships between Documents
- Structure in the classic web Hyperlinks
- Mashing
- Cluster and create links
- Build Relationships within Documents
- Information Extraction
52The traditional web
53Web Mining
- No stable document collection (spider, crawler)
- Huge number of documents (partial collection)
- Multimedia documents
- Great variation of document quality
- Multilingual problem
-
54Exploiting Inter-Document Links
Description (anchor text)
Links indicate the utility of a doc
What does a link tell us?
Slide is from ChengXiang Zhai
55Mashing
56Information filtering
- Instead of changing queries on stable document
collection, we now want to filter an incoming
document flow with stable interests (queries) - yes/no decision (in stead of ordering documents)
- Advantage the description of users interest may
be improved using relevance feedback (the user is
more willing to cooperate) - The basic techniques used for IF are the same as
those for IR Two sides of the same coin
keep
IF
doc3, doc2, doc1
ignore
Slide taken from Jimmy Lins tutorial
User profile
57Lets mine Twitter analysis
- Imagine you are a social scientist and interested
in the Arab spring and the influence of social
media or something else - You know that social media plays an important
role. Even the Pope tweets with an IPad!
58Twitter API
- Its easy to get information out of Twitter
- Search API http//twitter.com/!/search/house
- http//twitter.com/statuses/public_timeline.rss
59Twitter exercise
- What do we want to look for?
- Form Groups
- Create an account with YahooPipes
- http//pipes.yahoo.com/pipes/
- (You can use your Google one)
- Create a Pipe. What do you see?
60- I. Access Keywords source
- Fetch CSV Module.
- Enter the URL of the CSV file http//dl.dropbox.c
om/u/868826/Filter-Demo.csv - Use keywords as column names
- II. Loop through each element in the CSV file and
builds a search URL formatted for RSS output. - Under Operators Fetch Loop module
- Fetch URLs URL Builder into the Loops big field
- As base use http//search.twitter.com/search.atom
- As query parameters use q in the first box and
then item.keywords in the second - Assign results to item.loopurlbuilder
- III. Connect the CSV and Loop modules
61- IV. Search Twitter
- Under Operators Fetch Loop module
- Fetch Sourcess Fetch Feed into the Loops big
field - As URL use item.loopurlbuilder
- Emit all results
- V. Connect the two Loop modules
- VI. Sort
- 1. Under Operators Fetch Sort module
- Sort by item.ypublished.utime in descending
order - VII. Connect Sort module to pipe output. The
final module in every Yahoo Pipe. - VIII. Save and Run Pipe
http//www.squidoo.com/yahoo-pipes-guide
62Cluster to Create Links
63Group together similar documents
Idea Frequent terms carry more information about
the cluster they might belong to Highly
co-related frequent terms probably belong to the
same cluster
http//www.iboogie.com/
64Clustering Example
.
How many terms do these documents have?
65English Novels
- Normalise
- Calculate similarity according to dot product
66Lets code again
67FReSH (Forging ReSTful Services for e-Humanities)
- Creating Semantic Relationships
68 69- Digital edition of 6 newspapers / periodicals
- Monthly Repository (1806 1837)
- Northern Star (1837 1852)
- The Leader (1850 1860)
- English Womens Journal (1858 1864)
- Tomahawk (1867 1870)
- Publishers Circular (1837-1959 NCSE 1880-1890)
70Semantic view
71OCR Problems
- Thin Compimy in fmmod to iiKu'-t tho dooiro
ol'.those who seek, without Hpcoiilal/ioii, Hiifo
and .profltublo invtwtmont for larjo or Hinall
HiiniH, at a hi(jlilt"r rulo of intoront tlian
can be obtainod from tho in 'ihlio 1'uihIh, and
on oh Hocuro a basin. Tho invoHlinont Hystom,
whilo it olfors tho preutoHt advantages to tho
public, nifordH to i(.H -moniberH n perfect
Boourity, luul a hi hor rato ofintonmt than can
bo obtained oluowhoro, 'I'ho capital of 250,000
in divided, for tho oonvonionco of invoiitmont
and tninafor, into 1 bIiui-ob, of which 10a.
only'wiUbe oallod.
72- N-grams
- Latent Semantic Indexing
http//www.seo-blog.com/latent-semantic-indexing-l
si-explained.php
73Demo
74Producing Structured Information
75Information Extraction (IE)
- IE systems
- Identify documents of a specific type
- Extract information according to pre-defined
templates - Current approaches to IE focus on restricted
domains, for instance news wires
http//www.opencalais.com/about
http//viewer.opencalais.com/
76History of IE Terror, fleets, catastrophes and
management
- The Message Understanding Conferences (MUC) were
initiated and financed by DARPA (Defense Advanced
Research Projects Agency) to encourage the
development of new and better methods of
information extraction. The character of this
competitionmany concurrent research teams
competing against one anotherrequired the
development of standards for evaluation, e.g. the
adoption of metrics like precision and recall. - http//en.wikipedia.org/wiki/Message_Understanding
_Conference
The MUC-4 Terrorism Task The task given to
participants in the MUC-4 evaluation (1991) was
to extract specific information on terrorist
incidents from newspaper and newswire texts
relating to South America.
77Hunting for Things
- Named entity recognition
- Labelling names of things
- An entity is a discrete thing like Kings
College London - But also dates, places, etc.
78The aims Things and their relations
- Find and understand the limited relevant parts of
texts - Clear, factual information (who did what to whom
when?) - Produce a structured representation of the
relevant information relations - Terrorists have heads
- Storms cause damage
79Independent linguistic tools
- A Text Zoner, which turns a text into a set of
segments. - A Preprocessor which turns a text or text segment
into a sequence of sentences, each of which being
a sequence of lexical items. - A Filter, which turns a sequence of sentences
into a smaller set of sentences by filtering out
irrelevant ones. - A Preparser, which takes a sequence of lexical
items and tries to identify reliably determinable
small-scale structures, e.g. names - A Parser, which takes a set of lexical items
(words and phrases) and outputs a set of
parse-tree fragments, which may or may not be
complete.
80Independent linguistic tools II
- A Fragment Combiner, which attempts to combine
parse-tree or logical-form fragments into a
structure of the same type for the whole
sentence. - A Semantic Interpreter, which generates semantic
structures or logical forms from parse-tree
fragments. - A Lexical Disambiguator, which indexes lexical
items to one and only one lexical sense, or can
be viewed as reducing the ambiguity of the
predicates in the logical form fragments. - A Coreference Resolver which identifies different
descriptions of the same entity in different
parts of a text. - A Template Generator which fills the IE templates
from the semantic structures. Off to Linked Data
http//citeseerx.ist.psu.edu/viewdoc/download?doi
10.1.1.61.6480reprep1typepdf
81Stanford NLP
82More examples from Stanford
- Use conventional classification algorithms to
classify substrings of document as to be
extracted or not.
83Lets code again
84- Parliament at Stormont 1921-1972
- Transcripts of all debates - Hansards
85- Georeferencing basic principles
- Informal based on place names
- Formal based on coordinates, etc.
- Benefits
- Resolving ambiguity
- Ease of access to data objects
- Integration of data from heterogeneous sources
- Resolving space and time
86DBpedia
- Linked Data All we need to do now is to return
the results in the right format - For instance, extracting
- http//dbpedia.org/spotlight
87Sponging
88Thanks
89Example from Stanford
The task given to participants in the MUC-4
evaluation (1991) was to extract
specific information on terrorist incidents from
newspaper and newswire texts relating to South
America. part-of-speech taggers, systems that
assign one and only one part- of-speech symbol
(like Proper noun, or Auxiliary verb) to a word
in a running text and do so on the
basis (usually) of statistical generalizations
across very large bodies of text.
90(No Transcript)