Title: Web Information Extraction using a Search Engine
1Web Information Extraction using a Search Engine
Gijs Geleijnse
2Outline
- Introduction
- Turing
- Beyond Turing
3Web Information Extraction
- Information Extraction is the task of identifying
entities and their relations in natural language
texts. - Information Extraction is easy when we UNDERSTAND
documents. - Document Understanding is AI complete
computers need to be made as intelligent as
people.
4Information Extraction
- Gijs eats his wok dish with chop sticks,
- while Dragan has his soup with vinegar.
- We need grammar, precise semantics, deal with
ambiguities, make interpretations, read between
the lines etc.
5Information Extraction
- Gijs eats his wok dish with chop sticks,
- while Natasa has his soup with vinegar.
- We need grammar, precise semantics, deal with
ambiguities, make interpretations, read between
the lines etc.
6Why Web Information Extraction?
- To be really intelligent, applications need to be
world wise. - Hence, they need interpretable information
- Given some information demand, the web seems a
good place to look - We are interested in structured, machine
interpretable information
7Web Information
- Why use unstructured texts?
- Get opinions from social websites
- Get facts from Wikipedia
- Web Information Extraction is valuable when
- We create knowledge that can not be extracted
from e.g. Wikipedia or Last.fm
8Web Information Extraction
- Find relevant texts
- Identify entities
- Identify relations
- Problems
- - No consistent newspaper-like texts
- No training sets
- Various languages
- Typos, jokes, rubbish
- Solution
- Keep it simple
- Explore Redundancy
9Redundancy
- Redundancy of information does the trick
- We can assume that important concepts are
mentioned on multiple pages - Because of the expected redundancy we do not have
to recognize each occurrence - Simple techniques just might work
10Historical persons
Assume given biographical characterizations
11Pattern-based approach
- Concept has been successfully applied in various
studies - Simple and effective
- Precise queries lead to highly relevant snippets
- Amsterdam is the capital of
- We get relation for free
- Amsterdam is the capital of the Netherlands
- the Netherlands is a country
- has Amsterdam as its capital
12Historical persons
Querying with gendersis not the best choice
13Query periods
person ( period )is best pattern.
141912 - 1954
person was
15Beyond Turing
- How to identify entities?
- How to find relation patterns?
- How to process this data?
16Identifying Entities/Instances/Terms
- Basically two alternatives
- - Set of Rules
- - requires some intelligence
- - Machine Learning
- - requires training set
17Learning Patterns
- Queries need to give many useful results
- A pattern needs to be
- - Precise (to give high quality results)
- Occur frequently (to get (m)any results)
- not too narrow (not only applicable for one pair)
- Use of training set of instance pairs to learn
the patterns. - We want to perform as few queries as possible to
learn the patterns. - Bootstrapping learned relations can be used to
learn patterns.
18Use Instances found in queries
19Use Instances found in queries
starring Marlon Brando
20Retrieving patterns
We formulate queries with the elements in the
trainingset allintext Michael Jackson
Thriller allintext Thriller Michael Jackson
We retrieve all inner-sentence fragments
between the instances and normalize them (remove
punctuation marks and capitals).
21Retrieving relation patterns
We now have a (long) list of patterns album by
artist artists album album
album cover by artist album di artist
artist-cd album ......... Now to compute
scores frequency, precision, wide-spreadness
22Retrieving relation patterns
Frequency we take the frequency of the pattern
in the list obtained. Precision - we google
the pattern in combination with an
instance - observe the fraction of useful results
e.g. if we google ABBAs new album we divide
the number excerpts with an album title by the
total number of excerpts found
23Evaluate relation patterns
Wide-spreadness we count the number of different
instances found with the query. Score freq
prec spr We only compute the scores of the N
most frequent patterns. Number of queries 2
training set N instance set
24Experiment hyponym patterns
Are Hearst patterns Hea92 indeed the most
effective patterns for the is-a relation?
(O ((country, hynonym), (all countries,
country,countries), is_a,
(Afghanistan,country), (Afghanistan,
countries), (Akrotiri, country),
(Akrotiri, countries), ...))
25Case-study Hearst Patterns
Both the common Hearst Patterns and relations
typical for this setting (countries) perform
well. Method to select the most effective
hyponym patterns.
26Case-study Burger King
TREC QA question In which countries can Burger
King be found? O ((country, restaurant), (all
countries, McDonalds, KFC), located_in,
(McDonalds, USA), (KFC, China), ...))
27Case-study Burger King
We first find patterns using the method
described
28How to recognize a hamburger chain?
- We know where to look for them
- Capitalized Words
- Query restaurants like X and gives enough
hits.
29Instances learned
Learned while evaluating the patterns identified
- Restaurants
- Cuisines
- Stopwords
- Geography
30Case-study Burger King
- Finally, we use the patterns found in combination
with Burger King to find relations. - - Precision 80
- Recall 85
- Most errors due to countries in which Burger King
plans to open restaurants.
31Processing Data
- Okay, we now presented a pattern-based method to
extract information on the web - How to use the method to identify new
information? - If you know the instances, alternatives exists?
32(No Transcript)
33hard rock
folk
folk
country
camp
classical
pop
rap
rap
folk
soul
pop
pop
34Extracting Community-based meta-data from the Web
- We combine evidence from multiple web sources
- Does the Web community agree with Last.fm
community?
35Overview
- Problem description
- Three alternative methods to find co-occurrences
- Using instance similarities to improve mapping
- Experimental results tagging music artist with
genres and painters with art movement - Conclusions
36Two Problems
- A a set of instances (artists/painters/..)
- L a set of labels/tags/genres/
- 1. Find the most appropriate mapping m(a) ? L for
each instance a ? A . - 2. Find a score t(a,b) for each pair expressing
relatedness between each pair (a,b) ? A x A.
37Using co-occurrences for the mapping
- We use the Web to find co-occurrences of artists
and labels. - If two terms co-occur relatively often in the
same context, we can consider them to be related - How to find co-occurrences / which context to
choose? - (b) How to use them to find a mapping m?
38Finding artist similarities
- Three alternatives to find web co-occurrences
between a and l using - the number of hits
- patterns
- full documents.
39Co-occurrences using Google hits
co(abba,disco) 2,780,000 co(abba,hard
rock) 625,000 Use of additional terms can
specify the query.
Google Complexity A x L queries.
40Co-occurrences using patterns
Take a text fragment that expresses the relation
between a label and an artist,
label artists such as artist
take a label or an artist
country artists such as
artists such as Johnny Cash
and Google!
41Co-occurrences using patterns
- The relation can be specified
- Search for terms in the excerpts
- - only A L queries
42Co-occurrences using full documents
- - Google considers these pages very relevant to
the query (Britney Spears) - The tags on these pages will probably reflect
her. - - again only O(A L) queries
43Query country music
44Using these co-occurrences
Using relative frequencies
45Same trick to find similar artists
- Same three approaches to find co-occurrences
between the artists - the number of hits ? A (A
L) queries - patterns ? O(A L) queries
- full documents ? A L queries
46Using artist similarity to improve mapping
- We can use the computed relatedness between
artists to improve the classification.
country
country
country
47The final mapping
We take an artist and its k nearest neighbors. We
do a majority voting among m(a).
country
country
country
48Experiments
- Which of the three methods performs best?
- Does the use of artist similarity improve the
mapping? .. Or how to choose k ? - We only evaluate precision (if we find nothing,
its wrong)
49Classifying artists into Genres JKU dataset
- 224 artists divided over 14 genres
- publicly available dataset
- previous work on this data set focused on
clustering the artists (e.g. Knees et.al. 2004,
Schedl et.al. 2005) -
50Genres JKU dataset
Documents Patterns Google hits
51Comparing with Last.fm
52Classifying painters into movements
- Experiment conducted on
- 1,280 painters (en.wikipedia.org/List_of_painters
) - - 77 art movements (List_of_art_movements)
- Evaluation
- We visited the pages describing the art
movements. - Ground truth painters mentioned on 1 of these
pages. -
- Leads to set of 160 painter-movement pairs.
53Classifying painters into movements
54Classifying painters into movements
55Last slide Conclusions
- Presented a method to gather community-based
meta-data - Good methods require low Google Complexity
- Experimental results are encouraging
- Web Information Extraction is fun
- Papers available via http//gijsg.dse.nl
56(No Transcript)