Research update Language modeling, ngrams and search engines - PowerPoint PPT Presentation

1 / 52

About This Presentation

Title:

Research update Language modeling, ngrams and search engines

Description:

Poor recall most of the relevant documents are not located ... Peanut butter. Peanut candy. Roasted peanut. Chocolate peanut. Peanut brittle. Peanut cookie ... – PowerPoint PPT presentation

Number of Views:71

Avg rating:3.0/5.0

Slides: 53

Provided by: davidj7

Category:

more less

Transcript and Presenter's Notes

Title: Research update Language modeling, ngrams and search engines

1
Research update Language modeling,n-grams and
search engines

David Johnson
UTas September 2005

Supervisors Dr Vishv Malhotra and Dr Peter Vampl
ew
2
Overview

AISAT 2004
SWIRL 2004
Language modeling
A New Information Retrieval Tool
Questions

3
AISAT 2004

AISAT 2004 - The 2nd International Conference on
Artificial Intelligence in Science and
Technology
November 2004, UTAS, Hobart
Hosted by The School of Engineering

4
AISAT 2004

Johnson DG, Malhotra VM, Vamplew PW, Patro S,
Refining Search Queries from Examples Using
Boolean Expressions and Latent Semantic
Analysis
Extension of prior research (Vishv M. Malhotra,
Sunanda Patro, David Johnson Synthesize Web
Queries Search the Web by Examples. ICEIS 2005.)
the contribution of this paper was
investigating the application of LSA to improve
the refined queries
This is only a brief overview the full paper is
available o from the UTAS Eprints Repository

5
AISAT 2004

Consider the problems facing a web searcher
looking for information in an unfamiliar domain
Lack of knowledge of domain specific terms,
jargon and concepts
Difficulty rating the importance of newly
discovered terms in targeting relevant
information
Lack of understanding about how extra terms and
changing the structure of the query will affect
the search results

6
AISAT 2004

This often leads to problems with the resulting
query
Poor recall most of the relevant documents are
not located
Poor precision many of the retrieved documents
are not relevant
Frustration often results
MSN-Harris Survey (August 2004) reports there is
a significant minority -- 29 percent of search
engine users -- who only sometimes or rarely
find what they want (http//www.microsoft.com/pre
sspass/press/2004/aug04/08-02searchpollpr.mspx)

7
AISAT 2004

However it is usually relatively easy for a
searcher to classify some example documents (from
an initial search) as relevant or not
The text from these documents can then be
analyzed to build a better query one that will
select more relevant and less irrelevant
documents

8
AISAT 2004
Resubmit to Web Search Engine
9
AISAT 2004

The new query is initially built in conjunctive
normal form (CNF)
Original query (a OR b OR ) AND (a OR c OR )
AND
Each maxterm is chosen to select all documents in
set relevant and reject as many of the
documents from irrelevant as possible
The CNF expression (i.e. the conjunction of all
maxterms) is chosen to reject all documents from
irrelevant
In order to minimise the size of the CNF
expression, terms with high selective potential
must be used
In some cases further optimization was required
for instance (at the time) Google would only
accept queries of up to 10 keywords in length

10
AISAT 2004
The potential of a candidate term, t, in a
partially constructed maxterm is calculated as
TR relevant documents not yet selected
TIR irrelevant docs still selected by
conjunction of prior maxterms TRt new relevant
documents selected by term t TIRt irrelevant do
cuments from TIR selected by term t
11
AISAT 2004

The Potential(t) function enabled the formulation
of effective queries, but they were often
counterintuitive, e.g. searching for information
related to the Sun (as a star)
The synthesised query was
sun AND (solar AND (recent OR (field AND tour))
OR (lower AND home) OR (million AND core))
This was felt to be at least partially due to
overtraining on the relatively small sets of
example documents but could we improve the
generated queries?

12
AISAT 2004

Latent Semantic Analysis (LSA) was investigated
as a technique with the potential to help in
selecting more meaningful terms
It is A theory and method for extracting and
representing the contextual-usage meaning of
words by statistical computations applied to a
large corpus of text (Landauer and Dumais,
1997)
It uses no humanly constructed dictionaries or
knowledge bases, yet has been shown to overlap
with human scores on some language-based judgment
tasks

13
AISAT 2004

LSA can help overcome the problems of polysemy
(same word different meaning) and synonymy
(different word same meaning) by associating
words with abstract concepts via its analysis of
word usage patterns
It can be very computationally intensive
(requiring the singular value decomposition (SVD)
of a large matrix), but for the small collections
of relevant / irrelevant documents we are
analysing it can be performed quickly (CPU time)

14
AISAT 2004

The term weighting was adjusted to account for
the LSA weighting as follows
Weight(t) Potential(t) ((Normalised LSA
weight in set relevant) (Normalised LSA
weight in set irrelevant))
During the experimentation that lead to the
development of this weighting scheme, we noted
that not allowing the Potential(t) values
sufficient weight resulted in very long queries
Note The Potential(t) value is recalculated
after each term is selected, but the LSA values
are based on the entire document sets

15
AISAT 2004

Results - Example 1
Naïve query elephants searching for
information suitable for a high school project
7 out of first 20 documents, and around 30 of
all initial documents from the naïve query were
judged relevant
The initial number of relevant documents was
quite high, reflecting the reasonably broad
nature of the information need

16
AISAT 2004

Results Example 1
Refined query no LSA
elephant AND (forests OR lived OR calves OR
poachers OR maximus OR threaten OR electric OR
tel OR kudu)
Refined query with LSA
elephant AND (climate OR habitats OR grasslands
OR quantities OR africana OR threaten OR insects
OR electric OR kudu)
LSA did help in the selection of intuitive
words, although not as much as hoped it avoided
lived and tel (abbreviation for telephone
no), although it still selected quantities

17
AISAT 2004

Results - Example 2
Naïve query mushrooms searching for
information on growth, lifecycle and structure of
mushrooms
NOT interested in recipes, magic mushrooms, or
identifying wild edible mushrooms
3 of first 20 documents and 13 of initial
documents relevant

18
AISAT 2004

Results - Example 2
Refined query no LSA
mushrooms AND (cylindrical OR ancestors OR hyphae
OR cellulose OR issued OR hydrogen OR developing
OR putting)
Refined query LSA
mushrooms AND (ascospores OR hyphae OR itis OR
peroxide OR discharge OR developing OR pulled OR
jean)
In this example the LSA query has identified some
additional technical terms, although still
including the quite general term pulled. Note
that the selection of the term jean was due to
Jean being the first name of several authors
mentioned in relevant documents
ITIS Integrated Taxonomic Information System

19
(No Transcript)
20
(No Transcript)
21
AISAT 2004

Conclusions
The refined search queries show significant
improvement over original naïve queries, without
the need for detailed domain knowledge
LSA did assist in the formulation of more
meaningful Boolean web queries
LSA did not significantly improve retrieval
performance compared to the original query
enhancement algorithm

22
AISAT 2004

Further Comments (Not in original
presentation/paper)
A criticism of this approach is that the
requirement for the user to review and classify
several documents requires quite a bit of effort
were we really helping the user or not?
While it is our assertion that the review /
classification process required is not very time
consuming, we have tried to address this
criticism by taking a different approach in
current work

23
SWIRL 2004

Strategic Workshop on Information Retrieval in
Lorne, December 2004
Organized by Justin Zobel from RMIT and Alistair
Moffat from the University of Melbourne
Funded by the Australian Academy of Technological
Sciences and Engineering under their Frontiers
of Science and Technology Missions and Workshops
program
17 International and 13 Australian researchers
plus 8 research students
Included researchers from industry and government
as well as academia (Microsoft, NIST, CSIRO)
( NIST National Institute of Standards and
Technology, a US Government research organization)

24
SWIRL 2004

It was a discussion-based residential workshop
The aim for the workshop was to try and define
what we know (and also don't know) about
Information Retrieval, examining past work to
identify fundamental contributions, challenges,
and turning points and then to examine possible
future research directions, including possible
joint projects. That is, our goal was to pause
for a few minutes, reflect on the lessons of past
research, and reconsider what questions are
important and which research might lead to
genuine advances.
From the SWIRL 2004 web site -
http//www.cs.mu.oz.au/alistair/swirl2004/

25
SWIRL 2004

Attendees included researchers responsible for
many of the key innovations in web searching and
the implementation of effective and efficient
information retrieval systems.

John Tait (University of Sunderland)
Dave Harper (The Robert Gordon University,
Scotland)
Alistair Moffat (University of Melbourne)
Justin Zobel (RMIT)
Andrew Turpin (University of Melbourne)
Bruce Croft (University of Massachusetts,
Amherst)
Ross Wilkinson (CSIRO)
Bill Hersh, M.D. (Oregon Health Science Univers
ity)
Robert Dale (Macquarie University)
Kal Järvelin (University of Tampere, Finland)
Jamie Callan (Carnegie Mellon University )
David Harper (CSIRO)
26
SWIRL 2004

Lorne Beach during the site visit (Winter)

27
SWIRL 2004

Lorne Beach when we arrived (Summer)

28
SWIRL 2004

Program - Day 1
Travel from Melbourne to Lorne (group bus)
Keynote Presentation
The IR Landscape Bruce Croft
Presentation
Adventures in IR evaluation Ellen Voorhees
Group Discussion
IR where are we, how did we get here, and where
might we go? Mark Sanderson
Workshop Dinner

29
SWIRL 2004

Program - Day 2
Group Discussion
What motivates research Justin Zobel / Alistair
Moffat
Small Group Discussion
Challenges in information retrieval and language
modeling David Harper, Phil Vines and David
Johnson Challenges in Contextual Retrieval,
Challenges in Metasearch, Challenges in Cross
Language Information Retrieval (CLIR)
Group Discussion
Important papers for new research students to be
aware of David Hawking
Small Group Project
How to spend 1M per year for five years Ross
Wilkinson

30
SWIRL 2004

Program - Day 3
Presentations from Group Projects
Mark Sanderson, Ellen Voorhees, David Johnson, et
al Experimental Educational Search Engine
Targeting information needs for upper primary to
grade 10
Group Discussion
Where to now with SIGIR? Jamie Callan
Group Discussion
Writing the ideal SIGIR 2005 paper Susan
Dumais
Return to Melbourne
Via Port Campbell National Park and Great Ocean
Road

31
SWIRL 2004

Summary
An excellent opportunity to meet and talk to many
respected IR researchers
Provided guidance on the current state of the
art and ideas for the future direction of my
research
Provided many useful contacts

32
Language Modeling - Introduction

The seminal paper for language modeling in modern
IR A Language Modeling Approach to Information
Retrieval Jay Ponte and Bruce Croft 1998
Like LSA, it can deal with the problems of
polysemy (same word different meaning) and
synonymy (different word same meaning)
Unlike LSA, it has a firm theoretical
underpinning
Less computationally demanding than LSA
particularly important for large document
collections

33
Language Modeling - Introduction

Documents and queries are considered to be
generated stochastically using their underlying
language model
The document in a collection that is considered
most relevant to a given query is the one with
the highest probability of generating the query
from its language model
The assumption of word order independence is
usually made to make the model mathematically
tractable, although bi-grams, tri-grams, etc. can
be accommodated

34
Language Modeling - Introduction

A language model of a document is a function that
for any given word calculates the probability of
that word appearing in the document
The probability of a phrase (or query) is
calculated by multiplying together the individual
word probabilities (applying the word order
independence assumption)
All we need is a method to estimate the language
models of the documents!

35
Language Modeling Model Estimation

Obviously the document itself is the primary data
source, but
If a word, w1,doesnt appear in a document,
should we have P(w1Md) 0? (in other words
meaning that it is impossible for language model
Md to generate word w1)
If word w2 appears 5 times in a 1,000 word
document, is P(w2Md) 0.005 really a reasonable
estimate?
It is important to smooth the language model to
overcome problems caused by lack of training
data

36
Language Modeling Model Estimation

There are a number of smoothing methods in use,
the basic premise being that some of the
probability mass in the model should be taken
from the observed data to be used as an estimate
for unseen words
They take into account Zipfs law nature of
word usage (a few words used many times, many
words used infrequently) to improve estimates
for instance the Good-Turing algorithm
Corpus or general English word usage data may
also be used to augment the document data, but
care has to be taken. The Ponte-Croft paper
addresses this problem by calculating a
risk-adjustment factor, based on the difference
between corpus and document word usage

37
Language Modeling Model Estimation

Zipfs law

38
Language Modeling Results

A basic implementation of the Ponte-Croft method
proved very useful in quickly locating relevant
documents in small local document collections
It is planned to implement an improved version as
part of the information retrieval tool that is
currently under development discussed next

39
A New Information Retrieval Tool

Goals
Assist the user to get required information from
an ad-hoc web search more quickly and
effectively
Data driven to allow quick access to key parts of
retrieved documents
In some cases provide answers to questions
directly without the user needing to view
documents
Assist in refining web queries
Use language modeling techniques locally on
retrieved documents to satisfy more complex
information needs i.e. those not able to be
expressed adequately in a web query

40
A New Information Retrieval Tool

Information Flow

Web Query
Search Engine
Links to Results
WWW

From bi-gram List
Question may be answered
Jump directly to relevant parts of documents
containing bi-grams
Explore documents further using language
modeling
Formulate and run a refined web query
including/rejecting selected bi-grams

Retrieved Document Text 150 docs 35 seconds 200
docs 50 seconds
User
Bi-gram List
41
A New Information Retrieval Tool

Why use bi-grams?
Express a concept much better than a single word

Peanut
Foods/cooking Peanut butter Peanut candy Roaste
d peanut Chocolate peanut Peanut brittle Peanut
cookie Peanut recipe Peanut lover Peanut soup
Peanut oil
Agriculture Peanut institute Peanut commission
Peanut grower Peanut farmer Peanut producer Pea
nut plant
Commercial/Brands Peanut software Peanut linux
Peanut van Peanut inn Peanut clothing Peanut Ap
parel Baby peanut Peanut sandals Peanut ties M
r peanut
Medical Peanut allergy
Other Peanut gallery
42
A New Information Retrieval Tool

Why use bi-grams?
Can be used directly as a web search term by
using quotes
Easily combined into well targeted searches using
OR and - operators
Reasonably easy to extract meaningful bi-grams
from document text using simple rules
Tri-grams or higher n-grams tend to occur too
infrequently
Also, bi-grams seem to be somewhat neglected in
current IR research possibly because of the
sparse data problem a corpus with a
vocabulary of 50,000 words has the potential for
2,500,000,000 bi-grams causing difficulty in
language modeling and many other IR techniques

43
A New Information Retrieval Tool

Simple bi-gram extraction rules
Ignore a bi-gram if it contains a stop word (a
word that doesnt convey much meaning, for
instance - a, an, and, of, the, etc. without
this step the most frequent bi-grams are usually
of the, in the, to the and so on)
If bi-grams are found that are the same except
for plurals (e.g. african elephant and african
elephants) only present the most common form to
the user
Sort bi-grams by descending occurrence count
within descending document occurrence count
Alternative method
Part of speech filtering almost all interesting
bi-grams are of the form Adjective Noun or
Noun Noun
It isnt always possible to determine the part of
speech exactly (for example unknown words, words
with multiple possible parts of speech), but we
can certainly reject many bi-grams that could
never be of the required form

44
A New Information Retrieval Tool

Example of an automatically generated bi-gram
list
150 documents from simple Google search
Elephants

elephant jokes
baby elephant
elephant elephas
70 years
indian elephant
elephant society
large ears
wild elephants
land mammal
white elephant
forest elephant
baby elephants
elephants eat
ivory tusks
family elephantidae
species survival

13 feet
largest living
largest land
elephant seal
natural history
young elephants
female elephants
elephant conservation
long time
elephant range
give birth
incisor teeth
20 years
ivory trade
60 years
blood vessels

african elephant
elephant man
loxodonta africana
elephant loxodonta
asian elephant
south africa
privacy policy
united states
years old
elephas maximus
22 months
national park
endangered species
elephants live
forest elephants
small objects

45
A New Information Retrieval Tool

From browsing the bi-gram list the user
Gets an idea of the topic areas in the retrieved
data
Can jump directly to relevant parts of retrieved
documents
Can mark bi-grams to include/exclude from a new
web search
Can mark portions of retrieved documents as
relevant/irrelevant to use in a more targeted
local search using language modeling (this allows
drilling down to discover information on concepts
that are difficult to express as a web query)
The user can also use natural language and/or
keyword queries using language modeling to assist
in examining local data (i.e. the document text
downloaded as part of the retrieval process)

46
A New Information Retrieval Tool

Example of a new web query formulated by marking
relevant/irrelevant bi-grams
elephant ("african elephant" OR "asian elephants"
OR "loxodonta africana")
-"elephant man" -"elephant seal" -"elephant
jokes" -"white elephant
Using the criterion of our test information need
(web pages with suitable information for a high
school project on the land mammal elephant), 48
of the first 50 pages returned by the search were
judged to be relevant.
The web search also indicated there were about
228,000 pages matching our query, so we were
still getting a very wide coverage

47
A New Information Retrieval Tool

Are the generated bi-gram lists always useful?
At least some of the information in the retrieved
documents must be relevant for interesting
bi-grams to be generated
On the other hand, if the bi-grams are all way
off track, the user knows immediately to rethink
the initial search, rather than reviewing many
irrelevant documents
Initial testing with 27 different one word Google
searches (150 documents retrieved for each
search) generated useful bi-gram lists in 21
cases
By using a two word search (e.g. angle geometry
instead of angle, cobra snake instead of
cobra), useful bi-gram lists were obtained for
five of the remaining six cases

48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
A New Information Retrieval Tool

Future work
Continue development and refinement of this
information retrieval tool
Identify examples of information needs that are
difficult to satisfy by direct web search alone
User survey how does the tool perform in
practice? How could it be improved?

52
Questions/Comments?

Write a Comment

User Comments (0)