Research update Language modeling, ngrams and search engines - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Research update Language modeling, ngrams and search engines

Description:

Poor recall most of the relevant documents are not located ... Peanut butter. Peanut candy. Roasted peanut. Chocolate peanut. Peanut brittle. Peanut cookie ... – PowerPoint PPT presentation

Number of Views:70
Avg rating:3.0/5.0
Slides: 53
Provided by: davidj7
Category:

less

Transcript and Presenter's Notes

Title: Research update Language modeling, ngrams and search engines


1
Research update Language modeling,n-grams and
search engines
  • David Johnson
  • UTas September 2005

Supervisors Dr Vishv Malhotra and Dr Peter Vampl
ew
2
Overview
  • AISAT 2004
  • SWIRL 2004
  • Language modeling
  • A New Information Retrieval Tool
  • Questions

3
AISAT 2004
  • AISAT 2004 - The 2nd International Conference on
    Artificial Intelligence in Science and
    Technology
  • November 2004, UTAS, Hobart
  • Hosted by The School of Engineering

4
AISAT 2004
  • Johnson DG, Malhotra VM, Vamplew PW, Patro S,
    Refining Search Queries from Examples Using
    Boolean Expressions and Latent Semantic
    Analysis
  • Extension of prior research (Vishv M. Malhotra,
    Sunanda Patro, David Johnson Synthesize Web
    Queries Search the Web by Examples. ICEIS 2005.)
    the contribution of this paper was
    investigating the application of LSA to improve
    the refined queries
  • This is only a brief overview the full paper is
    available o from the UTAS Eprints Repository

5
AISAT 2004
  • Consider the problems facing a web searcher
    looking for information in an unfamiliar domain
  • Lack of knowledge of domain specific terms,
    jargon and concepts
  • Difficulty rating the importance of newly
    discovered terms in targeting relevant
    information
  • Lack of understanding about how extra terms and
    changing the structure of the query will affect
    the search results

6
AISAT 2004
  • This often leads to problems with the resulting
    query
  • Poor recall most of the relevant documents are
    not located
  • Poor precision many of the retrieved documents
    are not relevant
  • Frustration often results
  • MSN-Harris Survey (August 2004) reports there is
    a significant minority -- 29 percent of search
    engine users -- who only sometimes or rarely
    find what they want (http//www.microsoft.com/pre
    sspass/press/2004/aug04/08-02searchpollpr.mspx)

7
AISAT 2004
  • However it is usually relatively easy for a
    searcher to classify some example documents (from
    an initial search) as relevant or not
  • The text from these documents can then be
    analyzed to build a better query one that will
    select more relevant and less irrelevant
    documents

8
AISAT 2004
Resubmit to Web Search Engine
9
AISAT 2004
  • The new query is initially built in conjunctive
    normal form (CNF)
  • Original query (a OR b OR ) AND (a OR c OR )
    AND
  • Each maxterm is chosen to select all documents in
    set relevant and reject as many of the
    documents from irrelevant as possible
  • The CNF expression (i.e. the conjunction of all
    maxterms) is chosen to reject all documents from
    irrelevant
  • In order to minimise the size of the CNF
    expression, terms with high selective potential
    must be used
  • In some cases further optimization was required
    for instance (at the time) Google would only
    accept queries of up to 10 keywords in length

10
AISAT 2004
The potential of a candidate term, t, in a
partially constructed maxterm is calculated as
TR relevant documents not yet selected
TIR irrelevant docs still selected by
conjunction of prior maxterms TRt new relevant
documents selected by term t TIRt irrelevant do
cuments from TIR selected by term t
11
AISAT 2004
  • The Potential(t) function enabled the formulation
    of effective queries, but they were often
    counterintuitive, e.g. searching for information
    related to the Sun (as a star)
  • The synthesised query was
  • sun AND (solar AND (recent OR (field AND tour))
  • OR (lower AND home) OR (million AND core))
  • This was felt to be at least partially due to
    overtraining on the relatively small sets of
    example documents but could we improve the
    generated queries?

12
AISAT 2004
  • Latent Semantic Analysis (LSA) was investigated
    as a technique with the potential to help in
    selecting more meaningful terms
  • It is A theory and method for extracting and
    representing the contextual-usage meaning of
    words by statistical computations applied to a
    large corpus of text (Landauer and Dumais,
    1997)
  • It uses no humanly constructed dictionaries or
    knowledge bases, yet has been shown to overlap
    with human scores on some language-based judgment
    tasks

13
AISAT 2004
  • LSA can help overcome the problems of polysemy
    (same word different meaning) and synonymy
    (different word same meaning) by associating
    words with abstract concepts via its analysis of
    word usage patterns
  • It can be very computationally intensive
    (requiring the singular value decomposition (SVD)
    of a large matrix), but for the small collections
    of relevant / irrelevant documents we are
    analysing it can be performed quickly (CPU time)

14
AISAT 2004
  • The term weighting was adjusted to account for
    the LSA weighting as follows
  • Weight(t) Potential(t) ((Normalised LSA
    weight in set relevant) (Normalised LSA
    weight in set irrelevant))
  • During the experimentation that lead to the
    development of this weighting scheme, we noted
    that not allowing the Potential(t) values
    sufficient weight resulted in very long queries
  • Note The Potential(t) value is recalculated
    after each term is selected, but the LSA values
    are based on the entire document sets

15
AISAT 2004
  • Results - Example 1
  • Naïve query elephants searching for
    information suitable for a high school project
  • 7 out of first 20 documents, and around 30 of
    all initial documents from the naïve query were
    judged relevant
  • The initial number of relevant documents was
    quite high, reflecting the reasonably broad
    nature of the information need

16
AISAT 2004
  • Results Example 1
  • Refined query no LSA
  • elephant AND (forests OR lived OR calves OR
    poachers OR maximus OR threaten OR electric OR
    tel OR kudu)
  • Refined query with LSA
  • elephant AND (climate OR habitats OR grasslands
    OR quantities OR africana OR threaten OR insects
    OR electric OR kudu)
  • LSA did help in the selection of intuitive
    words, although not as much as hoped it avoided
    lived and tel (abbreviation for telephone
    no), although it still selected quantities

17
AISAT 2004
  • Results - Example 2
  • Naïve query mushrooms searching for
    information on growth, lifecycle and structure of
    mushrooms
  • NOT interested in recipes, magic mushrooms, or
    identifying wild edible mushrooms
  • 3 of first 20 documents and 13 of initial
    documents relevant

18
AISAT 2004
  • Results - Example 2
  • Refined query no LSA
  • mushrooms AND (cylindrical OR ancestors OR hyphae
    OR cellulose OR issued OR hydrogen OR developing
    OR putting)
  • Refined query LSA
  • mushrooms AND (ascospores OR hyphae OR itis OR
    peroxide OR discharge OR developing OR pulled OR
    jean)
  • In this example the LSA query has identified some
    additional technical terms, although still
    including the quite general term pulled. Note
    that the selection of the term jean was due to
    Jean being the first name of several authors
    mentioned in relevant documents
  • ITIS Integrated Taxonomic Information System

19
(No Transcript)
20
(No Transcript)
21
AISAT 2004
  • Conclusions
  • The refined search queries show significant
    improvement over original naïve queries, without
    the need for detailed domain knowledge
  • LSA did assist in the formulation of more
    meaningful Boolean web queries
  • LSA did not significantly improve retrieval
    performance compared to the original query
    enhancement algorithm

22
AISAT 2004
  • Further Comments (Not in original
    presentation/paper)
  • A criticism of this approach is that the
    requirement for the user to review and classify
    several documents requires quite a bit of effort
    were we really helping the user or not?
  • While it is our assertion that the review /
    classification process required is not very time
    consuming, we have tried to address this
    criticism by taking a different approach in
    current work

23
SWIRL 2004
  • Strategic Workshop on Information Retrieval in
    Lorne, December 2004
  • Organized by Justin Zobel from RMIT and Alistair
    Moffat from the University of Melbourne
  • Funded by the Australian Academy of Technological
    Sciences and Engineering under their Frontiers
    of Science and Technology Missions and Workshops
    program
  • 17 International and 13 Australian researchers
    plus 8 research students
  • Included researchers from industry and government
    as well as academia (Microsoft, NIST, CSIRO)
  • ( NIST National Institute of Standards and
    Technology, a US Government research organization)

24
SWIRL 2004
  • It was a discussion-based residential workshop
  • The aim for the workshop was to try and define
    what we know (and also don't know) about
    Information Retrieval, examining past work to
    identify fundamental contributions, challenges,
    and turning points and then to examine possible
    future research directions, including possible
    joint projects. That is, our goal was to pause
    for a few minutes, reflect on the lessons of past
    research, and reconsider what questions are
    important and which research might lead to
    genuine advances.
  • From the SWIRL 2004 web site -
    http//www.cs.mu.oz.au/alistair/swirl2004/

25
SWIRL 2004
  • Attendees included researchers responsible for
    many of the key innovations in web searching and
    the implementation of effective and efficient
    information retrieval systems.

John Tait (University of Sunderland)
Dave Harper (The Robert Gordon University,
Scotland)
Alistair Moffat (University of Melbourne)
Justin Zobel (RMIT)
Andrew Turpin (University of Melbourne)
Bruce Croft (University of Massachusetts,
Amherst)
Ross Wilkinson (CSIRO)
Bill Hersh, M.D. (Oregon Health Science Univers
ity)
Robert Dale (Macquarie University)
Kal Järvelin (University of Tampere, Finland)
Jamie Callan (Carnegie Mellon University )
David Harper (CSIRO)
26
SWIRL 2004
  • Lorne Beach during the site visit (Winter)

27
SWIRL 2004
  • Lorne Beach when we arrived (Summer)

28
SWIRL 2004
  • Program - Day 1
  • Travel from Melbourne to Lorne (group bus)
  • Keynote Presentation
  • The IR Landscape Bruce Croft
  • Presentation
  • Adventures in IR evaluation Ellen Voorhees
  • Group Discussion
  • IR where are we, how did we get here, and where
    might we go? Mark Sanderson
  • Workshop Dinner

29
SWIRL 2004
  • Program - Day 2
  • Group Discussion
  • What motivates research Justin Zobel / Alistair
    Moffat
  • Small Group Discussion
  • Challenges in information retrieval and language
    modeling David Harper, Phil Vines and David
    Johnson Challenges in Contextual Retrieval,
    Challenges in Metasearch, Challenges in Cross
    Language Information Retrieval (CLIR)
  • Group Discussion
  • Important papers for new research students to be
    aware of David Hawking
  • Small Group Project
  • How to spend 1M per year for five years Ross
    Wilkinson

30
SWIRL 2004
  • Program - Day 3
  • Presentations from Group Projects
  • Mark Sanderson, Ellen Voorhees, David Johnson, et
    al Experimental Educational Search Engine
    Targeting information needs for upper primary to
    grade 10
  • Group Discussion
  • Where to now with SIGIR? Jamie Callan
  • Group Discussion
  • Writing the ideal SIGIR 2005 paper Susan
    Dumais
  • Return to Melbourne
  • Via Port Campbell National Park and Great Ocean
    Road

31
SWIRL 2004
  • Summary
  • An excellent opportunity to meet and talk to many
    respected IR researchers
  • Provided guidance on the current state of the
    art and ideas for the future direction of my
    research
  • Provided many useful contacts

32
Language Modeling - Introduction
  • The seminal paper for language modeling in modern
    IR A Language Modeling Approach to Information
    Retrieval Jay Ponte and Bruce Croft 1998
  • Like LSA, it can deal with the problems of
    polysemy (same word different meaning) and
    synonymy (different word same meaning)
  • Unlike LSA, it has a firm theoretical
    underpinning
  • Less computationally demanding than LSA
    particularly important for large document
    collections

33
Language Modeling - Introduction
  • Documents and queries are considered to be
    generated stochastically using their underlying
    language model
  • The document in a collection that is considered
    most relevant to a given query is the one with
    the highest probability of generating the query
    from its language model
  • The assumption of word order independence is
    usually made to make the model mathematically
    tractable, although bi-grams, tri-grams, etc. can
    be accommodated

34
Language Modeling - Introduction
  • A language model of a document is a function that
    for any given word calculates the probability of
    that word appearing in the document
  • The probability of a phrase (or query) is
    calculated by multiplying together the individual
    word probabilities (applying the word order
    independence assumption)
  • All we need is a method to estimate the language
    models of the documents!

35
Language Modeling Model Estimation
  • Obviously the document itself is the primary data
    source, but
  • If a word, w1,doesnt appear in a document,
    should we have P(w1Md) 0? (in other words
    meaning that it is impossible for language model
    Md to generate word w1)
  • If word w2 appears 5 times in a 1,000 word
    document, is P(w2Md) 0.005 really a reasonable
    estimate?
  • It is important to smooth the language model to
    overcome problems caused by lack of training
    data

36
Language Modeling Model Estimation
  • There are a number of smoothing methods in use,
    the basic premise being that some of the
    probability mass in the model should be taken
    from the observed data to be used as an estimate
    for unseen words
  • They take into account Zipfs law nature of
    word usage (a few words used many times, many
    words used infrequently) to improve estimates
    for instance the Good-Turing algorithm
  • Corpus or general English word usage data may
    also be used to augment the document data, but
    care has to be taken. The Ponte-Croft paper
    addresses this problem by calculating a
    risk-adjustment factor, based on the difference
    between corpus and document word usage

37
Language Modeling Model Estimation
  • Zipfs law

38
Language Modeling Results
  • A basic implementation of the Ponte-Croft method
    proved very useful in quickly locating relevant
    documents in small local document collections
  • It is planned to implement an improved version as
    part of the information retrieval tool that is
    currently under development discussed next

39
A New Information Retrieval Tool
  • Goals
  • Assist the user to get required information from
    an ad-hoc web search more quickly and
    effectively
  • Data driven to allow quick access to key parts of
    retrieved documents
  • In some cases provide answers to questions
    directly without the user needing to view
    documents
  • Assist in refining web queries
  • Use language modeling techniques locally on
    retrieved documents to satisfy more complex
    information needs i.e. those not able to be
    expressed adequately in a web query

40
A New Information Retrieval Tool
  • Information Flow

Web Query
Search Engine
Links to Results
WWW
  • From bi-gram List
  • Question may be answered
  • Jump directly to relevant parts of documents
    containing bi-grams
  • Explore documents further using language
    modeling
  • Formulate and run a refined web query
    including/rejecting selected bi-grams

Retrieved Document Text 150 docs 35 seconds 200
docs 50 seconds
User
Bi-gram List
41
A New Information Retrieval Tool
  • Why use bi-grams?
  • Express a concept much better than a single word

Peanut
Foods/cooking Peanut butter Peanut candy Roaste
d peanut Chocolate peanut Peanut brittle Peanut
cookie Peanut recipe Peanut lover Peanut soup
Peanut oil
Agriculture Peanut institute Peanut commission
Peanut grower Peanut farmer Peanut producer Pea
nut plant
Commercial/Brands Peanut software Peanut linux
Peanut van Peanut inn Peanut clothing Peanut Ap
parel Baby peanut Peanut sandals Peanut ties M
r peanut
Medical Peanut allergy
Other Peanut gallery
42
A New Information Retrieval Tool
  • Why use bi-grams?
  • Can be used directly as a web search term by
    using quotes
  • Easily combined into well targeted searches using
    OR and - operators
  • Reasonably easy to extract meaningful bi-grams
    from document text using simple rules
  • Tri-grams or higher n-grams tend to occur too
    infrequently
  • Also, bi-grams seem to be somewhat neglected in
    current IR research possibly because of the
    sparse data problem a corpus with a
    vocabulary of 50,000 words has the potential for
    2,500,000,000 bi-grams causing difficulty in
    language modeling and many other IR techniques

43
A New Information Retrieval Tool
  • Simple bi-gram extraction rules
  • Ignore a bi-gram if it contains a stop word (a
    word that doesnt convey much meaning, for
    instance - a, an, and, of, the, etc. without
    this step the most frequent bi-grams are usually
    of the, in the, to the and so on)
  • If bi-grams are found that are the same except
    for plurals (e.g. african elephant and african
    elephants) only present the most common form to
    the user
  • Sort bi-grams by descending occurrence count
    within descending document occurrence count
  • Alternative method
  • Part of speech filtering almost all interesting
    bi-grams are of the form Adjective Noun or
    Noun Noun
  • It isnt always possible to determine the part of
    speech exactly (for example unknown words, words
    with multiple possible parts of speech), but we
    can certainly reject many bi-grams that could
    never be of the required form

44
A New Information Retrieval Tool
  • Example of an automatically generated bi-gram
    list
  • 150 documents from simple Google search
    Elephants
  • elephant jokes
  • baby elephant
  • elephant elephas
  • 70 years
  • indian elephant
  • elephant society
  • large ears
  • wild elephants
  • land mammal
  • white elephant
  • forest elephant
  • baby elephants
  • elephants eat
  • ivory tusks
  • family elephantidae
  • species survival
  • 13 feet
  • largest living
  • largest land
  • elephant seal
  • natural history
  • young elephants
  • female elephants
  • elephant conservation
  • long time
  • elephant range
  • give birth
  • incisor teeth
  • 20 years
  • ivory trade
  • 60 years
  • blood vessels
  • african elephant
  • elephant man
  • loxodonta africana
  • elephant loxodonta
  • asian elephant
  • south africa
  • privacy policy
  • united states
  • years old
  • elephas maximus
  • 22 months
  • national park
  • endangered species
  • elephants live
  • forest elephants
  • small objects

45
A New Information Retrieval Tool
  • From browsing the bi-gram list the user
  • Gets an idea of the topic areas in the retrieved
    data
  • Can jump directly to relevant parts of retrieved
    documents
  • Can mark bi-grams to include/exclude from a new
    web search
  • Can mark portions of retrieved documents as
    relevant/irrelevant to use in a more targeted
    local search using language modeling (this allows
    drilling down to discover information on concepts
    that are difficult to express as a web query)
  • The user can also use natural language and/or
    keyword queries using language modeling to assist
    in examining local data (i.e. the document text
    downloaded as part of the retrieval process)

46
A New Information Retrieval Tool
  • Example of a new web query formulated by marking
    relevant/irrelevant bi-grams
  • elephant ("african elephant" OR "asian elephants"
    OR "loxodonta africana")
  • -"elephant man" -"elephant seal" -"elephant
    jokes" -"white elephant
  • Using the criterion of our test information need
    (web pages with suitable information for a high
    school project on the land mammal elephant), 48
    of the first 50 pages returned by the search were
    judged to be relevant.
  • The web search also indicated there were about
    228,000 pages matching our query, so we were
    still getting a very wide coverage

47
A New Information Retrieval Tool
  • Are the generated bi-gram lists always useful?
  • At least some of the information in the retrieved
    documents must be relevant for interesting
    bi-grams to be generated
  • On the other hand, if the bi-grams are all way
    off track, the user knows immediately to rethink
    the initial search, rather than reviewing many
    irrelevant documents
  • Initial testing with 27 different one word Google
    searches (150 documents retrieved for each
    search) generated useful bi-gram lists in 21
    cases
  • By using a two word search (e.g. angle geometry
    instead of angle, cobra snake instead of
    cobra), useful bi-gram lists were obtained for
    five of the remaining six cases

48
(No Transcript)
49
(No Transcript)
50
(No Transcript)
51
A New Information Retrieval Tool
  • Future work
  • Continue development and refinement of this
    information retrieval tool
  • Identify examples of information needs that are
    difficult to satisfy by direct web search alone
  • User survey how does the tool perform in
    practice? How could it be improved?

52
Questions/Comments?
Write a Comment
User Comments (0)
About PowerShow.com