Chapter 4 Query Languages - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Chapter 4 Query Languages

Description:

include simple words and phrases as well as Boolean operators ... Query can ... held' and hold' retrieve strings such as hoax', hissing' allowing errors ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 27
Provided by: KCK86
Category:

less

Transcript and Presenter's Notes

Title: Chapter 4 Query Languages


1
Chapter 4Query Languages
  • ..
  • .

2
Introduction
  • Cover different kinds of queries posed to text
    retrieval systems
  • Keyword-based query languages
  • include simple words and phrases as well as
    Boolean operators
  • Pattern matching
  • complement keyword searching with data retrieval
    capabilities
  • Structural queries
  • querying on structure of text

3
Keyword-Based Querying
  • Query is formulation of user information need
  • Keyword-based queries are popular
  • intuitive
  • easy to express
  • allow for fast ranking
  • Query can be simply a word
  • in general more complex combination of operations
    involving several words

4
Single-Word Queries
  • Most elementary query is a word
  • Word is sequence of letters surrounded by
    separators
  • some characters are not letters but do not split
    a word, e.g. hyphen in on-line
  • Result of word queries is
  • set of documents containing at least one of the
    words in query
  • resulting documents are ranked
  • term frequency (count of word in document)
  • inverse document frequency (count of no. of
    documents in which word appears)

5
Context Queries
  • Complement single-word queries with ability to
    search words in given context, I.e. near other
    words
  • words near each other signal higher likelihood of
    relevance than if they appear apart
  • form phrases of words or find words which are
    proximal in text

6
Phrase
  • Sequence of single-word queries
  • for instance, possible to search for word
    enhance and then word retrieval
  • uninteresting words in text are not considered at
    all
  • e.g. above example query could match text such as
    enhance the retrieval

7
Proximity
  • Sequence of single words or phrases is given
    together with maximum allowed distance between
    them
  • For instance, above query stated as
  • enhance and retrieval should occur within
    four words
  • a possible match could be enhance the power of
    retrieval
  • Distance can be measured in characters or words

8
Boolean Queries
  • Oldest form of combining keyword queries is to
    use Boolean operators
  • Boolean query has following syntax
  • atoms (I.e. basic queries) that retrieve
    documents, and of
  • Boolean operators which work on their operands
    (sets of documents)
  • query syntax tree can be defined
  • leaves are basic queries
  • internal nodes are operators

9
Boolean Queries (Cont.)
  • Retrieve all documents that contain the word
    translation as well as either the word syntax
    or the word syntactic

AND
OR
translation
syntactic
syntax
10
Boolean Queries (Cont.)
  • No ranking of retrieved documents provided
  • document either satisfies query (retrieved) or
    does not (not retrieved)
  • does not allow partial matching between document
    and user query
  • to overcome this limitation, idea of fuzzy
    Boolean set of operators proposed
  • instead of all the operands (AND) or at least in
    one of operands (OR), retrieve elements in some
    operands

11
Natural Language
  • Distinction between AND and OR completely blurred
  • simply an enumeration of words and context
    queries
  • all documents matching portion of user query are
    retrieved
  • higher ranking assigned to documents matching
    more parts of query
  • eliminated any reference to Boolean operators

12
Pattern Matching
  • Query formulation based on concept of pattern
    that allow retrieval of pieces of text that have
    some property
  • Pattern is set of syntactic features that occur
    in text segment
  • Segments satisfying pattern specification said to
    match the pattern
  • We are interested in documents containing
    segments that match given search pattern

13
Pattern Matching (Cont.)
  • Most used types of pattern are
  • words
  • string (sequence of characters) that is a word in
    text
  • prefixes
  • string that form beginning of text word
  • prefix comput retrieve documents with words
    such as computer, computation
  • suffixes
  • string that form termination of word
  • suffix ters retrieve documents with words such
    as testers, computers

14
  • substrings
  • string which can appear within word
  • substring tal retrieve documents with words
    such as coastal, talk, metallic
  • ranges
  • pair of strings that match any word lying between
    them in lexicographical order
  • alphabets sorted to order string into
    lexicographical order (dictionary order)
  • range between words held and hold retrieve
    strings such as hoax, hissing

15
  • allowing errors
  • word together with error threshold
  • retrieves all text words similar to given word
  • pattern may have errors (typing, spelling) and
    documents with words with erroneous variants are
    retrieved (with edit distance)
  • if typing error splits flower into flo wer,
    still found with one error
  • regular expression (r.e.)
  • r.e. is built up by simple strings and operators
    like union, concatenation and repetition
  • query like pro (plem tein) (s ?) (0 1
    2) will match words like problem02,
    proteins

16
  • Extended patterns
  • subset of regular expressions expressed with
    simpler syntax
  • classes of characters, I.e. some position in
    pattern matched by any character from pre-defined
    set (e.g. some characters must be digit, not a
    letter, vowel, etc.)
  • conditional expressions, I.e. part of pattern may
    or may not appear
  • wild characters, I.e. match any sequence in text
    (e.g. any word starts as flo and ends with
    ers which match flowers as well as flounders

17
Structural Queries
  • Allowing user to query texts based on structure,
    and not content
  • mixing contents and structure in queries can pose
    powerful queries (much more expressive)
  • An example
  • select set of documents that satisfy certain
    constraints on content (using word, phrase, or
    patterns) and then
  • structural constraints expressed using
    containment, proximity, or chapters, sections
    present in documents

18
  • Types of structures
  • fixed structure
  • hypertext
  • hierarchical structure

19
Fixed Structure
  • Document has fixed set of fields
  • each field has some text inside
  • some fields not present in all documents
  • fields not allowed to nest or overlap
  • retrieval activity restricted to specifying that
    given pattern was to be found only in given
    fields
  • this model reasonable when text collection has
    fixed structure

20
Hypertext
  • Retrieval from hypertext began as navigational
    activity
  • user manually traverse hypertext nodes following
    links to search what he wanted
  • not possible to query hypertext based on its
    structure
  • WebGlimpse - interesting proposal to allow
    navigation plus ability to search by content in
    neighborhood of current node

21
Hierarchical Structure
  • Represent recursive decomposition of text
  • natural model for many text collections
  • Figure 4.3 shows example of hierarchical
    structure that consists of page of a book, its
    schematic view and parsed query to retrieve figure

22
Hierarchical ModelsPAT Expressions
  • Structure marked in the text by tags (as in HTML)
  • defined in terms of initial and final tags
  • each pair of initial and final tags defines a
    region, set of contiguous text areas
  • area of region cannot nest or overlap
  • possible to select areas containing other areas,
    contained in other areas, or followed by other
    areas

23
Overlapped Lists
  • Allows area of regions to overlap, but not to
    nest
  • considers use of inverted list where words and
    also regions are indexed
  • allows to perform set union, and to combine
    regions

24
List of References
  • Attempt to make definition and querying of
    structured text uniform, using common language
  • the language allows for querying on path
    expressions, which describe paths in structure
    tree
  • answers to queries are list of references
  • reference is pointer to region of database

25
Proximal Nodes
  • Tries to find good compromise between
    expressiveness and efficiency
  • specifies fully compositional language where
    leaves of query syntax tree formed by basic
    queries on contents or names of structural
    elements (e.g. all chapters)
  • internal nodes combine results
  • for efficiency, operations at internal nodes must
    relate nodes close in text

26
Tree Matching
  • Relies on single primitive tree inclusion
  • interpret structure of text database and query as
    trees
  • determine embedding of query into database which
    respects hierarchical relationships between nodes
    of query
  • simple queries return roots of the matches
Write a Comment
User Comments (0)
About PowerShow.com