Query Optimization by Genetic Algorithms - PowerPoint PPT Presentation

About This Presentation
Title:

Query Optimization by Genetic Algorithms

Description:

Information Retrieval System - IRS. IRS concerned ... will be encoded to be chromosomes for genetic programming in prefix form such as ... – PowerPoint PPT presentation

Number of Views:426
Avg rating:3.0/5.0
Slides: 34
Provided by: suhail4
Category:

less

Transcript and Presenter's Notes

Title: Query Optimization by Genetic Algorithms


1
Query Optimization by Genetic Algorithms
  • Suhail Owais,
  • Pavel Kromer,
  • Vaclav SnaÅ¡el
  • Department of Computer Science, VÅ B-Technical
    University of Ostrava,
  • 17. listopadu 15, Ostrava - Poruba, Czech Republic

2
Outline
  1. Introduction
  2. Information Retrieval (IR)
  3. Genetic Algorithms (GA)
  4. Optimization
  5. State of art
  6. IR and GA
  7. Experiments
  8. Conclusion
  9. Future Work

3
Internet
4
Information Retrieval
  • In principle, Suppose there are set of documents
    and a person (user of these documents), the user
    formulates a question (request or query) to which
    the answer is a subset of documents satisfying
    the information need expressed by his question
    Relevant documents.
  • Searching for information in documents, for
    document in collection of documents, for metadata
    in documents,
  • Searching will be in databases, or in hypertext
    networked databases Internet or intranet.

5
Information Retrieval System - IRS
  • IRS concerned
  • with responding to the requests of users queries
    for the information seeking text.
  • with retrieve all relevant documents to user
    query from a collection of documents, with
    retrieving some of non-relevant as less as
    possible.

6
Retrieved - Relevant Documents to the user Query
Collection of Documents

Relevant Doc.
Relevant Retrieved Doc.
Retrieved Doc.
7
IR Evaluation
  • The most Measuring performance of retrieval
    effectiveness are
  • Precision the percentage of the retrieved
    documents that are relevant to the user query
  • Recall the percentage of the relevant documents
    that are retrieved

8
Genetic Algorithm
  • GA used Darwinian Evolution to extract
    optimization strategies nature uses successfully
    and transform them for application in
    mathematical optimization theory to find the
    global optimum in defined phase space
  • GA are used in IR problems specially in
    optimizing of a Boolean query.
  • GA operators Selection, Fitness function,
    Crossover, and Mutation.

9
GA Flowchart Diagram
Condition Satisfied
Yes
Optimized Query
Encoding
Initialize Population
Evaluate Fitness's
No
Regenerate New Offsprings
End
Selection
Start
Crossover
Mutation
?
10
Optimization
  • The procedure or procedures are used to make a
    system or design as effective or functional as
    possible, especially the mathematical techniques
    involved.
  • Is the process of modifying a system to improve
    its efficiency. The system can be a single
    computer program, a collection of computers  or
    even an entire network such as the Internet.

11
State of the art 1
Evolutionary Learning of Boolean Queries by
Multiobjective Genetic Programming
  • Authors Cordon et al., Springer-Verlag GmbH 2002
  • Subject Automatic derivation of Boolean queries,
    by incorporating a Pareto-based multiobjective
    evolutionary approach, MOGA, into genetic
    programming technique.
  • Notes
  • A query represented as a parse tree with maximum
    of 20 nodes.
  • Boolean operators used are AND, OR and NOT.
  • Maximum number of documents is 1400.
  • Result The proposed approach has performed
    appropriately in seven queries of the well known
    Cranfield collection in terms of absolute
    retrieval performance and of the quality of the
    obtained Paretos.

12
State of the art 2
An Appropriate Boolean Query Reformulation
Interface for Information Retrieval Based on
Adaptive Generalization
  • Authors Yoshioka et al., WIRI 2005, In
    Conjunction with IEEE 2005, Tokyo Japan
  • Subject Implement a user query interface that
    supports reformulation of IR queries by using
    abstract concepts.
  • Notes
  • IR interface uses small numbers of query terms
    and concept categories with Boolean expression.
  • Reformulate a Boolean query by using only words
    that exist in the original query.
  • Boolean operators used are AND, and OR.
  • Result Proposed a new IR interface with Boolean
    query reformulation (ABRIR-AG). Find
    complementary query terms that exist in relevant
    documents and reformulate Boolean query formulas
    to clarify the information need.

ABRIR-AG Appropriate Boolean query
Reformulation for IR- Adaptive Generalization
13
IR and GA
  • Collection or set of Documents
  • Terms for Document di
  • Weighting function

T e r m s T e r m s T e r m s T e r m s T e r m s T e r m s T e r m s T e r m s T e r m s
W1 W2 W3 W4 W5 W6 W7 W8 W9
D O C U M E N T s d1 1 1 0 1 0 1 1 0 0
D O C U M E N T s d2 1 1 0 1 1 1 1
D O C U M E N T s d3 0 0 1 1 0 1 1 0 1
D O C U M E N T s d4 1 0 1 1 1 0 1 0 1
1
0
W2 Not in Document d2
W8 IN Document d2
14
Chromosome Encoding
  • A query combination from set of terms and set of
    Boolean operators
  • Set of queries will be encoded to be chromosomes
    for genetic programming in prefix form such as

(w2 OR w6) AND (w9 AND w3)
AND (OR w2 w6) (AND w9 w3)
? Prefix ?
(w3 AND w4) XOR ((w5 AND w6) OR w8)
XOR (AND w3 w4) (OR (AND w5 w6) w8)
? Prefix ?
15
Tree Structure Representation
XOR (AND w3 w4) (OR (AND w5 w6) w8)
AND (OR w2 w6) (AND w9 w3)
16
Fitness function
  • Recall and Precision functions are used to
    Evaluate Chromosomes.
  • Selection Operators
  • From the population of chromosomes, the best two
    chromosomes depending on the highest fitness
    values for precision or recall measures will be
    selected.
  • rd the relevance of document d (1 for relevant
    and 0 for nonrelevant),
  • fd the retrieved document d (1 for retrieval
    and 0 for nonretrieval), and
  • a and ß are arbitrary weights added specially to
    precision fitness function.

17
Crossover Operator
  • Chose Randomly one node position in each Tree to
    be exchanged

OR 4
OR 1
18
Exchange Sub trees
Created two New Offsprings
19
ReIndexing nodes in Offsprings
20
Mutation Operator
Randomly will change one of the Boolean logical
operators to another and the position randomly
chosen
AND , OR , XOR
AND , XOR
AND
No one select, SO no mutation will be done over
this offspring
AND 4
21
Experiments
  • Implementation for our Genetic Program was
    tested under the following conditions and
    limitations-
  • Two sets of queries that represent in a tree
    prefix forms used as two different initial
    populations
  • Boolean model of a collection of documents
  • Different Collections of documents
  • User query / request
  • w8 OR w2

22
Initial Populations
  • The two initial population differs by containing
    sub queries, so
  • Initial Population 1 contain sub query
  • w8 AND w2
  • Initial Population 2 contains sub queries
  • w8 AND w2
  • w8 OR w2
  • w8 XOR w2

Initial Population 2
Initial Population 1
1. ("w13"and"w8")and("w10"or"w4")
2. ("w1"and("w8"and"w2"))or("w4"or"w2")
3. ("w1"or"w2")and(("w5"or"w4")and("w3"and"w6"))
4. ("w9"and"w14")
5. ("w14") and "w1"
6. ("w2"or"w6")or("w8"and"w13")
7. ("w3"And"w4")or(("w12" xor "w15")and"w8")
8. "w1" or "w5"
1. ("w13"and"w8")and("w10"or"w4")
2. ("w1"and("w8"and"w2"))or("w4"or"w2")
3. ("w1"or"w2")and(("w5"or"w4")and("w3"and"w6"))
4. ("w9"and"w14")
5. ("w14") and "w1"
6. ("w2"xor"w8")or("w8"and"w13")
7. ("w3and"w4")or(("w2" or "w8")and"w8")
8. "w1" or "w5"
23
Variables initialization
  • Crossover probability value ? 0.8
  • Mutation probability value ? 0.2
  • Population size (number of chromosomes) ? 8
  • Maximum number of generations ? 50.
  • a ? 0.25
  • ß ? 1.0

24
Document Collections
  • Three different document collections with variant
    number of words and documents.

Collection Number of Documents Number of Words Maximum Number of Words in each Document
1 10 30 10
2 200 50 50
3 5000 2000 500
25
Notes on limitations
  • Single point for Crossover
  • Mutation operator applied only over Boolean
    operators AND, OR or XOR.
  • Fitness operator must be defined in input data to
    be-
  • PrecisionFitness or
  • RecallFitness.
  • maximum value for PrecisionFitness a ß
  • so It may be grater than one ( gt 1 )
  • it can not be interpreted as the probability of
    retrieving relevant document.

26
Experiments
  • Set of experiments done over three test cases
    depends on.
  • Initial Population used
  • Initial Population 1 OR
  • Initial Population 2.
  • Fitness function used
  • PrecisionFitness OR
  • RecallFitness.
  • Collection used
  • Collection 1 OR
  • Collection 2 OR
  • Collection 3.

27
Experiments Results Using IP. 1
IP Initial Population, FF Fitness Function , R
Recall , P Precision, V Value
Collection 1 Collection 1 Collection 1 Collection 1
FF Relaxed Optimized Query P. V. R.V.
P. (( "w8" OR "w2" ) OR ( "w8" AND "w2" )) 1.25 1.00
R. (( "w13" OR "w8" ) OR ( "w6" OR "w2" )) 1.08 1.00
Collection 2 Collection 2 Collection 2 Collection 2
P. (( "w13" AND "w8" ) OR ( "w8" OR "w2" )) 1.25 1.00
R. ( "w8" OR "w2" ) 1.25 1.00
Collection 3 Collection 3 Collection 3 Collection 3
P. ( "w13" AND "w8" ) 1.05 0.18
R. ( "w8" OR "w2" ) 1.25 1.00
28
Experiments Results Using IP. 2
IP Initial Population, FF Fitness Function , R
Recall , P Precision, V Value
Collection 1 Collection 1 Collection 1 Collection 1
FF Relaxed Optimized Query P. V. R.V.
P. (("w8 OR "w2")OR(("w2 AND "w4")OR(("w8 OR "w2")AND "w1"))) 1.25 1.00
R. (( "w2" OR "w4" ) OR ( "w6" OR "w2" )) 1.07 0.93
Collection 2 Collection 2 Collection 2 Collection 2
P. (( "w13" AND "w8" ) OR ( "w8" XOR "w2")) 1.25 1.00
R. (( "w13" AND "w8" ) OR ( "w8" XOR "w2" )) 1.25 1.00
Collection 3 Collection 3 Collection 3 Collection 3
P. ( "w8" OR "w2" ) 1.25 1.00
R. (( "w13" AND "w8" ) OR ( "w8" OR "w2" )) 1.25 1.00
29
Precision and Recall Diagrams
Collections
30
Precision and Recall Diagrams
Initial Populations
31
Conclusions
  • The final population contains set of individuals
    that have same fitness values
  • one randomly chosen to be an optimized query.
  • Because of selection queries with different sub
    queries similar to the user query that increase
    the quality of the initial population selected
  • ? this obtained better results
  • Especially when precision was used as fitness
    measure and experiment was done over largest
    collection, the fitness values of recall in final
    population were low.
  • in many experiments mostly all members of
    population reached the maximum values of
    precision and recall before reaching given number
    of generations.

32
Future works
  • Use more of unweighted Boolean operators like
  • ( ADJ, and OF) operators
  • Mutation operates over all Boolean operators
    (AND, OR, XOR, ADJ, OF, and NOT)
  • Try to improve selection method for choosing the
    best individual from a set of queries with equal
    values of precision or recall.
  • Appling of fuzzy theorem approach in this
    problematic
  • - Use weights for terms in documents instead of
    Boolean weights.

33
Thanks for your attention
Suhail Owais suhailowais_at_yahoo.com
Write a Comment
User Comments (0)
About PowerShow.com