Title: Query Optimization by Genetic Algorithms
1Query Optimization by Genetic Algorithms
- Suhail Owais,
- Pavel Kromer,
- Vaclav Snašel
- Department of Computer Science, VÅ B-Technical
University of Ostrava, - 17. listopadu 15, Ostrava - Poruba, Czech Republic
2Outline
- Introduction
- Information Retrieval (IR)
- Genetic Algorithms (GA)
- Optimization
- State of art
- IR and GA
- Experiments
- Conclusion
- Future Work
3Internet
4Information Retrieval
- In principle, Suppose there are set of documents
and a person (user of these documents), the user
formulates a question (request or query) to which
the answer is a subset of documents satisfying
the information need expressed by his question
Relevant documents. - Searching for information in documents, for
document in collection of documents, for metadata
in documents, - Searching will be in databases, or in hypertext
networked databases Internet or intranet.
5Information Retrieval System - IRS
- IRS concerned
- with responding to the requests of users queries
for the information seeking text. - with retrieve all relevant documents to user
query from a collection of documents, with
retrieving some of non-relevant as less as
possible.
6Retrieved - Relevant Documents to the user Query
Collection of Documents
Relevant Doc.
Relevant Retrieved Doc.
Retrieved Doc.
7IR Evaluation
- The most Measuring performance of retrieval
effectiveness are - Precision the percentage of the retrieved
documents that are relevant to the user query - Recall the percentage of the relevant documents
that are retrieved
8Genetic Algorithm
- GA used Darwinian Evolution to extract
optimization strategies nature uses successfully
and transform them for application in
mathematical optimization theory to find the
global optimum in defined phase space - GA are used in IR problems specially in
optimizing of a Boolean query. - GA operators Selection, Fitness function,
Crossover, and Mutation.
9GA Flowchart Diagram
Condition Satisfied
Yes
Optimized Query
Encoding
Initialize Population
Evaluate Fitness's
No
Regenerate New Offsprings
End
Selection
Start
Crossover
Mutation
?
10Optimization
- The procedure or procedures are used to make a
system or design as effective or functional as
possible, especially the mathematical techniques
involved. - Is the process of modifying a system to improve
its efficiency. The system can be a single
computer program, a collection of computers  or
even an entire network such as the Internet.
11State of the art 1
Evolutionary Learning of Boolean Queries by
Multiobjective Genetic Programming
- Authors Cordon et al., Springer-Verlag GmbH 2002
- Subject Automatic derivation of Boolean queries,
by incorporating a Pareto-based multiobjective
evolutionary approach, MOGA, into genetic
programming technique. - Notes
- A query represented as a parse tree with maximum
of 20 nodes. - Boolean operators used are AND, OR and NOT.
- Maximum number of documents is 1400.
- Result The proposed approach has performed
appropriately in seven queries of the well known
Cranfield collection in terms of absolute
retrieval performance and of the quality of the
obtained Paretos.
12State of the art 2
An Appropriate Boolean Query Reformulation
Interface for Information Retrieval Based on
Adaptive Generalization
- Authors Yoshioka et al., WIRI 2005, In
Conjunction with IEEE 2005, Tokyo Japan - Subject Implement a user query interface that
supports reformulation of IR queries by using
abstract concepts. - Notes
- IR interface uses small numbers of query terms
and concept categories with Boolean expression. - Reformulate a Boolean query by using only words
that exist in the original query. - Boolean operators used are AND, and OR.
- Result Proposed a new IR interface with Boolean
query reformulation (ABRIR-AG). Find
complementary query terms that exist in relevant
documents and reformulate Boolean query formulas
to clarify the information need.
ABRIR-AG Appropriate Boolean query
Reformulation for IR- Adaptive Generalization
13IR and GA
- Collection or set of Documents
- Terms for Document di
- Weighting function
T e r m s T e r m s T e r m s T e r m s T e r m s T e r m s T e r m s T e r m s T e r m s
W1 W2 W3 W4 W5 W6 W7 W8 W9
D O C U M E N T s d1 1 1 0 1 0 1 1 0 0
D O C U M E N T s d2 1 1 0 1 1 1 1
D O C U M E N T s d3 0 0 1 1 0 1 1 0 1
D O C U M E N T s d4 1 0 1 1 1 0 1 0 1
1
0
W2 Not in Document d2
W8 IN Document d2
14Chromosome Encoding
- A query combination from set of terms and set of
Boolean operators - Set of queries will be encoded to be chromosomes
for genetic programming in prefix form such as
(w2 OR w6) AND (w9 AND w3)
AND (OR w2 w6) (AND w9 w3)
? Prefix ?
(w3 AND w4) XOR ((w5 AND w6) OR w8)
XOR (AND w3 w4) (OR (AND w5 w6) w8)
? Prefix ?
15Tree Structure Representation
XOR (AND w3 w4) (OR (AND w5 w6) w8)
AND (OR w2 w6) (AND w9 w3)
16Fitness function
- Recall and Precision functions are used to
Evaluate Chromosomes.
- Selection Operators
- From the population of chromosomes, the best two
chromosomes depending on the highest fitness
values for precision or recall measures will be
selected.
- rd the relevance of document d (1 for relevant
and 0 for nonrelevant), - fd the retrieved document d (1 for retrieval
and 0 for nonretrieval), and - a and ß are arbitrary weights added specially to
precision fitness function.
17Crossover Operator
- Chose Randomly one node position in each Tree to
be exchanged
OR 4
OR 1
18Exchange Sub trees
Created two New Offsprings
19ReIndexing nodes in Offsprings
20Mutation Operator
Randomly will change one of the Boolean logical
operators to another and the position randomly
chosen
AND , OR , XOR
AND , XOR
AND
No one select, SO no mutation will be done over
this offspring
AND 4
21Experiments
- Implementation for our Genetic Program was
tested under the following conditions and
limitations- - Two sets of queries that represent in a tree
prefix forms used as two different initial
populations - Boolean model of a collection of documents
- Different Collections of documents
- User query / request
- w8 OR w2
22Initial Populations
- The two initial population differs by containing
sub queries, so - Initial Population 1 contain sub query
- w8 AND w2
- Initial Population 2 contains sub queries
- w8 AND w2
- w8 OR w2
- w8 XOR w2
Initial Population 2
Initial Population 1
1. ("w13"and"w8")and("w10"or"w4")
2. ("w1"and("w8"and"w2"))or("w4"or"w2")
3. ("w1"or"w2")and(("w5"or"w4")and("w3"and"w6"))
4. ("w9"and"w14")
5. ("w14") and "w1"
6. ("w2"or"w6")or("w8"and"w13")
7. ("w3"And"w4")or(("w12" xor "w15")and"w8")
8. "w1" or "w5"
1. ("w13"and"w8")and("w10"or"w4")
2. ("w1"and("w8"and"w2"))or("w4"or"w2")
3. ("w1"or"w2")and(("w5"or"w4")and("w3"and"w6"))
4. ("w9"and"w14")
5. ("w14") and "w1"
6. ("w2"xor"w8")or("w8"and"w13")
7. ("w3and"w4")or(("w2" or "w8")and"w8")
8. "w1" or "w5"
23Variables initialization
- Crossover probability value ? 0.8
- Mutation probability value ? 0.2
- Population size (number of chromosomes) ? 8
- Maximum number of generations ? 50.
- a ? 0.25
- ß ? 1.0
24Document Collections
- Three different document collections with variant
number of words and documents.
Collection Number of Documents Number of Words Maximum Number of Words in each Document
1 10 30 10
2 200 50 50
3 5000 2000 500
25Notes on limitations
- Single point for Crossover
- Mutation operator applied only over Boolean
operators AND, OR or XOR. - Fitness operator must be defined in input data to
be- - PrecisionFitness or
- RecallFitness.
- maximum value for PrecisionFitness a ß
- so It may be grater than one ( gt 1 )
- it can not be interpreted as the probability of
retrieving relevant document.
26Experiments
- Set of experiments done over three test cases
depends on. - Initial Population used
- Initial Population 1 OR
- Initial Population 2.
- Fitness function used
- PrecisionFitness OR
- RecallFitness.
- Collection used
- Collection 1 OR
- Collection 2 OR
- Collection 3.
27Experiments Results Using IP. 1
IP Initial Population, FF Fitness Function , R
Recall , P Precision, V Value
Collection 1 Collection 1 Collection 1 Collection 1
FF Relaxed Optimized Query P. V. R.V.
P. (( "w8" OR "w2" ) OR ( "w8" AND "w2" )) 1.25 1.00
R. (( "w13" OR "w8" ) OR ( "w6" OR "w2" )) 1.08 1.00
Collection 2 Collection 2 Collection 2 Collection 2
P. (( "w13" AND "w8" ) OR ( "w8" OR "w2" )) 1.25 1.00
R. ( "w8" OR "w2" ) 1.25 1.00
Collection 3 Collection 3 Collection 3 Collection 3
P. ( "w13" AND "w8" ) 1.05 0.18
R. ( "w8" OR "w2" ) 1.25 1.00
28Experiments Results Using IP. 2
IP Initial Population, FF Fitness Function , R
Recall , P Precision, V Value
Collection 1 Collection 1 Collection 1 Collection 1
FF Relaxed Optimized Query P. V. R.V.
P. (("w8 OR "w2")OR(("w2 AND "w4")OR(("w8 OR "w2")AND "w1"))) 1.25 1.00
R. (( "w2" OR "w4" ) OR ( "w6" OR "w2" )) 1.07 0.93
Collection 2 Collection 2 Collection 2 Collection 2
P. (( "w13" AND "w8" ) OR ( "w8" XOR "w2")) 1.25 1.00
R. (( "w13" AND "w8" ) OR ( "w8" XOR "w2" )) 1.25 1.00
Collection 3 Collection 3 Collection 3 Collection 3
P. ( "w8" OR "w2" ) 1.25 1.00
R. (( "w13" AND "w8" ) OR ( "w8" OR "w2" )) 1.25 1.00
29Precision and Recall Diagrams
Collections
30Precision and Recall Diagrams
Initial Populations
31Conclusions
- The final population contains set of individuals
that have same fitness values - one randomly chosen to be an optimized query.
- Because of selection queries with different sub
queries similar to the user query that increase
the quality of the initial population selected - ? this obtained better results
- Especially when precision was used as fitness
measure and experiment was done over largest
collection, the fitness values of recall in final
population were low. - in many experiments mostly all members of
population reached the maximum values of
precision and recall before reaching given number
of generations.
32Future works
- Use more of unweighted Boolean operators like
- ( ADJ, and OF) operators
- Mutation operates over all Boolean operators
(AND, OR, XOR, ADJ, OF, and NOT) - Try to improve selection method for choosing the
best individual from a set of queries with equal
values of precision or recall. - Appling of fuzzy theorem approach in this
problematic - - Use weights for terms in documents instead of
Boolean weights.
33Thanks for your attention
Suhail Owais suhailowais_at_yahoo.com