Title: Genetic Learning for Information Retrieval
1Genetic Learning forInformation Retrieval
- Andrew Trotman
- Computer Science
- 365 24 60 / 40 13,140
2Genetic Learning
- The Core Algorithm
- Crossover, Mutation, Reproduction
- Fitness proportionate selection
- Genetic Algorithms
- Chromosome is an array
- Genetic Programming
- Chromosome isan abstract syntax tree
A B C D E F X 1 2 3 4 5 6
3Information Retrieval (Text)
- Online Systems
- Dialog, LexisNexis, etc.
- Web Systems
- Alta Vista, Excite, Google, etc.
- Scientific Literature Systems
- CiteSeer, PubMed, BioMedNet, etc.
- Question
- How should scientific literature be ranked?
- Less time searching / More time researching
- Higher exposure for good work
4How Google Works
- PageRank
- Document ranking from PageRank
- A documents PageRank is some factor (d) of the
rank of incoming citations - A documents influence is some factor of its rank
and its outgoing citations - Characteristics of Scientific Literature
- Citations unidirectional (backwards in time)
- 12 month publication cycle
- Scientific citation cliques
5How IR works
- Indexing
- Build the dictionary
- Construct the Postings (ltd,fgt pairs)
- Searching
- Look up terms in dictionary
- Boolean resolution
- Rank on density (probability, vector space, etc.)
- Performance
- Recall and precision
6Structured-IR
- Sci-Lit documents have structure
- Title, abstract, conclusions, etc.
- ltd,fgt becomes ltd,p,fgt
7Using Structure in Ranking
- Documents have structure
- Title, Abstract, Conclusions, etc.
- Weight each structure on importance
- Title higher than Abstract higher than
- How to choose the weights
- Specified in the query (XIRQL)
- Query feedback
- Learn with a Genetic Algorithm
- Adapt ranking model to use structure
- Each tree node is a locus
- Weights are genes
8Experiment
- 50 training queries
- 50 evaluation queries
- 25 generations
- Probabilistic IR
- Vector Space IR
Results
- PROBABILISTIC IR
- 75.5 queries improved
- 6.7 increase in MAP (8.8 max)
- VECTOR SPACE IR
- 61 queries improved
- 4.7 increase in MAP (5.4 max)
9Ranking Algorithms
- Multitude exist
- Probability, vector space, Boolean
- Several published nomenclatures
- Over 100,000 published algorithms
- Purpose
- Put relevant documents first
- Sorting
- Performance measures with precision
- Sources
- Some guy thought it up
10Experiment
- 50 training queries
- 50 evaluation queries
- 31 runs
- Weekend time limit
- Compare to Probabilistic
Results
- 67 queries improved
- 15 increase in MAP
11Function Comparison
Vector Space
Probability
Learned
wdqStÎq(((((((((U / sqrt(sqrt(nt))) / (mq /
sqrt((((Lq / (sqrt(sqrt(Ld)) / sqrt((U / nc))))
min(mq, N)) / sqrt(((((((Tmax / sqrt(U)) /
sqrt((((log2(sqrt(nt)) / sqrt(nt)) / sqrt(Umax))
/ (M / nc)))) / sqrt((U / nc))) - uq) / mq) /
sqrt(nt))))))) / sqrt((log(Tmax) / nc))) /
sqrt(nt)) / sqrt(nt)) / sqrt((Lq /
sqrt(((sqrt((sqrt(sqrt(Ld)) / sqrt((min(mq,
sqrt((((log(Tmax) / nc) / sqrt(Umax)) / (mq /
sqrt(((N min((sqrt(nc) / sqrt(U)), Ld)) /
sqrt(N))))))) / sqrt(Ld))))) / sqrt((Tmax / nc)))
/ sqrt(nt)))))) / sqrt((min(mq, N) / nc))) /
sqrt((log(Tmax) / nc))) / sqrt(nt))
12Conclusions
- Using document structure improved ranking
- Structure weights can be learned with a GA
- GP can be used to learn ranking functions
- Speculation
- Combining GA and GP to learn a structure ranking
algorithm will better GA and GP alone
13Questions?
14Random NumbersAre your results an artifact of
your random number generator?