A Search Engine That Learns - PowerPoint PPT Presentation

About This Presentation
Title:

A Search Engine That Learns

Description:

20000 newsgroup articles from UCI Knowledge Discovery in Databases Archive. Hand formatted HTML ... Experiment further using LD as a fitness function ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 25
Provided by: jeffe68
Category:
Tags: engine | learns | search

less

Transcript and Presenter's Notes

Title: A Search Engine That Learns


1
A Search Engine That Learns
  • Jeff Elser jelser_at_cs.montana.edu
  • John Paxton paxton_at_cs.montana.edu
  • Montana State University - Bozeman

2
Presentation Outline
  • Problem
  • Background Information
  • Approach
  • Preliminary Results
  • Future Work
  • Summary
  • Questions

3
I. Problem
  • RightNow software use
  • Spidering and searching
  • Website optimization
  • Page by page is tedious and time consuming
  • Dual ownership should allow perfect optimization
  • Solutions
  • Search engine adjustments
  • Suggesting specific web page changes

4
II. Background Search Engine
  • Spidering
  • Indexing
  • Weighting factors

5
II. Background Genetic Algorithms
  • Goldbergs Simple GA
  • Mutation
  • Crossover
  • Elitism
  • Non-overlapping populations
  • Several fitness functions
  • Individual 1
  • Fitness 2
  • Individual 2
  • Fitness 4

6
III. Approach
  • Architecture
  • Training data
  • Testing controls (website source)
  • GA specifics
  • Fitness functions

7
A. Architecture
8
B. Training Data
  • Website source
  • 20000 newsgroup articles from UCI Knowledge
    Discovery in Databases Archive
  • Hand formatted HTML
  • Chosen for word count and structure

9
C. Testing Controls
  • Webmaster provides training data
  • List of important keywords
  • Associated ranked pages
  • Tedious, but trivial compared to optimizing all
    pages

10
D. GA Specifics
  • Random initial population
  • Population size 1000
  • Used GAlibs built in random number generator
  • Genome
  • 16 real numbers corresponding to the 16 weighting
    factors
  • Range 0.0 1000.0

11
D. GA Specifics
  • GA executes for 10000 generations
  • Elitism is turned on
  • Mutation probability 0.01
  • Crossover probability 0.6

12
D. Fitness Function 1
  • ?D
  • D (actual ranking) (desired ranking)
  • 1 to avoid division by 0

13
D. Fitness Function 2
  • 100 penalty for pages that dont appear
  • -10 reward for pages with a perfect fit

14
IV. Preliminary Results
  • 12 tests using fitness function 2
  • 1 realistic set of desired rankings
  • 11 random sets
  • 4 tests obtained perfect rankings
  • 4 improved rankings, but did not achieve optimal
  • 4 tests showed no improvement

15
IV. Preliminary Results
Htdig default weights
Fitness Function 2
16
IV. Preliminary Results
Fitness Function 2
Htdig default weights
17
V. Future Work Fitness Function 3Levenshtein
Distance
  • D string 1 A string 2
  • Construct a mxn Matrix (M) where m D1 and n
    A1
  • M0,i i and Mj,0 j
  • For each remaining cell
  • Di Aj then cost 0
  • Di ! Aj then cost 1
  • Mi,j MIN a, b, c where
  • a Mi-1,j 1
  • b Mi,j-1 1
  • c Mi-1,j-1 cost
  • Distance Mm,n

M
R
A
F
4
3
2
1
0
3
2
1
0
1
F
2
1
1
1
2
R
2
2
2
2
3
O
2
3
3
3
4
M
18
V. Future Work Fitness Function 3Levenshtein
Distance
  • Reduce the url comparison to string comparison
  • Experiment further using LD as a fitness function
  • Sigmoid weighting function to increase the
    importance of the front of the string

?
19
V. Future Work
  • Create more extensive test sets
  • dare.com, studentaid.ed.gov, fafsa.ed.gov,
    americorps.org

20
V. Future Work
21
V. Future Work
22
V. Future Work
  • Dare.com
  • Use of Meta tags and title tag vary only
    occasionally from page to page
  • Keyword Marijuana
  • Desired rankings from an index show a main
    background page, and a link to a medical
    marijuana page
  • Actual rankings from the local site search put
    the medical marijuana page 3nd, and the
    background page 7th
  • Htdig with default weights ranked them both worse
    than 20th

23
V. Future Work
  • For pages that still do not rank properly, create
    optimization suggestions
  • Use custom meta tags to properly rank outliers
  • Use implicit user feedback to find the desired
    rankings

24
VI. Summary
  • Proof of concept
  • Testing on real world websites will strengthen
    results and open other areas of study.

25
VII. Questions
  • Thanks for attending
  • Any questions?
Write a Comment
User Comments (0)
About PowerShow.com