????t?s? ?????f????? - PowerPoint PPT Presentation

About This Presentation
Title:

????t?s? ?????f?????

Description:

- – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 69
Provided by: ChristosPa4
Category:
Tags: crawlers

less

Transcript and Presenter's Notes

Title: ????t?s? ?????f?????


1
????t?s? ?????f?????
  • ???st?? ?apa?e?d????
  • ?µ?µa ???e????µ?a? - ??ß?????????µ?a?
  • ????? ?a?ep?st?µ??

2
????t?s? ?????f???a?
  • ?OS e?f?????µe a????e? p????f???s?? (queries)
  • ?OS e?t?p????µe ?a? a?a?t??µe p????f???e? p??
    ??a??p????? t?? a????e?
  • ?OS a????????µe ta ap?te??sµata t?? a?a??t?s??

3
Information Retrieval System
Input
Document classification Processor Search strategy
Documents
Output
queries
feedback
4
Information retrieval process
ranked docs
User Interface
user feedback
user need
Text Operations (tokenization, stopwords,
stemming, etc.)
text
DB manager
Indexing
Query operations
query
Searching
Docs database
Index
Retrieved docs
Ranking
5
??a?efa?a??s?
IR lt D, Q, F, R(qi, dj)gt
D documents Q queries F p?a?s?? a?apa??stas??
?e?µ???? R s???fe?a query qi µe ?e?µe?? dj
a???µ?? ? 0-1
6
?at?????e? Documents
  • ??µ?µ??a (structured)
  • e???af??, ped?a (??se?? ?ed?µ????)
  • ?????? ad?µ?ta
  • e?e??e?? ?e?µe??
  • ???epe?e??as?a (pre-processing)
  • Metadata
  • Stemming

7
?a?e???µe?a
  • Document identifier
  • ?a????µ??? ped??
  • ???e?? - f??se?? ??e?d?? (keywords)
  • ?e?????? (abstract)
  • ??a????? (extraction) - e?t?? s????af?a
  • ??as??p?se?? (reviews)- e?t?? s????af?a

8
??p???? ???sµ?? document
  • ?e??????? V, e?e???µe?? (controlled) ? ???
  • ???? wi,
  • document a
  • s????t?ta ???? wi st? a

9
Boolean model (1/2)
  • ?as?sµ??? st? ?e???a s??????
  • ?? ???? t?? query s??d???ta? µe t??? te?est??
    AND, OR, NOT
  • ?a??de??µa
  • Query restaurants AND (Mideastern OR vegeterian)
    AND inexpensive
  • ?p??t?s? ?e?µe?a p?? pe???aµß????? t?? ???e??
    restaurants, Mideastern, inexpensive ? t?? ???e??
    restaurants, vegeterian, inexpensive
  • ?? query µeta???feta? se Disjunctive normal form
    (s???st?sa e??a? ?p?? ?p???e? a????? t?µ? st??
    p??a?a a???e?a?)

10
Disjunctive normal form
ka kb Kc Kc ka ?kb ka ? Kc (ka ?kb )?(ka ? Kc)
1 1 1 0 1 0 1
1 1 0 1 1 1 1
1 0 1 0 0 0 0
1 0 0 1 0 1 1
0 1 1 0 0 0 0
0 1 0 1 0 0 0
0 0 1 0 0 0 0
0 0 0 1 0 0 0
q ka ?(kb ? Kc) ? DNF (ka ?kb )?(ka ? Kc)
?(1,1,1)?(1,1,0)?(1,0,0)
11
Boolean model (2/2)
  • Similarity query - documents ?ta? t??????st??
    µ?a s???st?sa t?? query Disjunctive normal form
    ta?t??eta? µe ??a document
  • ??µ?? similarity 0 ? 1
  • ?a??de??µa
  • q ka ?(kb ? Kc) ? (1,1,1)?(1,1,0)?(1,0,0)
  • d (0,1,0)
  • Similarity 0 (a? ?a? ?p???e? st? document ?
    ???? kb).

12
?e???e?t?µata Boolean model
  • ?e? ????eta? ? ???a ???e ????
  • Se??? eµf???s?? t?? ???e ????
  • ? pa?????ta? NOT
  • ??s????a s??ta??? boolean expressions
  • Data retrieval ?a? ??? information retrieval
  • ?e? p??ß??pe?
  • Ranking
  • Partial match
  • ?p?st??fe? e?te p??? ???a e?te p??a p????

13
Vector Model (1/2)
  • Similarity S???µ?t??? ????a? d?? documents dk,
    dj
  • ?p??????eta? ap? t? es?te???? ????µe?? t??
    d?a??sµ?t?? t?? documents

14
Vector model (2/2)
  • ?p????? µe ?at?f?? (threshold) st? ßa?µ?
    ?µ???t?ta? (similarity)
  • ???ß??µa ? µ?t??s? t?? s????t?t?? t?? ????
  • a?t?p??s?pe?t??? µ?t??s? t?? ßa??t?ta? t?? ????

15
??e??e?t?µata - µe???e?t?µata
  • ??e??e?t?µata
  • ?at?ta?? a???????s? ?e?µ???? µe ß?s? t??? ?????
    t???
  • Partial matching
  • ?a??te?? ap?d?s?
  • ?e?????t?µata
  • Te???s? a?e?a?t?s?a? t?? index terms
  • ?? ???? p?? ?e?p???

16
TFIDF Model
  • N documents, ni p????? documents µe t?? ??? ki,
    freqij ? s????t?ta t?? ???? st? document dj.
  • Term frequency
  • Inverse document frequency
  • Term weighting tf idf

17
??e??e?t?µata µe???e?t?µata
  • ?a??te?? ap?d?s?
  • ???s????s? partial matching
  • Ranking µe ß?s? t?? t?µ? t?? s???µ?t????
  • ?e?????t?µa ?e???s? ?t? ?? index terms e??a?
    a?e???t?t??

18
? ?????a t?? ?µad?p???s?? ?e?µ???? (document
clustering)
  • S?????? C ap? ?e?µe?a (index terms)
  • Query S????? ? ap? index terms
  • ???a ?e?µe?a a?????? st? ?
  • ???a ?a?a?t???st??? st???e?a pe?????f??? t???
    ????? t?? ? (intra cluster similarity, tf)
  • ???a ?a?a?t???st??? st???e?a d?a??????? ta µ???
    t?? ? (inter-cluster similarity, idf)

19
Clustering
  • ?ed?µ???? µ?a? s??????? te?µ?????, ?a
    d?µ???????e? µ?a ?e?a????? ?µad?p???s? (taxonomy)
    ßas?sµ??? se ??a µ?t?? s???fe?? t??? (similarity)
    (p.?. Yahoo)
  • ??t?a s???fe?a? te?µ?????
  • ??apa??stas? µe ß?s? TFIDF
  • ????e?d?e? ap?st?se?? µeta?? te?µ?????
  • S???µ?t??? t?? ????a? t?? te?µ?????
  • ???ß??µata
  • T???ß?? µe????? a???µ?? ????st?? ????
  • ? ?????a t?? ????ß?? e?a?t?ta? ap? t? s??????

20
Top-down clustering
  • k-Means Repeat
  • Choose k arbitrary centroids
  • Assign each document to nearest centroid
  • Recompute centroids

21
Bottom-up clustering
  • Initially G is a collection of singleton groups,
    each with one document
  • Repeat
  • Find ?, ? in G with max s(???)
  • Merge group ? with group ?
  • For each ? keep track of best ?

22
Probabilistic Model (1/2)
  • ?p???e? ?p?s????? R, t?? s?et???? ?e?µ???? µe t?
    query
  • ? ???st?? ?p?de????e? ta s?et??? ?e?µe?a
  • ??t?µ?s? p??a??t?ta? t? document ?a e??a? st??
    ep?????? t?? ???st?
  • wij ? 0,1 d?ad??? a?apa??stas? t?? ????, ?p??
    t? boolean model

23
Probabilistic Model (2/2)
  • ?a ep?st?ef?µe?a ?e?µe?a ??a??p????? t? query µe
    µ?a p??a??t?ta µe?a??te?? ap? ??a ?at?f??
    (threshold)

24
Extended similarity
  • ??? µp??? ?a ft???? t? µ??a?? µ???
  • ??a ?a?? s??e??e?? ??a ?a ep?d?????se?? t?
    2????? e??a? st?
  • a?t?????t? and aµ??? s???? apa?t??ta? µa??
    (co-occur)
  • ?e?µe?a µe s?et???? ???e?? e??a? s?et???
  • ?as???? p??se???se?? ??a a?a??t?s? ?a?
    ?µad?p???s?
  • T?sa???? (WordNet)
  • S?s??t?s? ???? p?? apa?t??ta? µa??

auto car car auto
auto car car auto
auto car car auto
car ? auto
auto
?
car
25
Latent Semantic Indexing
  • St????
  • ? ape?????s? t?? p??a?a documents-terms se ??a
    µ????te??? d?ast?se?? p??a?a p?? a?t?st???e? se
    ?????e? (concepts)
  • ??? ?e???? ?µ???t?ta a??? e???????????
  • ?a??µat??? p???p???? µ??t??? ßas?sµ??? st?
    ??aµµ??? ???eß?a

26
Latent Semantic Indexing
Term
Document
d
Documents
A
U
D
V
car
SVD
Terms
t
auto
d
r
27
Extended Boolean
  • St????
  • ?a d?s??µe ß??? st??? ????? t?? Boolean queries

??????? te?est?? (and, or ?.?p.)
28
???a µ??t??a
  • Te???a s??????
  • Fuzzy set
  • ???eß????
  • Generalised Vector model
  • Neural networks
  • Te???a ???a??t?t??
  • Bayesian networks
  • Inference networks
  • Belief networks

29
????????s? ????t?s??
  • Precision
  • Relevant answers (Ra)/ Total answers (A)
  • Recall
  • Relevant answers / Relevant documents

Documents
Ra
A
R
30
?a??de??µa
  • Se ??a e??t?µa q, ta s?et??? ?e?µe?a e??a? d3,
    d5, d9, d25, d39, d44, d56, d71, d89, d123
  • ? µ??a?? a?a??t?s?? ep?st?e?e µe se???
    s?et???t?ta? ta ?e?µe?a d123, d84, d56, d6, d8,
    d9, d511, d129, d187, d25, d38, d48, d250, d113,
    d3
  • ????ße?a (precision) Ra/A 5/15 33,3
  • ??????s? (recall) Ra/R 5/10 50

31
?aµp??? a???ße?a? / a?????s??
?p??t?s? ????ße?a ??????s?
1 1/1 1/10
2
3 2/3 2/10
4
5
6 3/6 3/10
7
8
9
10 4/10 4/10
11
12
13
14
15 5/15 5/10
32
????????s? a???t?s??
  • ??t?a s??d?asµ?? precision recall
  • ??s?? a?µ??????
  • ???a stat?st???
  • ??t?a ßas?sµ??a st? ???st?
  • V s?et??? ?e?µe?a, ???st? st? ???st?
  • Rk,apa?t?µ??a ?a? ???st? (A?V)
  • Ru, apa?t?µ??a ?a? ????sta st? ???st?
  • ?????? a?a???? (coverage) ???st? Rk / V
  • ?a???t?µ?a (novelty) p?s?st? ???? s?et????
    ?e?µ???? Ru / (Rk Ru )
  • S??????? ?e?µ???? ??a a???????s? µe??d??
  • TREC
  • ISI

33
Query Languages
  • ??????? t? ???st? st?? ?p?ß??? e??t?µ?t?? ?a?
    st?? ?at?ta?? t?? ap?te?esµ?t?? (?? data
    retrieval ???sse? de? ?????? ranking)
  • ???t?????a ???sse? p?? de? ape??????ta? st?
    ???st? a??? ???s?µ?p?????ta? ap? s?st?µata (p.?.
    ??a ?p?ß??? queries se CD-ROM archive ? se
    on-line databases, ?39.50, CCL, WAIS)
  • ???at?t?te? ???ss?? ?p?ß???? e??t?µ?t??
  • Keywords (Single words, Context, boolean, natural
    language)
  • Pattern matching (???e??, p????µata, ep???µata,
    d?a?e???s? ?a???, d?ast?µata, t?p???? e?f??se??
    ?a? epe?t?se??)
  • Structural queries (Forms, hypertext,
    hierarchical)

34
Keyword based querying (1/2)
  • ?p?ß??? ap??? ???e??
  • ?p?ß??? f??se?? (a?a??t?s? s?????? ???e?? p??
    s??µat????? µ?a f??s?)
  • ??t??s? e???t?ta? (proximity)
  • ?p?ß?????ta? ???e?? ? f??se?? ?a? µ?a µ???st?
    ep?t?ept? ap?stas? µeta?? t???
  • ???t?µata se f?s??? ???ssa
  • e??t?µata ?a? ?e?µe?a µetat??p??ta? se
    d?a??sµata ???? µe ß??? ??a ???e ???
  • a?a??t?s? ?e?µ???? p?? µ??????? pe??ss?te?? µe
    ta e??t?µata
  • e?a???? a?t?p??s?pe?t???? ???e?? ??e?d??? ap?
    ta e??t?µata

35
Keyword based querying (2/2)
  • Boolean queries ap?te????ta? ap?
  • ap?? queries (atoms) p?? a?a?t??? ?e?µe?a
  • boolean te?est?? (operators, AND, OR, NOT BUT)
    p?? efa?µ????ta? se s????a ?e?µ????
  • ????eta? d??t?? e??t?µat?? p?? ta f???a e??a?
    sta queries ?a? ?? es?te????? ??µß?? st???
    te?est??
  • ?a??de??µa µet?f?as? AND (s??ta?? OR s??ta?t???)

AND
OR
µet?f?as?
s??ta?t???
s??ta??
36
??a?????s? p??t?p?? (pattern matching) (1/3)
  • ??a??t?s? ?e?t???? p??t?p?? (patterns) µ?sa se
    ?e?µe?a
  • ?a p??t?pa s??d?????ta? µeta?? t??? µe boolean
    operators ??a t? s??µat?sµ? keyword queries
  • Substrings
  • p.?. any flow ? many flowers
  • ??ast?µata (ranges) a?faß?t??? a?a??t?s? ???e??
    a??µesa se ??a d??st?µa ap? strings
  • p.?. a?a??t?s? se ?e????

37
??a?????s? p??t?p?? (2/3)
  • ???t?µata µe ???? (allowing errors) d??eta?
    string t? ?p??? µetaß???eta? ??a ?a ß?e????
    pa??µ??e? ???e??
  • ?etaß??? e?sa???? d?a??af?, a?t??at?stas?
    ??aµµ?t?? ?a? ??s?? t???
  • Threshold st?? µetaß???? (edit distance) t?
    e????st? p????? t?? µetaß???? p?? apa?t???ta? ??a
    ?a ?????? ?d?a d?? strings.
  • ??p???? e?f??se?? (regular expressions) strings
    ? ?? a???????? s??d?asµ?? ap? strings
  • Concatenation (s??e???) strings (t? ?e) ? t??e
  • Union (e?a??a?t??? ???s?) (µese)
  • ?pa?????? e??? string e
  • p.?. pro (teinblem) (e012) ? protein ?
    problem02, e t? ?e?? string

38
??a?????s? p??t?p?? (3/3)
  • Extended patterns
  • Classes of characters s??d?asµ?? s??????
    ?a?a?t???? a??µesa st?? ??se?? e??? patterns
    (p.?. e?sa???? a???µ?? st?? ??se?? e??? pattern)
  • Wild characters (p.?. t??e ? t??e??as?,
    t??e-e?pa?de?s?, t??e-d??s?e?? ?.?p.)
  • Conditional expressions a?a??t?s? ? ??? e???
    µ????? ap? ??a pattern.

39
Structural queries (1/3)
  • ?p?t??p??? t?? a?a??t?s? s??d?????ta? t?
    pe??e??µe?? t?? documents µe t? d?µ? t???
  • Forms
  • ?a documents e??a? d?µ?µ??a se ped?a p?? de?
    ep??a??pt??ta?, ??te e??a? f???asµ??a
  • ??a??t?s? patterns se s???e???µ??? ped??

40
Structural queries (2/3)
  • ?pe??e?µe?a
  • ?e?µe?a p?? ????? s??d?se?? e?te
  • µeta?? t???
  • se s???e???µ??a s?µe?a t?? ?e?µ????
  • ??a??t???ta? patterns se se??de? ? se ?e?t??????
    t???
  • ?e?a????? d?µ?
  • ??a??t???ta? patterns se s???e???µ??e? d?µ??
  • ?? d?µ?? ??d???p?????ta? ap? tags (?p?? st??
    HTML), ta ?p??a ?a???????? pe?????? st? ?e?µe??
  • ?? pe?????? µp??e? ?a a????????? ? µ?a t?? ????,
    ?a e??a? ep??a??pt?µe?e?, ? ?a e??a? f???asµ??e?
  • G??eta? indexing ??? µ??? st??? ????? a??? ?a?
    st?? pe??????

41
Structural queries (3/3)
  • ?a??de??µa ?e?a?????? d?µ?? ?a? e??t?µat??

?ef??a?? 4 4.1 ??sa???? Se a?t? t? ?ef??a?? 4.4
??µ?µ??a e??t?µata
?ef??a??
pa????af??
pa????af??
t?t???
t?t???
s??µa
Query t? s??µa µ?a? pa?a???f?? µe t?t??
d?µ?µ??a
42
?e?????? ?e?t??s?? (expansion) Query
  • ? ???st?? ?p?de????e? ta s?et??? ?e?µe?a (User
    relevance feedback)
  • ????? a??µ??? t?? ???st?
  • ?????f???e? ap? ta ep?st?af??ta ?e?µe?a
    (automatic local analysis)
  • ?????f???e? ap? t? s?????? ?e?µ???? (automatic
    global analysis)

43
User Relevance Feedback (1/2)
  • ? ???st?? a??????e? ta ep?st?af??ta ?e?µe?a
    (relevant, non-relevant clusters)
  • ?p??es? ta s?et??? ?e?µe?a ????? pa??µ?????
    ?????
  • St???? ???p?p???s? t?? e??t?µat?? µe s??afe??
    ?????
  • ?e??????
  • query expansion,
  • term reweighting

44
User Relevance Feedback (2/2)
  • query expansion
  • a?a??t?s? e?e???? t?? query vector p?? d?a????e?
    ?a??te?a ta relevant ap? ta irrelevant documents
  • ???s?µ?p??e?ta? st? vector model.
  • term reweighting
  • a?ap??sd????sµ?? t?? s??te?est?? ßa??t?ta? st???
    ????? t?? query
  • ???s?µ?p??e?ta? sta vector ?a? probabilistic
    models

45
?????s? ?p?st?af??t?? (local analysis)
  • ??t?µat?? p??sd????sµ?? s?et???? ???? µe t???
    ????? t?? query
  • Local clustering
  • ?? ???? ?p??e??ta? se stemming
  • ?atas?e?? term-document matrix m (s????t?te? t??
    ???? sta ?e?µe?a)
  • Term term matrix s mmt, ? ?p???? de???e? ??a
    ???e ??? t?? ?µ???t?t? t?? µe t??? ?p????p??? (mt
    ? a??st??f?? p??a?a? t?? m)
  • G?a ???e ??? t?? query ep????eta? ap? t?? p??a?a
    s ? ?µ?da (cluster) µe t??? p?? s?et????? ?????

46
Local Context Analysis
  • ?as??eta? st? ???s? ?µ?d?? ???? (a?t? ??a ap??
    keywords) ap? ta s?et???te?a ?e?µe?a p??
    ep?st??f??ta? (top ranked documents)
  • ?? ?µ?de? t?? ???? a?t?st?????? se ?????e?
    (concepts) ?a? ???s?µ?p?????ta? ??a ße?t??s? t??
    query
  • ?a top ranked documents sp????? se passages d??.
    se ??µµ?t?a ?e?µ???? sta?e??? µ????? (p.?. 300
    ???e??)

47
Local Context Analysis (a??????µ??)
  • ????t?s? t?? ? top ranked documents p?? ?a
    ep?st?af??? ap? t?? e?t??es? t?? query
  • ????sµ?? t?? top ranked documents se passages
  • ???sd????sµ?? t?? ?µ?d?? t?? ???? (concepts)
    e??????ta? t? s???pa??? t??? µe t??? ????? t??
    query
  • ???sd????sµ?? t?? s?et???t?ta? similarity(concept,
    query) µe µ???d? pa??µ??a t?? tfidf ?a?
    ?at?ta?? t?? e?????? s?µf??a µe t?? ?µ???t?t?
    t??? µe ??? t? query
  • To query epe?te??eta? µe ta m top ranked concepts
    µe ßa??t?ta 1-0.9(i/m), I ? ??s? t?? concept
    st?? ?at?ta?? t?? concepts

48
Automatic Global Analysis
  • Similarity thesaurus
  • Index terms, similarity query-index term
  • Statistical Thesaurus
  • ?µad?p???s? ?e?µ???? µe ???t???? t? similarity
  • ?p????? t?? ???? ??a ???e ?µ?da

49
Similarity thesaurus
  • Index terms concepts
  • ??a??te?ta? ? s??s? (similarity) µeta?? t?? index
    terms µ?a? s??????? ?e?µ????
  • Inverse term frequency itfj log(t/tj), t t?
    p????? t?? ???? t?? s???????, tj t? p????? t??
    ???? e??? ?e?µ???? dj
  • ?e ß?s? ta itf ?p????????ta? ?? ßa??t?te? t??
    ???e ???? se ???e ?e?µe??
  • ?atas?e?? term document p??a?a µe t?µ?? t??
    ßa??t?te? t?? ???? sta ?e?µe?a
  • ?p?????sµ?? similarity t?? ???? sta ?e?µe?a t?
    es?te???? ????µe?? t?? ??aµµ?? t?? pa?ap???
    p??a?a
  • ?atas?e?? similarity thesaurus ??a? term-term
    p??a?a? µe t?µ?? ta similarities

50
Query Expansion µe Similarity thesaurus
  • ???sd???????ta? ?? ßa??t?te? t?? ???? t?? query,
    µe t?? ?d?? t??p? p?? ?p????????ta? ?? ßa??t?te?
    t?? ???? st? ??sa???
  • ?p?????sµ?? t?? similarity t?? ???? t?? ??sa????
    kv µe t? query sim(q, kv)
  • ?p??????ta? ??a ep??tas? ?? top r ranked terms,
    s?µf??a µe t? sim(q, kv)

51
Statistical Thesaurus
  • ? ??sa???? apa?t??eta? ap? ???se?? s?et???? ????
    ap? t? s?????? ?e?µ????
  • ?pa?t?s? ?? ???se?? ?a ????? µe???? ßa?µ?
    d?af??et???t?ta? (d?a???t???t?ta?), ?ts? ?ste ?a
    d?a??????ta? e????a
  • ??t? ? ?d??t?ta e?asfa???eta? ap? ????? µe ?aµ???
    s????t?ta eµf???s?? d??. p??? e?d????? ?????
  • ????d??
  • ?µad?p???s? (clustering) ?e?µ????
  • ap? t?? ?µ?de? ?e?µ???? ep??????ta? ?? ???? µe
    ?aµ??? s????t?ta ??a t?? ?a????sµ? t?? ???se??
    t?? ??sa????

52
?µad?p???s? ?e?µ???? complete link algorithm
  • ?????? t?p??ete?ta? ???e ?e?µe?? se ??a ?e????st?
    cluster
  • ?p?????sµ?? s?et???t?ta? ??a ???e ?e????? ap?
    clusters µe t? ???s? t?? vector model ?a? t?
    µ???d? t?? s???µ?t????
  • S?????e?s? t?? ?e??a???? t?? clusters µe t?
    µe?a??te?? ?µ???t?ta . ?? ??? cluster p??
    s??µat??eta? ??e? ??a similarity value ?s? µe t?
    similarity t?? clusters p?? s?????e????a?
  • ?pa?????? t?? pa?ap??? d?? ß?µ?t?? µ???? ?a µ??
    ?p?????? clusters ??a s?????e?s?
  • ?p?t??esµa t?? d?ad??as?a? e??a? µ?a ?e?a???a ap?
    ?µ?de? ?e?µ????

53
?p????? ???? ??sa????
  • ?p????? t?? clusters p?? ?a ???s?µ?p???????
  • ep??????ta? ?? clusters µe similarity values
    µe?a??te?e? ap? ??a threshold
  • ?p????? t?? ?e?µ???? p?? ?a ??f???? ?p ??? ap?
    t??? ep??e????te? clusters
  • ???s? threshold ??a t? p????? t?? ?e?µ???? t??
    clusters p?? ?a epe?e??ast???
  • ?p????? t?? ???? ??sa????
  • ??a ???e ??? ap? ta ep??e????ta ?e?µe?a
    ?p??????eta? t? Minimum inverse document
    frequency (MIDF)
  • ep??????ta? ?? ???? µe t?µ?? MIDF µ????te?e? ap?
    ??a threshold

54
Query Expansion µe Statistical thesaurus
  • ?p?????sµ?? t?? ßa??t?ta? ???e ???? p?? a???e? se
    µ?a ???s?
  • ?e ß?s? t?? pa?ap??? ?p?????sµ? ?p??????eta? ?
    ßa??t?ta t?? ???e ???s??
  • ?p????? t?? ???s?? ???? p?? ?a ???s?µ?p???????
    ??a query expansion

55
Web searching
  • Search Engines
  • Web directories
  • Hyperlink structure exploitation

56
???ß??µata
  • ?ata?eµ?µ??a ded?µ??a
  • ?sta?? ded?µ??a
  • ?e????? ????? ded?µ????
  • ????t?ta
  • ???µ????e?? ded?µ??a - p???µ?sa

57
Search Engines
  • ??af??? ap? IR
  • ?e? ???eta? a?a??t?s? se ?e?µe?a a??? se
    (?e?t????) index
  • Indexing inverted files

Query Engine
User
Interface
Indexer
Crawler
Web
58
Query Engine
  • Boolean, proximity, stemming, stop words
  • Alta-vista ???s? se??d?? p?? pe???aµß?????
    t??????st?? ??a ???
  • Hot Bot ???s? se??d?? p?? pe???aµß????? ????S
    t??? ?????
  • ?????? search engines µe t? ?d?? query engine
    (Magellan, Excite)

59
Ranking
  • Tf-idf model
  • Boolean spread ?p??tas? t?? boolean model
  • ??s?? ???? a?????? se ???e se??da t?? ap??t?s??
  • ??s?? ???? a?????? se ???e se??da st?? ?p??a
    ?p?????? links ap? t?? se??de? t?? ap??t?s??
  • Vector spread ?p??tas? t?? vector model
    ?p????????ta? t? similarity ?a? µe t?? se??de?
    st?? ?p??e? ?p?????? links ap? t?? se??de? t??
    ap??t?s??
  • Most cited µ??? ???? t?? se??d?? p?? ????? links
    st?? se??de? t?? ap??t?s??
  • Web query p?s? s??dedeµ??e? e??a? ?? web pages
  • HITS hubs authorities

60
Crawling
  • St???? e??µ???s? ?e?t????? ?ata????? ??a ??e?
    se??de?
  • ?e???d??? e??µ???s? (µ???? 2 µ??e?), e??µ???s?
    ?a? ap? d?a?e???st?? t?? se??d??
  • ?e??????
  • Depth first ep??tas? se ß???? e??? link
  • Breadth first ??a ta links µ?a? se??da?

61
Harvesting
  • Distributed architecture
  • ??e??e?t?µata (µe???e?t?µata crawlers)
  • ?p?ß????s? ?e?t?????a? web servers
  • ?????f???a?? p??ß??µa a???t?s? se??d??, t?
    pe??e??µe?? pet??eta?
  • ??e?a?t?s?a engines ????? s??t???sµ?
  • ?e???e?t?µata apa?t?s? p????? servers

62
????te?t?????
User
63
Brokers - Gatherers
  • Gatherers s?????? p????f????? se ta?t? ???????
    d?ast?µata
  • ?????f???e? se p?????? brokers
  • ????e? ??a ??a server (no traffic)
  • Brokers User Interface ?a? indexing ap?
    gatherers ?a? brokers
  • Teµat???? ?a? ?e?t????? brokers
  • S??e??as?a brokers (filtering)

64
Replicator - Cache
  • Replicator
  • ??t???af? brokers a?????a µe t? ??t?s? ?a? t?
    µ??e???
  • ????es? gatherers se brokers
  • Object cache
  • µe??s? f??t?? server, ?????f???a?

65
Web directories - Browsing
  • ?e?a?????? ta????µ?se?? a????p???? ???s?? (Yahoo)
  • ??e????t?µa ????ße?a a???t?s??
  • ?e?????t?µa ta????µ?s?
  • S??d?asµ?? searching browsing (WebGlimpse,
    ep?t??pe? a?a??t?s? st? site page indexing)
  • Meta-searchers

66
Hyperlink searching
  • Web Query Languages
  • S??d?asµ?? content µe link structure
  • Software Agents
  • ??????? sta Web pages a????????ta? ta links
  • Heuristics ??a ep????? p??te?a??t?ta? se??d??

67
Recommendation systems
  • Social recommendation ? collaborative filtering
  • Relevance feedback by many users for information
    ranking
  • ????d??
  • k-nearest neighbors (case-based reasoning)
  • ?fa?µ????
  • User actions prediction
  • User profile learning
  • Links evaluation recommendation (Letizia,
    Syskil Webert)

68
OPACs
  • 1? ?e?e?
  • µ?????, non-stadard e???af??
  • a?a??t?s? µe t?t??, s????af?a
  • 2? ?e?e?
  • a?a??t?s? µe ?eµat???? ?efa??de?, ???e?? -
    ??e?d??
  • 3? ?e?e?
  • ?e??????? a?a??t?s??,
  • a???µ???? p????f???a? e???af??,
  • GUI, Z39.50, metadata
Write a Comment
User Comments (0)
About PowerShow.com