vitmav03 - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

vitmav03

Description:

On the query ides of march, Shakespeare's Julius Caesar has a score of 3 ... A sz ritkas g t (megk l nb zteto k pess g t) a dokumentumgyujtem nyben (ides vs. of) ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 44

Provided by: christo400

Category:

more less

Transcript and Presenter's Notes

Title: vitmav03

1
SzövegbányászatX. eloadás

vitmav03
BME TMIT
Tikk Domonkos
Szaszkó Sándor
http//textmining.tmit.bme.hu

2
SzövegbányászatInformációkeresés 2.III.
eloadás

vitmav03
BME TMIT
Tikk Domonkos
Szaszkó Sándor
http//textmining.tmit.bme.hu

3
Elozo óra anyaga, Emlékezteto

Index készítés

4
Mai óra

Parametrikus és mezo keresés
Zónák a dokumentumban
Dokumentum pontozás zóna súlyozás
Index támogatás pontozáskor
tf?idf és vektortér

5
Parametric search

Each document has, in addition to text, some
meta-data in fields e.g.,
Language French
Format pdf
Subject Physics etc.
Date Feb 2000
A parametric search interface allows the user to
combine a full-text query with selections on
these field values e.g.,
language, date range, etc.

Fields
Values
6
Parametric search example
Notice that the output is a (large) table.
Various parameters in the table (column headings)
may be clicked on to effect a sort.
7
Parametric search example
We can add text search.
8
Parametric/field search

In these examples, we select field values
Values can be hierarchical, e.g.,
Geography Continent ? Country ? State ? City
A paradigm for navigating through the document
collection, e.g.,
Aerospace companies in Brazil can be arrived at
first by selecting Geography then Line of
Business, or vice versa

9
Index support for parametric search

Must be able to support queries of the form
Find pdf documents that contain stanford
university
A field selection (on doc format) and a phrase
query
Field selection use inverted index of field
values ? docids
Organized by field name
Use compression etc as before

10
Parametric index support

Optional provide richer search on field values
e.g., wildcards
Find books whose Author field contains strup
Range search find docs authored between
September and December
Inverted index doesnt work (as well)
Use techniques from database range search
Use query optimization heuristics as before

11
Field retrieval

In some cases, must retrieve field values
E.g., ISBN numbers of books by strup
Maintain forward index for each doc, those
field values that are retrievable
Indexing control file specifies which fields are
retrievable

12
Zones

A zone is an identified region within a doc
E.g., Title, Abstract, Bibliography
Generally culled from marked-up input or document
metadata (e.g., powerpoint)
Contents of a zone are free text
Not a finite vocabulary
Indexes for each zone - allow queries like
sorting in Title AND smith in Bibliography AND
recur in Body
Not queries like all papers whose authors cite
themselves

Why?
13
Zone indexes simple view
etc.
Author
Body
Title
14
So we have a database now?

Not really.
Databases do lots of things we dont need
Transactions
Recovery (our index is not the system of record
if it breaks, simple reconstruct from the
original source)
Indeed, we never have to store text in a search
engine only indexes
Were focusing on optimized indexes for
text-oriented queries, not a SQL engine.

15
Scoring
16
Scoring

Thus far, our queries have all been Boolean
Docs either match or not
Good for expert users with precise understanding
of their needs and the corpus
Applications can consume 1000s of results
Not good for (the majority of) users with poor
Boolean formulation of their needs
Most users dont want to wade through 1000s of
results cf. altavista

17
Scoring

We wish to return in order the documents most
likely to be useful to the searcher
How can we rank order the docs in the corpus with
respect to a query?
Assign a score say in 0,1
for each doc on each query
Begin with a perfect world no spammers
Nobody stuffing keywords into a doc to make it
match queries

18
Linear zone combinations

First generation of scoring methods use a linear
combination of Booleans
E.g.,
Score 0.6ltsorting in Titlegt 0.3ltsorting in
Abstractgt 0.1ltsorting in Bodygt
Each expression such as ltsorting in Titlegt takes
on a value in 0,1.
Then the overall score is in 0,1.

For this example the scores can only take on a
finite set of values what are they?
19
Linear zone combinations

In fact, the expressions between ltgt on the last
slide could be any Boolean query
Who generates the Score expression (with weights
such as 0.6 etc.)?
In uncommon cases the user through the UI
Most commonly, a query parser that takes the
users Boolean query and runs it on the indexes
for each zone
Weights determined from user studies and
hard-coded into the query parser

20
Exercise

On the query bill OR rights suppose that we
retrieve the following docs from the various zone
indexes

Author
1
2
bill
Compute the score for each doc based on the
weightings 0.6,0.3,0.1
rights
Title
5
8
3
bill
3
5
9
rights
Body
2
5
1
9
bill
5
8
3
9
rights
21
General idea

We are given a weight vector whose components sum
up to 1.
There is a weight for each zone/field.
Given a Boolean query, we assign a score to each
doc by adding up the weighted contributions of
the zones/fields.
Typically users want to see the K
highest-scoring docs.

22
Index support for zone combinations

In the simplest version we have a separate
inverted index for each zone
Variant have a single index with a separate
dictionary entry for each term and zone
E.g.,

bill.author
1
2
bill.title
5
8
3
bill.body
2
5
1
9
Of course, compress zone names like
author/title/body.
23
Zone combinations index

The above scheme is still wasteful each term is
potentially replicated for each zone
In a slightly better scheme, we encode the zone
in the postings
At query time, accumulate contributions to the
total score of a document from the various
postings, e.g.,

bill
1.author, 1.body
2.author, 2.body
3.title
As before, the zone names get compressed.
24
Score accumulation

As we walk the postings for the query bill OR
rights, we accumulate scores for each doc in a
linear merge as before.
Note we get both bill and rights in the Title
field of doc 3, but score it no higher.
Should we give more weight to more hits?

25
Scoring density-based

Zone combinations relied on the position of terms
in a doc title, author etc.
Obvious next idea if a document talks about a
topic more, then it is a better match
This applies even when we only have a single
query term.
A query should then just specify terms that are
relevant to the information need
Document relevant if it has a lot of the terms
Boolean syntax not required more web-style

26
Binary term presence matrices

Record whether a document contains a word
document is binary vector X in 0,1v
Query is a vector Y
What we have implicitly assumed so far
Score Query satisfaction overlap measure

27
Example

On the query ides of march, Shakespeares Julius
Caesar has a score of 3
All other Shakespeare plays have a score of 2
(because they contain march) or 1
Thus in a rank order, Julius Caesar would come
out tops

28
Overlap matching

Whats wrong with the overlap measure?
It doesnt consider
Term frequency in document
Term scarcity in collection (document mention
frequency)
of commoner than ides or march
Length of documents
(And queries score not normalized)

29
Overlap matching

One can normalize in various ways
Jaccard coefficient
Cosine measure
What documents would score best using Jaccard
against a typical query?
Does the cosine measure fix this problem?

30
Szó-dokumentum mátrix elofordulás alapján

Eddig a szavak gyakoriságát nem vettük figyelembe
Egy terminus elofordulásainak száma egy
dokumentumban
szózsákmodell
a dokumentum egy vektor az Nv térben (egy oszlop)

31
Elofordulás vs. gyakoriság

Nézzük ismét a ides of march keresést
Julius Caesar-ban 5-ször fordul elo az ides
Más darabban nem fordul elo az ides
march néhány tucat alkalommal fordul elo (több
darabban)
Minden daraban szerepel az of
Ez alapján a legrelevánsabb a legtöbb of-ot
tartalmazó darab lenne

32
Terminus gyakoriság tf

További gond, hogy az elozo mérték a hosszú
dokumentumokat elonyben részesíti, mivel azok
több szót tartalmaznak
Elso javítás elofordulás (támogatottság) helyett
gyakoriság (frekvencia)
tft,d a t terminus elofordulásainak száma
d-ben osztva d szavainak számával
Jó hír a tf-ek szummája egy dokumentumra 1 lesz
A dokumentumvektor L1 normája egy lesz
Kérdés, hogy a nyers tf megfelel-e mértéknek?

33
A terminus gyakoriság súlyozása tf

Mi a relatív fontossága, ha egy szó egy
dokumentumban
0-szor v. 1-szer fordul elo
1-szer v. 2-szer fordul elo
2-szer v. 3-szor fordul elo
Nem triviális nyilván minél többször szerepel,
annál jobb, de ez nem arányosan növekszik
(márpedig a nyers tf-nél ez arányos)
Használhatjuk mégis a nyers tf-et
De vannak más, a gyakorlatban sokszor alkalmazott
lehetoségek

34
Skalárszorzat szerinti illeszkedés

Az illeszkedést a dokumentum és a keresokifejezés
skalárszorzataként határozzuk meg
Megj 0, ha merolegesek (nincsenek közös
szavak)
Az illeszkedés mértéke szerint rangsorolunk
Alkalmazhatjuk a logaritmikus súlyozást (wf ) is
a szorzatbana tf helyett
Továbbra sem veszi figyelembe
A szó ritkaságát (megkülönbözteto képességét) a
dokumentumgyujteményben (ides vs. of)

35
A szó fontossága függjön a korpuszbeli
támogatottságától

Melyik informatívabb a dokumentum tartalmáról?
Az adóalany szó 10 elofordulása?
Az is 10 elofordulása?
Korlátozni szeretnénk a gyakori szavak súlyát
De mi számít gyakorinak?
Ötlet korpusztámogatottság (collection frequency
- cf )
A terminus összes elofordulásainak száma a teljes
gyujteményben

36
Dokumentumtámogatottság (df)

Azonban a dokumentumtámogatottság (df ) jobbnak
tunik
Szó cf df
ferrari 10422 17
insurance 10440 3997
A két méroszám megadása csak ismert (statikus)
korpuszok esetén lehetséges.
Hogyan használjuk ezután a df-et?

37
tf-idf súlyozás

tf-idf mérték komponensei
szógyakoriság (tf )
vagy wf, a szó suruségét határozza meg a
dokumentumban
inverz dokumentumtámogatottság (idf )
a szó megkülönbözteto képességéet adja meg a
korpuszbeli ritkasága alapján
számolható egyszeruen a szót tartalmazó
dokumentumok száma alapján (idfi 1/dfi)
de a leggyakoribb verzió

38
Összefoglalás tf-idf

Minden i szóhoz minden d dokumentumban rendeljük
az alábbi súlyt
Növekszik a dokumentumon belüli elofordulásokkal
Növekszik a korpuszon belüli ritkasággal

Mi annak a szónak a súlya, amely minden doksiban
szerepel
39
Valós értéku szó-dokumentum mátrix

A szóelofordulások függvénye
szózsákmodell
Minden dokumentumok egy valós reprezentál Rv -ben
Logaritmikusan skálázott tf.idf

Nagyobb lehet 1-nél!
40
Szózsákmodell-reprezentációról

Nem tesz különbséget a
Nitzsche mondta Isten halott
és az
Isten mondta Nitzsche halott
mondatok között.
Gondot jelent ez nekünk?

41
Dokumentumvektorok

Minden dokumentumot egy vektornak tekintünk
wf?idf értékek alapján, ahol az elemek a
szavakhoz tartoznak
Van tehát egy vektorterünk
Ennek a tengelyei a szavak/terminusok
Dokumentumok a vektortér pontjai
Még szótövezéssel is boven 20,000-nél nagyobb
lesz a vektortér dimenziója
(Ha a mátrixot a másik irányból nézzük, akkor a
dokumentumok lehetnek a tengelyek, és a szavak
vannak a vektortér elemei)

42
Dokumentumvektorok (2)

Minden q keresokifejezés is a vektortér
vektoraként fogható fel (általában nagyon ritka)
Az illeszkedést a vektorok közelsége alapján
határozzuk meg
Ezután minden dokumentumhoz hozzárendelheto egy
relevanciaérték a q keresokifejezés esetén

43
Anyag

MG Ch 4.4
New Retrieval Approaches Using SMART TREC
4Gerard Salton and Chris Buckley. Improving
Retrieval Performance by Relevance Feedback.
Journal of the American Society for Information
Science, 41(4)288-297, 1990.

Write a Comment

User Comments (0)