Evaluating Hierarchical Clustering of Search Results - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Evaluating Hierarchical Clustering of Search Results

Description:

Astrophysics. d4. Nuclear physics. d2, d3. d1, d2, d3, d4. Physics. Astrophysics. d4. Nuclear physics. Previous assumptions. Open world' perspective. X. X. Jokes ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 23
Provided by: anselm
Category:

less

Transcript and Presenter's Notes

Title: Evaluating Hierarchical Clustering of Search Results


1
Evaluating Hierarchical Clustering of Search
Results
SPIRE 2005, Buenos Aires
  • Departamento de Lenguajes y
  • Sistemas Informáticos
  • UNED, Spain
  • Juan Cigarrán
  • Anselmo Peñas
  • Julio Gonzalo
  • Felisa Verdejo
  • nlp.uned.es

2
Overview
  • Scenario
  • Assumptions
  • Features of a Good Hierarchical Clustering
  • Evaluation Measures
  • Minimal Browsing Area (MBA)
  • Distillation Factor (DF)
  • Hierarchy Quality (HQ)
  • Conclusion

3
Scenario
  • Complex information needs
  • Compile information from different sources
  • Inspect the whole list of documents
  • More than 100 documents
  • Help to
  • Find the relevant topics
  • Discriminate from unrrelevant documents
  • Approach
  • Hierarchical Clustering Formal Concept Analysis

4
(No Transcript)
5
(No Transcript)
6
Problem
  • How to define and measure the quality of a
    hierarchical clustering?
  • How to compare different clustering approaches?

7
Previous assumptions
  • Each cluster contains only those documents fully
    described by its descriptors

8
Previous assumptions
  • Open world perspective

9
Good Hierarchical Clustering
  • The content of the clusters.
  • Clusters should not mix relevant with non
    relevant information

10
Good Hierarchical Clustering
  • The hierarchical arrangement of the clusters
  • Relevant information should be in the same path

11
Good Hierarchical Clustering
  • The number of clusters
  • Number of clusters substantially lower than the
    number of documents
  • How clusters are described
  • Cognitive load of reading a cluster description
  • Ability to predict the relevance of the
    information that it contains (not addressed here)

12
Evaluation Measures
  • Criterion
  • Minimize the browsing effort for finding ALL
    relevant information
  • Baseline
  • The original document list returned by a search
    engine

13
Evaluation Measures
  • Consider
  • Content of clusters
  • Hierarchical arrangement of clusters
  • Size of the hierarchy
  • Cognitive load of reading a document (in the
    baseline) Kd
  • Cognitive load of reading a node descriptor (in
    the hierarchy) Kn
  • Requirement
  • Relevance assessments are available

14
Minimal Browsing Area (MBA)
  • The minimal set of nodes the user has to traverse
    to find ALL the relevant documents minimising the
    number of irrelevant ones

15
Distillation Factor (DF)
  • Ability to isolate relevant information compared
    with the original document list (Gain
    Factor, DFgt1)
  • Equivalent to
  • Considers only the cognitive load of reading
    documents

16
Distillation Factor (DF)
  • Example

Precision 4/7
Precision MBA 4/5
DF(L) 7/5 1.4
17
Distillation Factor (DF)
  • Counterexample
  • Bad clustering with good DF
  • Extend the DF measure considering the cognitive
    cost of taking browsing decisions ? HQ

18
Hierarchy Quality (HQ)
  • Assumption
  • When a node (in the MBA) is explored, all its
    lower neighbours have to be considered some will
    be in turn explored, some will be discarded
  • Nview subset of lower neighbours of each node
    belonging to the MBA

19
Hierarchy Quality (HQ)
  • Kn and Kd are directly related with the retrieval
    scenario in which the experiments take place
  • The researcher must tune KKn/Kd before
    conducting the experiment
  • HQ gt 1 indicates an improvement of the clustering
    versus the original list

20
Hierarchy Quality (HQ)
  • Example

21
Conclusions and Future Work
  • Framework for comparing different clustering
    approaches taking into account
  • Content of clusters
  • Hierarchical arrangement of clusters
  • Cognitive load to read document and node
    descriptions
  • Adaptable to the retrieval scenario in which
    experiments take place
  • Future work
  • Conduct user studies to compare their results
    with the automatic evaluation
  • Results will reflect the quality of the
    descriptors
  • Will be used to fine-tune the kd and kn parameters

22
  • Thank you!
Write a Comment
User Comments (0)
About PowerShow.com