Which Kinds of Trend Metrics Are More Effective for Emerging Trend Detection - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Which Kinds of Trend Metrics Are More Effective for Emerging Trend Detection

Description:

Monitoring research trends has always been a concern of policy makers ... They have been also termed as hot topics, upward trends, or emerging trends. ... – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 29
Provided by: liasNc
Category:

less

Transcript and Presenter's Notes

Title: Which Kinds of Trend Metrics Are More Effective for Emerging Trend Detection


1
Which Kinds of Trend Metrics Are More Effective
for Emerging Trend Detection?
  • Yuen-Hsien Tseng
  • National Taiwan Normal UniversityYu-I Lin
  • Taipei Municipal Univ. of Education

Chun-Hsien Kuo Yi-Yang Lee Science Technology
Policy Research and Information Center
Taipei, Taiwan, R.O.C. 106
This presentation is based on the work to appear
in Scientometrics.
2
Introduction ETD
  • Monitoring research trends has always been a
    concern of policy makers
  • it helps resource allocation and technology
    forecast.
  • Increasingly important research topics are of
    particular interest to those policy makers
  • They have been also termed as hot topics, upward
    trends, or emerging trends.
  • ETD (Emerging Trend Detection)

3
But how to detect them effectively?
  • Domain experts are often consulted
  • good at identifying interesting research trends
  • But their observations do not generalize
    effectively to the fields beyond their expertise
  • when a large number of research topics need to be
    prioritized, inconsistent decision may result
  • Automatic mechanism for monitoring research
    trends in a large stream of upcoming publications
    would be of great help

4
Detecting trends in scientometrics
  • Noyons and van Raan pointed out that
  • Domain experts are often hard to find, due to
    busy schedules and lack of affinity with
    scientometrics studies
  • Policy makers are often too much overwhelmed by
    the amount of resulting information

5
Motivations
  • In past trend analysis,
  • different year spans may be used to create the
    time sequence
  • different indices were chosen for trend
    observation
  • Simple count of publications is suspicious to get
    good trend sequences Chi et al, 2006
  • The effectiveness of these choices
  • was unknown quantitatively and comparatively
  • This work provides clues to better interpret the
    results when a certain choice was made

6
Questions ?
  • For effective trend detection, which options
    should be used?

Different year spans!
Data are from Smeaton et al 2003 ACM SIGIR Forum
Simple count to create a sequence suspicious for
ETD
Different trend orderings due to different
criteria!
7
Simple Trend vs EigenTrend
Chi, Tseng, Tatemura, 2006, CIKM challenged
the validity of the simple accumulation of
published documents over time
  • Simple authority
  • Simple trend
  • (First) Authority U1
  • (First) Eigen-Trend s11V1
  • Error

Break down by sources
DUSVT
8
Outline of the following talk
  • ETD methodology
  • Trend metrics to be compared
  • Evaluation method
  • Data sets for evaluating ETD
  • Safety agriculture (SA)
  • Information retrieval (IR)
  • Evaluation results
  • Conclusions and implications

9
ETD methodology
  • Documents (terms) were clustered to yield topics
  • For each topic, a time series of number of
    publications over time was created
  • Topics were then ranked by a trend metric
  • an IR-based metaphor
  • Input
  • a set of publications (each with PY, TI, AU, C1,
    SO, )
  • Output
  • a ranked list of topics in decreasing order of
    interest

10
Trend metrics to be compared (1/2)
  • api (average percentage of increase)
  • used in a foresight survey in Japan (STFC, 2004)
  • used by Noyons et al when n2
  • slp slope of the linear regression line that
    best fits the data in the time series
  • slpz same as slp, but the sequence is first
    z-score transformed (zi(di-avg)/stderr )

11
Trend metrics to be compared (2/2)
  • slppi a combination of api and slp.
  • d1, d2, , dngtpi1, pi2, , pin-1,
    pii(di1-di)/di
  • may be ideal for sharp increasing trend detection
  • slpc eigen-trend break down by C1
  • C1 first authors country
  • slpj eigen-trend break down by SO (journal)

12
Evaluation method NAP Pre_at_R
  • Assume A-E and V-Z are ten items to be ordered
    and A-E are relevant while V-Z are not.
  • Ordering S1 is the best by
  • NAP Non-interpolated Average Precision rate
  • Pre_at_R Precision rate at Recall position
  • Pre_at_R r/R, where r is the number of
    relevant items in the top R items
  • With NAP and Pre_at_R, we can evaluate which trend
    orderings are best

Pre_at_R0.603/5 NAP0.68(1/12/33/54/75/9)/5
13
Data set SA
  • Six research domains regarding safety agriculture
    (SA) were enumerated by a group of experts from
    the Science Technology Policy Research and
    Information Center (STPI)
  • food security, crop protection, livestock,
    fishery, agroforestry, and environment
  • for each domain, a query was formulated to search
    the ISIs Web of Science database
  • 72500 records between 1996 and 2005 were
    downloaded

14
Topic detection for Safety agriculture
  • Clustering analysis was based on controlled terms
  • 179 SC terms each occurs in more than 10 docs.
  • 3632 DE terms of this kind
  • Terms from each field (SC or DE) co-occurred in
    the same records were counted
  • Similarity based on this count was used in a
    complete link clustering algorithm
  • 80 clusters (topics) were found for SC terms
  • 1617 clusters for DE terms

15
Trend Type Labelling by Experts (1/2)
  • We sampled 50 of clusters from SC and 10 from
    DE for experts to judge their trend types
  • 6 professors, 2 researchers, 1 admin. manager
  • Trend types
  • sharp increasing
  • increasing
  • fluctuation
  • - decreasing
  • -- sharp decreasing
  • ? inconclusive

16
Trend Type Labelling by Experts (2/2)
  • Experts were advised to judge the type of each
    cluster based on their knowledge
  • If this did not help, the time series of the
    cluster can be consulted
  • If this did not help either, the documents in the
    cluster can be examined.
  • If all these efforts failed, the cluster was
    labeled inconclusive

17
Experts feedback
Data are from 72500 documents in safety
agricultural area.
Sharp increase Increase
Controlled terms clustering Different fields
undecidable
- Decrease -- Sharp decrease
18
Date set Information retrieval (IR)
  • 853 papers from the first ACM SIGIR conference to
    the 25th were clustered by a commercial software
    package called Clustan Graphics by Smeaton et al
    ACM SIGIR Forum 2003
  • 29 non-overlapping clusters were generated
  • They then inspected each cluster manually and
    assigned a topic description to reflect the theme
    of the majority of the papers in each cluster
  • Topics are sorted approximately in order of a
    combination of the year of their first
    appearance, and the number of papers published

19
Clustering and ordering of SIGIR papers by topics
made by Smeaton
20
Hot topics predicted by Seamton et al
  • The ideal paper title expected by Smeaton et al
    to appear in SIGIR 2003 is
  • "Evaluation of a Language Model Implementation of
    a Topic-Based, Cross-Lingual Question-Answering
    and Summarisation System

21
Fourteen session titles (topics) in the SIGIR
2003 conference
22
Evaluation results SA
Avg is the average of the values in the SC and
DE rows
23
Prediction effectiveness when year span varies
from 1, 2, to 5
(x1, x2)gt ((x1-avg)/stderr, (x2-avg)/stderr)(
(x1-x2)/2 / (x1-x2)/2, (-x1x2)/2 / (x1-x2)/2
). Thus only 3 values result (-1, 1), (1, -1),
(0, 0), which in turn yield only 3 possible
slopes 2, -2, and 0.
24
Prediction based on less data
Percentage of performance drop for slp using only
the first n years of data, where n10, 8, 6, 4,
and 2.
Pre_at_R
NAP
25
Evaluation results IR
26
Conclusions
  • Which metrics (methods) perform best for ETD?
  • api average percentage of increase
  • slp slope of the linear regression line
  • eigen-trends
  • Smeatons chronological ordering
  • Our answer is slp, because it performs well
    under
  • different year spans (1, 2, 5)
  • different observation durations (10 vs. 25 years)
  • different domains (SA vs IR)
  • different collection scales (72500 vs 853 papers)
  • api only works for n2 (so Noyons work still
    valid)

27
Conclusions
  • Our goal is to explore the best way to predict
    upward trends in an environment where a large
    number of topics are to be monitored.
  • If a good trend index is used, the inspection in
    the order sorted by the index should be efficient
  • Our work is important to know which metric is the
    best under a certain condition.

28
Implications
  • The IR based method for evaluating the trend
    index performance suggests a relatively objective
    and repeatable procedure to indentify better
    indices and to gather evidence to support (or
    invalidate) our current results.
Write a Comment
User Comments (0)
About PowerShow.com