Indexing and Data Mining in Multimedia Databases - PowerPoint PPT Presentation

1 / 79
About This Presentation
Title:

Indexing and Data Mining in Multimedia Databases

Description:

Data mining & fractals. Road map. Motivation problems / case study. Definition of fractals and power laws. Solutions to posed problems. More examples. USC 2001 ... – PowerPoint PPT presentation

Number of Views:109
Avg rating:3.0/5.0
Slides: 80
Provided by: christosf
Category:

less

Transcript and Presenter's Notes

Title: Indexing and Data Mining in Multimedia Databases


1
Indexing and Data Mining in Multimedia Databases
  • Christos Faloutsos
  • CMU
  • www.cs.cmu.edu/christos

2
Outline
  • Goal Find similar / interesting things
  • Problem - Applications
  • Indexing - similarity search
  • New tools for Data Mining Fractals
  • Conclusions
  • Resources

3
Problem
  • Given a large collection of (multimedia) records,
    find similar/interesting things, ie
  • Allow fast, approximate queries, and
  • Find rules/patterns

4
Sample queries
  • Similarity search
  • Find pairs of branches with similar sales
    patterns
  • find medical cases similar to Smith's
  • Find pairs of sensor series that move in sync
  • Find shapes like a spark-plug

5
Sample queries contd
  • Rule discovery
  • Clusters (of branches of sensor data ...)
  • Forecasting (total sales for next year?)
  • Outliers (eg., unexpected part failures fraud
    detection)

6
Outline
  • Goal Find similar / interesting things
  • Problem - Applications
  • Indexing - similarity search
  • New tools for Data Mining Fractals
  • Conclusions
  • related projects _at_ CMU and resourses

7
Indexing - Multimedia
  • Problem
  • given a set of (multimedia) objects,
  • find the ones similar to a desirable query object

8
distance function by expert
9
GEMINI - Pictorially
eg,. std
S1
F(S1)
1
365
day
F(Sn)
Sn
eg, avg
1
365
day
10
Remaining issues
  • how to extract features automatically?
  • how to merge similarity scores from different
    media

11
Outline
  • Goal Find similar / interesting things
  • Problem - Applications
  • Indexing - similarity search
  • Visualization Fastmap
  • Relevance feedback FALCON
  • Data Mining / Fractals
  • Conclusions

12
FastMap
O1 O2 O3 O4 O5
O1 0 1 1 100 100
O2 1 0 1 100 100
O3 1 1 0 100 100
O4 100 100 100 0 1
O5 100 100 100 1 0
??
13
FastMap
  • Multi-dimensional scaling (MDS) can do that, but
    in O(N2) time
  • We want a linear algorithm FastMap SIGMOD95

14
Applications time sequences
  • given n co-evolving time sequences
  • visualize them find rules ICDE00

DEM
rate
JPY
HKD
time
15
Applications - financial
  • currency exchange rates ICDE00

FRF GBP JPY HKD
USD(t)
USD(t-5)
16
Applications - financial
  • currency exchange rates ICDE00

USD(t)
USD(t-5)
17
Application VideoTrails
  • ACM MM97

18
VideoTrails - usage
  • scene-cut detection (about 10 errors)
  • scene classification (eg., dialogue vs action)

19
Outline
  • Goal Find similar / interesting things
  • Problem - Applications
  • Indexing - similarity search
  • Visualization Fastmap
  • Relevance feedback FALCON
  • Data Mining / Fractals
  • Conclusions

20
Merging similarity scores
  • eg., video text, color, motion, audio
  • weights change with the query!
  • solution 1 user specifies weights
  • solution 2 user gives examples ?
  • and we learn what he/she wants rel. feedback
    (Rocchio, MARS, MindReader)
  • but how about disjunctive queries?

21
FALCON
Inverted Vs
Vs
Trader wants only unstable stocks
22
Single query point methods



x



Rocchio
23
Single query point methods



x
x
x



Rocchio
MindReader
MARS
The averaging affect in action...
24
Main idea FALCON Contours
Wu, vldb2000


feature2 eg., frequency



feature1 (eg., temperature)
25
Conclusions for indexing visualization
  • GEMINI fast indexing, exploiting off-the-shelf
    SAMs
  • FastMap automatic feature extraction in O(N)
    time
  • FALCON relevance feedback for disjunctive queries

26
Outline
  • Goal Find similar / interesting things
  • Problem - Applications
  • Indexing - similarity search
  • New tools for Data Mining Fractals
  • Conclusions
  • Resourses

27
Data mining fractals Road map
  • Motivation problems / case study
  • Definition of fractals and power laws
  • Solutions to posed problems
  • More examples

28
Problem 1 - spatial d.m.
  • Galaxies (Sloan Digital Sky Survey w/ B. Nichol)
  • - spiral and elliptical galaxies
  • (stores households mpg MTBF...)
  • - patterns? (not Gaussian not uniform)
  • attraction/repulsion?
  • separability??

29
Problem2 dim. reduction
  • given attributes x1, ... xn
  • possibly, non-linearly correlated
  • drop the useless ones
  • (Q why?
  • A to avoid the dimensionality curse)

30
Answer
  • Fractals / self-similarities / power laws

31
What is a fractal?
  • self-similar point set, e.g., Sierpinski
    triangle

zero area infinite length!
...
32
Definitions (contd)
  • Paradox Infinite perimeter Zero area!
  • dimensionality between 1 and 2
  • actually Log(3)/Log(2) 1.58 (long story)

33
Intrinsic (fractal) dimension
Eg cylinders miles / gallon
  • Q fractal dimension of a line?

x y
5 1
4 2
3 3
2 4
34
Intrinsic (fractal) dimension
  • Q fractal dimension of a line?
  • A nn ( lt r ) r1
  • (power law yxa)

35
Intrinsic (fractal) dimension
  • Q fractal dimension of a line?
  • A nn ( lt r ) r1
  • (power law yxa)
  • Q fd of a plane?
  • A nn ( lt r ) r2
  • fd slope of (log(nn) vs log(r) )

36
Sierpinsky triangle
correlation integral
37
Road map
  • Motivation problems / case studies
  • Definition of fractals and power laws
  • Solutions to posed problems
  • More examples
  • Conclusions

38
Solution1 spatial d.m.
  • Galaxies (Sloan Digital Sky Survey w/ B. Nichol -
    BOPS plot - sigmod2000)
  • clusters?
  • separable?
  • attraction/repulsion?
  • data scrubbing duplicates?

39
Solution1 spatial d.m.
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
40
Solution1 spatial d.m.
w/ Seeger, Traina, Traina, SIGMOD00
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
41
spatial d.m.
Heuristic on choosing of clusters
42
Solution1 spatial d.m.
log(pairs within ltr )
- 1.8 slope - plateau! - repulsion!
ell-ell
spi-spi
spi-ell
log(r)
43
Solution1 spatial d.m.
log(pairs within ltr )
  • - 1.8 slope
  • - plateau!
  • repulsion!!

ell-ell
spi-spi
-duplicates
spi-ell
log(r)
44
Problem 2 Dim. reduction
45
Solution
  • drop the attributes that dont increase the
    partial f.d. PFD
  • dfn PFD of attribute set A is the f.d. of the
    projected cloud of points w/ Traina, Traina, Wu,
    SBBD00

46
Problem 2 dim. reduction
global FD1
PFD1
PFD1
PFD0
PFD1
PFD1
47
Problem 2 dim. reduction
global FD1
PFD1
PFD1
Notice max variance would fail here
PFD0
PFD1
PFD1
48
Problem 2 dim. reduction
global FD1
PFD1
PFD1
Notice SVD would fail here
PFD0
PFD1
PFD1
49
Road map
  • Motivation problems / case studies
  • Definition of fractals and power laws
  • Solutions to posed problems
  • More examples
  • fractals
  • power laws
  • Conclusions

50
disk traffic
  • Not Poisson, not(?) iid - BUT self-similar
  • How to model it?

51
traffic
  • disk traces (80-20 law multifractal
    ICDE02)

bytes
time
52
Traffic
  • Many other time-sequences are bursty/clustered
    (such as?)

53
Tape accesses
tapes needed, to retrieve n records? ( days
down, due to failures / hurricanes /
communication noise...)
54
Tape accesses
50-50 Poisson
tapes retrieved
real
qual. records
55
More apps Brain scans
  • Oct-trees brain-scans

56
GIS points
  • Cross-roads of Montgomery county
  • any rules?

57
GIS
  • A self-similarity
  • intrinsic dim. 1.51
  • avgneighbors(lt r ) rD

log(pairs(within lt r))
log( r )
58
ExamplesLB county
  • Long Beach county of CA (road end-points)

59
More fractals
  • cardiovascular system 3 (!)
  • stock prices (LYCOS) - random walks 1.5
  • Coastlines 1.2-1.58 (?)

60
(No Transcript)
61
Road map
  • Motivation problems / case studies
  • Definition of fractals and power laws
  • Solutions to posed problems
  • More examples
  • fractals
  • power laws
  • Conclusions

62
Fractals lt-gt Power laws
  • self-similarity -gt
  • ltgt fractals
  • ltgt scale-free
  • ltgt power-laws (yxa, FCr(-2))

log(pairs within ltr )
1.58
log( r )
63
Zipfs law
the
log(freq)
and
Bible RANK-FREQUENCY plot (in log-log scales)
log(rank)
Zipfs (first) Law
64
Zipfs law
  • similarly for first names (slope -1)
  • last names ( -0.7)
  • etc

65
More power laws
  • Energy of earthquakes (Gutenberg-Richter law)
    simscience.org

log(count)
amplitude
magnitude
day
66
Clickstream data
lturl, u-id, ....gt

67
Lotkas law
  • library science (Lotkas law of publication
    count) and citation counts (citeseer.nj.nec.com
    6/2001)

log(count)
J. Ullman
log(citations)
68
Korcaks law
log(count( gt area))
Scandinavian lakes area vs complementary
cumulative count (log-log axes)
log(area)
69
More power laws Korcak
log(count( gt area))
Japan islands area vs cumulative count (log-log
axes)
log(area)
70
(Korcaks law Aegean islands)
71
Olympic medals
log( medals)
Russia
China
USA
log rank
72
SALES data store96
count of products
units sold
73
TELCO data
count of customers
of service units
74
More power laws on the Internet
log(degree)
log(rank)
degree vs rank, for Internet domains (log-log)
sigcomm99
75
Even more power laws
  • Income distribution (Paretos law)
  • duration of UNIX jobs Harchol-Balter
  • Distribution of UNIX file sizes
  • Web graph CLEVER-IBM Barabasi

76
Overall Conclusions
  • Find similar/interesting things in multimedia
    databases
  • Indexing feature extraction (GEMINI)
  • automatic feature extraction FastMap
  • Relevance feedback FALCON

77
Conclusions - contd
  • New tools for Data Mining Fractals/power laws
  • appear everywhere
  • lead to skewed distributions (Gaussian, Poisson,
    uniformity, independence)
  • correlation integral for separability/cluster
    detection
  • PFD for dimensionality reduction

78
Resources
  • Software and papers
  • www.cs.cmu.edu/christos
  • Fractal dimension (FracDim)
  • Separability (sigmod 2000, kdd2001)
  • Relevance feedback for query by content (FALCON
    vldb 2000)

79
Resources
  • Manfred Schroeder Chaos, Fractals and Power Laws
Write a Comment
User Comments (0)
About PowerShow.com