A Local Approach to LargeScale Distributed Data Mining - PowerPoint PPT Presentation

1 / 53

About This Presentation

Title:

A Local Approach to LargeScale Distributed Data Mining

Description:

Ran Wolff, Amir Bar-Or, Denis Krivitsky, Bobi Gilburd, Tsachi Scharfman, Arik ... Sensor networks, peer-to-peer systems, and grid systems, produce and store ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 54

Provided by: ask2

Category:

more less

Transcript and Presenter's Notes

Title: A Local Approach to LargeScale Distributed Data Mining

1
A Local Approach to Large-Scale Distributed
Data Mining

Assaf Schuster
CS, Technion Israel Institute of Technology

Joint works with Ran Wolff, Amir Bar-Or, Denis
Krivitsky, Bobi Gilburd, Tsachi Scharfman, Arik
Friedman, Liran Liss, Mirit Shalem, Daniel Keren,
Tsachi Birk
2
Large-Scale Distributed Databases

Sensor networks, peer-to-peer systems, and grid
systems, produce and store massive databases
Often, data is secondary to knowledge
Modeling (summary), patterns (tracking), and
decision making (information retrieval)
A decentralized system is essential for
Cost (Skype 0.001 per user)
Power (WSN battery depletion)
Ownership/Privacy/Anonymity

A P2P Database (Jan 2004) 60M users 5M
simultaneously connected 45M downloads/month 900M
shared files
3
Data Mining Applications for P2P

Current technology trend allows customers to
collect huge amounts of data
e-economy, cheap storage, high-speed connectivity
P2P technology enables customers to share data

Customer data mining Mirroring
corporates' Recommendations (unbiased)
e-Mule Product Lifetime Cost (as opposed to CLV)
4
Data Mining a P2P Database

Impossible to collect the data
Privacy, Size, Processing power
? Distributed algorithms
? In network processing, push pcssing to peers
Internet scale
?NO global operators, NO synchronizations
?NO global communication
Ever changing data and system
Failures, crashes, joins, departures
Data modified faster than propagated
?Incremental algorithms
?Ad-hoc, anytime results

5
How Not to Data Mine P2P

Many data mining algorithms use decomposable
statistics (Avg, Var, cross-tables, etc.)
Global statistics can be calculated using sum
reduction or can they?
Synchronization
Bandwidth requirements
Failures
Consistency

Literature old-style parallelism
rigid pattern, not data-adaptive
no failure tolerance
limited scalability
one-time computation

6
An Alternative Approach to Data Mining in P2P

A data mining process that behaves like a network
protocol (e.g., routing)
No termination ad hoc result
No synchronization speculative progress
Scalability 10M network nodes
The outcome constantly feeds into
The system
User Interface (personal portal)

7
An (immature) approachGossip, Sampling

Inaccurate
Hard to decompose
Hard to employ in iterative data mining algs
Assume global knowledge (N, mixing times, )
May work eventually
Work in progress
Gossip-Based Computation of Aggregate
Information. Kempe, Dobra, Gehrke. FOCS03.
Towards Data Mining in Large and Fully
Distributed Peer-to-Peer Overlay Networks.
Kowalczyk, Jelasity, Eiben. BNAIC'03.
Gossip and mixing times of random walks on
random graphs. Boyd, Ghosh, Prabhakar, Shah.

8
A successful approachLocal Algorithms

Every peer's result depends on the data gathered
from a (small) environment of peers
Size of environment may depend on the problem at
hand
Eventual correctness guaranteed

9
Local Algorithms

? Scalability
? Robustness
? Incrementality
? Energy-efficient
? Asynchronousity

10
Local Algorithms

Performance independent of system size
Examples Coloring, MST, Persistent Bit,
Awerbuch, Bar-Noy, Kuhn, Kutten, Linial,
Moscibroda, Naor, Patt-Shamir, Peleg, Stockmeyer,
Wattenhofer
Generalized (weighted) majority voting
?p1s(p)/votes(p) gt ? (1gt?gt0, p?peers)
Depends of the significance of the vote
in a tie all votes must be counted
For many applications a tie is not important
inconclusive voting

11
Correctness vs. Locality

Information propagation correctness
No propagation
locality
Rule of thumb propagate when
neighbor needs to be persuaded
previous message to neighbor turns out to be
potentially misleading

Y
X
Z
W
12
Local Majority Voting
Y
Z
W
X
Wolff Schuster ICDM'03
13
(No Transcript)
14
Locality
1,600 nodes All initiated at once Local DB of
10K transactions Locked step Run until there are
no further messages
15
Dynamic Behavior 1M-peers

1 noise
At every simulator step

48 set input bits
0.1 noise
16
Local Majority Voting Variations General Graphs

Birk, Liss, Schuster, Wolff DiSC'04

17
Local Majority Voting Variations Private Votes

K-Privacy
Oblivious Counters
Homomorphic encryption

Gilburd, Schuster, Wolff CCGrid04 HPDC'04
KDD04

18
A Decomposition Methodology

Decompose data mining process into primitives
Primitives are simpler
Find local distributed algorithms for primitives
Efficient in the common case
Recompose data mining process from primitives
Maintain correctness
Maintain locality
Maintain asynchronousity
Results Association Rules ICDM03, Hill
Climbing DCOSS05, Facility Location
submitted, Hierarchical Decision Trees
SDM04, K-anonymity submitted, Approximate
Decision Trees in writing

19
Local Majority and Associations

is frequent if more than MinFreq of the
transactions contain X same for Y
is confident if more than MinConf of the
transactions that contain X also contain Y

20
Local Majority and Associations

Find that is frequent

Apples 80
21
Local Majority and Associations

Find that is frequent
And that is frequent

Bananas 80

22
Local Majority and Associations

Find that is frequent
And that is frequent
Then compare the frequencies of and
Three votings!

Apples Bananas 75

23
Activation Flow in Distributed Association Rules
Mining
24
Avoiding SynchronizationThe Iterative Scheme
25
Avoiding SynchronizationSpeculation
26
The Big Picture
27
Associations Performance

By the time the database is scanned once, in
parallel
the average peer has discovered 95 of the
rules
has less than 10 false rules.

28
Privacy preserving --Performance
29
The Facility Location Problem

Large network of motion sensors log data
Second tier high powered relays assist in data
collection
Both resources limited
Which relays to use?

30
Facility Location Problem

Choose which locations to activate
Minimizing sum of costs
Distance based
Data based
Dynamic

31
Facility Location ProblemCross Domain Clustering

e-Mule users, each have a music database
Search can improve if users are grouped by their
preferences clustering
Natural to talk of preference in terms of music
ontology Country and Soul, Rock and 80's,
Classics and Jazz, etc.
Ontology provides
important background
knowledge

32
A Closer Look at the Facility Location Problem

Choose representations
Minimizing sum of costs
Discrepancies based
Private
Dynamic

33
Facility Location Problem

Old problem Kuehn, Hamburger'63, many variants
Shown NP-Hard Kleinberg, Papadimitriou,
Raghavan'98
Hill-climbing heuristic gives constant factor
approximation Arya, Garg, Khandekar,
Munagala'01
Given a configuration of the facilities
Choose the single step that minimizes the cost

34
Facility Location Problem

Cost of a configuration
cost of activating facilities
distances from points to nearest facility
Cost(C) ?fac?C activate-cost(fac)
?p?points minfac?C
dist(p,fac)

Computed locally by sensor/peer
35
Climbing the Hill

Cost(C) Cost (C)
?p?points minfac?C d(p,fac) - ?p?points minfac?C
d(p,fac)
?p?points minfac?C d(p,fac) - minfac?C
d(p,fac)
?p?points ?p(C,C)
To choose between C and C' only the sign is
needed
Reducible to argmin
Argmin is reducible to majority vote ?

Computed locally by sensor/peer
36
Climbing the Hill

Want to find the best next configuration in
Next(C)
For all pairs Ci,Cj in Next(C) check that
Cost(Ci)-Cost(Cj)lt0
The cost of the best step wins them all

37
How many majority votes per hill step?
M locations K facilities Full order
O(M2K2) Pivoting O(MK) ArgMin MK-1
38
The Big Picture

All participants begin with a predefined
configuration
Each develops the next possible steps
Use ArgMin to compute the best step
Continue speculatively
Eventual convergence guaranteed

39
Experimental data

10 Gauss distributions
Random mean
Variance 1
20 random noise

40
Convergence, Noise

Internet topology (BRITE)
512 nodes
1K points in node
5 nodes selected randomly
in each 10 modified points
Every 5 simulator steps
(35 times in average edge delay)

41
Message Overhead, Scalability
de Bruijn
BRITE Internet
42
Database Size
43
Locality
Instances
44
Robustness (different instances)
45
Dynamic Experiments

Switch databases among sensors
at average edge delay
98 accuracy retained

46
Static Experiments Messages

Scalable
Topology robust
On average Single message / majority vote /
participant

47
Static Experiments Locality

Percentile of avg. environment size 80,90,95
Topology sensitive
4-6 peers data

48
Conclusions

Complex data mining algorithms can be
implemented in-network
Using local algorithms as building blocks
assures
superb scalability
modest bandwidth demands
rapid convergence
Energy efficiency
robustness for stationary changes in the data

49
THE END!

Questions???

50
A successful approachLocal Algorithms

The complexity of computing the result does not
directly depend on the number of participants.
Each participant computes the result using
information gathered from just a few nearby
neighbors its environment.
Environment size may depend on the problem.
Eventual correctness guaranteed.

51
Local Algorithms