A Local Approach to LargeScale Distributed Data Mining - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

A Local Approach to LargeScale Distributed Data Mining

Description:

Ran Wolff, Amir Bar-Or, Denis Krivitsky, Bobi Gilburd, Tsachi Scharfman, Arik ... Sensor networks, peer-to-peer systems, and grid systems, produce and store ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 54
Provided by: ask2
Category:

less

Transcript and Presenter's Notes

Title: A Local Approach to LargeScale Distributed Data Mining


1
A Local Approach to Large-Scale Distributed
Data Mining
  • Assaf Schuster
  • CS, Technion Israel Institute of Technology

Joint works with Ran Wolff, Amir Bar-Or, Denis
Krivitsky, Bobi Gilburd, Tsachi Scharfman, Arik
Friedman, Liran Liss, Mirit Shalem, Daniel Keren,
Tsachi Birk
2
Large-Scale Distributed Databases
  • Sensor networks, peer-to-peer systems, and grid
    systems, produce and store massive databases
  • Often, data is secondary to knowledge
  • Modeling (summary), patterns (tracking), and
    decision making (information retrieval)
  • A decentralized system is essential for
  • Cost (Skype 0.001 per user)
  • Power (WSN battery depletion)
  • Ownership/Privacy/Anonymity

A P2P Database (Jan 2004) 60M users 5M
simultaneously connected 45M downloads/month 900M
shared files
3
Data Mining Applications for P2P
  • Current technology trend allows customers to
    collect huge amounts of data
  • e-economy, cheap storage, high-speed connectivity
  • P2P technology enables customers to share data

Customer data mining Mirroring
corporates' Recommendations (unbiased)
e-Mule Product Lifetime Cost (as opposed to CLV)
4
Data Mining a P2P Database
  • Impossible to collect the data
  • Privacy, Size, Processing power
  • ? Distributed algorithms
  • ? In network processing, push pcssing to peers
  • Internet scale
  • ?NO global operators, NO synchronizations
  • ?NO global communication
  • Ever changing data and system
  • Failures, crashes, joins, departures
  • Data modified faster than propagated
  • ?Incremental algorithms
  • ?Ad-hoc, anytime results

5
How Not to Data Mine P2P
  • Many data mining algorithms use decomposable
    statistics (Avg, Var, cross-tables, etc.)
  • Global statistics can be calculated using sum
    reduction or can they?
  • Synchronization
  • Bandwidth requirements
  • Failures
  • Consistency
  • Literature old-style parallelism
  • rigid pattern, not data-adaptive
  • no failure tolerance
  • limited scalability
  • one-time computation

6
An Alternative Approach to Data Mining in P2P
  • A data mining process that behaves like a network
    protocol (e.g., routing)
  • No termination ad hoc result
  • No synchronization speculative progress
  • Scalability 10M network nodes
  • The outcome constantly feeds into
  • The system
  • User Interface (personal portal)

7
An (immature) approachGossip, Sampling
  • Inaccurate
  • Hard to decompose
  • Hard to employ in iterative data mining algs
  • Assume global knowledge (N, mixing times, )
  • May work eventually
  • Work in progress
  • Gossip-Based Computation of Aggregate
    Information. Kempe, Dobra, Gehrke. FOCS03.
  • Towards Data Mining in Large and Fully
    Distributed Peer-to-Peer Overlay Networks.
    Kowalczyk, Jelasity, Eiben. BNAIC'03.
  • Gossip and mixing times of random walks on
    random graphs. Boyd, Ghosh, Prabhakar, Shah.

8
A successful approachLocal Algorithms
  • Every peer's result depends on the data gathered
    from a (small) environment of peers
  • Size of environment may depend on the problem at
    hand
  • Eventual correctness guaranteed

9
Local Algorithms
  • ? Scalability
  • ? Robustness
  • ? Incrementality
  • ? Energy-efficient
  • ? Asynchronousity

10
Local Algorithms
  • Performance independent of system size
  • Examples Coloring, MST, Persistent Bit,
    Awerbuch, Bar-Noy, Kuhn, Kutten, Linial,
    Moscibroda, Naor, Patt-Shamir, Peleg, Stockmeyer,
    Wattenhofer
  • Generalized (weighted) majority voting
  • ?p1s(p)/votes(p) gt ? (1gt?gt0, p?peers)
  • Depends of the significance of the vote
  • in a tie all votes must be counted
  • For many applications a tie is not important
    inconclusive voting

11
Correctness vs. Locality
  • Information propagation correctness
  • No propagation
  • locality
  • Rule of thumb propagate when
  • neighbor needs to be persuaded
  • previous message to neighbor turns out to be
    potentially misleading

Y
X
Z
W
12
Local Majority Voting
Y
Z
W
X
Wolff Schuster ICDM'03
13
(No Transcript)
14
Locality
1,600 nodes All initiated at once Local DB of
10K transactions Locked step Run until there are
no further messages
15
Dynamic Behavior 1M-peers
  • 1 noise
  • At every simulator step

48 set input bits
0.1 noise
16
Local Majority Voting Variations General Graphs
  • Birk, Liss, Schuster, Wolff DiSC'04

17
Local Majority Voting Variations Private Votes
  • K-Privacy
  • Oblivious Counters
  • Homomorphic encryption
  • Gilburd, Schuster, Wolff CCGrid04 HPDC'04
    KDD04

18
A Decomposition Methodology
  • Decompose data mining process into primitives
  • Primitives are simpler
  • Find local distributed algorithms for primitives
  • Efficient in the common case
  • Recompose data mining process from primitives
  • Maintain correctness
  • Maintain locality
  • Maintain asynchronousity
  • Results Association Rules ICDM03, Hill
    Climbing DCOSS05, Facility Location
    submitted, Hierarchical Decision Trees
    SDM04, K-anonymity submitted, Approximate
    Decision Trees in writing

19
Local Majority and Associations
  • is frequent if more than MinFreq of the
    transactions contain X same for Y
  • is confident if more than MinConf of the
    transactions that contain X also contain Y

20
Local Majority and Associations
  • Find that is frequent

Apples 80
21
Local Majority and Associations
  • Find that is frequent
  • And that is frequent
  • Bananas 80

22
Local Majority and Associations
  • Find that is frequent
  • And that is frequent
  • Then compare the frequencies of and
  • Three votings!
  • Apples Bananas 75

23
Activation Flow in Distributed Association Rules
Mining
24
Avoiding SynchronizationThe Iterative Scheme
25
Avoiding SynchronizationSpeculation
26
The Big Picture
27
Associations Performance
  • By the time the database is scanned once, in
    parallel
  • the average peer has discovered 95 of the
    rules
  • has less than 10 false rules.

28
Privacy preserving --Performance
29
The Facility Location Problem
  • Large network of motion sensors log data
  • Second tier high powered relays assist in data
    collection
  • Both resources limited
  • Which relays to use?

30
Facility Location Problem
  • Choose which locations to activate
  • Minimizing sum of costs
  • Distance based
  • Data based
  • Dynamic

31
Facility Location ProblemCross Domain Clustering
  • e-Mule users, each have a music database
  • Search can improve if users are grouped by their
    preferences clustering
  • Natural to talk of preference in terms of music
    ontology Country and Soul, Rock and 80's,
    Classics and Jazz, etc.
  • Ontology provides
  • important background
  • knowledge

32
A Closer Look at the Facility Location Problem
  • Choose representations
  • Minimizing sum of costs
  • Discrepancies based
  • Private
  • Dynamic

33
Facility Location Problem
  • Old problem Kuehn, Hamburger'63, many variants
  • Shown NP-Hard Kleinberg, Papadimitriou,
    Raghavan'98
  • Hill-climbing heuristic gives constant factor
    approximation Arya, Garg, Khandekar,
    Munagala'01
  • Given a configuration of the facilities
  • Choose the single step that minimizes the cost

34
Facility Location Problem
  • Cost of a configuration
  • cost of activating facilities
  • distances from points to nearest facility
  • Cost(C) ?fac?C activate-cost(fac)
  • ?p?points minfac?C
    dist(p,fac)

Computed locally by sensor/peer
35
Climbing the Hill
  • Cost(C) Cost (C)
  • ?p?points minfac?C d(p,fac) - ?p?points minfac?C
    d(p,fac)
  • ?p?points minfac?C d(p,fac) - minfac?C
    d(p,fac)
  • ?p?points ?p(C,C)
  • To choose between C and C' only the sign is
    needed
  • Reducible to argmin
  • Argmin is reducible to majority vote ?

Computed locally by sensor/peer
36
Climbing the Hill
  • Want to find the best next configuration in
    Next(C)
  • For all pairs Ci,Cj in Next(C) check that
    Cost(Ci)-Cost(Cj)lt0
  • The cost of the best step wins them all

37
How many majority votes per hill step?
M locations K facilities Full order
O(M2K2) Pivoting O(MK) ArgMin MK-1
38
The Big Picture
  • All participants begin with a predefined
    configuration
  • Each develops the next possible steps
  • Use ArgMin to compute the best step
  • Continue speculatively
  • Eventual convergence guaranteed

39
Experimental data
  • 10 Gauss distributions
  • Random mean
  • Variance 1
  • 20 random noise

40
Convergence, Noise
  • Internet topology (BRITE)
  • 512 nodes
  • 1K points in node
  • 5 nodes selected randomly
  • in each 10 modified points
  • Every 5 simulator steps
  • (35 times in average edge delay)

41
Message Overhead, Scalability
de Bruijn
BRITE Internet
42
Database Size
43
Locality
Instances
44
Robustness (different instances)
45
Dynamic Experiments
  • Switch databases among sensors
  • at average edge delay
  • 98 accuracy retained

46
Static Experiments Messages
  • Scalable
  • Topology robust
  • On average Single message / majority vote /
    participant

47
Static Experiments Locality
  • Percentile of avg. environment size 80,90,95
  • Topology sensitive
  • 4-6 peers data

48
Conclusions
  • Complex data mining algorithms can be
    implemented in-network
  • Using local algorithms as building blocks
    assures
  • superb scalability
  • modest bandwidth demands
  • rapid convergence
  • Energy efficiency
  • robustness for stationary changes in the data

49
THE END!
  • Questions???

50
A successful approachLocal Algorithms
  • The complexity of computing the result does not
    directly depend on the number of participants.
  • Each participant computes the result using
    information gathered from just a few nearby
    neighbors its environment.
  • Environment size may depend on the problem.
  • Eventual correctness guaranteed.

51
Local Algorithms
  • Locality
  • ? Scalability
  • ? Robustness
  • ? Incrementality
  • ? Asynchronousity

52
Experimental Validation
  • Large network of sensors
  • Thousands
  • Topologies Grid, de-Bruijn, Internet
  • Synthetic data
  • Static and dynamic (Stationary)

53
Local Majority Voting Locality
Write a Comment
User Comments (0)
About PowerShow.com