Title: A Local Approach to LargeScale Distributed Data Mining
1A Local Approach to Large-Scale Distributed
Data Mining
- Assaf Schuster
- CS, Technion Israel Institute of Technology
Joint works with Ran Wolff, Amir Bar-Or, Denis
Krivitsky, Bobi Gilburd, Tsachi Scharfman, Arik
Friedman, Liran Liss, Mirit Shalem, Daniel Keren,
Tsachi Birk
2Large-Scale Distributed Databases
- Sensor networks, peer-to-peer systems, and grid
systems, produce and store massive databases - Often, data is secondary to knowledge
- Modeling (summary), patterns (tracking), and
decision making (information retrieval) - A decentralized system is essential for
- Cost (Skype 0.001 per user)
- Power (WSN battery depletion)
- Ownership/Privacy/Anonymity
A P2P Database (Jan 2004) 60M users 5M
simultaneously connected 45M downloads/month 900M
shared files
3Data Mining Applications for P2P
- Current technology trend allows customers to
collect huge amounts of data - e-economy, cheap storage, high-speed connectivity
- P2P technology enables customers to share data
Customer data mining Mirroring
corporates' Recommendations (unbiased)
e-Mule Product Lifetime Cost (as opposed to CLV)
4Data Mining a P2P Database
- Impossible to collect the data
- Privacy, Size, Processing power
- ? Distributed algorithms
- ? In network processing, push pcssing to peers
- Internet scale
- ?NO global operators, NO synchronizations
- ?NO global communication
- Ever changing data and system
- Failures, crashes, joins, departures
- Data modified faster than propagated
- ?Incremental algorithms
- ?Ad-hoc, anytime results
5How Not to Data Mine P2P
- Many data mining algorithms use decomposable
statistics (Avg, Var, cross-tables, etc.) - Global statistics can be calculated using sum
reduction or can they? - Synchronization
- Bandwidth requirements
- Failures
- Consistency
- Literature old-style parallelism
- rigid pattern, not data-adaptive
- no failure tolerance
- limited scalability
- one-time computation
6An Alternative Approach to Data Mining in P2P
- A data mining process that behaves like a network
protocol (e.g., routing) - No termination ad hoc result
- No synchronization speculative progress
- Scalability 10M network nodes
- The outcome constantly feeds into
- The system
- User Interface (personal portal)
7An (immature) approachGossip, Sampling
- Inaccurate
- Hard to decompose
- Hard to employ in iterative data mining algs
- Assume global knowledge (N, mixing times, )
- May work eventually
- Work in progress
- Gossip-Based Computation of Aggregate
Information. Kempe, Dobra, Gehrke. FOCS03. - Towards Data Mining in Large and Fully
Distributed Peer-to-Peer Overlay Networks.
Kowalczyk, Jelasity, Eiben. BNAIC'03. - Gossip and mixing times of random walks on
random graphs. Boyd, Ghosh, Prabhakar, Shah.
8A successful approachLocal Algorithms
- Every peer's result depends on the data gathered
from a (small) environment of peers - Size of environment may depend on the problem at
hand - Eventual correctness guaranteed
9Local Algorithms
- ? Scalability
- ? Robustness
- ? Incrementality
- ? Energy-efficient
- ? Asynchronousity
10Local Algorithms
- Performance independent of system size
- Examples Coloring, MST, Persistent Bit,
Awerbuch, Bar-Noy, Kuhn, Kutten, Linial,
Moscibroda, Naor, Patt-Shamir, Peleg, Stockmeyer,
Wattenhofer - Generalized (weighted) majority voting
- ?p1s(p)/votes(p) gt ? (1gt?gt0, p?peers)
- Depends of the significance of the vote
- in a tie all votes must be counted
- For many applications a tie is not important
inconclusive voting
11Correctness vs. Locality
- Information propagation correctness
- No propagation
- locality
- Rule of thumb propagate when
- neighbor needs to be persuaded
- previous message to neighbor turns out to be
potentially misleading
Y
X
Z
W
12Local Majority Voting
Y
Z
W
X
Wolff Schuster ICDM'03
13(No Transcript)
14Locality
1,600 nodes All initiated at once Local DB of
10K transactions Locked step Run until there are
no further messages
15Dynamic Behavior 1M-peers
- 1 noise
- At every simulator step
48 set input bits
0.1 noise
16Local Majority Voting Variations General Graphs
- Birk, Liss, Schuster, Wolff DiSC'04
17Local Majority Voting Variations Private Votes
- K-Privacy
- Oblivious Counters
- Homomorphic encryption
- Gilburd, Schuster, Wolff CCGrid04 HPDC'04
KDD04
18A Decomposition Methodology
- Decompose data mining process into primitives
- Primitives are simpler
- Find local distributed algorithms for primitives
- Efficient in the common case
- Recompose data mining process from primitives
- Maintain correctness
- Maintain locality
- Maintain asynchronousity
- Results Association Rules ICDM03, Hill
Climbing DCOSS05, Facility Location
submitted, Hierarchical Decision Trees
SDM04, K-anonymity submitted, Approximate
Decision Trees in writing
19Local Majority and Associations
- is frequent if more than MinFreq of the
transactions contain X same for Y - is confident if more than MinConf of the
transactions that contain X also contain Y
20Local Majority and Associations
Apples 80
21Local Majority and Associations
- Find that is frequent
- And that is frequent
22Local Majority and Associations
- Find that is frequent
- And that is frequent
- Then compare the frequencies of and
- Three votings!
23Activation Flow in Distributed Association Rules
Mining
24Avoiding SynchronizationThe Iterative Scheme
25Avoiding SynchronizationSpeculation
26The Big Picture
27Associations Performance
- By the time the database is scanned once, in
parallel - the average peer has discovered 95 of the
rules - has less than 10 false rules.
28Privacy preserving --Performance
29The Facility Location Problem
- Large network of motion sensors log data
- Second tier high powered relays assist in data
collection - Both resources limited
- Which relays to use?
30Facility Location Problem
- Choose which locations to activate
- Minimizing sum of costs
- Distance based
- Data based
- Dynamic
31Facility Location ProblemCross Domain Clustering
- e-Mule users, each have a music database
- Search can improve if users are grouped by their
preferences clustering - Natural to talk of preference in terms of music
ontology Country and Soul, Rock and 80's,
Classics and Jazz, etc. - Ontology provides
- important background
- knowledge
32A Closer Look at the Facility Location Problem
- Choose representations
- Minimizing sum of costs
- Discrepancies based
- Private
- Dynamic
33Facility Location Problem
- Old problem Kuehn, Hamburger'63, many variants
- Shown NP-Hard Kleinberg, Papadimitriou,
Raghavan'98 - Hill-climbing heuristic gives constant factor
approximation Arya, Garg, Khandekar,
Munagala'01 - Given a configuration of the facilities
- Choose the single step that minimizes the cost
34Facility Location Problem
- Cost of a configuration
- cost of activating facilities
- distances from points to nearest facility
- Cost(C) ?fac?C activate-cost(fac)
- ?p?points minfac?C
dist(p,fac)
Computed locally by sensor/peer
35Climbing the Hill
- Cost(C) Cost (C)
- ?p?points minfac?C d(p,fac) - ?p?points minfac?C
d(p,fac) - ?p?points minfac?C d(p,fac) - minfac?C
d(p,fac) - ?p?points ?p(C,C)
- To choose between C and C' only the sign is
needed - Reducible to argmin
- Argmin is reducible to majority vote ?
Computed locally by sensor/peer
36Climbing the Hill
- Want to find the best next configuration in
Next(C) - For all pairs Ci,Cj in Next(C) check that
Cost(Ci)-Cost(Cj)lt0 - The cost of the best step wins them all
37How many majority votes per hill step?
M locations K facilities Full order
O(M2K2) Pivoting O(MK) ArgMin MK-1
38The Big Picture
- All participants begin with a predefined
configuration - Each develops the next possible steps
- Use ArgMin to compute the best step
- Continue speculatively
- Eventual convergence guaranteed
39Experimental data
- 10 Gauss distributions
- Random mean
- Variance 1
- 20 random noise
40Convergence, Noise
- Internet topology (BRITE)
- 512 nodes
- 1K points in node
- 5 nodes selected randomly
- in each 10 modified points
- Every 5 simulator steps
- (35 times in average edge delay)
41Message Overhead, Scalability
de Bruijn
BRITE Internet
42Database Size
43Locality
Instances
44Robustness (different instances)
45Dynamic Experiments
- Switch databases among sensors
- at average edge delay
- 98 accuracy retained
46Static Experiments Messages
- Scalable
- Topology robust
- On average Single message / majority vote /
participant
47Static Experiments Locality
- Percentile of avg. environment size 80,90,95
- Topology sensitive
- 4-6 peers data
48Conclusions
- Complex data mining algorithms can be
implemented in-network - Using local algorithms as building blocks
assures - superb scalability
- modest bandwidth demands
- rapid convergence
- Energy efficiency
- robustness for stationary changes in the data
49THE END!
50A successful approachLocal Algorithms
- The complexity of computing the result does not
directly depend on the number of participants. - Each participant computes the result using
information gathered from just a few nearby
neighbors its environment. - Environment size may depend on the problem.
- Eventual correctness guaranteed.
51Local Algorithms
- Locality
- ? Scalability
- ? Robustness
- ? Incrementality
- ? Asynchronousity
52Experimental Validation
- Large network of sensors
- Thousands
- Topologies Grid, de-Bruijn, Internet
- Synthetic data
- Static and dynamic (Stationary)
53Local Majority Voting Locality