Title: Intelligent Remote Sensing Using Wireless Networks
1The Netflix ChallengeParallel Collaborative
Filtering
James Jolly Ben MurrellCS 387Parallel
Programming with MPIDr. Fikret Ercal
2What is Netflix?
- subscription-based movie rental
- online frontend
- over 100,000 movies to pick from
- 8M subscribers
- 2007 net income 67M
3What is the Netflix Prize?
- attempt to increase Cinematch accuracy
- predict how users will rate unseen movies
- 1M for 10 improvement
4The contest dataset
- contains 100,480,577 ratings
- from 480,189 users
- for 17,770 movies
5Why is it hard?
- user tastes difficult to model in general
- movies tough to classify
- large volume of data
6Sounds like a job for collaborative filtering!
- infer relationships between users
- leverage them to make predictions
7Why is it hard?
User Movie Rating Dijkstra Office
Space 5 Knuth Office Space 5 Turing Office
Space 5 Knuth Dr. Strangelove 4 Turing Dr.
Strangelove 2 Boole Titanic 5 Knuth Titanic
1 Turing Titanic 2
8What makes users similar?
Office Space
Titanic
Dr. Strangelove
9What makes users similar?The Pearson Correlation
Coefficient!
Office Space
Titanic
Dr. Strangelove
pc .813
10Building a similarity matrix
Turing Knuth Boole Chomsky
Turing 1.000 0.813 0.750 0.125
Knuth 0.813 1.000 0.325 0.500
Boole 0.750 0.325 1.000 0.500
Chomsky 0.125 0.500 0.500 1.000
11Predicting user ratings
Would Chomsky like Grammar Rock?
- approach
- use matrix to find users like Chomsky
- drop ratings from those who havent seen it
- take weighted average of remaining ratings
12Predicting user ratings
Turing Knuth Boole Chomsky
Turing 1.000 0.813 0.750 0.125
Knuth 0.813 1.000 0.325 0.500
Boole 0.750 0.325 1.000 0.500
Chomsky 0.125 0.500 0.500 1.000
Suppose Turing, Knuth, and Boole rated it 5, 3,
and 1. Since .125 .5 .5 1.125, we
predict rChomsky ( (.125/1.125)5 (.5/1.125)3
(.5/1.125)1 )/3 rChomsky 1.519
13So how is the data really organized?
user 1, rating 5user 13, rating 3user 42,
rating 2
movie file 1movie file 2movie file 3
user 13, rating 1user 42, rating 1user
1337, rating 2
user 13, rating 5user 311, rating 4user
666, rating 5
14Training Data
- 17,770 text files (one for each movie)
- gt 2 GB
15Parallelization
- Two Step Process
- Learning Step
- Prediction Step
- Concerns
- Data Distribution
- Task Distribution
16Parallelizing the learning step
user 1 user 2 user 3 user 4 user 5 user 6 user 7 user 8
user 1 c1,1 c1,2 c1,3 c1,4 c1,5 c1,6 c1,7 c1,8
user 2 c2,1 c2,2 c2,3 c2,4 c2,5 c2,6 c2,7 c2,8
user 3 c3,1 c3,2 c3,3 c3,4 c3,5 c3,6 c3,7 c3,8
user 4 c4,1 c4,2 c4,3 c4,4 c4,5 c4,6 c4,7 c4,8
user 5 c5,1 c5,2 c5,3 c5,4 c5,5 c5,6 c5,7 c5,8
user 6 c6,1 c6,2 c6,3 c6,4 c6,5 c6,6 c6,7 c6,8
user 7 c7,1 c7,2 c7,3 c7,4 c7,5 c7,6 c7,7 c7,8
user 8 c8,1 c8,2 c8,3 c8,4 c8,5 c8,6 c8,7 c8,8
17Parallelizing the learning step
user 1 user 2 user 3 user 4 user 5 user 6 user 7 user 8
user 1 c1,1 c1,2 c1,3 c1,4 c1,5 c1,6 c1,7 c1,8
user 2 c2,1 c2,2 c2,3 c2,4 c2,5 c2,6 c2,7 c2,8
user 3 c3,1 c3,2 c3,3 c3,4 c3,5 c3,6 c3,7 c3,8
user 4 c4,1 c4,2 c4,3 c4,4 c4,5 c4,6 c4,7 c4,8
user 5 c5,1 c5,2 c5,3 c5,4 c5,5 c5,6 c5,7 c5,8
user 6 c6,1 c6,2 c6,3 c6,4 c6,5 c6,6 c6,7 c6,8
user 7 c7,1 c7,2 c7,3 c7,4 c7,5 c7,6 c7,7 c7,8
user 8 c8,1 c8,2 c8,3 c8,4 c8,5 c8,6 c8,7 c8,8
P1
P2
P3
P4
18Parallelizing the learning step
- store data as usermovie rating
- each proc has all rating data for n/p users
- calculate each ci,j
- calculation requires message passing(only 1/p
of correlations can be calculated locally within
a node)
19Parallelizing the prediction step
- Data distribution directly affects task
distribution - Method 1 Store all user information on each
processor and stripe movie information(less
communication)
P0
predict(user, movie)
rating estimate
P1 P2 P3
All User Information All User Information All User Information
Movie1 Movie2 Movie3
Movie4 Movie5 Movie6
Movie7 Movie8 Movie9
Movie10 Movie11 Movie12
20Parallelizing the prediction step
- Data distribution directly affects task
distribution - Method 2 Store all movie information on each
processor and stripe user information (more
communication)
P0
predict(user, movie)
gather partialestimates
P1 P2 P3
All Movie Ratings All Movie Ratings All Movie Ratings
User1 User2 User3
User4 User5 User6
User7 User8 User9
User10 User11 User12
21Parallelizing the prediction step
- Data distribution directly affects task
distribution - Method 3 hybrid approach(lots of communication
high number of nodes)
P1 P2 P3
Users 1-3 Users 1-3 Users 1-3
Movie1 Movie2 Movie3
Movie4 Movie5 Movie6
Movie7 Movie8 Movie9
Movie10 Movie11 Movie12
P7 P8 P9
Users 4-6 Users 4-6 Users 4-6
Movie13 Movie14 Movie15
Movie16 Movie17 Movie18
Movie19 Movie20 Movie21
Movie22 Movie23 Movie24
P0
predict(user, movie)
P4 P5 P6
Users 1-3 Users 1-3 Users 1-3
Movie13 Movie14 Movie15
Movie16 Movie17 Movie18
Movie19 Movie20 Movie21
Movie22 Movie23 Movie24
Users 4-6 Users 4-6 Users 4-6
Movie25 Movie26 Movie27
Movie28 Movie29 Movie30
Movie31 Movie32 Movie33
Movie34 Movie35 Movie36
22Our Present Implementation
- operates on a trimmed-down dataset
- stripes movie information and stores
similarity matrix in each processor - this wont scale well!
- storing all movie information on each node would
be optimal, but nic.mst.edu cant handle it
23In summary
- tackling Netflix Prize requires lots of data
handling - we are working toward an implementation that
- can operate on the entire training set
- simple collaborative filtering should get us
close - to the old Cinematch performance