Title: ????????????? Hadoop ? Mahout
1????????????? Hadoop ? Mahout ? ???????? ????????
??????? ?????? ???????????? ?.?.?. ???.????????
?????????????????????? ???????????? ??????????
2Hadoop ? Mahout ??????? ?.?.
Big Data
- Big Data ?????? ????????? ??????? ???????
?????? - ????????? ? ?????????
- ??????? ???????? ????????? ?????????? ???????
????????????? ????????? - ?????? ?????????? Gartner ? IDC
- Big Data ?????? ? ??? 10 ???????? ??????
????????? ???????? ?????????????? ?????????? - ????? Big Data ???? ?? ????? ??????????????
- MapReduce ???? ?? ???????? ?????????? ???????
????????? ?????? ? Big Data
3Hadoop ? Mahout ??????? ?.?.
????
- ?????? MapReduce ? Apache Hadoop
- ?????????? Hadoop
- ???????? ???????? ? Apache Mahout
4Hadoop ? Mahout ??????? ?.?.
??????? Hadoop ? MapReduce
- ?????????? MapReduce ????????? ? Google ???
??????? ?????? ? ???????? - ???? ??????? ? ???????????? ??????? ??????
?????? ?? ??????? ???????????, ???????????? ????? - Goggle ?? ?????????????? ???? ??????????
MapReduce - Jeffrey Dean, Sanjay Ghemawat. MapReduce
Simplified Data Processing on Large Clusters - Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung.
The Google File System - Apache Hadoop ???????? ?????????? MapReduce
- ?????????? ?? ?????? ???? Google
- ??????? ?? Java
- http//hadoop.apache.org/
5Hadoop ? Mahout ??????? ?.?.
??? ?????????? Hadoop
- ??? ?????????? Hadoop
- ????? ??????? ??????? Hadoop ? Yahoo!
- 4500 ????????
- ???????????? ??? ????????? ??????? ? ???????
????????? ??????????
6Hadoop ? Mahout ??????? ?.?.
???????? ?????????? Hadoop
- HDFS (Hadoop Distributed File System) ????????
?????? - MapReduce ????????? ??????
7Hadoop ? Mahout ??????? ?.?.
HDFS
????
8Hadoop ? Mahout ??????? ?.?.
HDFS
????
64??
64??
64??
9Hadoop ? Mahout ??????? ?.?.
HDFS
????
64??
64??
64??
10Hadoop ? Mahout ??????? ?.?.
HDFS
Name Node
Data Node 1
Data Node 2
Data Node 3
1, 4, 6
1, 3, 5
1, 2, 5
Data Node 4
Data Node 5
Data Node 6
11Hadoop ? Mahout ??????? ?.?.
?????? ? HDFS
- ????? ?????? ? HDFS ???????????? ?? ??????
???????? - ?????? ???????????? HDFS
- ?? ???????? ??????????? ??????? ls, cp, mv ? ?.?.
- ?????????? ???????????? ??????????? ???????
- hadoop dfs cmd
- ???????
- hadoop dfs -ls
- Found 3 items
- -rw-r--r-- 1 hadoop supergroup 0 2011-06-22
1358 /user/hadoop/file1 - -rw-r--r-- 1 hadoop supergroup 0 2011-06-22
1358 /user/hadoop/file2 - -rw-r--r-- 1 hadoop supergroup 0 2011-06-22
1358 /user/hadoop/file3 - hadoop dfs -put /tmp/file4
- hadoop dfs -cat file4
- Hello, world!
12Hadoop ? Mahout ??????? ?.?.
??????????? HDFS
- HDFS ?????????????????? ???????? ???????,
???????????????? ??? ???????????? ?????????
?????? ? ???????? ??????? - ???????? ?? ??? ???? ?????!
- ?????? Write Once Read Many
- ?????? ???????? ????, ????? ?????? ????????? ?
????? - ??????? ?????? ?????
- ??-???????? 64 ?? (????? 128 ??? 256 ??)
- ?? ?????????? ???????????? ?????? (???? ?????? ?
?.?.)
13Hadoop ? Mahout ??????? ?.?.
MapReduce
- MapReduce ?????????? ?????????????? ??????????
- ???? MapReduce ????????? ?????? ?????????? ?
??????????? ??????????????? ?????????????? - ??????????? ????????? ?????? ?????? ??????????
- ?????????????? ?????? ? ???????? ??????????????
????????????? - MapReduce ???????? ? ??????? ??? ? ??????
???????????? - ???????? ? ????? ?????
- ????????????? ???????????? ???????
- ???????????? ?????? ??????
- ????????? ????? ??????? ? ???????
???????? http//www.youtube.com/watch?vSS27F-hYW
fU
14Hadoop ? Mahout ??????? ?.?.
??????? Map ? Reduce
???????? http//developer.yahoo.com/hadoop/tutori
al/module4.html
15Hadoop ? Mahout ??????? ?.?.
?????? MapReduce WordCount
- ?????? ?????????, ??????? ??? ????? ???????????
? ????? - ?????????? ????????? ? Web-?????????
- ?????????? ????????? ????? ??? ?????????????
????? - ???????? ??????
- ????????? ?????
- ?????? ???? ??????? ?? ???? ????????????
- ??????
- ???? MapReduce ????????? ?????? ?????????? ?
??????????? ??????????????? ??????????????.
??????????? ????????? ?????? ?????? ??????????
16Hadoop ? Mahout ??????? ?.?.
WordCount ??????? Map
- ???????? ??????
- ???? MapReduce ????????? ?????? ?????????? ?
??????????? ??????????????? ??????????????.
??????????? ????????? ?????? ?????? ?????????? - ?????????? ?????????
- lt????, 1gt, ltmapreduce,1gt, lt?????????, 1gt,
lt??????,1gt, lt??????????, 1gt, lt?, 1gt,
lt???????????, 1gt, lt???????????????, 1gt,
lt??????????????, 1gt, lt???????????, 1gt,
lt?????????, 1gt, lt??????,1gt, lt??????, 1gt,
lt??????????, 1gt - ?????????? ? ??????????? ?? ?????
- ltmapreduce,1gt, lt??????????????, 1gt, lt?, 1gt,
lt??????,1gt, lt??????, 1gt, lt???????????, 1gt,
lt??????????, 1gt, lt??????????, 1gt, lt???????????,
1gt, lt?????????, 1gt, lt???????????????, 1gt,
lt?????????, 1gt, lt??????,1gt, lt????, 1gt.
17Hadoop ? Mahout ??????? ?.?.
WordCount ??????? Reduce
- ???? ? ??????????? ??????? ?????????? ? ????
??????? Reduce - ltmapreduce,1gt ? ltmapreduce,1gt
- lt??????????????, 1gt ? lt??????????????, 1gt
- lt?, 1gt ? lt?, 1gt
- lt??????,1gt, lt??????, 1gt ? lt??????, 2gt
- lt???????????, 1gt ? lt???????????, 1gt
- lt??????????, 1gt, lt??????????, 1gt ? lt??????????,
2gt - lt???????????, 1gt ? lt???????????, 1gt
- lt?????????, 1gt ? lt?????????, 1gt
- lt???????????????, 1gt ? lt???????????????, 1gt
- lt?????????, 1gt ? lt?????????, 1gt
- lt??????,1gt ? lt??????,1gt
- lt????, 1gt ? lt????, 1gt
18Hadoop ? Mahout ??????? ?.?.
?????? MapReduce
- MapReduce ???????? ?????? ? ??????? ??????
WordCount - ???? ????? ??????? ? ?????????? ???????? ???????
- ??????????? MapReduce
- ??????????? ??????????????? ?????????????????
??????? Map ? Reduce ????? ???????????? ????????
?????? ??????????? ?? ???????? ???? ?? ????? - ???????????????? ?????? ????? ??????????? ??
?????? ???????? (? HDFS) ? ?????????????? ?????
?? ?????? ???????? - ?????????????????? ??? ?????? ?? ????? ???????
??????? Map ??? Reduce ??????????? ?? ??????
??????? - ?????????? MapReduce
- ????????????? ???????? ????????? ??????
- ??????? ????????? ??????? ?? ?????????????????
19Hadoop ? Mahout ??????? ?.?.
??????????? ?????????? ? ??????
20Hadoop ? Mahout ??????? ?.?.
?????? ??????? ?????? Hadoop
- hadoop jar hadoop-examples-.jar grep input
output 'dfsa-z.' - hadoop-examples-.jar ??? ?????? ? ????????? ??
???????????? Hadoop - grep ??? ??????? ? ?????? ? ?????????
- input ??????? ??????? ?????? (? HDFS)
- output ??????? ???????? ?????? (? HDFS)
- 'dfsa-z.' ?????? ??? ??????
21Hadoop ? Mahout ??????? ?.?.
?????????? Hadoop
- MapReduce ?????? ?????? ????????????????, ??
?????????????? - ?????????? ??????????? ???????? ??????????
??????? ??????? ??????????? - Hadoop ?????? ? ????????? ? ?????????????????
- ?? ?????? Hadoop ????????? ??????????
- ??????????? ???????? ??? ??????? ?????????
?????????? ?????, ???????????? Hadoop ???
??????????????? - ???????????? Hadoop
- ???????? ??????? ??? Hadoop
22Hadoop ? Mahout ??????? ?.?.
?????????? Hadoop
- Pig ????????????? ???? ??????? ??????
- Hive ?????? ?????? ? ?????????????? ?????,
???????? ? SQL - Oozie ????? ????? ? Hadoop
- Hbase ???? ?????? (?????????????), ??????
Google Big Table - Mahout ???????? ????????
- Sqoop ??????? ?????? ?? ????? ? Hadoop ?
???????? - Flume ??????? ????? ? HDFS
- Zookeeper, MRUnit, Avro, Giraph, Ambari,
Cassandra, HCatalog, Fuse-DFS ? ?.?.
23Hadoop ? Mahout ??????? ?.?.
???????????? Hadoop
- Apache
- hadoop.apache.org
- ???????????? ???????????, ?????? Hadoop
- ?????????????? ????????????
- ????????? Hadoop, HBase, Pig, Hive, Mahout,
Sqoop, Zookeeper ? ??. - ???????? ????????????? ????????? ?
?????????????????, ??????????, ???????????? - ?????????? ?????????????? ?????????????
- Cloudera
- MapR
- Hortonworks
- Intel
24Hadoop ? Mahout ??????? ?.?.
???????? ??????? Hadoop
- Amazon Elastic MapReduce (Amazon EMR)
- http//aws.amazon.com/elasticmapreduce/
- ??????????? ? MapR
- Apache Hadoop on Rackspace
- http//www.rackspace.com/knowledge_center/article/
apache-hadoop-on-rackspace-private-cloud - ??????????? ? Hortonworks
- Microsoft Windows Azure
- http//www.windowsazure.com/en-us/home/scenarios/b
ig-data/ - Qubole Data Service
- http//www.qubole.com/qubole-data-service
- Web-????????? ??? ??????? ?????? ? Hadoop, Hive,
Pig ? ??. ?? Amazon EMR
25Hadoop ? Mahout ??????? ?.?.
Apache Mahout
- ?????????????? ?????????? ????????? ????????
(machine learning) - ?????? ??????
- ? ???????? Hadoop
- ???????? ?? ????? ??????????
- Mahout ????? ?? ?????????? ?????, ????????
???????? ?????? - ???????? ???????? ????
- ??????? ?? Java
- ???????? Apache 2.0
- ???????? ???????
- http//mahout.apache.org/
26Hadoop ? Mahout ??????? ?.?.
???????? ???????? ? Mahout
- ??????????????? (??????????) ??????????
- ????????????
- ?????????????
- ??????????? ???????? ? ?????? (????????, ???????
?? ?????????) - ??????? Google News ?????????? ??????? ?? ????
???? - ????????? ? Mahout K-Means, Fuzzy K-Means, Mean
Shift, Dirichlet, Canopy ? ??. - ?????????????
- ??????????? ?????????????? ??????? ? ?????????
?????? (?????? ???????? ???????) - ??????? ??????????? ?????, ??????????? ????????
?????? (????? ? ????????, ?????? ? ?.?.) - ????????? ? Mahout Logistic Regression, Naive
Bayes, Support Vector Machines, Online Passive
Aggressive ? ??.
27Hadoop ? Mahout ??????? ?.?.
????????????
28Hadoop ? Mahout ??????? ?.?.
???????????? ????????????
- ??????? ??????? ???????????? ????? ???????????
??????? ????? ?? ?????? ? ??????? - 1M NetflixPrize
- ???????? Netflix ???????? ???????????? ??
????????? ????????? ???????????? DVD - ?????? ????? 1 ??????? ????????
- ??????? ????????? ????? ???????? ????????
???????????? ?? 10 - ???? ???????? ??????? BellKors Pragmatic Chaos
? 2009 ?. - ???????????? ????????? ? 2006 ?? 2009 ?.
- ?????? ??? ???????????? ???? ?? ???????? 50 000
- http//www.netflixprize.com/
29Hadoop ? Mahout ??????? ?.?.
??????? ????????????
- ?? ?????? ????????
- ?????? ???????????? ?????? ????? ???????, ??????
????? ????????????? ??? ?????? ????? ??????? ???
???????????? ?????????? - ?????????? ??????? ???????????? ?????? ???????
?? ?????? ? ???????????? - ?? ?????? ????????????
- ???????????? ?? ?????? ?????? ?????????????
- ??????? ???????????? ????? ???? ??????
- ????? ??????????? ????? ???????, ?? ????????? ??
?????? - ?????????? ? Mahout
30Hadoop ? Mahout ??????? ?.?.
????????????
- ???????????? ? Mahout ???????? ?? ??????
???????????? ????????????? - ???????????? ? Mahout
- ???????????? (????? ?????)
- ?????? (????? ?????)
- ???????????? (????? ??????? ????????)
- ?????? ?????? ? ????????????? ??? Mahout ??
??????? GroupLens (??????????? ????????) ??????
?????????????? ??????? - 196 242 3 881250949
- 186 302 3 891717742
- 22 377 1 878887116
- 244 51 2 880606923
- user id item id rating timestamp
(?? ???????????? ? Mahout)
31Hadoop ? Mahout ??????? ?.?.
??????? ? ????????????
- ?? ?????? ?????????????
- ????? ????????????? ? ???????? ???????
- ??????????, ??? ???????? ???? ?????????????
- ????????????? ??????? ? ???????????? ?
?????????????? ??????? ????????????? - ?????????? ?????? ????? ??????????????,
???????????? ?????? ???????? - ?? ?????? ????????
- ????? ???????, ??????? ?? ??, ??????? ???????????
???????????? - ????????????? ???????? ?????????? ?? ???
- ???????????? ?????? ??????????????, ??????
???????? ???????? ?????. ???????????? ?????
???????????? ? ?????????? ?????? (?
?????????????? Hadoop)
32Hadoop ? Mahout ??????? ?.?.
???????????? ?? ?????? ?????????????
public static void main(String args) throws
Exception DataModel model new FileDataModel
(new File("u.data")) UserSimilarity
similarity new PearsonCorrelationSimilarity
(model) UserNeighborhood neighborhood
new NearestNUserNeighborhood (2, similarity,
model) Recommender recommender new
GenericUserBasedRecommender ( model,
neighborhood, similarity)
ListltRecommendedItemgt recommendations
recommender.recommend(1, 1) for
(RecommendedItem recommendation
recommendations) System.out.println(recommen
dation) RecommendedItem item643,
value4.27682
33Hadoop ? Mahout ??????? ?.?.
???????????? ?? ?????? ?????????????
???????? Sean Owen, Robin Anil, Ted Dunning, and
Ellen Friedman. Mahout in Action
34Hadoop ? Mahout ??????? ?.?.
????? ??????? ?????????????
- ??? ??????????, ??? ????? ????????????? ???????
- ???? ????????? - ????? ?? -1 ?? 1.
- 1 ????? ????????????? ?????????
- 0 ? ????????????? ??? ????? ??????
- -1 ????? ????????????? ??????????????
- Mahout ?????????? ????????? ?????????? ???????
????????? - ??????????? ???????
- ????????? ??????????
- ?????????? ????????
- ??????????? ????????
- ??????????????? ?????????????
35Hadoop ? Mahout ??????? ?.?.
???????? ????????????
????????????? ????? ??????? (NearestNUserNeighborh
ood )
?????? ? ???????? ??????? (ThresholdUserNeighborho
od)
???????? Sean Owen, Robin Anil, Ted Dunning, and
Ellen Friedman. Mahout in Action
36Hadoop ? Mahout ??????? ?.?.
????? ??????????
- ????? ??? ????????? ????????????? ??????
- ????? ??? ????????? ??????
- ???????? ??????
- ???????????? ?????? ???
- ?????????? ?????? ??? ?????? ??????
- ????????? ???????????? ? ??????? ???????????!
???????? Sean Owen, Robin Anil, Ted Dunning, and
Ellen Friedman. Mahout in Action
37Hadoop ? Mahout ??????? ?.?.
???????????? ?? ?????? ????????
public static void main(String args) throws
Exception DataModel model new FileDataModel
(new File("u.data")) ItemSimilarity
itemSimilarity new LogLikelihoodSimilarity(dataM
odel) ItemBasedRecommender recommender
new GenericItemBasedRecommender(dataModel,
itemSimilarity) ListltRecommendedItemgt
recommendations recommender.recommend(1, 1)
for (RecommendedItem recommendation
recommendations) System.out.println(recommen
dation) RecommendedItem item271,
value4.27682
38Hadoop ? Mahout ??????? ?.?.
Mahout ? Hadoop
???????? Sean Owen, Robin Anil, Ted Dunning, and
Ellen Friedman. Mahout in Action
39Hadoop ? Mahout ??????? ?.?.
Mahout ? Hadoop
- Mahout ????? ???????? ??? ????????, ??? ? ?
???????? Hadoop - ?????? ???????????? Mahout ? Hadoop ???????????
? ??????? ?????? RecommenderJob - ?????? ? ????????????? ?????? ???? ???????? ?
HDFS - ?????????? ???????????? ???????????? ? HDFS
- ???????????? ????? ????????? ? ???? ?????? ?
??????? sqoop
40Hadoop ? Mahout ??????? ?.?.
?????? ??????? Mahout ? Hadoop
hadoop jar mahout-core-0.7-job.jar \
org.apache.mahout.cf.taste.hadoop.item.Recommender
Job \ -Dmapred.input.dirinput
-Dmapred.output.diroutput --usersFile
users_list.txt
- ????????? ?????????
- Dmapred.input.dir ??????? ? ??????? ?
????????????? (? HDFS, ????? ???? ?????????
??????) - Dmapred.output.dir ???????, ???? ????????????
??????????????? ???????????? (? HDFS) - --usersFile ???? ? ????????????????
?????????????, ??? ??????? ????? ?????????????
???????????? - --similarityClassname ??? ??????, ???????
????????? ?????? ????????? - --numRecommendations ?????????? ???????????? ??
?????? ????????????
41Hadoop ? Mahout ??????? ?.?.
?????
- MapReduce ??????????? ?????? ??? ?????????
??????? ??????? ?????? (BigData) - Hadoop ???????? ?????????? MapReduce
- ?????????? Hadoop
- Mahout ???????? ???????? ? Hadoop
- ????????????, ?????????????, ?????????????
- ???????????? ? Mahout
- ???????????? ????????????, ??????, ??????
- ???????????? ?? ?????? ????????????? ? ?? ??????
???????? - ????????? ????????????? ? ????????
- ????????? ?????????????
- ?????? Mahout RecommenderJob ? Hadoop
42Hadoop ? Mahout ??????? ?.?.
???????? ???????? ?????? ??????? avs_at_imm.ura
n.ru www.asozykin.ru