Pass4sure CCD-410 Cloudera Certified Developer - PowerPoint PPT Presentation

About This Presentation
Title:

Pass4sure CCD-410 Cloudera Certified Developer

Description:

Big data company Cloudera is preparing to launch major new open-source software for storing and serving lots of different kinds of unstructured data, with an eye toward challenging heavyweights in the database business, VentureBeat has learned. – PowerPoint PPT presentation

Number of Views:46

less

Transcript and Presenter's Notes

Title: Pass4sure CCD-410 Cloudera Certified Developer


1
Cloudera Certified Developer for Apache Hadoop
(CCDH)
2
Who We Are
Mission To help organizations profit from their
data
  • How We Do It
  • We deliver relevant products and services.
  • A distribution of Apache Hadoop that is tested,
    certified and supported
  • Comprehensive support and professional service
    offerings
  • A suite of management software for Hadoop
    operations
  • Training and certification programs for
    developers, administrators, managers and data
    scientists
  • Technical Team
  • Unmatched knowledge and experience.
  • Founders, committers and contributors to Hadoop
  • A wealth of experience in the design and delivery
    of production software
  • Credentials
  • The Apache Hadoop experts.
  • Number 1 distribution of Apache Hadoop in the
    world
  • Largest contributor to the open source Hadoop
    ecosystem
  • More committers on staff than any other company
  • More than 100 customers across a wide variety of
    industries
  • Strong growth in revenue and new accounts

Leadership Strong executive team with proven
abilities.
Mike Olson CEO Kirk Dunn COO Charles Zedlewski VP, Product Mary Rorabaugh CFO Jeff Hammerbacher Chief Scientist Amr Awadalla VP Engineering Doug Cutting Chief Architect Omer Trajman VP, Customer Solutions
3
Users of Cloudera
Retail Consumer
Financial
Web
Media
Telecom
https//www.pass4sureexam.com/ccD-410.html
4
What is Apache Hadoop?
CORE HADOOP COMPONENTS
  • Hadoop is a platform for data storage and
    processing that is
  • Scalable
  • Fault tolerant
  • Open source
  • Scalability
  • Scale-out architecture divides workloads across
    multiple nodes
  • Flexible file system eliminates ETL bottlenecks
  • Low Cost
  • Can be deployed on commodity hardware
  • Open source platform guards against vendor lock
  • Flexibility
  • A single repository for storing processing
    analyzing any type of data
  • Not bound by a single schema

https//www.pass4sureexam.com/ccD-410.html
5
What Makes Hadoop Different?
  • Ability to scale out to Petabytes in size using
    commodity hardware
  • Processing (MapReduce) jobs are sent to the data
    versus shipping the data to be processed
  • Hadoop doesnt impose a single data format so it
    can easily handle structure, semi-structure and
    unstructured data
  • Manages fault tolerance and data replication
    automatically

https//www.pass4sureexam.com/ccD-410.html
6
Why the Need for Hadoop?
10,000
1.8 trillion gigabytes of data was created in
2011
  • More than 90 is unstructured data
  • Approx. 500 quadrillion files
  • Quantity doubles every 2 years

5,000
GIGABYTES OF DATA CREATED (IN BILLIONS)
0
2005
2015
2010
Source IDC 2011
7
Hadoop Use Cases
Application
Application
Industry
Use Case
Use Case
Social Network Analysis
Clickstream Sessionization
Clickstream Sessionization
Content Optimization
Network Analytics
Mediation
Loyalty Promotions Analysis
ADVANCED ANALYTICS
Data Factory
DATA PROCESSING
Fraud Analysis
Trade Reconciliation
Entity Analysis
SIGINT
Sequencing Analysis
Genome Mapping
8
Hadoop in the Enterprise
ANALYSTS
BUSINESS USERS
OPERATORS
ENGINEERS
IDEs
BI / Analytics
Enterprise Reporting
Management Tools
Enterprise Data Warehouse
CUSTOMERS
Web Application
Logs
Files
Web Data
Relational Databases
https//www.pass4sureexam.com/ccD-410.html
9
What is CDH?
  • Clouderas Distribution Including
  • Apache Hadoop (CDH) is an enterprise-ready
  • distribution of Hadoop that is
  • 100 Apache open source
  • Contains all components needed for deployment
  • Fully documented and supported
  • Released on a reliable schedule
  • Stable and Reliable
  • Extensive Cloudera QA systems, software
    processes
  • Tested run in production at scale
  • Proven at scale in dozens of enterprise
    environments
  • Community Driven
  • Incorporates only main-line components from the
    Apache Hadoop ecosystem no forks or proprietary
    underpinnings
  • FREE
  • Fastest Path to Success
  • No need to write your own scripts or do
    integration testing on different components
  • Works with a wide range of operating systems,
    hardware, databases and data warehouses

10
Clouderas Commitment to the Open Source Community
Component Cloudera Committers Cloudera Founder 2011 Commits
Common 6 Yes 1
HDFS 6 Yes 2
MapReduce 5 Yes 1
HBase 2 No 2
Zookeeper 1 Yes 2
Oozie 1 Yes 1
Pig 0 No 3
Hive 1 No 2
Sqoop 2 Yes 1
Flume 3 Yes 1
Hue 3 Yes 1
Snappy 2 No 1
Bigtop 8 Yes 1
Avro 4 Yes 1
Whirr 2 Yes 1
11
Components of CDH
Cloudera Enterprise
User Interface
HUE
Workflow
File System Mount
Scheduling
APACHE OOZIE
APACHE OOZIE
FUSE-DFS
Data Integration
Fast Read/Write Access
Languages / Compilers
APACHE PIG, APACHE HIVE
APACHE FLUME, APACHE SQOOP
APACHE HBASE
Coordination
APACHE ZOOKEEPER
https//www.pass4sureexam.com/ccD-410.html
12
Hadoop Distributed File System
Block Size 64MB Replication Factor 3
1
2
2
4
5
5
1
3
4
2
5
1
3
3
Cost is 400-500/TB
4
5
13
Components of Hadoop
  • NameNode Holds all metadata for HDFS
  • Needs to be a highly reliable machine
  • RAID drives typically RAID 10
  • Dual power supplies
  • Dual network cards Bonded
  • The more memory the better typical 36GB to -
    64GB
  • Secondary NameNode Provides check pointing for
    the NameNode. Same hardware as the NameNode
    should be used

14
Components of Hadoop
  • DataNodes Hardware will depend on the specific
    needs of the cluster
  • No RAID needed, JBOD (just a bunch of disks) is
    used
  • Typical ratio is
  • 1 hard drive
  • 2 cores
  • 4GB of RAM

https//www.pass4sureexam.com/ccD-410.html
15
Networking
  • One of the most important things to consider when
    setting up a Hadoop cluster
  • Typically a top of rack is used with Hadoop with
    a core switch
  • Careful on over subscribing the backplane of the
    switch!

16
Map
  • Records from the data source (lines out of files,
    rows of a database, etc) are fed into the map
    function as keyvalue pairs e.g., (filename,
    line).
  • map() produces one or more intermediate values
    along with an output key from the input.

(key 1, values)
Shuffle Phase
(key 1, int. values)
Map Task
Reduce Task
(key 2, values)
(key 1, int. values)
Final (key, values)
(key 3, values)
(key 1, int. values)
17
Reduce
  • After the map phase is over, all the intermediate
    values for a given output key are combined
    together into a list
  • reduce() combines those intermediate values into
    one or more final values for that same output key

(key 1, values)
Shuffle Phase
(key 1, int. values)
Map Task
Reduce Task
(key 2, values)
(key 1, int. values)
Final (key, values)
(key 3, values)
(key 1, int. values)
18
MapReduce Execution
https//www.pass4sureexam.com/ccD-410.html
19
Sqoop
  • SQL to Hadoop
  • Tool to import/export any JDBC-supported database
    into Hadoop
  • Transfer data between Hadoop and external
    databases or EDW
  • High performance connectors for some RDBMS
  • Developed at Cloudera

20
Flume
  • Distributed, reliable, available service for
    efficiently moving large amounts of data as it is
    produced
  • Suited for gathering logs from multiple systems
  • Inserting them into HDFS as they are generated
  • Design goals
  • Reliability, Scalability, Manageability,
    Extensibility
  • Developed at Cloudera

21
Flume high-level architecture
Master send configuration to all Agents
Configurable levels of reliability Guarantee
delivery in event of failure Deployable,
centrally administered
Agent
Agent
Agent
Agent
encrypt
MASTER
Optionally pre-process incoming data perform
transformations, suppressions, metadata enrichment
Processor
Processor
batch
compress
encrypt
Writes to multiple HDFS file formats (text,
sequence, JSON, Avro, others) Parallelized writes
across many collectors as much write throughput
as
Collector(s)
Flexibly deploy decorators at any step to improve
performance, reliability or security
22
HBase
  • Column-family store. Based on design of Google
    BigTable
  • Provides interactive access to information
  • Holds extremely large datasets (multi-TB)
  • Constrained access model
  • (key, value) lookup
  • Limited transactions (only one row)

https//www.pass4sureexam.com/ccD-410.html
23
HBase
23
24
Hive
  • SQL-based data warehousing application
  • Language is SQL-like
  • Supports SELECT, JOIN, GROUP BY, etc.
  • Features for analyzing very large data sets
  • Partition columns, Sampling, Buckets
  • Example
  • SELECT s.word, s.freq, k.freq FROM shakespeares
  • JOIN ON (s.word k.word) WHERE s.freq gt 5

25
Pig
  • Data-flow oriented language Pig latin
  • Datatypes include sets, associative arrays,
    tuples
  • High-level language for routing data, allows
    easy integration of Java for complex tasks
  • Example
  • empsLOAD 'people.txt AS(id,name,salary)
  • rich FILTER emps BY salary gt 100000 srtd
    ORDER rich BY salary DESC STORE srtd INTO
    rich_people.txt'

https//www.pass4sureexam.com/ccD-410.html
26
Oozie
Oozie is a workflow/cordination service to manage
data processing jobs for Hadoop
27
Zookeeper
  • Zookeeper is a distributed consensus engine
  • Provides well-defined concurrent access
    semantics
  • Leader election
  • Service discovery
  • Distributed locking / mutual exclusion
  • Message board / mailboxes

28
Pipes and Streaming
  • Multi-language connector libraries for MapReduce
  • Write native-code MapReduce in C
  • Write MapReduce passes in any scripting language,
    including
  • Perl
  • Python

https//www.pass4sureexam.com/ccD-410.html
29
FUSE - DFS
  • Allows mounting of HDFS volumes via Linux FUSE
    file system
  • Does allow easy integration with other systems
    for data import/export
  • Does not imply HDFS can be used for
    general-purpose file system

30
Hadoop Security
  • Authentication is secured by Kerberos v5 and
    integrated with LDAP
  • Hadoop server can ensure that users and groups
    are who they say they are
  • Job Control includes Access Control Lists, which
    means Jobs can specify who can view logs,
    counters, configurations and who can modify a job
  • Tasks now run as the user who launched the job

https//www.pass4sureexam.com/ccD-410.html
31
Cloudera Enterprise
Cloudera Enterprise makes open source Hadoop
enterprise-easy
CLOUDERA ENTERPRISE COMPONENTS
  • Simplify and Accelerate Hadoop Deployment
  • Reduce Adoption Costs and Risks
  • Lower the Cost of Administration
  • Increase the Transparency Control of Hadoop
  • Leverage the Experience of Our Experts

EFFECTIVENESS Ensuring You Get Value From Your
Hadoop Deployment
EFFICIENCY Enabling You to Affordably Run Hadoop
in Production
32
Cloudera Manager
The industrys first end-to-end management
application for Apache Hadoop
Proactively manages the Apache Hadoop stack
Automates the full operational lifecycle of
Apache Hadoop
33
Cloudera Enterprise
  • Demo

https//www.pass4sureexam.com/ccD-410.html
34
Cloudera Enterprise
Including Cloudera Support
Feature Benefit
Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA requirements
Configuration Checks Verify that your Hadoop cluster is fine-tuned for your environment
Issue Resolution and Escalation Processes Proven processes ensure that support cases get resolved with maximum efficiency
Comprehensive Knowledgebase Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop
Certified Connectors Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza and Revolution Analytics
Notification of New Developments and Events Stay up to speed with whats going on in the Apache Hadoop community
35
Cloudera University
Public and Private Training to Enable Your Success
Class Description
Developer Training Certification (4 Days) Hands-on training and certification for developers who want to analyze their data but are new to Apache Hadoop
System Administrator Training Certification (3 Days) Hands-on training and certification for administrators who will be responsible for setting up, configuring, monitoring an Apache Hadoop cluster
HBase Training (2 Day) Covers the HBase architecture, data model, and Java API as well as some advanced topics and best practices
Analyzing Data with Hive and Pig (2 Days) Hive and Pig training is designed for people who have a basic understanding of how Apache Hadoop works and want to utilize these languages for analysis of their data
Essentials for Managers (1 Day) Provides decision-makers the information they need to know about Apache Hadoop, answering questions such as when is Hadoop appropriate?, what are people using Hadoop for? and what do I need to know about choosing Hadoop?
36
Cloudera Consulting Services
Put Our Expertise To Work For You.
Clouderas team of Solutions Architects provides
guidance and hands-on expertise to address unique
enterprise challenges.
Service Description
Use Case Discovery Assess the appropriateness and value of Hadoop for your organization
New Hadoop Deployment Set up and configure high performance, production-ready Hadoop clusters
Proof of Concept Verify the prototype functionality and project feasibility for a new Hadoop cluster
Production Pilot Deploy your first production-level project using Hadoop
Process and Team Development Define the requirements and processes for creating a new Hadoop team
Hadoop Deployment Certification Perform periodic health checks to certify and tune up existing Hadoop clusters
37
Journey of the Cloudera Customer
Discover the Benefits of Apache Hadoop
Clouderas Distribution
Subscribe to Cloudera Enterprise
Flexibility to store and mine all types of data
The fastest, surest path to success with Apache
Hadoop
Simplify and accelerate Apache Hadoop deployment
https//www.pass4sureexam.com/ccD-410.html
38
Cloudera in Production
  • Consulting Services
  • Cloudera University

Cloudera Services
ANALYSTS
BUSINESS USERS
CUSTOMERS
OPERATORS
ENGINEERS
Cloudera Enterprise
  • Cloudera Management Suite
  • Cloudera Support

IDEs
BI / Analytics
Enterprise Reporting
Management Tools
Web Application
Enterprise Data Warehouse
Clouderas Distribution Including Apache Hadoop
(CDH) SCM Express
Operational Rules Engines
Logs
Files
Web Data
Relational Databases
39
Cloudera helps you profit from all your data.
Get Hadoop
twitter.com/ cloudera
facebook.com/ cloudera
40
Cloudera Manager
The first and only Hadoop management application
that
1. Manages the full Hadoop lifecycle 2.
Manages and monitors the complete Hadoop
stack 3. Incorporates comprehensive log and
event management 4. Has Technical Support
integration built-in
https//www.pass4sureexam.com/ccD-410.html
41
Cloudera Manager
Key Features and Functionality
ONLY CLOUDERA
Automated Deployment Installs the complete Hadoop stack in minutes. The simple, wizard-based interface guides you through the steps.
Centralized Management Gives you complete, end-to-end visibility and control over your Hadoop cluster from a single interface
Service Configuration Management Set server roles, configure services and manage security across the cluster Gracefully start, stop and restart of services as needed
Audit Trails Maintains a complete record of configuration changes for SOX compliance
Proactive Health Checks Monitors dozens of service performance metrics and alerts you when you approach critical thresholds
Intelligent Log Management Gather, view and search Hadoop logs collected from across the cluster Scans Hadoop logs for irregularities and warns you before they impact the cluster
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
https//www.pass4sureexam.com/ccD-410.html
42
Cloudera Manager
Key Features and Functionality
ONLY CLOUDERA
Global Time Control Establishes the time context globally for almost all views Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis
Support Integration Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution
Event Management Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching
Alerting Generates email alerts when certain events occur
Operational Reports Visualize current and historical disk usage by user, group and directory Track MapReduce activity on the cluster by job or user
Host Level Monitoring View information pertaining to hosts in your cluster including status, resident memory, virtual memory and roles
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
43
Two Editions
FREE EDITION
ENTERPRISE EDITION
Max Number of Nodes Supported 50 Unlimited
Automated Deployment
Host-Level Monitoring
Secure Communication Between Server Agents
Configuration Management Configuration Management Configuration Management
Manage HDFS, MapReduce, HBase, Hue, Oozie Zookeeper
Audit Trails
Start/Stop/Restart Services
Add/Restart/Decomission Role Instances
Configuration Versioning History
Support for Kerberos
Service Monitoring Service Monitoring Service Monitoring
Proactive Health Checks
Status Health Summary
Intelligent Log Management
Events Management Alerts
Activity Monitoring
Operational Reporting
Global Time Control
Support Integration
Part of the Cloudera Enterprise subscription
44
View Service Health and Performance
https//www.pass4sureexam.com/ccD-410.html
45
Get Host-Level Snapshots
https//www.pass4sureexam.com/ccD-410.html
46
Monitor and Diagnose Cluster Workloads
https//www.pass4sureexam.com/ccD-410.html
47
Gather, View and Search Hadoop Logs
https//www.pass4sureexam.com/ccD-410.html
48
Track Events From Across the Cluster
https//www.pass4sureexam.com/ccD-410.html
49
Run Reports on System Performance Usage
https//www.pass4sureexam.com/ccD-410.html
50
New in Cloudera Manager 3.7
ONLY CLOUDERA
1. Proactive Health Checks Monitors dozens of service performance metrics and alerts you when you approach critical thresholds
2. Intelligent Log Management Gathers and scans Hadoop logs for irregularities and warns you before they impact the cluster
3. Global Time Control Correlates jobs, activities, logs, system changes, configuration changes and service metrics along a single timeline to simplify diagnosis
4. Support Integration Takes a snapshot of the cluster state and automatically sends it to Cloudera support to assist with resolution
5. Event Management Creates and aggregates relevant Hadoop events pertaining to system health, log messages, user services and activities and make them available for alerting and searching
6. Alerts Generates email alerts when certain events occur
7. Audit Trails Maintains a complete record of configuration changes for SOX compliance
8. Operational Reporting Visualize current and historical disk usage by user, group and directory and track MapReduce activity on the cluster by job or user
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
ONLY CLOUDERA
https//www.pass4sureexam.com/ccD-410.html
51
Cloudera Support
Our team of experts on call to help you meet your
SLAs
Feature Benefit
Flexible Support Windows Choose from 8x5 or 24x7 options to meet SLA requirements
Configuration Checks Verify that your Hadoop cluster is fine-tuned for your environment
Issue Resolution and Escalation Processes Proven processes ensure that support cases get resolved with maximum efficiency
Comprehensive Knowledgebase Browse through hundreds of Articles and Tech Notes to expand upon your knowledge of Apache Hadoop
Certified Connectors Connect your Apache Hadoop cluster to your existing data analysis tools such as IBM Netezza, Revolution Analytics, and MicroStrategy
Proactive Notification of New Developments and Events Stay up to speed with whats going on in the Apache Hadoop community
https//www.pass4sureexam.com/ccD-410.html
52
Cloudera Enterprise
The Fastest Path to Success Running Apache Hadoop
in Production.
Only Cloudera Enterprise
Why Cloudera Enterprise?
  • Apache Hadoop is a distributed system that
    presents unique operational challenges
  • The fixed cost of managing an internal patch and
    release infrastructure is prohibitive
  • Apache Hadoop skills and expertise are scarce
  • Its challenging to track consistently to
    community development efforts

Has a management application that supports the
full lifecycle of operationalizing Apache
Hadoop Has production support backed by
the Apache committers Has the depth of
experience supporting hundreds of production
Apache Hadoop clusters
53
Hadoop Distributed File System
Block Size 64MB Replication Factor 3
Cost is 400-500/TB
54
MapReduce Distributed Processing
https//www.pass4sureexam.com/ccD-410.html
55
Thank you.
https//www.pass4sureexam.com/ccD-410.html
Write a Comment
User Comments (0)
About PowerShow.com