Apache Kudu - PowerPoint PPT Presentation

About This Presentation
Title:

Apache Kudu

Description:

This presentation gives an overview of the Apache Kudu project. It explains the Kudu project in terms of it's architecture, schema, partitioning and replication. It also provides an example deployment scale. Links for further information and connecting – PowerPoint PPT presentation

Number of Views:197
Slides: 13
Provided by: semtechs

less

Transcript and Presenter's Notes

Title: Apache Kudu


1
What Is Apache Kudu ?
  • A column oriented data store
  • Open source / Apache 2.0 license
  • Written in C
  • Provides fast processing of OLAP workloads
  • Integrates with
  • MapReduce, Spark, Hadoop ecosystem, Impala
  • Scales to large datasets and large clusters
  • Choose consistency requirements on a per-request
    basis

2
Kudu Architecture
  • Kudu tables are split into tablet units
  • A Kudu cluster may have multiple Masters
  • One Master will lead whilst the others follow
  • Tablet servers support tablet data
  • Raft consensus is used to elect leaders and
    followers
  • A tablet server may lead other tablet servers
  • This architecture supports
  • Fault tolerance
  • High availability

3
Kudu Architecture
4
Kudu Schema
  • Structured data model similar to RDBMS
  • Three main concerns for schema design
  • Column design
  • Primary key design
  • Partitioning design
  • Kudu has strongly-typed columns
  • It uses a columnar on-disk storage format

5
Kudu Schema
  • Schema design should accomplish
  • Efficient partition design
  • Even distribution of data across tablet servers
  • Even distribution of reads/writes across tablet
    servers
  • Even growth of data across tablet servers
  • Scans would read the minimum amount of data
  • The last point is also impacted by
  • Primary key design

6
Kudu Partitioning
  • Partitioning involves
  • Partitioning tables into tablets
  • Across tablet servers
  • Partitioning affects performance
  • Aim to partition evenly across cluster
  • Strategies include
  • Range, hash, multilevel

7
Kudu Column Types
  • Supported column types include
  • boolean
  • 8-bit signed integer
  • 16-bit / 32-bit / 64-bit signed integer
  • date (32-bit days since the Unix epoch)
  • unixtime_micros (64-bit microseconds since the
    Unix epoch)
  • single-precision (32-bit) IEEE-754 floating-point
    number
  • double-precision (64-bit) IEEE-754 floating-point
    number
  • decimal
  • varchar
  • UTF-8 encoded string (up to 64KB uncompressed)
  • binary (up to 64KB uncompressed)

8
Kudu Replication
  • Kudu is rack aware
  • It knows the server rack assignments
  • It replicates operations not on disk data
  • It performs logical replication not physical
  • Inserts and updates do not transmit data over the
    network
  • Deletes do not need to move any data
  • Compaction does not transmit the data over the
    network
  • Tablets performing compactions dont need to
  • Perform at the same time
  • Use the same schedule
  • Remain in synchronisation

9
Kudu Replication Terms
  • Kudu hot replica
  • A tablet replica that is continuously receiving
    writes
  • Kudu cold replica
  • A tablet replica that is not hot
  • A replica that is not frequently receiving writes
  • Kudu data on disk
  • Total amount of data stored on a tablet server
  • Across all disks

10
Kudu Example Scale
  • 3 master servers
  • 100 tablet servers
  • 8 TiB of stored data per tablet server
  • post-replication and post-compression.
  • 1000 tablets per tablet server
  • post-replication.
  • 60 tablets per table
  • per tablet server, at table-creation time.
  • 10 GiB of stored data per tablet.

11
Available Books
  • See Big Data Made Easy
  • Apress Jan 2015
  • See Mastering Apache Spark
  • Packt Oct 2015
  • See Complete Guide to Open Source Big Data
    Stack
  • Apress Jan 2018
  • Find the author on Amazon
  • www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
  • Connect on LinkedIn
  • www.linkedin.com/in/mike-frampton-38563020

12
Connect
  • Feel free to connect on LinkedIn
  • www.linkedin.com/in/mike-frampton-38563020
  • See my open source blog at
  • open-source-systems.blogspot.com/
  • I am always interested in
  • New technology
  • Opportunities
  • Technology based issues
  • Big data integration
Write a Comment
User Comments (0)
About PowerShow.com