Apache Arrow - PowerPoint PPT Presentation

About This Presentation
Title:

Apache Arrow

Description:

This presentation gives an overview of the Apache Arrow project. It explains the Arrow project in terms of its in memory structure, its purpose, language interfaces and supporting projects. Links for further information and connecting – PowerPoint PPT presentation

Number of Views:189
Slides: 12
Provided by: semtechs

less

Transcript and Presenter's Notes

Title: Apache Arrow


1
What Is Apache Arrow ?
  • A development platform for in-memory data
  • It has a columnar memory format
  • It provides efficient analytic operations on
    modern hardware
  • Used for in memory processing
  • Cross language support
  • Open source / Apache 2.0 license
  • Supports zero-copy reads for lightning fast data
    access

2
Languages supported
  • Arrow supports many languages
  • C
  • C
  • C
  • Go
  • Java
  • JavaScript
  • MATLAB
  • Python
  • R
  • Ruby
  • Rust

3
OS Community Support
  • Many open source projects support Arrow
  • Calcite
  • Cassandra
  • Drill
  • Hadoop
  • HBase
  • Ibis
  • Impala
  • Kudu
  • Pandas
  • Parquet
  • Phoenix
  • Spark
  • Storm

4
The problem Arrow tackles
  • Each system has its own internal memory format
  • 70-80 computation wasted
  • on serialization and de-serialization
  • Similar functionality implemented in multiple
    projects
  • Overheads for cross-system communication
  • All systems utilize different memory formats

5
The problem Arrow tackles
  • No shared in memory data model

6
Arrow solves this problem
  • All systems utilize the same memory format
  • In memory
  • Columnar format
  • Optimized for modern CPUs and GPUs
  • No overhead for cross-system communication
  • Projects can share functionality

7
Arrow solves this problem
  • Arrow shared data model

8
Arrow works with Parquet
  • Arrow is an in memory format
  • Parquet is designed for disk storage
  • Arrow and Parquet are intended to be used
    together
  • Parquet is a columnar file format
  • Used for data serialization
  • Parquet is a streaming format
  • Data must be decoded from start-to-end
  • Files are compressed and encoded
  • Means smaller files on disk

9
Arrow Memory Buffer
  • Arrow supports data adjacency for sequential
    access

10
Available Books
  • See Big Data Made Easy
  • Apress Jan 2015
  • See Mastering Apache Spark
  • Packt Oct 2015
  • See Complete Guide to Open Source Big Data
    Stack
  • Apress Jan 2018
  • Find the author on Amazon
  • www.amazon.com/Michael-Frampton/e/B00NIQDOOM/
  • Connect on LinkedIn
  • www.linkedin.com/in/mike-frampton-38563020

11
Connect
  • Feel free to connect on LinkedIn
  • www.linkedin.com/in/mike-frampton-38563020
  • See my open source blog at
  • open-source-systems.blogspot.com/
  • I am always interested in
  • New technology
  • Opportunities
  • Technology based issues
  • Big data integration
Write a Comment
User Comments (0)
About PowerShow.com