Impala and BigQuery - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Impala and BigQuery

Description:

Impala and BigQuery By David Gruzman BigDataCraft.com Impala Hive Traces While dremel converts data into own format, Impala supports multiple formats. – PowerPoint PPT presentation

Number of Views:301
Avg rating:3.0/5.0
Slides: 48
Provided by: filesMeet78
Category:

less

Transcript and Presenter's Notes

Title: Impala and BigQuery


1
Impala and BigQuery
By David Gruzman BigDataCraft.com
2
Impala and BigQuery
  • Big Query is google's database service based on
    the Dremel. Big Query is hosted by Google.
  • Impala is open source database inspired by the
    Dremel paper. Impala is part of the Cloudera
    Hadoop distribution.

by David Gruzman
3
Today agenda
  • Overview of Dremel as a technology
  • Overview of the BigQuery
  • A few words about Impala
  • DG Mediamind use case
  • Deeper insights into Impala
  • Conclusions
  • QA

4
What is Dremel
  • It can be viewed as a kind of database technology
    / architecture.
  • Closest known example MPP databases.
  • Main difference from them in memory processing
    with the following consequences
  • Only small to big table joins (in first releases)
  • Small results size.
  • No operations like external sorts.

5
Dremel's Philosophy
  • Lets do SQL subset which do have fast and
    scalable implementation
  • It is somewhat similar to other NoSQLs we do
    what we can do VERY FAST and scalable. The rest
    application problem.

6
Why dremel?
  • Google is first who got MapReduce
  • Google is first faced MapReduce main problem
    latency. The problem was propagated to engines on
    top of MapReduce also.
  • It is logical that Google was first who
    approached it by developing real time query
    capability for big data.

7
How dremel is used in google
  • Dremel is not replacement for the MapReduce or
    Tenzing but complements it. (Tenzing is Google's
    Hive)
  • Analyst can make many fast queries using Dremel
  • After getting good idea what is needed run slow
    MapReduce (or SQL based on MapReduce) to get
    precise results

8
Why dremel is Unique
  • Dremel with BigQuery built on top of it is
    probably only Interactive big data query engine
    today.
  • I mean that it is only engine capable to produce
  • results over terabytes of data in seconds!
  • Main idea (my guess) that is harness huge cluster
    of machines for the single query.

9
Dremel as technology
  • Novel Hierarchical columnar format.
  • LLVM based code generation.
  • Distributed aggregation Tree
  • In-situ data processing. (inside the storage)

10
Dremel Aggregation tree
11
Dremel Nested columnar format
12
Big Query
  • Service built by google on top of the Dremel
    engine
  • Only (known to me) query engine as a service
    working with BigData.
  • Query time not depends on data size

13
BigQuery main capabilities
  • Aggregations
  • Join of big table to small table.
  • Join of two big tables (recently added)
  • Hierarchical data format. It makes
    pre-aggregations cheaper.

14
Main limitations
  • Small results size
  • Intermediate results should not exceed memory
    size.
  • No external tables

15
Pricing model
  • The pricing is per Gigabyte of processed data.
  • Price is high - 35 per TB
  • In my view it is costly because it is hyper
    elasitc.
  • You can do the same processing in amazon, but it
    will take a few hours (and much less money).
  • You can not in Amazon get required CPU power for
    a few seconds.

16
Why BigQuery is not popular
17
So,why BigQuery is not popular
  • Data is not created in google cloud. It is hard
    and not practical to move big data. It is heavy,
    after all.
  • Google is used to change APIs. BigQuery also
    changed during last years. It is hard to build
    busines.
  • Many companies in Internet related businesses a
    wary of sharing data with Google.
  • It is expensive. 35 per TB can give 1000th of
    dollars bills per day.

18
Dremel
19
In the same time it is goodtechnically
  • I got referances from company doing serious
    testing
  • Marting Fawler's company also tested it and give
    very good feedback.

20
Question to all of you
  • Why Your organization decided not to use google's
    Big Query?

21
Where we can find Impala
22
Impala
23
What is impala
  • Massive parralel processing (MPP) database
    engine, developed by Cloudera.
  • Integrated into Hadoop stack on the same level as
    MapReduce, and not above it (as Hive and Pig)

Hive
Pig
Map Reduce
Impala
HDFS
24
Why impala
  • Data has a gravity
  • Today a lot of data live in HDFS
  • It is not practical to move big data
  • It is practical to bring engine to the data
  • In the same time MapReduce is not must
  • Impala process data in Hadoop cluster without
    using MapReduce

25
MapReduce bypass
  • Several other modern Database engines also
    realized the opportunity to bypass MapReduce but
    work right with HDFS.
  • They takes various approaches.

26
MapReduce Bypass
  • Existing MPP databases, like Greenplum store
    their external tables in the HDFS

27
MapReduce bypass
  • Jethrodata store data in their own format on HDFS
    and also work with it without MR layer.
  • They have their proprietary format which enable
    full indexing of the data together with columnar
    efficiency. In cases of high selectivity queries
    this approach has serious advantages.

28
Use Case from DG
  • I think it is will be typical case in the future
  • DG is using Hadoop and Hive
  • Evaluation Impala to do part of things more
    efficiently.
  • After their case presentation we will back to
    discuss insights of the Impala

29
Again Impala has different place then Pig and
Hive
Hive and Pig
Map Reduce
Impala
HDFS
30
Impala architecture
31
Impala Dremel traces
  • LLVM code generation
  • It is really fast
  • C as implementation language (not Java...)
  • Simple query engine. It actually doing things
    which can be done in memory.
  • Broadcast join algorithm is implemented

32
LLVM code generation
  • Assume you want to write custom code for the
    specific query. It will be super efficient
  • Code generation automate this process for each
    query
  • We actually need to super-optimize inner loop
    doing filtering (where) and group by.
  • LLVM enables us to compile in fraction of seconds
    into native code
  • LLVM enable us to enjoy new CPU capabilities like
    SSE in a portable way.

33
Why code generation it interesting?
  • If you develop own engine, or some peace of code
    responsible to process serious data volumes code
    generation may give you order of magnitude boost.
  • I had cases when usage of such technology was
    game changing

34
Impala Hive Traces
  • While dremel converts data into own format,
  • Impala supports multiple formats. It is kind of
    schema on read.
  • Impala shares metastore with Hive, which enables
    very simple adoption
  • Internally Impala have well defined way to add
    new formats

35
Impala unique things
  • Impala format adapters, called scanners have
    predicate pushdown capability.
  • Probably only open source MPP engine
  • Today we do not have any other means to run
    hundreds of CPU cores in one query efficiently
    without expensive license.
  • Hive give us the same but not efficiently.

36
Impala vs MPP
  • It usually tooks many years to create MPP
    database.
  • There are serious simplifications
  • The data is read only
  • There is actually not DBMS only query engine.
  • No serious resource management, but measurement
    (all over code).

37
Impala hive killer?
  • Not so quickly.
  • Hive is doing things Impala can not do yet, like
    joins between several big tables.
  • Hive has convinient java UDF, while impala is not
  • Impala does not have inter-query fault tolerance.
  • In the same time MapReduce is not good
    framework for the database engine

38
Impala Data Formats
  • There are scanners for the following types
  • RCFile
  • Parquet (native dremel format)
  • CSV
  • AVRO
  • Sequence File

39
Impala future
  • Will get closer to other MPP engines
  • Support more formats
  • More advanced scheduling and resource management

40
Basic benchmark
  • TPC-H, Q1, SF10
  • 4 EC2 large instances
  • 4 seconds, while hive takes about 1 minute.
  • This number means group by speed of about
    235MB/sec per core.

41
Impala price per GB
  • 1 Large instance costs 0.24
  • Cluster costs 0.96 per hour.
  • Cost of 1 second 0.96 / 3600
  • We process by such cluster 1.75GB per second
  • So cost of 1 TB processing is about 0.15
  • It is about 300 times cheaper then BigQuery

42
Performance - summary
  • It is fast when data reduction is big
  • It is fast, when data is hot.
  • It should enjoy fast storage / SSD. My
    measurements shows about 200 MB/sec per core
    group by processing
  • Always faster then Hive at least 10 times

43
What with clouds?
44
Impala in cloud is not elastic
  • To be elastic we need to create cluster when we
    need it.
  • Even if we agree to by hour resolution storage
    will be a problem
  • S3 will not give us hundreds of Mbs per second
    per instance
  • To store data in local file system is transient

45
Impala - conclusions
  • It is first time I remember when we can put our
    hands on free MPP database.
  • There is no risk to try it side-by-side with Hive
  • It is possible to offload part of the work to
    Impala and do the rest with Hive
  • It is part of the Cloudera Hadoop distribution
    and easily installed by Cloudera Manager

46
Materials used
  • Benchmarks
  • http//www.slideshare.net/sudabon/performance-eval
    uation-of-cloudera-impala-20121208-15536323
  • https//amplab.cs.berkeley.edu/benchmark/
  • Architecture
  • http//www.slideshare.net/scottleber/impala-191769
    06
  • https//cloud.google.com/files/BigQueryTechnicalWP
    .pdf
  • POC
  • http//martinfowler.com/articles/bigQueryPOC.html

47
Material used - comparisons
  • To hive http//www.quora.com/Cloudera/Does-Cloud
    era-Impala-have-any-drawbacks-when-compared-with-H
    ive
  • To vertica http//www.quora.com/Cloudera-Impala/H
    ow-does-Cloudera-Impala-compare-to-Vertica
  • To dremel http//www.quora.com/Cloudera-Impala/Ho
    w-does-Clouderas-Impala-compare-to-Googles-Dremel

48
Thank you!!!
  • Special thanks to
  • Faina Kamenetsky who helped set up clusters in
    amazon.

49
BigDataCraft.com
  • We are boutique consulting company
  • Our services are
  • On paper POC
  • On hardware POC
  • Architecture / Design reviews
  • Custom integrations and bug fixing

50
Impala - Flow
Write a Comment
User Comments (0)
About PowerShow.com