Title: Beaconstac Analytics
1 Big Data and Internet of things(IOT)
2Project Morpheus (Beaconstac Analytics)
May 2015
Garima Batra Core Platform Engineer MobStac
3A quick intro about Beaconstac
1
- Beaconstac is a proximity marketing and analytics
platform for beacons - Several beacon specific events are defined to aid
proximity marketing - The events include Camp on event, beacon exit
event, region enter, region exit etc. - Beaconstac analytics platform makes it easy for
managers/marketers/developers to analyze event
data - Components include Beaconstac iOS/Android sdk,
beaconstac portal
4Why Hadoop?
1
- Collect event logs generated from Beaconstac SDK
usage - Needed a system to answer queries like
- Heat map of beacons by the number of visits
received in a specified time interval. - Heat map of beacons by the amount of time spent
in a specified time interval. - Average time spent by users near different
beacons - Last seen per user
- Last seen per beacon
- Analyzing data with custom attributes filters
- Traversed path in an area by individual users
5Leveraging Amazon's EMR for Beaconstac Analytics
1
- Amazon's Streaming API for writing mapper and
reducer functions in Python - Input - Copy programs to Amazon S3
- Output Copy the processed/output data to S3
- Initial tests were run using Amazon's EMR
console. Here you can define the following - - Cluster configuration Name, Termination
protection, Logging, logs location on S3 etc. - Software configuration Hadoop AMI version,
applications to be installed on startup etc. - Hardware configuration Types of nodes master,
Core and Task - Security keys, allowed users
- Bootstrap actions Configure Hadoop, Custom
actions etc. - Steps Streaming program, Hive program, Pig
program
6Integrating EMR in production
1
7Batch processing for Morpheus
1
AWS Data pipeline
8Deep dive into EMR startup and job submission
1
9How Does AWS Data Pipeline Work?
1
- Pipeline definition - specifies the business
logic of your data management - AWS Data pipeline web service - interprets the
pipeline definition and assigns tasks to workers
to move and transform data. - Task runner - polls the AWS Data Pipeline web
service for tasks and then performs those tasks.
10Morpheus version of Data pipeline
1
Copy the output to Elastic Search
Run EMR jobs
Copy logs from Kafka to S3
- Runs every hour
- Requires a Kafka consumer script
- Runs once every day
- Processes each job and produces output
- Each job comprises of mapper and reducer scripts
- Runs once every day
- Inserts output in Elastic search
11Settings file in each job
1
1
Questions??
Source Lorem Ipsum