Unstructured data is useful - PowerPoint PPT Presentation

1 / 14

About This Presentation

Title:

Unstructured data is useful

Description:

none – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 15

Provided by: develop

Category:

Tags: data | my | unstructured | useful | yahoo

Transcript and Presenter's Notes

Title: Unstructured data is useful

1
(No Transcript)
2
(No Transcript)
3
Unstructured data is useful

Take everyones favorite example, log parsing
207.181.42.20 - - 07/Feb/2003113828 -0800
"GET /archive/2003/02/01/space_sh.shtml HTTP/1.1"
200 11966 "http//www.google.com/search?hlenlr
ieUTF-8oeUTF-8qSpaceShuttleColumbiaNovembe
r2002" "Mozilla/4.0 (compatible MSIE 6.0
Windows 98 Q312461)"
ip-address identd authuser DD/MMM/YYYYhhmmss
TZ "request string" status bytes "referrer"
"user-agent"

4
Structured data is useful

Utility of unstructured data improved by
structured data
E.g., IP Geolocation resolves IP addresses to
city, state, country
100 MB of data
Available as SQL database dump

5
Joining data

Problem Merge the log records with IP
geolocation data
Too much log data to dump to SQL db, how to bring
db to us?
Hadoop MapReduce, Hive, Pig all work from HDFS!

6
DBInputFormat

Connects to JDBC interface
Selects records out of tables, arbitrary queries
Provides interface to use arbitrary input
queries, tables, databases
Records written to DBWritable, provided as value
to Mapper
Constraints
Must be able to totally order results (e.g., by
primary key)
Must be able to count expected result set size
ahead of time

7
DBWritable

You define a class to hold a row from the
database
Must be able to read from JDBC ResultSet into
fields
Must be able to write to JDBC PreparedStatement
Should also implement regular Writable

8
Configuration Example

JobConf conf new JobConf(getConf(), Foo.class)
conf.setInputFormat(DBInputFormat.class)
DBConfiguration.configureDB(conf,
com.mysql.jdbc.Driver,
jdbcmysql//localhost/mydatabase)
String fields my_pkey, my_value
DBInputFormat.setInput(conf, MyRecord.class,
mytable,
null, my_pkey, fields)
// set Mapper, etc., and call JobClient.runJob(con
f)

9
DBWritable Example

class MyRecord implements Writable, DBWritable
long pkey
long val
public void readFields(DataInput in) throws
IOException
this.pkey in.readLong()
this.val in.readLong()
public void readFields(ResultSet resultSet)
throws SQLException
this.pkey resultSet.getLong(1)
this.val resultSet.getLong(2)

10
Parallelism and scalability

Prepares statement of the form
SELECT ORDER BY LIMIT OFFSET
for each Mapper
InputSplit corresponds to OFFSET into query
(Counting query required ahead of time to
determine split count)
Scalability limited by bandwidth of the database
server
100 Mappers/Reducers would easily saturate the
pipe from one node
Could be used once to do a bulk import into HDFS
for Hive, etc.

11
DBOutputFormat

Define the table and fields to populate with
results from MapReduce job
Individual values emitted by Reducers are bundled
into SQL transaction
All committed at end of reduce operation (during
close())
DBWritable interface provides write(PreparedStatem
ent stmt)

12
Flexibility

Any JDBC database can work (MySQL, Postgres,
HSQLdb)
Supports quick read-in of existing tables for
ad-hoc jobs
Database sharding currently would need to be
handled at db side
Future work support client-side row-level
sharding

13
Conclusions

Good for ad-hoc queries
May be useful for bulk loading database into Hive
Straightforward interface extends existing
MapReduce API
Available in Hadoop 0.19
(But HADOOP-2536 can be applied to 0.18.x without
much difficulty)

14
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Recommended Relevance Latest Highest Rated Most Viewed

Sort by:

Related More from user

CrystalGraphics Presentations

Introducing-PowerShowcom PowerPoint PPT Presentation

Introducing-PowerShowcom - Introducing-PowerShowcom (Without Music)

CrystalGraphics 3D Character Slides for PowerPoint PowerPoint PPT Presentation

CrystalGraphics 3D Character Slides for PowerPoint - CrystalGraphics 3D Character Slides for PowerPoint

Chart and Diagram Slides for PowerPoint PowerPoint PPT Presentation

Chart and Diagram Slides for PowerPoint - Beautifully designed chart and diagram s for PowerPoint with visually stunning graphics and animation effects. Our new CrystalGraphics Chart and Diagram Slides for PowerPoint is a collection of over 1000 impressively designed data-driven chart and editable diagram s guaranteed to impress any audience. They are all artistically enhanced with visually stunning color, shadow and lighting effects. Many of them are also animated. And they’re ready for you to use in your PowerPoint presentations the moment you need them. – PowerPoint PPT presentation

Related Presentations

Top Big Data & Analytics Trends PowerPoint PPT Presentation

Top Big Data & Analytics Trends - Interested in Learning more about Hadoop and BigData. Click here http://www.dezyre.com/article/how-jpmorgan-uses-hadoop-to-leverage-big-data-analytics/142 | PowerPoint PPT presentation | free to view

Best Master Data Management Services In Bangalore | Tutortek PowerPoint PPT Presentation

Best Master Data Management Services In Bangalore | Tutortek - Tutortek Master Data Management service provider in Bangalore helps companies manage their Master Data by putting in Governance Framework and integration of disparate systems. | PowerPoint PPT presentation | free to view

Convert unstructured data to structured data PowerPoint PPT Presentation

Convert unstructured data to structured data - Convert unstructured data to structured data from different types of media and files here you know how to convert unstructured data to structured data. | PowerPoint PPT presentation | free to view

Big Data Hadoop, differences and real time applications ppt – IQ Online Training PowerPoint PPT Presentation

Big Data Hadoop, differences and real time applications ppt – IQ Online Training - Many big data technologies which deal primarily with advanced analytics, data mining, machine learning applications etc., has Hadoop as the center. Big Data Hadoop online training course at IQ Online Training. | PowerPoint PPT presentation | free to view

How Big Data analytics play an important role in business PowerPoint PPT Presentation

How Big Data analytics play an important role in business - Big Data Solutions is a bunch of structured and unstructured data that takes variety, volume, and velocity to determine and manage. Earlier it was difficult to manage such huge data but now with the big data technologies, it is easy. | PowerPoint PPT presentation | free to view

Data Handling & Analytics - Department of Electronics & Telecommunication Engineering PowerPoint PPT Presentation

Data Handling & Analytics - Department of Electronics & Telecommunication Engineering - A presentation on Data Handling & Analytics which includes topics like Types of Data, Rapid Growth of Unstructured Data, What is big data, Big Data Analytics, Big data challenges and more. It is presented by Dr. Risil Chhatrala, from the department of Electronics & Telecommunication Engineering at International Institute of Information Technology, I²IT. | PowerPoint PPT presentation | free to view

5 Data Management Tips To Improve Fraud Detection PowerPoint PPT Presentation

5 Data Management Tips To Improve Fraud Detection - In this PPT, We explain about some data management tips to improve fraud detection. Almost 75% of insurance companies reported that cases of fraud have increased. You need to feed the tools with accurate and relevant data. Here are five data management tips to improve your organization’s efforts to discover fraud. Learn more at https://ewsolutions.com/ | PowerPoint PPT presentation | free to view

Big Data Analytics - Why is it the hottest buzzword around? PowerPoint PPT Presentation

Big Data Analytics - Why is it the hottest buzzword around? - Data exists in various forms like structured, semi-structured, and unstructured data. Emails come under semi-structured and videos and pictures come under unstructured data. All of this can be collectively called big data. | PowerPoint PPT presentation | free to view

Career Range of Data Science and AWS PowerPoint PPT Presentation

Career Range of Data Science and AWS - As technological advancements make our life much more comfortable, the increasing dependency of people on technology has given rise to more career options in the IT industry. Earlier, people had limited career platforms, but today data reinforces all the revolutionary technologies. From Social media to IoT devices, data has been like a golden goose lately. Organizations these days are so dependent on data that it acts as a fuel in their growth. But is data enough to escalate a business? Indeed knowing the effective use of data to draw useful insights from it is a crucial factor. This is where Data Science comes into play. Data Science is a method of utilizing data to determine solutions to foresee consequences for a problem statement. | PowerPoint PPT presentation | free to view

How To Start Data Science Career PowerPoint PPT Presentation

How To Start Data Science Career - Data science is the study of data. It involves developing methods of recording, storing, and analyzing data to effectively extract useful information. The goal of data science is to gain insights and knowledge from any type of data — both structured and unstructured. Data science is more closely related to the mathematics field of Statistics, which includes the collection, organization, analysis, and presentation of data. | PowerPoint PPT presentation | free to view

Data science Course in Hyderabad PowerPoint PPT Presentation

Data science Course in Hyderabad - Our annual ranking on analytics and Best data science Course aims to highlight the best providers of analytics training in Hyderabad. The ranking is concluded after an in-depth analysis and a detailed process of vetting through various training institutes in Hyderabad. This ranking is a result of the extensive in-house analysis, inputs from faculties and industry experts and conducting a survey where we took inputs from data science enthusiasts who have already taken these courses | PowerPoint PPT presentation | free to view

Importance of Unstructured Data Management PowerPoint PPT Presentation

Importance of Unstructured Data Management - Contoural Inc. We can help you uncover the enormous business value of unstructured data, by having systems to store, analyze and report data from different sources. Take your business to new heights by effective decision-making based on Email & Unstructured Data Management. | PowerPoint PPT presentation | free to view

Data Management and Advanced Data Analytics Firm - Mastech InfoTrellis PowerPoint PPT Presentation

Data Management and Advanced Data Analytics Firm - Mastech InfoTrellis - Mastech InfoTrellis builds a strong data foundation and then adds to it – brick by brick – until the base is ready to deploy modern, advanced analytics capabilities (such as Artificial Intelligence and Machine Learning) that deliver significant business value. For more visit https://mastechinfotrellis.com/ | PowerPoint PPT presentation | free to view

5 Most Popular Data Extraction Tools PowerPoint PPT Presentation

5 Most Popular Data Extraction Tools - Data is useful because of the information it provides in the proper context. Today, data sources are abundant, but the information value in data is not readily available due to its unstructured or poorly structured format. Data extraction software automates the retrieval and storage of unstructured or poorly structured data from various sources and transforms them into machine-readable data for further processing. | PowerPoint PPT presentation | free to view

Introduction to Data Science PowerPoint PPT Presentation

Introduction to Data Science - Data Science is an interdisciplinary field making use of scientific methods, processes, algorithms and systems for extracting knowledge and insights from structured and unstructured data, and applies knowledge and actionable insight from data across a broad range of application domains. https://www.synergisticit.com/data-science/ | PowerPoint PPT presentation | free to view

What are the benefits of outsourcing data enrichment services? PowerPoint PPT Presentation

What are the benefits of outsourcing data enrichment services? - Data Enrichment is a digital method that converts raw data to computer language so that it may be readily changed and performed by your business's demands. Anything that is fed into a computer is referred to as data Enrichment. For instance, your boss may ask you to complete an assessment form on the company's website, or a hotel staff member may enter information about a guest's arrival, room, and meals into a computer. Until and until data is enhanced by specialists, it is neither valuable nor credible. Data enrichment services assist in removing unnecessary data and aggregating data from different sources. Indeed, unstructured information derived from disorganized data may result in you providing incorrect information to your clients, resulting in decreased job productivity. | PowerPoint PPT presentation | free to view

Best Data science online training PowerPoint PPT Presentation

Best Data science online training - A comprehensive up-to-date Data Science course that includes all the essential topics of the Data Science domain, presented in a well-thought-out structure. Taught and developed by experienced and certified data professionals, the course goes right from collecting raw digital data to presenting it visually. Suitable for those with computer backgrounds, analytic mindset, and coding knowledge. | PowerPoint PPT presentation | free to view

data science online training in hyderabad PowerPoint PPT Presentation

data science online training in hyderabad - A comprehensive up-to-date Data Science course that includes all the essential topics of the Data Science domain, presented in a well-thought-out structure. Taught and developed by experienced and certified data professionals, the course goes right from collecting raw digital data to presenting it visually. Suitable for those with computer backgrounds, analytic mindset, and coding knowledge. | PowerPoint PPT presentation | free to view

Online data science training PowerPoint PPT Presentation

Online data science training - A comprehensive up-to-date Data Science course that includes all the essential topics of the Data Science domain, presented in a well-thought-out structure. Taught and developed by experienced and certified data professionals, the course goes right from collecting raw digital data to presenting it visually. Suitable for those with computer backgrounds, analytic mindset, and coding knowledge. | PowerPoint PPT presentation | free to view

6 Tips On How To Do Data Scraping Of Unstructured Data | 3i Data Scraping PowerPoint PPT Presentation

6 Tips On How To Do Data Scraping Of Unstructured Data | 3i Data Scraping - Data scraping, data extraction or web scraping is an automatic web method to fetch or do data collection from your websites. It converts unstructured data into structured one which can be a warehouse in the database. | PowerPoint PPT presentation | free to view

Top Skills That You Should Master to Become an Awesome Data Scientist PowerPoint PPT Presentation

Top Skills That You Should Master to Become an Awesome Data Scientist - As the demand for Data Scientists rises, the field becomes more appealing to students and working professionals. Thanks to big data’s role as an additional perspective engine, Data Scientists are in high demand at the organizational level across all vertical markets. | PowerPoint PPT presentation | free to view

Big Data Infrastructure Market - Forecast(2022 - 2027) PowerPoint PPT Presentation

Big Data Infrastructure Market - Forecast(2022 - 2027) - The market for Big Data Infrastructure is forecast to reach $4.2 billion by 2026, growing at a CAGR of 32.3% from 2021 to 2026 due to rapid increase in consumer and machine data developments. | PowerPoint PPT presentation | free to view

Big Data Infrastructure Market - Forecast(2022 - 2027) PowerPoint PPT Presentation

Big Data Infrastructure Market - Forecast(2022 - 2027) - The market for Big Data Infrastructure is forecast to reach $4.2 billion by 2026, growing at a CAGR of 32.3% from 2021 to 2026 due to rapid increase in consumer and machine data developments. Big Data is referred as the collection of data sets so large and complex that it is not possible to process it in traditional way. | PowerPoint PPT presentation | free to view

Big Data Analytics in Healthcare PowerPoint PPT Presentation

Big Data Analytics in Healthcare - Healthcare big data analytics can not only improve patient care and health outcomes, but it can help healthcare providers diagnose diseases faster and more accurately than ever before. It can also contribute significantly to bettering the overall patient experience. The COVID-19 pandemic has perfectly illustrated the significance of big data analytics in healthcare. It allowed healthcare organizations to properly allocate resources to ensure that every patient gets effective treatment and also allowed governments to formulate the strategies needed to curb the spread of the disease. | PowerPoint PPT presentation | free to view

Big Data Infrastructure Forecast to Reach $4.2 Billion by 2026 PowerPoint PPT Presentation

Big Data Infrastructure Forecast to Reach $4.2 Billion by 2026 - The market for Big Data Infrastructure is forecast to reach $4.2 billion by 2026, growing at a CAGR of 32.3% from 2021 to 2026 due to rapid increase in consumer and machine data developments. Big Data is referred as the collection of data sets so large and complex that it is not possible to process it in traditional way. | PowerPoint PPT presentation | free to view

Big Data Infrastructure Forecast to Reach $4.2 Billion by 2026 PowerPoint PPT Presentation

Big Data Infrastructure Forecast to Reach $4.2 Billion by 2026 - The market for Big Data Infrastructure is forecast to reach $4.2 billion by 2026, growing at a CAGR of 32.3% from 2021 to 2026 | PowerPoint PPT presentation | free to view

Big Data Infrastructure Market - Forecast (2022 - 2027) PowerPoint PPT Presentation

Big Data Infrastructure Market - Forecast (2022 - 2027) - The market for Big Data Infrastructure is forecast to reach $4.2 billion by 2026, growing at a CAGR of 32.3% from 2021 to 2026 | PowerPoint PPT presentation | free to view