Title: Advisory Expert Group Big Data
1Advisory Expert GroupBig Data
2Outline
- Big data and the National Accounts
- Establishing the right infrastructure
- Lessons learned case studies from Statistics
Canada - Traditional big data
- Scanner data
- Electricity consumption
- Credit card and Interact
- Remote sensing
3Big data and the National Accounts
- From a business perspective "Big data is high
volume, high velocity, and/or high variety
information assets that require new forms of
processing to enable enhanced decision making,
insight discovery and process optimization..
(Gartner 2012) Wikipedia - From an NSO perspective "Big data is high volume,
high velocity, and/or high variety information
assets that require new forms of processing to
which could reduce respondent burden, increase
quality, develop new statistical products or
enhance the detail of existing statistical
products... ????
4Big data and the National Accounts
- Mich Couper from the University of Michigans
Survey Research Center sites the following
limitations NSO will face when confronting Big
data - lack of covariates in the datasets
- self-selection and self-reporting biases
- lack of stability
- privacy issues
- access issues
- opportunity for mischief
- size issues and
- selective reporting of results (file drawer
problem). - You could add to that
- Sustainability data sources disappear, systems
change, perceptions change. - Couper, Mick P., Is the Sky Falling New
Technology, Changing Media, and the Future of
Surveys. (Presentation, European Survey Research
Association, 5th Conference, Ljubljana, Slovenia,
July, 2013)
5Big data and the National Accounts
- There needs to be up-front acknowledgement that
we are trying to fit a square peg in a round
hole. - The needs of business (big data to increase
business intelligence) and national accountants
(big data to produce comprehensive macroeconomic
statistics) is quite different.
Dimensions of the data Needs of National Accountants Needs of business
Scope of the dataset Comprehensive Limited to the needs of the business
Use of the dataset Produce meaningful aggregate statistics Find patterns, explore the detail
Structure of the dataset On-going, stable, regular Structure can change as required by the business
6Putting in place the appropriate infrastructure
- In order to determine how to best leverage big
data NSO needs to put in place the proper
infrastructure to - Obtain the data
- Process the data
- Evaluate the data
- Integrate the data
7Putting in place the appropriate infrastructure
Obtaining the data
- Use of legislation e.g., Section 13 of Canadas
Statistics Act states that A person having the
custody or charge of any documents or records
that are maintained in any department or in any
municipal office, corporation, business or
organization, from which information sought in
respect of the objects of this Act can be
obtained or that would aid in the completion or
correction of that information, shall grant
access thereto for those purposes to a person
authorized by the Chief Statistician to obtain
that information or aid in the completion or
correction of that information. 1970-71-72, c.
15, s. 12. - Memorandum of understanding (MOUs) which outline
- Roles and responsibilities
- Delivery mechanism
- Uses of data
- Termination of the agreement
- Purchasing big data
- Many firms sell big data that can be used for
business intelligence it could also be
purchased for statistical purposes. Under what
conditions and terms should NSOs purchase big
data?
8Putting in place the appropriate infrastructure
Processing the data
- File transfer system - NSOs need a secure, high
capacity file transfer system to transfer data
from the data provider to the NSO. - Storage and processing capacity - In most NSOs
(especially NA divisions) the processing capacity
for big data does not exist. - Software - Statistics Canada is leveraging the
SAS distributed computing solution called SAS
Grid to shorten the time needed to process and
analyze its larger data holdings. Also, the Data
Analysis Resource Center at Statistics Canada
maintains a research computer with analytical
software installed, offering a wide range of
add-ons that provide advanced analytical and
visualization tools particular to big data
analytics. - Information management policies Access,
privacy, confidentiality, retention
9Putting in place the appropriate infrastructure
Evaluating the data
- Big data community of practice
- There needs to be a structure in place that
allows analysts and programs to gain knowledge
and share experiences with respect to big data,
to engage with colleagues internally or
externally when needed and to report findings to
senior managers when appropriate. - Big data needs to be evaluated with respect to
its - Quality
- Coverage
- Timeliness
- Detail
- Regularity
- In order to leverage big data we need to develop
a research and development orientation.
10Examples of big data research at Statistics
CanadaInternational merchandise trade statistics
- Collection/access agreement Access to detailed
customs data is governed by two memorandum of
understanding one with the Canadian Revenue
Agency and one with the U.S. Census Bureau - Cost Nil
- Dimensions 1.5 Terabytes, 60 attributes
- Uses Balance of Payments, International
Merchandise Trade Statistics - Timeliness 35 days following the reference
period - Frequency Daily, if required
- Potential uses Creating an importer and exporter
characteristics file which can be used to analyze
the entry an exit of Canadian traders within the
Canadian economy, used in studies of
globalization, global production, goods for
processing, foreign affiliate statistics.
11Examples of big data research at Statistics
CanadaTaxation statistics
- Collection/access agreement Access to detailed
taxation statistics is governed by a memorandum
of understanding with the Canada Revenue Agency. - Cost Approximately 1.6 million
- Dimensions 6 Terabytes and growing
- Uses Benchmark estimates of wages and salaries
output property incomes, taxes, etc. - Timeliness Earliest use 45 data following the
reference period - Frequency Mainly annual, some monthly (goods and
services taxation statistics) - Potential uses Creation of a National Accounts
longitudinal filea business level micro-data
file that can be used to undertake studies such
as GDP by city, GDP by firm size, productivity by
firm size.
12Examples of big data research at Statistics
CanadaGovernment finance statistics
- Collection/access agreement No formal agreement
in place institutional understanding between
Statistics Canada and the government
jurisdictions. - Cost Nil
- Dimensions 40 million financial transactions,
200 GB - Uses Government Finance Statistics, government
sector National Accounts - Timeliness Earliest is 15 days following the
reference period. - Frequency Monthly, quarterly, annual
- Potential uses Local government remains a
survey of municipalities, access to electronic
files will increase our ability to provide CMA
level data as well as increased revenue and
expenditure details. Potential data uses for the
health, education and justice programs.
13Examples of big data research at Statistics
CanadaElectronic household transactions (credit
and debit)
- Collection/access agreement Memorandum of
understanding outlining the roles and
responsibilities of both Statistics Canada and
the data provider. - Cost Nil
- Dimensions Aggregated big data - number of
transactions, value of transactions aggregated by
merchant group by place of transaction (domestic,
international) by class of transactor (personal
or commercial). - Uses Indicator for household final consumption
expenditure and international travel abroad - Timeliness Earliest is 15 days following the
reference period. - Frequency Monthly
- Potential uses International travel services,
monthly household final consumption expenditure.
14Examples of big data research at Statistics
CanadaElectronic household transactions (credit
and debit)
15Examples of big data research at Statistics
CanadaElectronic household transactions (credit
and debit)
16Examples of big data research at Statistics
CanadaScanner data vendor specific
- Collection/Access Agreement MOU in negotiation
- Cost Current costs are nil though the long-term
approach being proposed would involve a quid pro
quo agreement where CPD would provide the company
their data back with value added (i.e., an
implicit cost would be borne by the division). - Dimensions Sales, quantities, and item
descriptions of all goods sold for a given store
over a given period - Uses Consumer prices and household expenditure
weights to feed the CPI - Timeliness TBD, though potentially as little as
a one day lag (e.g., weekly data for a given week
could be delivered on the first day of the
following week). - Frequency Initial data has been provided on a
weekly aggregated basis. Future work will look at
daily and / or transactional level data. - Dataset size For one week of sales data
(aggregated on the week) for one store, - roughly 4,000 KB
- roughly 30,000 rows (i.e., unique items sold)
- implies roughly 200MB for one year of weekly
aggregated data for one store. - Potential uses moving forward Direct input into
the calculation of the CPI (potential replacement
for collected prices), studies on consumer
behaviour, CPI weights, household final
consumption expenditures, retail sales.
17Examples of big data research at Statistics
CanadaSmart meter household electricity
consumption
- Collection/access agreement Two memoranda of
understanding with two regional electricity
distributors - Cost Nil
- Dimensions Roughly 200 GB of raw hourly
electricity consumption data have been obtained,
providing detailed information on approximately
120,000 customers, between the years of 2008 to
2013 - Uses Household electricity consumption
- Timeliness Earliest is 15 days following the
reference period. - Frequency Hourly
- Potential uses Household final consumption
expenditure, monthly Gross Domestic Products
utilities.
18Examples of big data research at Statistics
CanadaSmart meter household electricity
consumption
Total residential consumption
19Examples of big data research at Statistics
CanadaSatellite Imaging Land Account
- Collection/Access Agreement Public data
- Cost Nil
- Dimensions 20 GB. Although not apparent here,
dimension of this type of big data (which is
not really big data, strictly speaking) may well
explode in the coming years. LIDAR datasets (high
resolution radar), as well as higher resolution
(space and time) satellite data will require
terabytes of storage and terahertz of
processing capacity. - Uses Land accounts Land cover / land use change
2000 and 2010 - 2013 - Timeliness 3 years lag
- Frequency Annual
- Potential Uses moving forward Landscape and
freshwater ecosystem accounts
20Examples of big data research at Statistics
CanadaRemote sensing land use
21Examples of big data research at Statistics
CanadaWater Measurement Instruments Water
Account
- Collection/Access Agreement Informal agreement
with Water Survey of Canada - Cost Nil
- Dimensions Original WSC data is 5 GB derived
water yield data is 90 GB - Uses Water accounts Water Yield
- Timeliness From real-time to lag of several
years - Frequency Daily
- Potential Uses moving forward Freshwater
ecosystem accounts
22Some lessons learned so far
- Quid pro quo is important when trying to obtain
big data. Firms are more willing to part with
their big data if you show them how they will
receive a business intelligence benefit on
their side. - Cost big data is not always the cheapest
option. It is sometimes easier to have the firm
complete the survey than to create an
infrastructure to receive and process their data.
For example, the data received from local
electricity providers is equivalent to the
completion of two questions on our current
survey. - Classification systems big data does not
follow any standard classification system. For
example, electronic retail transactions are
classified according to merchant groups rather
than industries. - Big data aggregates asking firms to aggregate
their big data is an option. - Data formats Need to work with new data formats
that we are often not familiar with.
23Discussion point for the AEG
- In order to exploit the potential of big data,
NSOs need to make significant investments. How
can we leverage the work taking place across
various NSOs to minimize the investment and
maximize the return? - How do we promote the development of new data
products using big data over using big data to
re-construct existing data products? Do we adjust
our frameworks to accommodate big data or do we
adjust big data to accommodate our frameworks?