Title: Diapositiva 1
1Quality Challenges in Processing Administrative
Data to Produce Short-term Labour Cost Statistics
M. Carla Congia, Silvia Pacini, Donatella Tuzi
(tuzi_at_istat.it) Istat - Italy
European Conference on Quality 2008 in Official
Statistics Session on Administrative data.
Rome, 811 July 2008
2Administrative data Session
Presentation Outlines
- The Italian Oros Survey
- The peculiarities of the administrative source
used - The quality strategy in a context of timely and
extensive use of administrative data - Final remarks
Q2008. Rome, 8-11 July 2008
3Administrative data Session
The Oros Survey
Since 2003 the Italian NSI has released quarterly
indicators on gross wages and total labour cost
(Oros Survey) covering all size enterprises in
the private non-agricultural sector. Indices are
released 70 days after the end of the reference
quarter. In the past this information was
monthly collected only for large firms through
the Survey on Large Enterprises (gt 500
employees). The Oros Survey was planned to fill
this gap in the Italian statistics, using
administrative data (employees social
contribution declarations to the National Social
Security Institute - INPS) for Small and Medium
Enterprises, integrated with the survey data on
Large Enterprises (LES).
Nowadays, in Italy the Oros Survey is an
innovative example of administrative data
extensively used to produce timely business
statistics
Q2008. Rome, 8-11 July 2008
4Administrative data Session
The Administrative Sources
All Italian non-agricultural firms in the private
sector, with at least one employee (roughly 12
million employees and 1.3 million employers per
year) have to pay monthly social security
contributions to INPS.
- INPS administrative register (AR)
- Contains structural information for each
administrative unit (administrative id., fiscal
code, name, legal form, dates of registration and
cancellation, etc.). About 4 million records each
quarter. - Transmitted at the end of the reference quarter.
Employers monthly declaration (DM10 form) Highly
detailed grid organized in administrative codes
with information on employment by type, paid
days, wage bills, social contributions, credit
terms and tax relieves. Each DM10 lays in more
records (on average 8 records per unit). About 10
million records each month. Transmitted 35 days
after the end of the reference quarter.
Q2008. Rome, 8-11 July 2008
5Administrative data Session
Peculiarities of the Administrative Source
- Differently from Survey data, the use of an
administrative source - reduces the financial costs of a direct
collection and avoids further response burden on
enterprises - satisfies the growing demand for timely and
detailed statistical information, for multiple
statistical aims. - Yet, data collection is beyond the NSI control
(that needs information about the quality of the
administrative data used). - Strict relationships and coordination with the
administrative institutions help to reduce the
risks to incur in data quality problems due to
the dependence from the data supplier. - In this, the Oros Survey does not differ from
other register-based statistics.
Q2008. Rome, 8-11 July 2008
6Administrative data Session
Peculiarities of the Administrative Source (2)
- What makes the Oros Survey peculiar with respect
to other register-based statistics is its release
timeliness, that obliged Istat to acquire data
without any previous check and aggregation
(completely raw). Unusual statistical quality
aspects are implied - the processing of a huge quantity of complex
data in a very short time - the lack of standardized metadata to translate
administrative information - the continuous changes of administrative
definitions and concepts. - The acquisition of raw information allows Istat
to monitor most of the processing aspects, but an
hard work is needed to guarantee a high standard
of quality. - A pervasive strategy of quality has been
implemented, covering the whole Oros production
process.
Q2008. Rome, 8-11 July 2008
7Administrative data Session
The Quality Strategy in the Oros Production
Process
Q2008. Rome, 8-11 July 2008
8Administrative data Session
The Administrative Register
- The AR is used as a representation of the current
population. - But
- it suffers of over-coverage problems (temporary
suspensions and firm closures are
under-recorded) - the economic activity code is drawn from the
Italian Business Register (BR) (90 of the Oros
active units) - hard work to outline the estimation frame
(exclusion of units not belonging to the Oros
target population) - special attention to the quality of the fiscal
code as leading matching variable.
Q2008. Rome, 8-11 July 2008
9Administrative data Session
Preliminary Checks and Retrieval of the
Statistical Variables
Meta-information on laws, regulations,
contribution rates, codes and other technical
aspects of Social Security is timely collected
and updated in a standardized METADATA DATABASE
in-house built. It is necessary to carry out
- preliminary checks on raw data and correction of
errors on codes, record duplications,
incoherencies with current legislation - translation of the administrative data into
statistical variables, through complex additions
and subtractions of a huge number of wage and
contribution items identified by numerous
administrative codes (actually more than 5,000) - estimation of some components for which
information is not available in the
administrative form (e.g. Employers injuries
insurance premium and severance payment).
In this step each DM10 is reorganized in 1 record.
Q2008. Rome, 8-11 July 2008
10Administrative data Session
Treatment of Measurement Errors
Once statistical data have been made available a
more traditional micro editing procedure is set
upbut
given the huge number of units, it is strongly
based on selective criteria. A score function
assigns to each of the 1.3 million of units the
probability that an error occurs in the target
variables.
Cut-off thresholds are fixed to select anomalous
values, but their identification is deeply
affected by the significant tails in the
distribution of the target variables
- very low per capita wages (e.g. units with only
supplementary earnings) - negative per capita other labour costs (e.g.
social contribution rebates).
Q2008. Rome, 8-11 July 2008
11Figure 1 Distribution of the per capita other
labour costs (euro values) in the Oros
manufacturing small and medium enterprises July
2007 -
Mean 450 Median 430 Max 6,900
Min -1,350
12Administrative data Session
Treatment of Measurement Errors (2)
The edit and imputation rules are based on known
functional relations among the analyzed variables
and are aimed at evaluating and keeping at unit
record level both cross-sectional and
longitudinal consistency using information on the
closest months.
The number of monthly edits is generally not high
but even an oversight may have a significant
effect.
Quarterly changes of the Oros wage index in the
Wholesale and retail trade sector (G) In the
third quarter 2007, the number of employees of a
unit was affected by a measurement error part
time workers 73,000. Imputed data 2. Would have
implied a change of 0.8 instead of 3.
This step is mainly interactive. Given the nature
of data, by experience automatic corrections are
avoided
Q2008. Rome, 8-11 July 2008
13Administrative data Session
Treatment of Non-response Errors
In the Oros Survey non-responses are units
delivering the DM10 with a delay. Nevertheless,
almost the 95-98 of the Oros population is
represented by the preliminary administrative
data. Given the tested MAR nature of the missing
units and their limited number in the preliminary
data, they do not significantly affect the Oros
wage and other labour cost changes.
Units referred to Temporary Employment Agencies
(TEA) are an exception, because of their strong
characterization.
About 100 units accounting for the 3 of total
employment in the private sector (20 in sector K
- Real estate, renting and business activities).
The absence of even few of these units may
significantly impact on changes of the per capita
indicators
Q2008. Rome, 8-11 July 2008
14Administrative data Session
Treatment of Non-response Errors (2)
- The single out of TEA unit non-responses is not
an easy task - the population under study is represented by the
current AR which suffers of over-coverage
problems (a list of respondents is not
available). It follows that the unit active
status must be predicted, through a longitudinal
analysis of the unit activity in the nearby
quarters - given the strong dynamic nature of TEA, an hard
work is necessary to follow their frequent
changes (e.g. mergers, split-ups, etc.) over time
to separate real non-responses from non-active
units.
Imputation of missing data is deterministic and
widely based on the use of past information on
non-respondents and panel information on the
current respondents.
Q2008. Rome, 8-11 July 2008
15Administrative data Session
Integration with Survey Data on Large Enterprises
In the Oros estimates a special attention is
given to Large Enterprises (firms with more than
500 employees - LE). In the Italian
non-agricultural sector LE account for about 1000
units employing 2 million workers.
- In the past integration of survey data on LE was
strongly motivated by a non-significant
representation of these units in the preliminary
administrative data. - Nowadays the INPS source guarantees a good
coverage of these units but, as experience has
suggested, the use of the statistical source
provides higher quality data - enterprise recalling in case of non-responses or
suspected measurement errors - more rapid and efficient management of the
frequent legal changes these units are subjected
to (e.g. mergers, split-ups, acquisitions etc.).
Q2008. Rome, 8-11 July 2008
16Administrative data Session
Integration with Survey Data on Large Enterprises
(2)
- Combining Survey and administrative data,
specific quality aspects are involved - harmonisation of variables
- record matching the fiscal code is the main
linking variable, but ambiguities may happen
because of formal errors or different updating
time in the two sources (mergers, hive-offs,
split-ups might be recorded in several periods).
Big efforts are aimed at avoiding omissions and
duplications, using supplementary information
(legal name, number of employees etc.).
About 12 of LES employment is manually reviewed
and matched to the correspondent administrative
firms.
Q2008. Rome, 8-11 July 2008
17Administrative data Session
Checks on Macro Data
Final checks on macro data are a key step in the
quality target to identify possible residual
errors that may affect the estimates. These
checks are mainly based on
- analytic and graphical inspection of the time
series at a sub-population detail acceptance
boundaries must be respected by pre-defined
statistical measures - automatic detection of outliers based on TERROR,
an application of the software TRAMO-SEATS, where
the detection of suspected errors is based on
REG-ARIMA model estimates - comparison with other statistical source figures
(e.g. National Accounts, Indices of wages
according to collective agreements, etc.) - variable relationships, whose coherence has to
be guaranteed (e.g. the ratio of other labor
costs on wages, etc.).
If any error is detected, a drill-down to micro
data may be necessary
Q2008. Rome, 8-11 July 2008
18Administrative data Session
Internal Oros Quality Reporting
- The quarterly documentation and updating of the
Oros production process is a fundamental task in
the general strategy of quality - metadata are archived
- methodological information is documented
- imputed data are flagged (and pre-imputation
data are archived) - quality indicators on the impact of imputation
are calculated.
The documentation of the Oros process guarantees
its reproducibility and repeatability
Q2008. Rome, 8-11 July 2008
19Administrative data Session
Final Remarks
- The Oros Survey was
- developed with any previous experience in the
use of administrative data for the production of
short term official statistics - gradually implemented learning by doing.
- High timeliness, frequent changes in Social
Security laws and regulations and strongly
detailed raw data imply relevant and unusual
quality problems managed through - strict relationships and coordination with the
administrative institution - pervasive quality strategy along the whole
production process - highly skilled human resources to handle the
wide and non-conventional processing aspects,
subjected to frequent modifications - systematic documentation of the production steps.
Less standardizable than a traditional survey
quality strategy?
Q2008. Rome, 8-11 July 2008
20Administrative data Session
References
Baldi C., Ceccato F., Cimino E., Congia M.C.,
Pacini S., Rapiti F., Tuzi D. (2004) Use of
Administrative Data to produce Short Term
Statistics on Employment, Wages and Labour Cost.
Essays, n.15/2004, Istat, Rome. Caporello G.,
Maravall A. (2002) A tool for quality control of
time series data. Program TERROR. Bank of
Spain. Eurostat (2003) Quality assessment of
administrative data for statistical purposes.
Doc. Eurostat/A4/Quality/03/item6, available on
the web site http//epp.eurostat.ec.europa.eu/pls
/portal/docs/PAGE/PGP_DS_QUALITY/TAB47141301/DEFIN
ITION_2.PDF Istat, CBS, SFSO, Eurostat (2007)
Recommended Practices for Editing and Imputation
in Cross-Sectional Business Surveys, available on
the web site http//edimbus.istat.it/dokeos/docum
ent/document.php?openDir2FRPM_EDIMBUS
Thank you for your attention Donatella
Tuzi tuzi_at_istat.it
Q2008. Rome, 8-11 July 2008