Title: European Conference on Quality in Survey Statistics
1European Conference on Quality in Survey
Statistics
2Exploring the statistical utilisation of
financial statements to estimate sub-regional
economic variables
- P. Calia - C. Filippucci
- University of Bologna - Italy
- Filippucci_at_stat.unibo.it
3Aim of the work
- Obtain statistical data at local level SLL
(sub-regional level) and by economic sector not
provided by official statistics (value added,
output, labour cost) referring to a digital data
base of companies financial statements AIDA - SLL cluster of municipalities based on
commuting flows - Emilia-Romagna Region and manufacturing sector
are considered - year 2002 -
4Aim of the work
- THE PROBLEM
- to face unit selection by an appropriate
estimation strategy for variables
5Why to use accounting data
- Moreover
- Information is available about a large number of
business - much more than sample surveys, - Reduce the respondent burden
- Supply micro data for modelling economic
behaviour
6Outlines of the paper
- Does business accounts provide necessary
information to estimate the economic variables of
interest? - AIDA reliability
- AIDAs coverage of the population
- Strategy for estimation
7AIDA
- Edited by Bureau Van Dijk
- Collects financial statements of Italian
companies since 1996 - Accessible via web after subscription
- ----------------
- 18,796 firms located in Emilia Romagna out of
365,000 - year 2002
8AIDA cnt
- It is possible
- Select firms by different criteria (economic
activity, location, identification codes, legal
form, etc) - select variables of interest (accounting items
and ratios, user defined indexes, structural
economic variables, stakeholders, etc.) - perform peer group comparisons and simple
statistical analysis - export data to be processed with other softwares
9AIDA cnt
- Companies compelled to present the financial
statements are 26 but 72 in terms of
employment. - In AIDA they are more than 99
- Total coverage in the manufacturing sector is
14 (firms) and 61 (employees) it improves
considering the above group 24 (firms) and 84
(employees) - In AIDA only firms having an estimated 2002
production value gt500,000 Euros
10From Business accounts to SNA concepts
- The problem has been addressed by UN (UN, 2000)
success depends on the formats used for
compilation of financial statements - In Italy, business accounting structure, format
and contents are established by law into the EU
framework - Output components and costs are pointed out
11From business accounts to SBS concepts cnt
- Exact correspondence cannot be established for
all variables (items detail and aggregation) - In Italy, the adequacy of accounting data to EC
regulations for business statistics has been
studied by National statistical Institute -
Istat - accounting data covers about 65 of the variables
considered by structural business surveys - high correspondence between accounting and
surveys values - Actually, accounting data are used by Istat to
impute for missing data in structural business
surveys
12From business accounts to SBS concepts cnt
- Macrovariables in the frame of SNA are not
exactly the same - BUT
- Financial statements are the only source for
obtaining fairly good indicators of firms
perfomances at sub regional and activity sectors
levels
13Production value (output)
14Value added at factor costs
15Reliability
- two strategies
- Internal checks
- External checks comparing with ASIA 2002
- 2002 is considered because Asia is referred to
2002 - ASIA permanent archive of all active firms in
industry and services, yearly updated. It is the
frame for all the business surveys
16Issue 3 avalaible information
- From ASIA E-R (2002)
- Fiscal code (unique id)
- Name of the enterprise
- Activity code (ateco 2002 Nace rev. 1.1)
- Geographical location
- Legal form
- Number of employees
- From AIDA
- Fiscal code and business register code (CCIAA)
- Name of the enterprise
- Activity code (ateco 2002 Nace rev. 1.1)
- Geographical location
- Legal form
- Date of birth
- End date of financial statement
- Accounting items from income statement
17Preliminary check
- Internal Checks Codes duplication and Missing
data Essentially no problems - 2. External checks
- Linking AIDA and ASIA (by fiscal code, Nace rev.
1.1 codes, location and legal form) - Some problems very few or easy to recover and
to correct using ASIA
18Issue 3 Preliminary check cnt.
- ONLY 35 records are duplicated by fiscal code
(same company) - Different cciaa code only 2 are clearly
recording error - Different geographical location (33 companies)
- Different legal form (2 companies)
- Different economic activity code (17 companies)
- Differences in accounting items (5 companies)
- Retain records whose geographical location is
equal to that of asia 18761 unique records
19Issue 3 Preliminary chek cnt.
- Missing data in
- Fiscal code (12 records recovered in aida)
- Geographical location (15 record 14 recovered
in aida) - Legal form (859) Easy to obtain from ASIA
- Economic activity (43 record)
20Issue 2 Preliminary chek cnt.
- LINKING BY FISCAL CODE WITH ASIA
- 18023 companies out of 18761 match
- 738 are not in ASIA NOT CONSIDERED IN OUR
ANALYSIS - reasons
- not eligible according to ASIA (excludes
businesses in sectors A, B, L, P, Q, division 91
and legal unit classified as institutions or
belong to the private non profit sector) - not resident in Emilia Romagna at 2002 and moved
later (location and other characteristics in AIDA
refers to the last balance sheets available) - no more active in 2002 but compelled to present
the balance sheet (i.e. liquidation, bankruptcy ) - acquired by other enterprise during that year
21Issue 3 Preliminary chek cnt.
- Problems in comparing (Because point 2 -
previous slide) - Ateco code 7982 records in AIDA with different
values from ASIA (reduce to 3245 considering
2-digits code) - Geographical location after normalisation, 814
business are located differently (371 change SLL
location) - Legal form after normalisation, 1416 enterprises
with different legal form (whose 859 with missing
data) - In the following analysis data from ASIA have
been chosen
22Framework for estimation
- Sample is not random predictive approach is
needed for estimation (Royall, 1988 Royall and
Pfeffermann,1982 Särndal, 1996) - The predictive approach is based on a
statistical model in the population relating the
variable of interest to auxiliary variables. - The model should be able to make the selection
process ignorable - It is not a behavioural model but its only task
is to incorporate the relationships between
observed and non-observed units
23Framework for estimation cnt
- The model is the only source for inference
- From this model predictions for non observed
units are obtained - Auxiliary variables are chosen in order to
identify groups where the selection process is
not informative (Estevao et al., 1995)
24Framework for estimation cnt
- Finite
population - variable to be estimated
- Auxiliary variables known for each
unit - subsets of of units observed and
non observed - sample units in a sample s
- selection mechanism for a sample s
25Framework for estimation cnt
- In the model approach it i necessary to specify
- function connected to the
selection - Y distribution in the
finite population on the basis of the a priori - Joint distribution
of Y and -
distribution based on
observations
26Framework for estimation cnt
- Making inference on selection process
can be ignored if , given ,
(Rubin -1976) - If in a sample the variable of interest would be
independent from the unit selection but
dependent from the Z konwn - Then
would be
equal to the distribution of observations
obtained ignoring the selection (inference is
possible referring only to observations and the
model)
27Framework for estimation cnt
- In a non random sample selection is represented
by where Z is a variable indicator - Given s, inference on ,for the group
selected by mean of Z, is obtained from
, ignoring the selection - Making inference on others groups also
should be known - If X are variables known for all the units in the
population and it is true
then
the selection on Z can be ignored - Inference on quantities of the finite population
needs also to know
28Framework for estimation cnt
- The first task for model specification is
- to find out the variables driving the selection
29Variables driving the selection
- Legal form is the crucial variable for
partecipating in AIDA and it is not possible to
use it as auxiliary variable - Possible auxiliary variables are Size (number of
employees) Sector of activity Sub-regional area
SLL - To define their relevance, coverage and
distributions in AIDA are compared with
population
30Tables by legal form
31Coverage by size
- SMALL FIRMS ARE NOT REPRESENTED
- - gt 20 employees more than 74 (firms) 76.5
(employees) - - 1 employee 1.2
- - 2-9 employees 6 (firms) and 9 (employees)
- - Distributions in AIDA 22 has less than 10
employees vs 80 in the population.
32Coverage by size
- Excluding proprietorships and partnerships,
coverage rise up - - gt 20 employees more than 90 (firms and
employees) - - 1 employee 9
- - 2-9 employee 30-40
- - distributions get closer, especially for
employees
33Tables by size
34Coverage by sector
- Coverage by sector ranges between 5 to 60
(firms) and between 30 to 80 (employment) - 50 of sectors present a coverage greather than
17 (firms) and than 58 (employment) - AIDA and ASIA distributions are not too much
different taking into account the simple index
of dissimilarity is 0.22 and 0.11 - Considering the subgroup of companies and
cooperatives, coverage improves and distributions
gets closer
35Table by sector (2-digit code)
36Coverage by SLL
- Coverage by SLL is very variable it ranges
between 3 and 23 (firms) and between 19 and
78 (employees) - coverage in 50 of SLL is greater than 10.6
(firms) and 53 (employees) - AIDA and ASIA distributions are not too much
different simple dissimilarity indexes are very
low (respectively 0.12 for firms and 0.06 for
employees) - Considering the subgroup of companies and
cooperatives, coverage improves and distributions
gets closer
37From the above analysis
- From the evidence presented it is clear that the
variables considered to study coverage are
important in driving the selection hence are
relevant to implement a model-based estimation
strategy - Moreover they are important from an economic
point of view.
38From the above analysiscnt
- Because of severe undercoverage of small firms,
only firms with more than 1 employee (excluding
mostly proprietorships and partnerships) are
considered - As consequence, totals will be underestimated
even if economic weight of firms with one
employee is negligible they are 29 of
population but represent only 2.8 of employees
39BLU predictor
- The total of Y
- The predictor of T
- We need a predictor only for the non observed
units of the population - The form of the predictor its unbiasedness and
optimality depend on the model - General specification for a BLU predictor as well
as unbiasedness conditions are obtained assuming
a general regression model
40Model Groups
- In our case, SLL are the estimation domains they
form a partition of the population of N units - The population is to be divided in H groups able
to explain most of the variance in Y. - Groups are based on size and economic activity
according to strategies - 1 - groups are obtained cross classifying by
firm size (number of employees) and economic
activity - 2 - economic activity defines the groups and
size enter the model as continuos variable with
different parameter for each group
41Model Groups cnt
- SLL (estimation domain) could also be a relevant
auxiliary variable to establish the model groups - Only if SLL effect is captured by activity sector
(many SLL are specialised in one activity) they
are not relevant - As consequence also SLL could be used to form
groups
42Superpopulation models
Group h- (size by sector) effect equal across
domains
Group (size x sector) effect plus indipendent
domain (q) effect
Interaction effect of size (continuous) with
sector plus indipendent domain effect
Interaction effect of size (continuous) with
sector
Group effect (size by sector) different across
domains
43Superpopulation models cnt
- Group effects are fixed
- The model specification ask for the definiton of
variance and covariance structure of observations
- The simplest structure is Homoskedasticity
- We do not expect Homoskedasticity but variability
should depend on size and/or economic activity
44Superpopulation models cnt
- The analysis of residual of the model estimation
for Value added confirms our expectation - We concetrated on two main heteroskedasticity
structures - A) x size, k firm
- B) m activity
sector or size - A further alternative consider x2 in A and B
45Superpopulation models cnt
-
- Model selection is based on usual criteria to
evaluate model optimality - - model fitting indicators (AIC, Likelihood
ratio test) - - standard errors
-
46Main findings cnt
- the estimation under the model 5 is not suitable
because it is possible only by high aggregation
of groups considered no useful - The heteroskedasticity structure has a
foundamental role and in particular the two
alternative tested are relevant. - When x2 is considered further improvements are
obtained
47Main findings cnt
- Under the assumption (A)
- in the models 2 and 3 SLL effect is not
significative (according to usual tests) hence
the models 1 and 4 (SLL not considered) should be
preferred
48Main findings cnt
- 2) Under the assumptions (B)
- a) many specifications meet estimation problems
(likehood does not behave properly) - b) when estimation is possible there are cases
where SLL effect is significative - c) this happens in the model 3 when variance is
modelled by m activity
sector or size - d) according to AIC the best result is using
activity sector
49Main findings cnt
- As consequence
-
- The models to be considered are (1) and (3) but
looking at AIC the variance specification (A)
gives the worse results - model (3) is to be preferred also because it
includes a significative SLL effect -
50Conclusions
- AIDA is the only source to estimate main economic
variables at sub-regional level and by economic
activity sector - It is possible to reach a quite good
approximation of national accounting variables
and use it to obtain estimates of firms
perfomances at sub regional and activity sectors
levels
51Conclusions cnt
- Problems of estimation due to undercoverage and
selection bias are managed by predictive
inference - The approach has been implemented referring to
different model specifications and variance
assumptions - The
- is to be preferred with the assumption that
variance depends on firm size and activity sector
52Further work
- Further check of model robustness will be carried
out - - Effect of influential observation
- - Effects of some different groups
aggregations - - Sensitivity analysis