Title: Objective
1 Data Integration The Principles of Data
Fusion MRG 17th May Steve Wilcox
2Start With A Question
- Q. What is the best way to achieve a data
integration? - A. It depends.
3It Depends Upon
- The analysis objectives?
- What surveys are available?
- Is respondent level data available or too
difficult to access? - How complex is the survey data?
- How convenient does the integrated database need
to be? - Can the survey data be simplified to provide
convenience.? - without losing functionality?
- Are there important currencies to preserve?
- What integration techniques does a preferred
supplier do well? - How much money is available?
4There Isnt A Simple Answer
- No solution could be declared universally the
best. - There are several valid but different approaches.
- Doing it well is as important as the choice of
technique.
5What Are We Trying To Estimate?
- Ownership of MP3 players TGI.
- Number of spots seen in a TV schedule BARB.
- Number of spots seen by MP3 owners ?
6Targeting With A Demographic Surrogate
- Adults MP3 Owners TGI
Profile - TGI Profile Eg1 Eg2 Actual
- ABC1 16-44 27 100 27 48
- ABC1 45 27 0 27 13
- C2DE 16-44 23 0 23 31
- C2DE 45 23 0 23 5
- Eg1 Buy against ABC1 16-44
- Eg2 Buy against All AdultsActual Buy against
ABC116-44. Hope to pick up others
anyway.Source BMRB Target Group Ratings
7Targeting Performance
- Perfect if product penetration is 100 in a
single demographic group. - Perfect if a demographic target contains all
product users and their profile is bland across
all other profiles that affect viewing. - In practice we may have to rely too much on our
judgement. - How can we embrace the less important demographic
groups?
8Profile Matching
- Adults Adults MP3 Owners
- TGI Profile BARB TVR TGI Profile
- ABC1 16-44 27 10 48
- ABC1 45 27 5 13
- C2DE 16-44 23 15 31
- C2DE 45 23 20 5
- Adult TVR 12
- Assume MP3 owners and non-owners have the same
TVR in each - demographic segment.
- MP3 owners TVR 11
- Source BMRB Target Group Ratings
9Profile Matching Performance
- Perfect if the segmentation covers all the
product profile differences that might affect
viewing. - Perfect if the segmentation explains all the
non-random variation in TV viewing between
individuals. - Bigger risk if bigger variation within segments.
- Minimise risk by introducing more demographics.
- Ideally the same segmentation for all viewing
measurements.
10Segmentation For Profile Matching
- Profile Segments Cumulative Ave.Seg. Sample
- Region 6 6 1667
- Class 4 24 416
- Age 7 168 60
- Sex 2 336 30
- Multi-Channel 3 1008 10
- Work Status 2 2016 5
- Household Size 5 10080 1
- Children 2 20160Household Status 3 60480
- Marital Status 2 120960
- Children 0-3 2 241920
- Ethnicity 2 483840
11Segmentation For Profile Matching
- Sample size dictates that the segmentation would
be limited to 4 or 5 profiles. - Anova based technique required to find optimum
segmentation. - What if we still think the segmentation may fail
to explain cross-survey interactions? - Use MultiBasing or Data Fusion to extend beyond
the 4 or 5 profile limitation.
12Data Fusion
- Use all available demographics to predict the
viewing behaviour of each individual in the TGI
sample. - Find a demographic match in the BARB panel for
each TGI respondent. - Assume they are the same person and give that
BARB panel members complete viewing record to
the TGI respondent. - Looks like a single source survey.
13- STEVE ROGER STOGER
- London London London
- AB AB AB
- 48 49 (ish) 48
- Male Male Male
- Cable DTT Cable
- Working (ish) Working (ish) Working
- H/H Size 1(ish) H/H Size 1(ish) H/H Size 1
- Head of H/H (ish) Head of H/H (ish) Head of H/H
- No kids No kids No kids
- No babies No babies No babies
- Divorced Divorced Divorced
- White White White
- TGI Data BARB Data TGIBARB Data
- Demographics are called linking variables or
hooks
14Matching The Whole BARB Panel
- Stoger matches on 10 out of 12 hooks.
- BARB and TGI are both representative samples of
the same diverse population. - Finding a match for every BARB panel member in
the (larger) TGI sample is an achievable sampling
exercise. - Latest fusion whole sample matched on 11 out of
14 hooks.
15MP3 Owners Schedule TVRs
- 8pm 11pm ABC1 25-44 MP3 Owners Index
- ITV1 Mon-Thu 73 57 79
- CH4 Mon-Thu 53 31 59
- Five Mon-Thu 29 30 97
- Sky1 Mon-Thur 11 8 73
- ITV1 Fri 13 10 73
- CH4 Fri 11 11 100
- Five Fri 9 6 67
- Sky1 Fri 2 1 59
- Total TVRs 200 153 77
- 1 Cover 68 55 81
- Source BARB/BMRB Target Group Ratings
16Choosing The Hooks
- Extending the matching to 11 hooks is a hollow
achievement if they are highly correlated. - Try to find hooks which stretch the fusion
process.
17Using The Hooks
- Understand the relative importance of the hooks
to the subject of both surveys. - Understand the correlations between the hooks.
- Incorporate both into a summary measurement of
the difference between two potential matches. - Ensures that priority is given to most relevant
hooks. - Then more hooks can only improve the fusion.
18Media Imperatives As Hooks
- Media imperatives may be available on other
surveys. - Not necessarily the most important hooks
- - must discriminate well on specific media
behaviour. - - must discriminate well on product usage or
other media behaviour. - If too specific, may explain random rather then
systematic behaviour.
19The Value Of A Hub Survey
- Top-line single source survey with a manageable
respondent task. - Tailor made to embrace all hooks relevant to
fusion surveys. - Creative media imperative hooks.
- Evaluate hooks in terms of media interactions
- - ideal for media imperatives vs. demographics.
- Fuse other surveys onto the hub, one by one.
20Media Currency Preservation
- Particularly important for mixed media fusions.
- Control the fusion so that all or a
representative sample of each survey is used. - Sampling error can change the media trading
currencies. - Calibration may be required.
21The Transportation Algorithm
- Generate a virtual sample larger then the two
surveys to be fused. - Transport fragments of respondents from the two
surveys to the virtual sample match supply and
demand. - Complete weighted sample from both surveys is
used. - Preserves top-line currencies.
- Reduced effectiveness for estimation of survey
interactions.
22Respondents And Their Weights
TGI Sample Virtual Sample BARB Sample
1.5
1.0
1.0
0.5
1.2
0.7
1.0
1.8
0.3
1.5
1.5
- Weighted value of each TGI and BARB record is
preserved in the virtual sample.
23Fusion On The Fly
- Fusion tailored to each cross survey analysis
requirement. - Fusion linkage is re-constructed for each
analysis. - Maybe increased sensitivity means less
consistency.
24Validation
- Check the diagnostics relevant to the chosen
fusion algorithm. - Currency preservation.
- Is any single source survey data available for
comparison?
25Regression To The Mean
- Happens when the hooks dont explain all the
differences in viewing for a product group. - Can measure it if some similar single source data
exists. - Split sample/foldover test using TGI half-hour
viewing data.
26Average Regression to the Mean
27BARB Panel Lifestyle And Insights
- Product related information.
- - 100 Additional Panel Classifications.
- Limited validation of TGRs .
- Generates about 100 Additional Panel
Classifications
28Closeness of Actual and Fused data
29Closeness of Actual and Fused data
30Validation Adults 25-54 Who Have A Mortgage
Based on ITV1 evening viewing across the week
(TVRs)
Source TGRs October 2004, BARB October 2004
31Summary
- Fusion must be tailored to the objectives of the
integration. - Success based upon ability of hooks
(demographics) to explain cross-survey
interactions. - In combination they form a powerful explanatory
variable. - Additional hooks can be constructed from media
imperatives, if available. - A generalised fusion creates a convenient and
consistent analysis database.
32Summary
- There can be a trade-off between sensitivity and
currency preservation. - Fusion is always at least as good as profile
matching. - General and/or specific validation is essential
to build confidence in a technique. - Integration algorithms have to be based upon well
developed theory doing it well is as important
as the choice of technique.
33Data Fusion?