Title: NORC Data Enclave
 1NORC Data Enclave
- Module 2 
- Metadata for researchers
2Overview
- What is metadata and why is it important 
- What can metadata do for you? 
- Guide to creating metadata 
- Creating metadata in the enclave
3What is metadata? 
 4Metadata and the survey life cycle
- A survey is not a static process 
- It dynamically evolved across time and involves 
 many players
- It extends to aggregate data to reach decision 
 makers
- Metadata is crucial to capture knowledge
5Importance of metadata
- Imagine a world without metadata. 
- Users would say 
- I cant find the right data! How do I get access? 
- Where is the report / questionnaire / 
 methodology?
- I dont understand this survey / file / variable 
- I cant merge the files 
- How do I weight the data? 
- My results dont match the report, I cant 
 reproduce the same results
- Are these things comparable? 
- I didnt know someone did this research before? 
-  Sounds familiar? 
- Metadata is an answer to a researchers 
 frustrations
- Producers and archivists are making efforts to 
 improve metadata but similarly, metadata must
 also be captured by researchers (Life Cycle!)
6When to capture metadata?
- Metadata must be captured at the time the event 
 occurs!
- Documenting after the facts leads to considerable 
 loss of information
- This is true for producers and researchers
7Metadata and the Replication standard
- Replication standard 
- Gary King, Harvard, 1995 
- The only way to understand and evaluate an 
 empirical analysis fully is to know the exact
 process by which the data were generate
- Replication dataset include all information 
 necessary to replicate empirical results
- Metadata crucial to meet the standard 
- Composed of documentation and structured metadata 
- Undocumented data is useless
8What can Metadata do for you? 
 9What can metadata do for you?
- Facilitate publication of results and increase 
 visibility of your work
- Integrating research results in the survey 
 knowledge
- Facilitate reporting, citations, etc. 
- Capture research process (replication standard!) 
- Facilitate reusability / extend the research 
- Compare results 
- Outcome 
- makes your life easier and your everyone happy
10Guide to Creating Good Metadata 
 11Capturing and using metadata
- Starts with good practices 
- Need to be complemented with tools 
- In the enclave environment 
- Follow common guidelines and good practices 
- File and variable naming conventions 
- Code documentation 
- Good statistical methods 
- Take advantage of the collaboratory 
- Use the tools at your disposal 
- Exchange ideas with others 
- Express yourself using blog, wiki, shared 
 document, etc.
- Explore available resources
12Coding and naming conventions (1)
- Give meaningful names to files 
- Avoid spaces in names, dont use upper case 
- Version your files (capture progress) 
- Use middle extensions 
- Include metadata in the name 
- Not too good 
- report.doc, notes.txt 
-  myfile.dta, table2.xls 
- reg.do, test.do,, results. 
- Better 
- byu_atp_final_report_200607.doc 
- byu_results_200706.dta , byu_enterprise_by_project
 _success.xls,
- income_regression_mode.v200706.do 
13Coding and naming conventions (2)
- Give meaningful names to variables 
- Not too good 
- tmp3, ag_exp2, v324 
- Better 
- valid_enterprise, agricultural_expenditure, s1q3 
- Comments, comments, comments!! 
- Make sure to include lots of comments in your 
 source code
- This is the best time to capture knowledge! 
- It also promotes replicability and will help you 
 in a few months when to try to remember what you
 did
- Share source code, use peer review
14Not so good code example
- local mypath  c\data\anonymization\" 
- global data_in  "mypath'"  "\"  
 "Demohh1000.dta"
- global data_out  "mypath'"  "\"  
 "Demohh1000.out.dta"
- global threshold  0.8 
- cd mypath 
- set more off 
- use data_in, clear 
- tempfile temp 
- gen fk1 
- gen wiweight 
- collapse (sum) fk wi, by (town province marstat 
 sex age)
- gen pkfk/wi 
- gen qk1-pk 
- gen rk (pk/qk)  log(1/pk) if fk1 
- replace rk (pk/(qk2))  ((pklog(pk))qk) if 
 fk2
- replace rk(pk/(2(qk3)))  (qk(3qk-2) - 
 (2pk2)log(pk)) if fk3
- delimit  
- replace rk (pk/fk)  (1 (qk/(fk1))  ((2qk2) 
 / ((fk1)(fk2)))  ((6qk3) /
 ((fk1)(fk2)(fk3)))  ((24qk4) /
 ((fk1)(fk2)(fk3)(fk4)))  ((120qk5)
 / ((fk1)(fk2)(fk3)(fk4)(fk5)))
 ((720qk6) / ((fk1)(fk2)(fk3)(fk4)(fk5)
 (fk6)))  ((5040qk7) / ((fk1)(fk2)(fk3)
 (fk4)(fk5)(fk6)(fk7)))) if fkgt3
15Better code example
- / 
-   Computes the disclosure risk at individual 
 level
-   
-   _at_author John Anonymous (janon_at_example.org) 
-   _at_version 2007.06 
-   References 
-   - micro-Argus 4.1 manual, p27-25 
-  / 
- // Configuration 
- local mypath  C\data\anonymization\" 
- global data_in  "mypath'"  "\"  
 "Demohh1000.dta"
- global data_out  "mypath'"  "\"  
 "Demohh1000.out.dta"
- global threshold  0.8 
- // Initialize 
- cd my_path 
- set more off 
16Better code example (cont.)
- // Compute frequencies 
- gen fk1 
- gen wiweight 
- // Group individual by re-indentifiction 
 variables
- collapse (sum) fk wi, by (town province marstat 
 sex age)
- gen pkfk/wi 
- gen qk1-pk 
- // Compute risk is cell frequency is 1 
- gen rk (pk/qk)  log(1/pk) if fk1 
- // Compute risk is cell frequency is 2 
- replace rk (pk/(qk2))  ((pklog(pk))qk) if 
 fk2
- // Compute risk is cell frequency is 3 
- replace rk(pk/(2(qk3)))  (qk(3qk-2) - 
 (2pk2)log(pk)) if fk3
- // Compute risk is cell frequency is greater than 
 3 (series approximation)
17Creating Metadata in the Enclave 
 18Using the SharePoint based portal
Use the HELP!
- But be aware that 
- not all documented functionalities are available 
- Some functions require administrative access
19Editing content 
 20Organizing work and exchanging ideas
Use the enclave announcement, tasks / todo, and 
calendar to distribute and organize the research 
work
Use the discussion groups to exchange ideas, 
submit questions, etc 
 21Using the blog to capture research events
- Research is an iterative, evolving process 
- Capturing ideas and milestone is crucial 
- Personal logs have often been used in the past 
- Blogs is todays version of it
22Using the wiki to capture research knowledge
- Familiar with Wikipedia? 
- A wiki is a shared web site but does not require 
 programming skills to maintain
- Multiple authors can add, remove, and edit 
 content (mass authoring).
- Knowledge grows across time based in community 
 contributions
- Pages automatically link to each other page on 
 topics
23Sharing and tagging files
- Take advantage of the shared documents facility 
 to make information available to others
- Documents, paper, etc 
- Scripts, programs 
- Tables 
- Organize documents by keyword/topics 
- Take advantage of the enclave search function
24Report data quality issues!
- A survey is not perfect, problems are always 
 detected during research
- Data issues 
- Invalid code, missing values, file that cannot be 
 merged, missing files or variables, inconsistent
 results, bad distribution, etc
- Metadata / Documentation issues 
- Undocumented variables or codes, discrepancies 
 between docs and data, the post-processing/cleanin
 g/quality assurance black box, etc
- Reporting this is crucial for other researchers 
 and for the producer