Title: Quantitative Methods For Social Sciences
1Quantitative MethodsFor Social Sciences
SKEMA Ph.D programme 2010 2011
- Lionel Nesta
- Observatoire Français des Conjonctures
Economiques - Lionel.nesta_at_ofce.sciences-po.fr
2Objective of The Course
- The objective of the class is to provide students
with a set of techniques to analyze quantitative
data. It concerns the application of quantitative
and statistical approaches as developed in the
social sciences, for future decision makers,
policy markers, stake holders, managers, etc. - All courses are computer-based classes using the
STATA statistical package. The objective is to
reach levels of competence which provide the
students with skills to both read and understand
the work of others and to carry out one's own
research.
3Examples
- Rise in biotechnology
- Should the EU fund fundamental research in
biotechnology? - Has biotechnology increased the productivity of
firm-level RD? - Did it increase the speed of discovery in
pharmaceutical RD? - Increasing university-industry collaborations
- Does it facilitate innovation by firms?
- Does it increase the production of new knowledge
by academics? - Does it modify the fundamental/applied nature of
research?
4Examples
- Economic (productivity) Growth
- Does it come mainly from new firms or improving
existing firms? - Is market selection operating correctly?
- Why do good firms exit the market?
- How does the organisation of knowledge impact on
performance? - How do knowledge stock and specialisation impact
on productivity? - How do firms enter into new technological fields?
- Do firms diversify in new technologies/businesses
purposively?
5Structure of the Class
- Part 1 Descriptive Statistics
- Part 2 Statistical Inference
- Part 3 Relationship Between Variables
- Part 4 Ordinary Least Squares (OLS)
- Part 5 Extension to OLS
- Part 6 Qualitative Dependent variables
6Structure of the Class
- Part 1 Descriptive Statistics
- Mean, variance, standard deviation
- Data management
- Part 2 Statistical Inference
- Part 3 Relationship Between Variables
- Part 4 Ordinary Least Squares (OLS)
- Part 5 Extension to OLS
- Part 6 Qualitative Dependent variables
7Structure of the Class
- Part 1 Descriptive Statistics
- Part 2 Statistical Inference
- Distributions
- Comparison of means
- Part 3 Relationship Between Variables
- Part 4 Ordinary Least Squares (OLS)
- Part 5 Extension to OLS
- Part 6 Qualitative Dependent variables
8Structure of the Class
- Part 1 Descriptive Statistics
- Part 2 Statistical Inference
- Part 3 Relationship Between Variables
- ANOVA, Chi-Square
- Correlation
- Part 4 Ordinary Least Squares (OLS)
- Part 5 Extension to OLS
- Part 6 Qualitative Dependent variables
9Structure of the Class
- Part 1 Descriptive Statistics
- Part 2 Statistical Inference
- Part 3 Relationship Between Variables
- Part 4 Ordinary Least Squares (OLS)
- Correlation coefficient, simple regression
- Multiple regression
- Part 5 Extension to OLS
- Part 6 Qualitative Dependent variables
10Structure of the Class
- Part 1 Descriptive Statistics
- Part 2 Statistical Inference
- Part 3 Relationship Between Variables
- Part 4 Ordinary Least Squares (OLS)
- Part 5 Extension to OLS
- Regressions diagnostics
- Qualitative explanatory variables
- Part 6 Qualitative Dependent variables
11Structure of the Class
- Part 1 Descriptive Statistics
- Part 2 Statistical Inference
- Part 3 Relationship Between Variables
- Part 4 Ordinary Least Squares (OLS)
- Part 5 Extension to OLS
- Part 6 Qualitative Dependent variables
- Linear probability model
- Maximum likelihood (logit, probit)
12Part 1Descriptive Statistics
13Types of Data
- Descriptive statistics is the branch of
statistics which gathers all techniques used to
describe and summarize quantitative and
qualitative data. - Quantitative data
- Continuous
- Measured on a scale (value its the range)
- The size of the number reflect the amount of the
variable - Age wage, sales height, weight GDP
- Qualitative data
- Discrete, categorical
- The number reflect the category of the variable
- Type of work gender nationality
14Descriptive Statistics
- All means are good to summarize data in a
synthetic way graphs charts tables. - Quantitative data
- Graphs scatter plots line plots histograms
- Central tendency
- Dispersion
- Qualitative data
- Graphs pie graphs histograms
- Tables, frequency, percentage, cumulative
percentage - Cross tables
15Central Tendency and Dispersion
- A distribution is an ordered set of numbers
showing how many times each occurred, from the
lowest to the highest number or the reverse - Central tendency measures of the degree to which
scores are clustered around the mean of a
distribution - Dispersion measures the fluctuations around the
characteristics of central tendency - In other words, the characteristics of central
tendency produce stylized facts, when the
characteristics of dispersion look at the
representativeness of a given stylized fact.
16Central Tendency
- The mode
- The most frequent score in distribution is called
the mode. - The median
- The middle value of all observed values, when 50
of observed value are higher and 50 of observed
value are lower than the median - The mean
- The sum of all of the values divided by the
number of value
The mode, the mean and the median ore equal if
and only of the distribution is symmetrical and
unimodal.
17Dispersion
- The range
- Difference between the maximum and minimum values
- The variance
- Average of the squared differences between data
points and the mean (average) quadratic deviation - The standard deviation
- Square root of variance, therefore measures the
spread of data about the mean, measured in the
same units as the data
18Research Productivity in the Bio-pharmaceutical
Industry
19Stylised Facts about Modern Biotech
- Innovations emerge from uncertain, complex
processes involving knowledge and markets Roles
of networks. - Economic value is created in many ways globally
and in geographical agglomerations - Various linkages exist among diverse actors
(LDFs, DBFs, Univ, Venture Capital) in innovation
processes, but the firm plays a particularly
important role. - Regulations, social structures and institutions
affect on-going innovation processes as well as
their impacts on society Importance of IPR.
20STATA softwareStatistical Package for the Social
Sciences
21The Stata software
- Stata Corp, spinoff from Texas AM College
Station Texas (1985) - Among the most widely used programs for
statistical analysis in social sciences. - Probably to most widely used econometric software
among economists - Data management (case selection, file reshaping,
creating derived data) - Features of Stata are accessible via pull-down
menus - The pull-down menu interface generates command
syntax.
22The Stata software
- STATA is a statistical software in constant
evolution - Updates are constantly put on the web available
to the use of other Stata user (command update
all) - Most are available through the Boston College
server - ssc install module_name, all
- And hundreds of other can be reached as follows
- net search key_words
- net install module_name, all
23The Stata software
Pull down menus
Review window
Results window
Variable window
Command window
24The Stata software
- How to use STATA ?
- Using pull-down menus
- Typing STATA instructions in the Command window
- Grouping a series of STATA instructions in a .do
files - Programming new functions (.ado files)
- Programming new functions with a powerful matrix
language (MATA) similar to C (Version 9.0 of
STATA onwards)
25The Stata software
- All STATA commands used from the menu or the
command window are automatically stored in the
Review window - At the end of a session, the review window can
then be saved by right-clicking on it - save all under a .Do-file
- Send to do-file editor A new window opens up.
- A Do-file is a text file containing a list of
STATA commands which will be executed step by
step by STATA. - It is recommended to explore results and methods
with the command window. Once the methods are
settled, save the series of commands as a do-file.
26The Stata software
- All STATA results are displayed in the Result
window - This window is a buffer. Once it disappears from
the screen, it is deleted. That is why you may
want to record results. - log using log_name.txt (beginning of a session)
- log close (end of a session)
- It is recommended to save results in a log file.
Moreover, if you work with a do file, you can
always get ols results with the do-file.
27The Stata software
- Memory settings
- By default, 10 megabytes are available for
database uploading. If a database is greater than
10Mb, STATA does not upload the database. There
are also other limits (matrix size, of
variables) which can be managed using the
commands below. - Useful commands
- describe using database_name.dta query
memory clear set memory 500m,
permanently set maxvar n , permanently set
matsize n , permanently set virtual on ,
permanently
28Data Handling (1) Database creation
- 1st step Creating a database
- Typing data in the database through Data Editor
(edit) - Importing data
- insheet myfile.txt , options
- options tab comma delimiter("char")
clear names - Importing data from a .txt file
- - Without fixed format (without dictionnary)
- infile1 var1 var2 var3 using myfile.txt ,
options - - With a fixed format (with dictionnary)
- infile2 using mydict.dct , using (myfile.txt)
options
29DH(2) Database Exploration
- 2nd step Exploring the Data
- To obtain a description of the database
- describe varlist, options inspect
varlist codebook varlist, options nmissing
varlist, options npresent varlist, options - To display all possible values of a variable
- list varlist if in, options
- Example list var1 if var2 gt var3 in 1/100
-
30DH(3) Database Organisation
- 3rd step Organisation of the database
- Sorting observations
- sort varlist gsort - varlist
- Sorting variables
- order varlist aorder varlist (If no varlist is
specified, _all is assumed.) - Fusionner plusieurs bases de données (ajouter des
variables) - merge varlist using base1.dta base2.dta,
options - Fusionner plusieurs bases de données (ajouter des
observations) - append using base1.dta base2.dta, options
-
31DH(3) Database Organisation
- 3rd step Organisation of the database
- Modifying the shape of the database
- reshape long stubnames, i(varlist) j(varlist)
- reshape wide stubnames, i(varlist) j(varlist)
i j - id year sex inc ---------------
-------------- - i 1 80 0 5000
- id sex inc80 inc81 inc82 1 81
0 5500 - -------------------------------------- 1
82 0 6000 - 1 0 5000 5500 6000 2 80
1 2000 Long form - 2 1 2000 2200 3300 2 81
1 2200 - 3 0 3000 2000 1000 2 82
1 3300 - 3 80 0 3000
- Wide form 3 81 0 2000
-
3 82 0 1000 -
32DH(4) Saving, Opening, Exporting
- 4th step Save and re-use STATA database files
(.dta files) - Changes the working directory to the specified
drive and directory - cd "C\STATA SKEMA"
- Saves the database as a STATA file (.dta)
- save myfile.dta , replace
- Opens a STATA format database (.dta)
- use myfile.dta , clear
- Exports a database as a txt files
- outsheet varlistusing myfile.txt , options
- options comma nonames replace
-
33Handling Variables
- Create a new variable
- By assigning a value to it
- generate var1 expression if in
- Using a predefined function Extensions to
generate - egen var1 fcn(arguments) if in, options
by(varlist) - fcn min max mode mean median sd
total - pctile group count etc
- Examples egen mean(salaire) , by(age)
- egen group(nom)
- egen count(id), by(sector)
34Handling Variables
- Variables modifications and removal
- Modifying a variable which has already been
created - replace var1 expression in if
- Erasing variables
- drop varlist
- keep varlist
- Erasing observations
- drop in if
- keep in if
- Examples drop if revenu lt 100 keep if age
gt 18
35Handling Variables
- Time series and panel data utilities
- Declaring data as time series or panel data
- tsset panelvar timevar , options
- options daily weekly monthly quarterly
yearly - Exemple tsset id annee , yearly
- Using time series operators
- Lagged values L. ? L.X Xt-1
L2. ou LL. ? L2.X Xt-2 - Forwarded values F. ? F.X Xt1
F2. ou FF. ? F2.X Xt2 - Differenced values D. ? D.X Xt - Xt-1
D2. ou DD. ? D2.X Xt - Xt-1 (Xt-1 -
Xt-2 )
36Descriptive Statistics with STATA
Using log files log using xxx, replace / log
close Defining and using labels label
variable label define label values Descriptive
statistics summarize table table,
content() tabulate Manipulating .dta files and
exporting collapse save as outsheet using...
37Log files
- Log files save the result window. They are useful
when producing descriptives statistics on the
.dta files and on the variables. - log using nom_fichier_log, replace
- Instructions STATA
- log close
- Advantage. Very useful to find back old results
(replication and refutation) - Drawbacks. Tedious to manipulate
38Labelling variables
- Labelling is too often neglected.
- No influence on the results
- Large influence on correct interpretation of
variables and results - label variable. Describe a variable
- label variable asset "real capital"
- label define. Define a label
- label define firm_type 1 "biotech" 0 "Pharma"
- label values Applies the label
- label values type firm_type
-
39Descriptive statistics summarize
- summarize var1 var2....varN
- Produces number of obs. means, variance, min and
max - We can add a condition using if
- summarize var1 var2 ....varN if condition
- We can produce descriptive statistics by subsets
of teh database using bysort - bysort varcat summarize var1 var2 ....varN
- Beware Most of the time, you do not need a
comma before if. However, if you get an error
message, there is very high chances that it comes
from the absence of a comma before if.
40Descriptive statistics table
The table command applies to categorical
variables (string or categorical). table
varcat1 Provide the number of observations by
categories of varcat1 table varcat1
varcat2 Provides a cross table between varcat1
and varcat2 table varcat, content(count var1
mean var1 sd var1...) Provide descriptive
statistics on var1 by categories of varcat
41Descriptive statistics tabulate
The tabulate command is similar to table, but
obtions are different. tabulate varcat,
gen(varcat_) generates dummy variables for each
category of varcat tabulate varcat1 varcat2,
options Generate measures of associations
between two categorical variables tabulate
varcat1, summarize(var2) Provide descriptive
statistics on var2 by categories of var1
42Stacking observations collapse
- The collapse command produces a new database
which is an aggregation of the old database. - collapse will aggregate lines (observation) by
categories of your choice of a define categorical
variable - collapse (mean)var1 var2 (sum) var3, by(varcat)
- Will generate a new database with as many lines
as there are categories of varcat, with 3
variables (means of var1 var2, sum of var3) - collapse (mean)var1 var1 (sd) sdvar1var1
sdvar2var2, by(varcat1 varcat2) - Will generate a new database with as many lines
as there are categories of varcat1 varcat2,
with 3 variables (means of var1 var2, standard
deviation of var1 var2) - Note collapse is interesting to export tables
of results to excel. - Note Please save the old and new database
under different names!
43Keywords for table collapse
mean means (default) sd
standard deviations sum sums
rawsum sums, ignoring optionally
specified weight count number of
nonmissing observations max
maximums min minimums iqr
interquartile range median
medians p1 1st percentile
p2 2nd percentile ...
3rd-49th percentiles p50 50th
percentile (same as median) ...
51st-97th percentiles p98 98th
percentile p99 99th percentile
44Graphs
- Graphic representations are a very effective
means of synthesis . - Pie graphs, which display proportions of a
population or a sample - Two-way graphs linking any two quantitative
dimensions - Distribution graphs (histograms) which plots
central tendency characteristics and dispersion
of a variable
45Pie Graphsgraph pie, over(varcat)
46Two-way Graphs
Two-way graphs link two continuous var1 and
var2. There are several types of two-way graphs
- Line graphs twoway line var1 var2 -
Classical scatterplot twoway scatter var1
var2 - Conencted graphs twoway connected var1
var2
47Line graphestwoway line var1 var2
twoway line rdi year if name Abbott"
48Line graphstwoway line var1 var2
twoway (line rdi year if name"Amgen", sort)
(line rdi year if name"Abbott", sort),
legend(on order(1 "Amgen" 2 "Abbott"))
49Connected graphstwoway connected var1 var2
twoway (connected rdi year if name"Abbott")
50Scatterplots twoway scatter var1 var2
twoway scatter lrdi lassets
51Distribution graphs
- Distribution graphs plot the distribution of one
quantitative variable var1 at a time by means of
a histogram - On the horizontal axis, classes of var1 are
displayed. - On the vertical axis, the density of each class
is displayed.
52Distributionnal histogrammes hist var1
hist lassets
53Kernel distributions
Using kernel, one can get the probability density
function of var1. The probability density
function is important to visually look at the
normality of the distribution. Normal
distributions are also called Gaussian
distribution. These are very frenquently used in
sciences to account for random processes. They
are based on the theory of large numbers and the
central limit theorem.
54Distribution de kernelkdensity var1
kdensity lassets
55Exporting Graphs
One can simply copy and paste graph in any
microsoft office software. One can use.do files,
and write graph export graph_name,
asextension options Exemple graph export
SKEMA_rdi.wmf, as(wmf) replace Possible
extensions PostScript (ps), Encapsulated
PostScript (eps), Windows Metafile (wmf), Windows
Enhanced Metafile (emf), Macintosh PICT format
(pict), Acrobat Reader (pdf)
56SPSS softwareStatistical Package for the Social
Sciences
57SPSS Opening SPSS
58SPSS Importing data
59SPSS Importing data
60SPSS Importing data
- Settings in the import text dialogue box
- No predefine format (1)
- Delimited (2)
- First lines contains the variable names (2)
- One observation per line // all observations (3)
- Tab delimited only (4)
- Finish (6)
61SPSS windows
- SPSS has opens automatically windows
- The datasheet window
- Observe, manage, modify, create, data
- The results window
- Everything you do will be stored there
- The syntax window can be opened
62SPSS Data sheet (1)
63SPSS Data sheet (2)
64SPSS Result / Journal
65SPSS Saving data
66SPSS working, at last!
67Recoding Variables
- Changing existing values to new values
(biotechnologie ? DBF, pharmaceutique ? LDF)
3
1
2
68Computing New Variables
- Taking logarithm (normalization of continuous
variables)
1
2
69Creating Dummy Variables
- Taking logarithm (normalization of continuous
variables)
1
3
2
70Computation of Descriptive Statistics
1
3
2
71Descriptive Statistics
72Splitting Database
1
2
73Descriptive Statistics (by type)
74Logarithm
- Normalization
- Taking the logarithm is a transformation which
usually normalize distribution. - Elasticities http//en.wikipedia.org/wiki/Elastici
ty_(economics) - A change in log of x is a relative change of x
itself. - Cobb-Douglas production function