Title: Review of mathematics and data analysis for IR
1Review of mathematics and data analysis for IR
2Todays class
- Reminder of basic notation and terminology
- Overview of several important statistical tools
for data analysis - Discussion of matrix algebra in the context of
data analysis (and IR) - Well handle probability during a later class
3Basic Notation
N Number of items summed
Sigma summation
subscript the ith element of the summation
common variables
4Getting to know your data
- A group of n observations on p variables.
- For now well assume numeric variables.
- Also assume the the observations are independent
and identically distributed (iid)
5The mtcars data
- Road-test information from Motor Trend Magazine.
- n 32 cars
- p 11 variables
6The mtcars data
, 1mpgMiles/(US) gallon , 2cylNumber of
cylinders , 3dispDisplacement (cu.in.) ,
4hpGross horsepower , 5dratRear axle ratio ,
6wtWeight (lb/1000) , 7qsec1/4 mile time ,
8vsV/S , 9amTransmission (0auto,
1manual) ,10gearNumber of forward
gears ,11carbNumber of carburetors
- Road-test information from Motor Trend Magazine.
- n 32 cars
- p 11 variables
7The mtcars data
mpg cyl disp hp drat wt qsec vs am
gear carb Mazda RX4 21.0 6 160.0 110
3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4
4 Ferrari Dino 19.7 6 145.0 175
3.62 2.770 15.50 0 1 5 6 Maserati Bora
15.0 8 301.0 335 3.54 3.570 14.60 0 1 5
8 Volvo 142E 21.4 4 121.0 109 4.11
2.780 18.60 1 1 4 2
So now we have n32 observations on p9
(numeric) variables.
8The mtcars data
mpg cyl disp hp drat wt qsec gear
carb Mazda RX4 21.0 6 160.0 110 3.90
2.620 16.46 4 4 Mazda RX4 Wag 21.0
6 160.0 110 3.90 2.875 17.02 4 4 Ferrari
Dino 19.7 6 145.0 175 3.62 2.770 15.50
5 6 Maserati Bora 15.0 8 301.0 335
3.54 3.570 14.60 5 8 Volvo 142E
21.4 4 121.0 109 4.11 2.780 18.60 4 2
9Working with data
- From our data we may calculate a statistic such
as average mpg. - We use statistics for different things depending
on our needs. - Description the avg mpg in our sample is easier
to think about than all 32 of them. - Inference extrapolate from our sample to a
parameter in the population. e.g. mpg of modern
cars.
10Working with data
- A statistic can describe one or many variables.
For now well concentrate on univariate stats. - The may be roughly divided into two categories
- Measures of central tendency
- Measures of variability
11Visualizing univariate data
Median 19.2
12Visualizing univariate data
Median 19.2
Intuitively, what is the sample mean (xbar)? In
what sense is is the datas center? Why might the
mean and median diverge?
13Visualizing univariate data
mu is the population mean, a parameter weve been
approximating with xbar.
14Visualizing univariate data
So what are these, sigma and sigma squared?
15Variance and Std. Deviation
- We mentioned measures of dispersion.
- Variance and std. deviation quantify how spread
out the data are around the mean. - More precisely, the standard deviation is the
average distance of a data point from the mean. - The variance is simply the square of SD.
16Variance and Std. Deviation
- Example
- X 1, 2, 3, 2, 5, 1, 7
- Sum(x)
- Xbar
- Sum((x-xbar)2)
- Var(x)
17Numeric Overviews of Data Descriptive Statistics
- Returning to our mpg example
- Mean(mpg) 20.09
- Mean(disp) 230.72
- Variance(mpg) 36.32
- Variance(disp) 15360.8
- Standard dev. (mpg) 6.02
- Standard dev. (disp) 123.94
18Probabilistic Models Estimating the Distribution
of Data
xbar 20.09 sigHat 6.02
- The Gaussian approximation of the variable mpg
19Probabilistic Models Estimating the Distribution
of Data
xbar 20.09 sigHat 6.02
- The Gaussian approximation of the variable mpg
Different values for xbar or sigmaHat give
different probability models.
20Probabilistic Models Estimating the Distribution
of Data
xbar 20.09 sigHat 1
- The Gaussian approximation of the variable mpg
21Probabilistic Models Estimating the Distribution
of Data
xbar 20.09 sigHat 6.02
- The Gaussian approximation of the variable mpg
The crucial point based on our data we calculate
a statistic, which is an estimator for a
corresponding parameter in the population.
22Probabilistic Models Estimating the Distribution
of Data
xbar 20.09 sigHat 6.02
- The Gaussian approximation of the variable mpg
The crucial point based on our data we calculate
a statistic, which is an estimator for a
corresponding parameter in the population.
statistic
parameter
23Probabilistic Models Estimating the Distribution
of Data
xbar 20.09 sigHat 6.02
- The Gaussian approximation of the variable mpg
statistic
parameter
If we drew another sample of size N from the
population, we would compute new values for xbar
and sigHat. These would probably be close to our
observed statistics, but not identical. If we
drew a whole bunch of new samples, each of size
N, we would have a bunch (I.e. a sample) of
estimates of mu. There is a nice result from
statistics
24Probabilistic Models Estimating the Distribution
of Data
xbar 20.09 sigHat 6.02
- The Gaussian approximation of the variable mpg
statistic
parameter
The sample mean xbar is normally distributed
around the population mean mu. Moreover it has
its own standard deviation (the standard error)
25One reason we take Averages
26One reason we take Averages
The sample mean gives us a better estimate of the
population mean than an individual data point
does because on average, means vary around the
population mean less than single data points.
27Review of logarithms
- Read this statement as an English sentence
28Review of logarithms
- So what does this statement mean?
Logarithms can be taken to a variety of bases.
The most common for us will be base 2.
29Review of logarithms
- So what does this statement mean? Logs are
closely related to exponents.
iff
30Review of logarithms
- So what does this statement mean? Logs are
closely related to exponents.
iff
31Review of logarithms
- So what does this statement mean? Logs are
closely related to exponents.
iff
32Review of logarithms
- Why do we use these weird exponents?
- Logs dampen the effect of very large numbers.
- Logs allow us to express complicated (e.g.
multiplicative) relationships as simple addition
Logs also allows us to deal with very small
numbers (such as probabilities) without numerical
underflow problems.
33Intro to Matrices
mpg cyl disp hp drat wt qsec vs
am gear carb Mazda RX4 21.0 6 160.0
110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
Wag 21.0 6 160.0 110 3.90 2.875 17.02 0
1 4 4 Datsun 710 22.8 4 108.0
93 3.85 2.320 18.61 1 1 4 1 Hornet 4
Drive 21.4 6 258.0 110 3.08 3.215 19.44 1
0 3 1 Hornet Sportabout 18.7 8 360.0
175 3.15 3.440 17.02 0 0 3 2 Valiant
18.1 6 225.0 105 2.76 3.460 20.22 1 0
3 1 Duster 360 14.3 8 360.0 245
3.21 3.570 15.84 0 0 3 4 Merc 240D
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4
2 Merc 230 22.8 4 140.8 95 3.92
3.150 22.90 1 0 4 2 Merc 280
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4
4 Merc 280C 17.8 6 167.6 123 3.92
3.440 18.90 1 0 4 4 Merc 450SE
16.4 8 275.8 180 3.07 4.070 17.40 0 0 3
3 Merc 450SL 17.3 8 275.8 180 3.07
3.730 17.60 0 0 3 3 Merc 450SLC
15.2 8 275.8 180 3.07 3.780 18.00 0 0 3
3 Cadillac Fleetwood 10.4 8 472.0 205 2.93
5.250 17.98 0 0 3 4 Lincoln Continental
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3
4 Chrysler Imperial 14.7 8 440.0 230 3.23
5.345 17.42 0 0 3 4 Fiat 128
32.4 4 78.7 66 4.08 2.200 19.47 1 1 4
1 Honda Civic 30.4 4 75.7 52 4.93
1.615 18.52 1 1 4 2 Toyota Corolla
33.9 4 71.1 65 4.22 1.835 19.90 1 1 4
1 Toyota Corona 21.5 4 120.1 97 3.70
2.465 20.01 1 0 3 1 Dodge Challenger
15.5 8 318.0 150 2.76 3.520 16.87 0 0 3
2 AMC Javelin 15.2 8 304.0 150 3.15
3.435 17.30 0 0 3 2 Camaro Z28
13.3 8 350.0 245 3.73 3.840 15.41 0 0 3
4 Pontiac Firebird 19.2 8 400.0 175 3.08
3.845 17.05 0 0 3 2 Fiat X1-9
27.3 4 79.0 66 4.08 1.935 18.90 1 1 4
1 Porsche 914-2 26.0 4 120.3 91 4.43
2.140 16.70 0 1 5 2 Lotus Europa
30.4 4 95.1 113 3.77 1.513 16.90 1 1 5
2 Ford Pantera L 15.8 8 351.0 264 4.22
3.170 14.50 0 1 5 4 Ferrari Dino
19.7 6 145.0 175 3.62 2.770 15.50 0 1 5
6 Maserati Bora 15.0 8 301.0 335 3.54
3.570 14.60 0 1 5 8 Volvo 142E
21.4 4 121.0 109 4.11 2.780 18.60 1 1 4
2
- It is often the case that each observation we
make is comprised of several variables (11 in the
case of mpg data). - To help us manipulate and understand such
multivariate data, it is useful to consider our
data set (sample) as a matrix. - Matrix theory (and linear algebra) provide a
strong apparatus for manipulating multivariate
data.
34Intro to Matrices
From your basic math class A matrix is a
2-Dimensional array
For the purposes of this course a matrix will
contain numbersthough in general this need not
be the case.
35Intro to Matrices
From your basic math class A matrix is a
2-Dimensional array
This is very much like a spreadsheetor a data
set, where we have data arranged into rows and
columns. Matrix algebra allows us to manipulate
portions of this group of numbers, or the whole
group at once.
Here, matrix A is a 3 x 3 matrix of integers
36Intro to Matrices
From your basic math class A matrix is a
2-Dimensional array
We can think of a matrix as a collection of row
vectors, where a vector is just a 1-D array.
Row 1
Row 2
Row 3
Here, matrix A is a 3 x 3 matrix of integers
37Intro to Matrices
From your basic math class A matrix is a
2-Dimensional array
Or we can consider the column vectors. An
important thing about a matrix is that we can
take either the row-wise or column-wise view of
its vectors.
Column 1
Column 2
Column 2
Here, matrix A is a 3 x 3 matrix of integers
38Intro to Matrices
From your basic math class A matrix is a
2-Dimensional array
Of course, such arrangements dont preclude
consideration of the individual cells or
values in the matrix.
Here, matrix A is a 3 x 3 matrix of integers
39Intro to Matrices
Some Special Matrices
A square matrix is any matrix that has the same
number of rows as columns
40Intro to Matrices
Some Special Matrices
A diagonal matrix is composed of all elements
equal to 0 except those on the main diagonal
A special kind of diagonal matrix is the identity
matrix, which simply has all 1s on the main
diagonal.
41Intro to Matrices
Transposing a Matrix
Transposing a matrix just involves swapping the
rows and columns
42Intro to Matrices
Transposing a Matrix
Transposing a matrix just involves swapping the
rows and columns
Here, matrix A is a 3 x 3 matrix of integers
43Intro to Matrices
Matrix Addition
44Intro to Matrices
Matrix Addition
45Intro to Matrices
Matrix (and vector) Addition
46Intro to Matrices
Matrix Subtraction
47Intro to Matrices
Matrix Addition Subtraction
Undefined!
Matrices must have the same number of rows and
columns in order to be added or subtracted.
48Intro to Matrices
Multiplication (by a scalar)
Multiply 3 x 3 matrix A by the scalar 3 to get 3A.
49Intro to Matrices
Multiplication (by a scalar)
Multiply the column vector v by the scalar 3 to
get 3v.
50Intro to Matrices
Multiplication (of two vectors)
Multiplying two vectors gives a number (a
scalar). This operation is called the inner
product (or dot product) of two vectors.
51Intro to Matrices
Multiplication (of two vectors)
52Intro to Matrices
Multiplication (of two vectors)
53Intro to Matrices
Multiplication (of two Matrices)
Multiplying 2 matrices tends to freak people out
it doesnt seem right, especially after our nice
results about adding matrices and multiplying
matrices by scalars. But our discussion of
multiplying vectors forms the basis for
understanding how we multiply matrices, because
vectors are basically just skinny
matrices. Multiplying two matrices X and Y
involves finding many dot products. Exactly how
many and in what order is the confusing part.
54Intro to Matrices
Multiplication (of two Matrices)
Consider these items
1 x 3
? x ?
3 x 2
55Intro to Matrices
Multiplication (of two Matrices)
Consider these items
These are just dot products scalars.
1 x 3
1 x 3
3 x 2
3 x 2
The dimensions of the product matrix are given by
the outer dimensions of the factors. In order
for matrix multiplication to be defined the inner
dimensions of the factors must match.
1 x 2
56Intro to Matrices
Multiplication (of two Matrices)
Consider these items
These are just dot products scalars.
1 x 3
1 x 3
3 x 2
3 x 2
The dimensions of the product matrix are given by
the outer dimensions of the factors. In order
for matrix multiplication to be defined the inner
dimensions of the factors must match.
1 x 2
57Intro to Matrices
Multiplication (of two Matrices)
Consider these items
2 x 2
2 x 2
2 x 1
2 x 1
2 x 1
58Intro to Matrices
Multiplication (of two Matrices)
Consider these items
2 x 2
2 x 2
2 x 2
2 x 2
2 x 2
59Intro to Matrices
Multiplication (of two Matrices)
Consider these items
2 x 2
2 x 2
2 x 2
2 x 2
In general, the ijth cell of the product matrix
is the inner product of the ith row of X and the
jth column of Y.
60Some practice
Find the following matrix products, if they exist
61Some practice
This table shows information about the words that
occur in a (fictional) group of documents. Each
row is a document each column is a term. 1
indicates that a term occurs in a document, 0
indicates that it doesnt.
Let A be the matrix specified by this table. The
vector q represents a searchers query dog paw
prints. Use matrix (vector) multiplication to
count the query words contained by each document
62Some practice
This table shows information about the words that
occur in a (fictional) group of documents. Each
row is a document each column is a term. 1
indicates that a term occurs in a document, 0
indicates that it doesnt.
Let A be the matrix specified by this table. Can
you devise a matrix multiplication that will
result in a 5 X 5 matrix giving the number of
terms shared by each pair of docs? How about the
equivalent for terms?
63Summary
- Basic notation
- Statistical distributions as probabilistic models
- Using matrix algebra for multiplicative and
additive operations.