Review of mathematics and data analysis for IR

About This Presentation

Title:

Review of mathematics and data analysis for IR

Description:

Reminder of basic notation and terminology. Overview of several important statistical tools for data analysis ... Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 64

Provided by: technica5

Category:

more less

Transcript and Presenter's Notes

Title: Review of mathematics and data analysis for IR

1
Review of mathematics and data analysis for IR
2
Todays class

Reminder of basic notation and terminology
Overview of several important statistical tools
for data analysis
Discussion of matrix algebra in the context of
data analysis (and IR)
Well handle probability during a later class

3
Basic Notation
N Number of items summed
Sigma summation
subscript the ith element of the summation
common variables
4
Getting to know your data

A group of n observations on p variables.
For now well assume numeric variables.
Also assume the the observations are independent
and identically distributed (iid)

5
The mtcars data

Road-test information from Motor Trend Magazine.
n 32 cars
p 11 variables

6
The mtcars data
, 1mpgMiles/(US) gallon , 2cylNumber of
cylinders , 3dispDisplacement (cu.in.) ,
4hpGross horsepower , 5dratRear axle ratio ,
6wtWeight (lb/1000) , 7qsec1/4 mile time ,
8vsV/S , 9amTransmission (0auto,
1manual) ,10gearNumber of forward
gears ,11carbNumber of carburetors

Road-test information from Motor Trend Magazine.
n 32 cars
p 11 variables

7
The mtcars data
mpg cyl disp hp drat wt qsec vs am
gear carb Mazda RX4 21.0 6 160.0 110
3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag
21.0 6 160.0 110 3.90 2.875 17.02 0 1 4
4 Ferrari Dino 19.7 6 145.0 175
3.62 2.770 15.50 0 1 5 6 Maserati Bora
15.0 8 301.0 335 3.54 3.570 14.60 0 1 5
8 Volvo 142E 21.4 4 121.0 109 4.11
2.780 18.60 1 1 4 2
So now we have n32 observations on p9
(numeric) variables.
8
The mtcars data
mpg cyl disp hp drat wt qsec gear
carb Mazda RX4 21.0 6 160.0 110 3.90
2.620 16.46 4 4 Mazda RX4 Wag 21.0
6 160.0 110 3.90 2.875 17.02 4 4 Ferrari
Dino 19.7 6 145.0 175 3.62 2.770 15.50
5 6 Maserati Bora 15.0 8 301.0 335
3.54 3.570 14.60 5 8 Volvo 142E
21.4 4 121.0 109 4.11 2.780 18.60 4 2
9
Working with data

From our data we may calculate a statistic such
as average mpg.
We use statistics for different things depending
on our needs.
Description the avg mpg in our sample is easier
to think about than all 32 of them.
Inference extrapolate from our sample to a
parameter in the population. e.g. mpg of modern
cars.

10
Working with data

A statistic can describe one or many variables.
For now well concentrate on univariate stats.
The may be roughly divided into two categories
Measures of central tendency
Measures of variability

11
Visualizing univariate data
Median 19.2
12
Visualizing univariate data
Median 19.2
Intuitively, what is the sample mean (xbar)? In
what sense is is the datas center? Why might the
mean and median diverge?
13
Visualizing univariate data
mu is the population mean, a parameter weve been
approximating with xbar.
14
Visualizing univariate data
So what are these, sigma and sigma squared?
15
Variance and Std. Deviation

We mentioned measures of dispersion.
Variance and std. deviation quantify how spread
out the data are around the mean.
More precisely, the standard deviation is the
average distance of a data point from the mean.
The variance is simply the square of SD.

16
Variance and Std. Deviation

Example
X 1, 2, 3, 2, 5, 1, 7
Sum(x)
Xbar
Sum((x-xbar)2)
Var(x)

17
Numeric Overviews of Data Descriptive Statistics

Returning to our mpg example
Mean(mpg) 20.09
Mean(disp) 230.72
Variance(mpg) 36.32
Variance(disp) 15360.8
Standard dev. (mpg) 6.02
Standard dev. (disp) 123.94

18
Probabilistic Models Estimating the Distribution
of Data
xbar 20.09 sigHat 6.02

The Gaussian approximation of the variable mpg

19
Probabilistic Models Estimating the Distribution
of Data
xbar 20.09 sigHat 6.02

The Gaussian approximation of the variable mpg

Different values for xbar or sigmaHat give
different probability models.
20
Probabilistic Models Estimating the Distribution
of Data
xbar 20.09 sigHat 1

The Gaussian approximation of the variable mpg

21
Probabilistic Models Estimating the Distribution
of Data
xbar 20.09 sigHat 6.02

The Gaussian approximation of the variable mpg

The crucial point based on our data we calculate
a statistic, which is an estimator for a
corresponding parameter in the population.
22
Probabilistic Models Estimating the Distribution
of Data
xbar 20.09 sigHat 6.02

The Gaussian approximation of the variable mpg

The crucial point based on our data we calculate
a statistic, which is an estimator for a
corresponding parameter in the population.
statistic
parameter
23
Probabilistic Models Estimating the Distribution
of Data
xbar 20.09 sigHat 6.02

The Gaussian approximation of the variable mpg

statistic
parameter
If we drew another sample of size N from the
population, we would compute new values for xbar
and sigHat. These would probably be close to our
observed statistics, but not identical. If we
drew a whole bunch of new samples, each of size
N, we would have a bunch (I.e. a sample) of
estimates of mu. There is a nice result from
statistics
24
Probabilistic Models Estimating the Distribution
of Data
xbar 20.09 sigHat 6.02

The Gaussian approximation of the variable mpg

statistic
parameter
The sample mean xbar is normally distributed
around the population mean mu. Moreover it has
its own standard deviation (the standard error)
25
One reason we take Averages

Distribution of x

Distribution of xbar

26
One reason we take Averages

Distribution of x

Distribution of xbar

The sample mean gives us a better estimate of the
population mean than an individual data point
does because on average, means vary around the
population mean less than single data points.
27
Review of logarithms

Read this statement as an English sentence

28
Review of logarithms

So what does this statement mean?

Logarithms can be taken to a variety of bases.
The most common for us will be base 2.
29
Review of logarithms

So what does this statement mean? Logs are
closely related to exponents.

iff
30
Review of logarithms

So what does this statement mean? Logs are
closely related to exponents.

iff
31
Review of logarithms

So what does this statement mean? Logs are
closely related to exponents.

iff
32
Review of logarithms

Why do we use these weird exponents?

Logs dampen the effect of very large numbers.
Logs allow us to express complicated (e.g.
multiplicative) relationships as simple addition

Logs also allows us to deal with very small
numbers (such as probabilities) without numerical
underflow problems.
33
Intro to Matrices
mpg cyl disp hp drat wt qsec vs
am gear carb Mazda RX4 21.0 6 160.0
110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
Wag 21.0 6 160.0 110 3.90 2.875 17.02 0
1 4 4 Datsun 710 22.8 4 108.0
93 3.85 2.320 18.61 1 1 4 1 Hornet 4
Drive 21.4 6 258.0 110 3.08 3.215 19.44 1
0 3 1 Hornet Sportabout 18.7 8 360.0
175 3.15 3.440 17.02 0 0 3 2 Valiant
18.1 6 225.0 105 2.76 3.460 20.22 1 0
3 1 Duster 360 14.3 8 360.0 245
3.21 3.570 15.84 0 0 3 4 Merc 240D
24.4 4 146.7 62 3.69 3.190 20.00 1 0 4
2 Merc 230 22.8 4 140.8 95 3.92
3.150 22.90 1 0 4 2 Merc 280
19.2 6 167.6 123 3.92 3.440 18.30 1 0 4
4 Merc 280C 17.8 6 167.6 123 3.92
3.440 18.90 1 0 4 4 Merc 450SE
16.4 8 275.8 180 3.07 4.070 17.40 0 0 3
3 Merc 450SL 17.3 8 275.8 180 3.07
3.730 17.60 0 0 3 3 Merc 450SLC
15.2 8 275.8 180 3.07 3.780 18.00 0 0 3
3 Cadillac Fleetwood 10.4 8 472.0 205 2.93
5.250 17.98 0 0 3 4 Lincoln Continental
10.4 8 460.0 215 3.00 5.424 17.82 0 0 3
4 Chrysler Imperial 14.7 8 440.0 230 3.23
5.345 17.42 0 0 3 4 Fiat 128
32.4 4 78.7 66 4.08 2.200 19.47 1 1 4
1 Honda Civic 30.4 4 75.7 52 4.93
1.615 18.52 1 1 4 2 Toyota Corolla
33.9 4 71.1 65 4.22 1.835 19.90 1 1 4
1 Toyota Corona 21.5 4 120.1 97 3.70
2.465 20.01 1 0 3 1 Dodge Challenger
15.5 8 318.0 150 2.76 3.520 16.87 0 0 3
2 AMC Javelin 15.2 8 304.0 150 3.15
3.435 17.30 0 0 3 2 Camaro Z28
13.3 8 350.0 245 3.73 3.840 15.41 0 0 3
4 Pontiac Firebird 19.2 8 400.0 175 3.08
3.845 17.05 0 0 3 2 Fiat X1-9
27.3 4 79.0 66 4.08 1.935 18.90 1 1 4
1 Porsche 914-2 26.0 4 120.3 91 4.43
2.140 16.70 0 1 5 2 Lotus Europa
30.4 4 95.1 113 3.77 1.513 16.90 1 1 5
2 Ford Pantera L 15.8 8 351.0 264 4.22
3.170 14.50 0 1 5 4 Ferrari Dino
19.7 6 145.0 175 3.62 2.770 15.50 0 1 5
6 Maserati Bora 15.0 8 301.0 335 3.54
3.570 14.60 0 1 5 8 Volvo 142E
21.4 4 121.0 109 4.11 2.780 18.60 1 1 4
2

It is often the case that each observation we
make is comprised of several variables (11 in the
case of mpg data).
To help us manipulate and understand such
multivariate data, it is useful to consider our
data set (sample) as a matrix.
Matrix theory (and linear algebra) provide a
strong apparatus for manipulating multivariate
data.

34
Intro to Matrices
From your basic math class A matrix is a
2-Dimensional array
For the purposes of this course a matrix will
contain numbersthough in general this need not
be the case.
35
Intro to Matrices
From your basic math class A matrix is a
2-Dimensional array
This is very much like a spreadsheetor a data
set, where we have data arranged into rows and
columns. Matrix algebra allows us to manipulate
portions of this group of numbers, or the whole
group at once.
Here, matrix A is a 3 x 3 matrix of integers
36
Intro to Matrices
From your basic math class A matrix is a
2-Dimensional array
We can think of a matrix as a collection of row
vectors, where a vector is just a 1-D array.
Row 1
Row 2
Row 3
Here, matrix A is a 3 x 3 matrix of integers
37
Intro to Matrices
From your basic math class A matrix is a
2-Dimensional array
Or we can consider the column vectors. An
important thing about a matrix is that we can
take either the row-wise or column-wise view of
its vectors.
Column 1
Column 2
Column 2
Here, matrix A is a 3 x 3 matrix of integers
38
Intro to Matrices
From your basic math class A matrix is a
2-Dimensional array
Of course, such arrangements dont preclude
consideration of the individual cells or
values in the matrix.
Here, matrix A is a 3 x 3 matrix of integers
39
Intro to Matrices
Some Special Matrices
A square matrix is any matrix that has the same
number of rows as columns
40
Intro to Matrices
Some Special Matrices
A diagonal matrix is composed of all elements
equal to 0 except those on the main diagonal
A special kind of diagonal matrix is the identity
matrix, which simply has all 1s on the main
diagonal.
41
Intro to Matrices
Transposing a Matrix
Transposing a matrix just involves swapping the
rows and columns
42
Intro to Matrices
Transposing a Matrix
Transposing a matrix just involves swapping the
rows and columns
Here, matrix A is a 3 x 3 matrix of integers
43
Intro to Matrices
Matrix Addition
44
Intro to Matrices
Matrix Addition
45
Intro to Matrices
Matrix (and vector) Addition
46
Intro to Matrices
Matrix Subtraction
47
Intro to Matrices
Matrix Addition Subtraction
Undefined!
Matrices must have the same number of rows and
columns in order to be added or subtracted.
48
Intro to Matrices
Multiplication (by a scalar)
Multiply 3 x 3 matrix A by the scalar 3 to get 3A.
49
Intro to Matrices
Multiplication (by a scalar)
Multiply the column vector v by the scalar 3 to
get 3v.
50
Intro to Matrices
Multiplication (of two vectors)
Multiplying two vectors gives a number (a
scalar). This operation is called the inner
product (or dot product) of two vectors.
51
Intro to Matrices
Multiplication (of two vectors)
52
Intro to Matrices
Multiplication (of two vectors)
53
Intro to Matrices
Multiplication (of two Matrices)
Multiplying 2 matrices tends to freak people out
it doesnt seem right, especially after our nice
results about adding matrices and multiplying
matrices by scalars. But our discussion of
multiplying vectors forms the basis for
understanding how we multiply matrices, because
vectors are basically just skinny
matrices. Multiplying two matrices X and Y
involves finding many dot products. Exactly how
many and in what order is the confusing part.
54
Intro to Matrices
Multiplication (of two Matrices)
Consider these items
1 x 3
? x ?
3 x 2
55
Intro to Matrices
Multiplication (of two Matrices)
Consider these items
These are just dot products scalars.
1 x 3
1 x 3
3 x 2
3 x 2
The dimensions of the product matrix are given by
the outer dimensions of the factors. In order
for matrix multiplication to be defined the inner
dimensions of the factors must match.
1 x 2
56
Intro to Matrices
Multiplication (of two Matrices)
Consider these items
These are just dot products scalars.
1 x 3
1 x 3
3 x 2
3 x 2
The dimensions of the product matrix are given by
the outer dimensions of the factors. In order
for matrix multiplication to be defined the inner
dimensions of the factors must match.
1 x 2
57
Intro to Matrices
Multiplication (of two Matrices)
Consider these items
2 x 2
2 x 2
2 x 1
2 x 1
2 x 1
58
Intro to Matrices
Multiplication (of two Matrices)
Consider these items
2 x 2
2 x 2
2 x 2
2 x 2
2 x 2
59
Intro to Matrices
Multiplication (of two Matrices)
Consider these items
2 x 2
2 x 2
2 x 2
2 x 2
In general, the ijth cell of the product matrix
is the inner product of the ith row of X and the
jth column of Y.
60
Some practice
Find the following matrix products, if they exist
61
Some practice
This table shows information about the words that
occur in a (fictional) group of documents. Each
row is a document each column is a term. 1
indicates that a term occurs in a document, 0
indicates that it doesnt.
Let A be the matrix specified by this table. The
vector q represents a searchers query dog paw
prints. Use matrix (vector) multiplication to
count the query words contained by each document
62
Some practice
This table shows information about the words that
occur in a (fictional) group of documents. Each
row is a document each column is a term. 1
indicates that a term occurs in a document, 0
indicates that it doesnt.
Let A be the matrix specified by this table. Can
you devise a matrix multiplication that will
result in a 5 X 5 matrix giving the number of
terms shared by each pair of docs? How about the
equivalent for terms?
63
Summary

Basic notation
Statistical distributions as probabilistic models
Using matrix algebra for multiplicative and
additive operations.

Write a Comment

User Comments (0)

About PowerShow.com

Review of mathematics and data analysis for IR - PowerPoint PPT Presentation

Review of mathematics and data analysis for IR

Reminder of basic notation and terminology. Overview of several important statistical tools for data analysis ... Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 ... – PowerPoint PPT presentation