Advanced Topics in - PowerPoint PPT Presentation

About This Presentation

Title:

Advanced Topics in

Description:

You often spend more time futzing with reading files that are in inscrutable formats ... so, let's have a foray into the world of probability. Overall Strategy ... – PowerPoint PPT presentation

Number of Views:46

Avg rating:3.0/5.0

Slides: 47

Provided by: billm7

Learn more at: https://www.ldeo.columbia.edu

Category:

more less

Transcript and Presenter's Notes

Title: Advanced Topics in

1
Lecture 8

Advanced Topics in
Least Squares
- Part Two -

2
Concerning the Homework

the dirty little secret of data analysis

3
You often spend more time futzing with reading
files that are in inscrutable formats

than the intellectually-interesting side of data
analysis

4
Sample MatLab Code

cs importdata('test.txt','')
Ns length(cs)
mag zeros(Ns,1)
Nm0
for i 1Ns
schar(cs(i))
smags(4850)
stypes(5152)
if( stype 'Mb' )
NmNm1
mag(Nm,1)str2num(smag)
end
end
mag mag(1Nm)

a routine to read a text file
choose non-occurring delimiter to force complete
line into one cell
returns cellstr data type array of strings
convert cellstr element to string
convert string to number
5
What can go wrong in least-squares m GTG-1
GT dthe matrix GTG-1 is singular
6
EXAMPLE - a straight line fit
d1d2d3dN
1 x11 x21 x31 xN
N Si xi Si xi Si xi2
GTG
m
det(GTG) N Si xi2 Si xi2
GTG-1 singular when determinant is zero
7
det(GTG) N Si xi2 Si xi2 0

N1, only one measurement (x,d)
N Si xi2 Si xi2 x2 - x2 0
you cant fit a straight line to only one point
N?1, all data measured at the same x
N Si xi2 Si xi2 N2 x2 N2 x2 0
measuring the same point over and over
doesnt help

8
another example sums and differences
s1s2dNd-1 dNd-1
Ns sums, si, and Nd differences, di, of two
unknowns m1 and m2
1 11 11 -11 -1
m
NsNd Ns-Nd Ns-Nd NsNd
GTG
det(GTG) 0 NsNd2 - Ns-Nd2 Ns2Nd2
2NsNd - Ns2Nd2 -2NsNd 4NsNd
zero when Ns0 or Nd0, that is, only
measurements of one kind
9
This sort of missing measurementmight be
difficult to recognize in a complicated
problembut it happens all the time
10
Example - Tomography
11
in this method, you try to plaster the subject
with X-ray beams made at every possible position
and direction, but you can easily wind up missing
some small region
no data coverage here
12
What to do ?

Introduce prior information
assumptions about the behavior of the unknowns
that fill in the data gaps

13
Examples of Prior Information

The unknowns
are close to some already-known value
the density of the mantle is close to 3000
kg/m3
vary smoothly with time or with geographical
position
ocean currents have length scales of 10s of
km
obey some physical law embodied in a PDE
water is incompressible and
thus its velocity satisfies div(v)0

14
Are you only fooling yourself ?

It depends
are your assumptions good ones?

15
Application of theMaximum Likelihood Methodto
this problem
so, lets have a foray into the world of
probability
16
Overall Strategy1. Represent the observed data
as a probability distribution2. Represent prior
information as a probability distribution3.
Represent the relationship between data and model
parameters as a probability distribution4.
Combine the three distributions in a way that
embodies combining the information that they
contain5. Apply maximum likelihood to the
combined distribution
17
How to combine distributions in a way that
embodies combining the information that they
contain

Short answer multiply them
But lets step through a more well-reasoned
analysis of why we should do that

p1(x)
x
p2(x)
x
pT(x)
x
18
how to quantify the information in a distribution
p(x)
Information compared to what? Compared to a
distribution pN(x) that represents the state of
complete ignorance Example pN(x) a uniform
distribution The information content should be a
scalar quantity, Q
19
Q ? ln p(x)/pN(x) p(x) dx
Q is the expected value of ln p(x)/pN(x)
Properties Q0 when p(x) pN(x) Q?0 always
(since limitp?0 of p?ln(p)0) Q is invariant
under a change of variables x?y
20
Combining distributions pA(x) and pB(x)
Desired properties of the combination pA(x)
combined with pB(x) is the same as pB(x)
combined with pA(x) pA(x) combined pB(x)
combined with pC(x) is the same as
pA(x) combined pB(x) combined with pC(x) Q
of pA(x) combined with pN(x) ? QA
21
pAB(x) pA(x) pB(x) / pN(x)
When pN(x) is the uniform distribution
combining is just multiplying.
But note that for points on the surface of a
sphere, the null distribution, p(q,f), q is
latitude and f is longitude, where would not be
uniform, but rather proportional to sin(q).
22
Overall Strategy1. Represent the observed data
as a Normal probability distributionpA(d) ?
exp -½ (d-dobs)T Cd-1 (d-dobs)
In the absence of any other information, the best
estimate of the mean of the data is the observed
data itself.
I dont feel like typing the normalization
Prior covariance of the data.
23
Overall Strategy2. Represent prior information
as a Normal probability distribution pA(m) ?
exp -½ (m-mA)T Cm-1 (m-mA)
Prior estimate of the model, your best guess as
to what it would be, in the absence of any
observations.
Prior covariance of the model quantifies how good
you think your prior estimate is
24
exampleone observationdobs 0.8 0.4one
model parameter withmA1.0 1.25
25
0
pA(d) pA(m)
dobs0.8
2
0
2
mA1
26
Overall Strategy3. Represent the relationship
between data and model parameters as a
probability distributionpT(d,m) ? exp -½
(d-Gm)T CG-1 (d-Gm)
linear theory, Gmd, relating data, d, to model
parameters, m.
Prior covariance of the theory quantifies how
good you think your linear theory is.
27
exampletheory dmbut only accurate to 0.2
28
0
pT(d,m)
dobs0.8
2
0
2
mA1
29
Overall Strategy4. Combine the three
distributions in a way that embodies combining
the information that they containp (m,d)
pA(d) pA(m) pT(m,d) ? exp -½
(d-dobs)T Cd-1 (d-dobs)
(m-mA)T Cm-1 (m-mA)
(d-Gm)T CG-1 (d-Gm)
a bit of a mess, but it can be simplified ,,,
30
0
p(d,m)pA(d) pA(m) pT(d,m)
2
0
2
31
Overall Strategy 5. Apply maximum likelihood to
the combined distribution, p(d,m) pA(d) pA(m)
pT(m,d) There are two distinct ways we could do
thisFind the (d,m) combinations that maximized
the joint probability distribution, p(d,m) Find
the m that maximized the individual probability
distribution, p(m) ? p(d,m) ddThese do not
necessarily give the same value for m
32
maximum of p(d,m)pA(d) pA(m) pT(d,m)
0
p(d,m)
dpre
Maximum likelihood point
2
mest
0
2
33
maximum of p(m) ?p(d,m)dd
Maximum likelihood point
p(m)
m
mest
34
special case of an exact theory

in the limit CG?0
exp -½ (d-Gm)T CG-1 (d-Gm) ? d(d-Gm)
and p(m) ? p(d,m) dd
pA(m) ? pA(d) d(d-Gm) dd
pA(m) pA(dGm)
so for normal distributions p(m) ? exp -½
(Gm-dobs)T Cd-1 (Gm-dobs) (m-mA)T Cm-1 (m-mA)

Dirac delta function, with property ? f(x) d(x-y)
dx f(y)
35
special case of an exact theory

maximizing p(m) is equivalent to minimizing
(Gm-dobs)TCd-1(Gm-dobs) (m-mA)TCm-1(m-mA)

weighted prediction error
weighted distance of the model from its prior
value
36
solutioncalculated via the usual messy
minimization process

mest mA M dobs GmA
where M GTCd-1G Cm-1-1 GT Cd-1

Dont Memorize, but be prepared to use (e.g. in
homework).
37
interesting interpretation

mest - mA M dobs GmA

estimated model minus its prior
observed data minus the prediction of the prior
model
linear connection between the two
38
special case of no prior informationCm ? ?

M GTCd-1G Cm-1-1 GT Cd-1?GTCd-1G-1 GT
Cd-1
mest mA GTCd-1G-1 GT Cd-1 dobs GmA
mAGTCd-1G-1GTCd-1dobsGTCd-1G-1GTCd-1GmA
mAGTCd-1G-1GTCd-1dobsmA
GTCd-1G-1GTCd-1dobs

recovers weighted least squares
39
special case of infinitely accurate prior
information Cm ? 0

M GTCd-1G Cm-1-1 GT Cd-1 ? 0
mest mA 0
mA

recovers prior value of m
40
special uncorrelated case Cmsm2I and Cdsd2I

M GTCd-1G Cm-1-1 GT Cd-1
sd-2GTG sm-2I-1 GT sd-2
GTG (sd/sm)2I -1 GT

this formula is sometimes called damped least
squares, with damping factor esd/sm
41
Damped Least Squaresmakes the process of
avoiding singular matrices associated with
insufficient datatrivially easyyou just add
e2I to GTG before computing the inverse
42
GTG ? GTG e2I this process regularizes the
matrix, so its inverse always existsits
interpretation is in the absence of relevant
data, assume the model parameter has its prior
value
43
Are you only fooling yourself ?

It depends
is the assumption - that you know the prior value
- a good one?

44
Smoothness Constraints

e.g. model is smooth when its second derivative
is small
d2mi/dx2 ? mi-1 - 2mi mi1

(assuming the data are organized according to one
spatial variable)
45
matrix D approximates second derivative
1 -2 1 0 0 0 0 1 -2 1 0
0 0 0 0 1 -2 1
D
d2m/dx2 ? Dm
46
Choosing a smooth solution is thus equivalent to
minimizing