Title: Advanced Topics in
1Lecture 8
- Advanced Topics in
- Least Squares
- - Part Two -
2Concerning the Homework
- the dirty little secret of data analysis
3You often spend more time futzing with reading
files that are in inscrutable formats
- than the intellectually-interesting side of data
analysis
4Sample MatLab Code
- cs importdata('test.txt','')
- Ns length(cs)
- mag zeros(Ns,1)
- Nm0
- for i 1Ns
- schar(cs(i))
- smags(4850)
- stypes(5152)
- if( stype 'Mb' )
- NmNm1
- mag(Nm,1)str2num(smag)
- end
- end
- mag mag(1Nm)
a routine to read a text file
choose non-occurring delimiter to force complete
line into one cell
returns cellstr data type array of strings
convert cellstr element to string
convert string to number
5What can go wrong in least-squares m GTG-1
GT dthe matrix GTG-1 is singular
6EXAMPLE - a straight line fit
d1d2d3dN
1 x11 x21 x31 xN
N Si xi Si xi Si xi2
GTG
m
det(GTG) N Si xi2 Si xi2
GTG-1 singular when determinant is zero
7det(GTG) N Si xi2 Si xi2 0
- N1, only one measurement (x,d)
- N Si xi2 Si xi2 x2 - x2 0
- you cant fit a straight line to only one point
- N?1, all data measured at the same x
- N Si xi2 Si xi2 N2 x2 N2 x2 0
- measuring the same point over and over
doesnt help
8another example sums and differences
s1s2dNd-1 dNd-1
Ns sums, si, and Nd differences, di, of two
unknowns m1 and m2
1 11 11 -11 -1
m
NsNd Ns-Nd Ns-Nd NsNd
GTG
det(GTG) 0 NsNd2 - Ns-Nd2 Ns2Nd2
2NsNd - Ns2Nd2 -2NsNd 4NsNd
zero when Ns0 or Nd0, that is, only
measurements of one kind
9This sort of missing measurementmight be
difficult to recognize in a complicated
problembut it happens all the time
10Example - Tomography
11in this method, you try to plaster the subject
with X-ray beams made at every possible position
and direction, but you can easily wind up missing
some small region
no data coverage here
12What to do ?
- Introduce prior information
- assumptions about the behavior of the unknowns
- that fill in the data gaps
13Examples of Prior Information
- The unknowns
- are close to some already-known value
- the density of the mantle is close to 3000
kg/m3 -
- vary smoothly with time or with geographical
position - ocean currents have length scales of 10s of
km - obey some physical law embodied in a PDE
- water is incompressible and
- thus its velocity satisfies div(v)0
14Are you only fooling yourself ?
- It depends
- are your assumptions good ones?
15Application of theMaximum Likelihood Methodto
this problem
so, lets have a foray into the world of
probability
16Overall Strategy1. Represent the observed data
as a probability distribution2. Represent prior
information as a probability distribution3.
Represent the relationship between data and model
parameters as a probability distribution4.
Combine the three distributions in a way that
embodies combining the information that they
contain5. Apply maximum likelihood to the
combined distribution
17How to combine distributions in a way that
embodies combining the information that they
contain
- Short answer multiply them
- But lets step through a more well-reasoned
analysis of why we should do that
p1(x)
x
p2(x)
x
pT(x)
x
18how to quantify the information in a distribution
p(x)
Information compared to what? Compared to a
distribution pN(x) that represents the state of
complete ignorance Example pN(x) a uniform
distribution The information content should be a
scalar quantity, Q
19Q ? ln p(x)/pN(x) p(x) dx
Q is the expected value of ln p(x)/pN(x)
Properties Q0 when p(x) pN(x) Q?0 always
(since limitp?0 of p?ln(p)0) Q is invariant
under a change of variables x?y
20Combining distributions pA(x) and pB(x)
Desired properties of the combination pA(x)
combined with pB(x) is the same as pB(x)
combined with pA(x) pA(x) combined pB(x)
combined with pC(x) is the same as
pA(x) combined pB(x) combined with pC(x) Q
of pA(x) combined with pN(x) ? QA
21pAB(x) pA(x) pB(x) / pN(x)
When pN(x) is the uniform distribution
combining is just multiplying.
But note that for points on the surface of a
sphere, the null distribution, p(q,f), q is
latitude and f is longitude, where would not be
uniform, but rather proportional to sin(q).
22Overall Strategy1. Represent the observed data
as a Normal probability distributionpA(d) ?
exp -½ (d-dobs)T Cd-1 (d-dobs)
In the absence of any other information, the best
estimate of the mean of the data is the observed
data itself.
I dont feel like typing the normalization
Prior covariance of the data.
23Overall Strategy2. Represent prior information
as a Normal probability distribution pA(m) ?
exp -½ (m-mA)T Cm-1 (m-mA)
Prior estimate of the model, your best guess as
to what it would be, in the absence of any
observations.
Prior covariance of the model quantifies how good
you think your prior estimate is
24exampleone observationdobs 0.8 0.4one
model parameter withmA1.0 1.25
250
pA(d) pA(m)
dobs0.8
2
0
2
mA1
26Overall Strategy3. Represent the relationship
between data and model parameters as a
probability distributionpT(d,m) ? exp -½
(d-Gm)T CG-1 (d-Gm)
linear theory, Gmd, relating data, d, to model
parameters, m.
Prior covariance of the theory quantifies how
good you think your linear theory is.
27exampletheory dmbut only accurate to 0.2
280
pT(d,m)
dobs0.8
2
0
2
mA1
29Overall Strategy4. Combine the three
distributions in a way that embodies combining
the information that they containp (m,d)
pA(d) pA(m) pT(m,d) ? exp -½
(d-dobs)T Cd-1 (d-dobs)
(m-mA)T Cm-1 (m-mA)
(d-Gm)T CG-1 (d-Gm)
a bit of a mess, but it can be simplified ,,,
300
p(d,m)pA(d) pA(m) pT(d,m)
2
0
2
31Overall Strategy 5. Apply maximum likelihood to
the combined distribution, p(d,m) pA(d) pA(m)
pT(m,d) There are two distinct ways we could do
thisFind the (d,m) combinations that maximized
the joint probability distribution, p(d,m) Find
the m that maximized the individual probability
distribution, p(m) ? p(d,m) ddThese do not
necessarily give the same value for m
32maximum of p(d,m)pA(d) pA(m) pT(d,m)
0
p(d,m)
dpre
Maximum likelihood point
2
mest
0
2
33maximum of p(m) ?p(d,m)dd
Maximum likelihood point
p(m)
m
mest
34special case of an exact theory
- in the limit CG?0
- exp -½ (d-Gm)T CG-1 (d-Gm) ? d(d-Gm)
- and p(m) ? p(d,m) dd
- pA(m) ? pA(d) d(d-Gm) dd
- pA(m) pA(dGm)
- so for normal distributions p(m) ? exp -½
- (Gm-dobs)T Cd-1 (Gm-dobs) (m-mA)T Cm-1 (m-mA)
Dirac delta function, with property ? f(x) d(x-y)
dx f(y)
35special case of an exact theory
- maximizing p(m) is equivalent to minimizing
- (Gm-dobs)TCd-1(Gm-dobs) (m-mA)TCm-1(m-mA)
weighted prediction error
weighted distance of the model from its prior
value
36solutioncalculated via the usual messy
minimization process
- mest mA M dobs GmA
- where M GTCd-1G Cm-1-1 GT Cd-1
Dont Memorize, but be prepared to use (e.g. in
homework).
37interesting interpretation
estimated model minus its prior
observed data minus the prediction of the prior
model
linear connection between the two
38special case of no prior informationCm ? ?
- M GTCd-1G Cm-1-1 GT Cd-1?GTCd-1G-1 GT
Cd-1 - mest mA GTCd-1G-1 GT Cd-1 dobs GmA
- mAGTCd-1G-1GTCd-1dobsGTCd-1G-1GTCd-1GmA
- mAGTCd-1G-1GTCd-1dobsmA
- GTCd-1G-1GTCd-1dobs
recovers weighted least squares
39special case of infinitely accurate prior
information Cm ? 0
- M GTCd-1G Cm-1-1 GT Cd-1 ? 0
- mest mA 0
- mA
recovers prior value of m
40special uncorrelated case Cmsm2I and Cdsd2I
- M GTCd-1G Cm-1-1 GT Cd-1
- sd-2GTG sm-2I-1 GT sd-2
- GTG (sd/sm)2I -1 GT
this formula is sometimes called damped least
squares, with damping factor esd/sm
41Damped Least Squaresmakes the process of
avoiding singular matrices associated with
insufficient datatrivially easyyou just add
e2I to GTG before computing the inverse
42GTG ? GTG e2I this process regularizes the
matrix, so its inverse always existsits
interpretation is in the absence of relevant
data, assume the model parameter has its prior
value
43Are you only fooling yourself ?
- It depends
- is the assumption - that you know the prior value
- a good one?
44Smoothness Constraints
- e.g. model is smooth when its second derivative
is small - d2mi/dx2 ? mi-1 - 2mi mi1
(assuming the data are organized according to one
spatial variable)
45matrix D approximates second derivative
1 -2 1 0 0 0 0 1 -2 1 0
0 0 0 0 1 -2 1
D
d2m/dx2 ? Dm
46Choosing a smooth solution is thus equivalent to
minimizing
- (Dm)T (Dm) mT (DTD) m
- comparing to the
- (m-mA)TCm-1(m-mA)
- minimization implied by the general solution
- mest mA M dobs GmA
- where M GTCd-1G Cm-1-1 GT Cd-1
- indicates that we should make the choices
- mA 0
- Cm-1 (DTD)
- To implement smoothness