Title: Cancer Biomarkers, Insufficient Data, Multiple Explanations
1Cancer Biomarkers, Insufficient Data, Multiple
Explanations
2- Do not trust everything you read
3Side Topic Biomarkers from mass spec data
- SELDI
- Surface Enhanced Laser Desorption/Ionization. It
combines chromatography with TOF-MS. - Advantages of SELDI technology
- Uses small amounts (lt 1?l/ 500-1000 cells) of
sample (biopsies, microdissected tissue). - Quickly obtain protein mapping from multiple
samples at same conditions. - Ideal for discovering biomarkers quickly.
4ProteinChip Arrays
5SELDI Process
copy from http//www.bmskorea.co.kr/new01_21-1.htm
6 Protein mapping
C
C
N
N
C
7Biomarker Discovery
- Markers can be easily found by comparing protein
maps. - SELDI is faster and more reproducible than 2D
PAGE. - Has been being used to discover protein
biomarkers of diseases such as ovarian cancer,
breast cancer, prostate and bladder cancers.
(Normal)
(Cancer)
Modified from Ciphergen Web Site)
8Inferring biomarkers
- Conrads-Zhou-Petercoin-Liotta-Veenstra (Cancer
diagnosis using proteomic patterns, Expert. Rev.
Mold Diagn. 34(2003) 411-420) Genetic
algorithm, found 10 proteins (out of 15000
peaks!) as biomarker to separate about 100
spectra. - Question
- Are they the only separator?
- Might they be by-products of other more important
proteins? - Could some of them be noises/pollutants?
9Multiple Decision List Experiment(Research work
with Jian Liu)
- We answer the first question. Last 2 are not our
business. - Decision List
- If M1 then O1
- else if M2 then O2
-
- else if Mj-1 then Oj-1
- else Oj
- Each term Mi is a monomial of boolean variables.
Each Oi denote positive or negative, for
classification. - k-decision list Mis are monomials of size k. It
is pac-learnable (Rivest, 1987)
10Learning Algorithm for k-DL
- k-decision list Algorithm
- Stage i find a term that
- Covers only the yet-uncovered cancer (or normal)
set - The cover is largest
- Add the term to decision list as i-th term,
with Oicancer (or normal) - Repeat until, say, cancer set is fully covered.
Then last Ojnormal. - Multiple k-decision list at each stage, find
many terms that have the same coverage.
11Thousands of Solutions
- proteins nodes training of
equivalent testing accuracy - used in 3-DL accuracy terms in
each DL node max min - 3 4 (43,79)
(1,3,7,7) (45/81) (44/81) - 5 5 (42,81)
(1,5,1,24,24) (44/81) (43/81) - 8 5 (43,81)
(1,9,1,29,92) (43/81) (42/81) - 10 5 (43,81)
(1,9,3,46,175) (43/81) (42/81) - 15 6 (45,80)
(1,10,5,344,198,575) (46/81) (43/71) - 20 5 (45,81)
(1,10,5,121,833) (45/81) (43/73) - 25 4 (45,81)
(1,14,2,1350) (46/81) (44/80) - 30 4 (45,81)
(1,18,2,2047) (46/81) (44/80) - ---------------2-DL --------------------------
--------------------------------------------------
- - 50 4 (45/81)
(1,4,8,703) (46/81) (42/78) - 100 4 (45/81)
(1,4,60,2556) (46/81) (42/76) - WCX I Performance of multiple decision lists.
Notation (x/y) (normal/cancer) - Example 14602556 gt ½ million 2-decision lists
(involving 7.8 proteins each on average) are
perfect for (4581) spectra, but most fail to
classify 45814681 spectra perfectly (On
average such a random hypothesis cuts off 3
healthy womens ovary and leave one cancer
undetected). Why should we trust that 10
particular proteins that are perfect for 216
spectra? - Need a new learning theory to deal with small
amount of data, too many relevant attributes.
12Excursion into Learning Theory
- OED Induction is the process of inferring a
general law or principle from the observations of
particular instances''. - Science is induction from observed data to
physical laws. - But, how?
13Occams Razor
- Commonly attributed to William of Ockham
(1290--1349). It states Entities should not be
multiplied beyond necessity. - Commonly explained as when have choices, choose
the simplest theory. - Bertrand Russell It is vain to do with more
what can be done with fewer.' - Newton (Principia) Natura enim simplex est, et
rerum causis superfluis non luxuriat''.
14Example. Inferring a DFA
- A DFA accepts 1, 111, 11111, 1111111 and
rejects 11, 1111, 111111. What is it? - There are actually infinitely many DFAs
satisfying these data. - The first DFA makes a nontrivial inductive
inference, the 2nd does not.
1
1
1
1
1
1
1
15Exampe. History of Science
- Maxwell's (1831-1879)'s equations say that (a)
An oscillating magnetic field gives rise to an
oscillating electric field (b) an oscillating
electric field gives rise to an oscillating
magnetic field. Item (a) was known from M.
Faraday's experiments. However (b) is a
theoretical inference by Maxwell and his
aesthetic appreciation of simplicity. The
existence of such electromagnetic waves was
demonstrated by the experiments of H. Hertz in
1888, 8 years after Maxwell's death, and this
opened the new field of radio communication.
Maxwell's theory is even relativistically
invariant. This was long before Einsteins
special relativity. As a matter of fact, it is
even likely that Maxwell's theory influenced
Einsteins 1905 paper on relativity which was
actually titled On the electrodynamics of moving
bodies'. - J. Kemeny, a former assistant to Einstein,
explains the transition from the special theory
to the general theory of relativity At the time,
there were no new facts that failed to be
explained by the special theory of relativity.
Einstein was purely motivated by his conviction
that the special theory was not the simplest
theory which can explain all the observed facts.
Reducing the number of variables obviously
simplifies a theory. By the requirement of
general covariance Einstein succeeded in
replacing the previous gravitational mass' and
inertial mass' by a single concept. - Double helix vs triple helix --- 1953, Watson
Crick
16Counter Example.
- Once upon a time, there was a little girl named
Emma. Emma had never eaten a banana, nor had she
ever been on a train. One day she had to journey
from New York to Pittsburgh by train. To relieve
Emma's anxiety, her mother gave her a large bag
of bananas. At Emma's first bite of her banana,
the train plunged into a tunnel. At the second
bite, the train broke into daylight again. At the
third bite, Lo! into a tunnel the fourth bite,
La! into daylight again. And so on all the way to
Pittsburgh. Emma, being a bright little girl,
told her grandpa at the station Every odd bite
of a banana makes you blind every even bite puts
things right again.' (N.R. Hanson, Perception
Discovery)
17PAC Learning (L. Valiant, 1983)
- Fix a distribution for the sample space (P(v) for
each v in sample space). A concept class C is
pac-learnable (probably approximately correct
learnable) iff there exists a learning algorithm
A such that, for each f in C and e (0 lt e lt 1),
algorithm A halts in a polynomial number of steps
and examples, and outputs a concept h in C which
satisfies the following. With probability at
least 1- e, - Sf(v) ? h (v) P(v) lt e
18Simplicity means understanding
- We will prove that given a set of positive and
negative data, any consistent concept of size
reasonably' shorter than the size of data is an
approximately' correct concept with high
probability. That is, if one finds a shorter
representation of data, then one learns. The
shorter the conjecture is, the more efficiently
it explains the data, hence the more precise the
future prediction. - Let a lt 1, ß 1, m be the number of examples,
and s be the length (in number of bits) of the
smallest concept in C consistent with the
examples. An Occam algorithm is a polynomial time
algorithm which finds a hypothesis h in C
consistent with the examples and satisfying - K(h) sßma
19Occam Razor Theorem
- Theorem. A concept class C is polynomially
pac-learnable if there is an Occam algorithm for
it. I.e. With probability gt1- e, Sf(v) ? h (v)
P(v) lt e - Proof. Fix an error tolerance e (0 lt e lt1).
Choose m such that - m max (2sß/e)1/(1- a) , 2/e
log 1/e . - This is polynomial in s and 1/ e. Let m be
as above. Let S be a set of r concepts, and let f
be one of them. - Claim The probability that any concept h in S
satisfies P(f ? h) e and is consistent with m
independent examples of f is less than (1- e )m
r. - Proof Let Eh be the event that hypothesis h
agrees with all m examples of f. If P(h ? f )
e, then h is a bad hypothesis. That is, h and f
disagree with probability at least e on a random
example. The set of bad hypotheses is denoted by
B. Since the m examples of f are independent, - P( Eh ) (1- e )m .
- Since there are at most r bad hypotheses,
- P( Uh in B Eh) (1- e)m r.
20Proof of the theorem continues
- The postulated Occam algorithm finds a hypothesis
of Kolmogorov complexity at most sßma. The number
r of hypotheses of this complexity satisfies - log r sßma .
- By assumption on m, r (1- e )-m/ 2 (Use e lt -
log (1- e) lt e/(1-e) for 0 lt e lt1). Using the
claim, the probability of producing a hypothesis
with error larger than e is less than - (1 - e )m r (1- e )m/2 lt e.
- The last inequality is by substituting m.
21Inadequate data, Too many relevant attributes?
- Data in biotechnology is often expensive or hard
to get. - Pac-learning theory, MDL, SVM, Decision tree
algorithm all need sufficient data. - Similar situation in expression arrays where
too many attributes are relevant which ones to
choose?
22Epicurus Multiple Explanations
- Greek philosopher of science Epicurus
(342--270BC) proposed the Principle of Multiple
Explanations If more than one theory is
consistent with the observations, keep all
theories. 1500 years before Occams razor! - There are also some things for which it is not
enough to state a single cause, but several, of
which one, however, is the case. Just as if you
were to see the lifeless corpse of a man lying
far away, it would be fitting to state all the
causes of death in order that the single cause of
this death may be stated. For you would not be
able to establish conclusively that he died by
the sword or of cold or of illness or perhaps
by poison, but we know that there is something of
this kind that happened to him.' Lucretius
23Can the two theories be integrated?
- When we do not have enough data, Epicurus said
that we should just be happy to keep all the
alternative consistent hypotheses, not selecting
the simplest one. But how can such a
philosophical idea be converted to concrete
mathematics and learning algorithms?
24A Theory of Learning with Insufficient Data
- Definition. With the pac-learning notations, a
concept class is polynomially Epicurus-learnable
iff the learning algorithm always halts within
time and number of examples p(f, 1/e), for some
polynomial p, with a list of hypotheses of which
one is probably approximately correct. - Definition Let a lt 1 and ß 1 be constants, m be
the number of examples, and s be the length (in
number of bits) of the smallest concept in C
consistent with the examples. An Epicurus
algorithm is a polynomial time algorithm which
finds a collection of hypotheses h1, hk in C
that - they are all consistent with the examples
- they satisfy K(hi ) sß ma , for i1, , k,
where K(x) is Kolmogorov complexity of x. - they are mutually error-independent with respect
to the true hypothesis h, that is h1 ? h, , hk
? h are mutually independent, where hi ? h is the
symmetric difference of the two concepts.
25- Theorem. A concept class C is polynomially
Epicurus-learnable if there is an Epicurus
algorithm outputting k hypotheses and using - m 1/k max (2sß/e)1/(1-a), 2/e log 1/e
- examples, where 0ltelt1 is error tolerance.
- This theorem gives a sample-size vs learnability
tradeoff. When k1, then this becomes old Occams
Razor theorem. - Admittedly, error-independence requirement is too
strong to be practical.
26- Proof. Let m be as in theorem, C contain r
concepts and f be one of them, and h1 hk be
the k error-independent hypotheses from the
Epicurus algorithm. - Claim. The probability that h1 hk in C satisfy
P(f ? hi ) e and are consistent with m
independent examples of f is less than (1- e )km
Crk. - Proof of Claim. Let E1..k be the event that
hypothesis h1 hk all agree with all m examples
of f. If P(hi ? f ) e, for i1 k, then,
since the m examples of f are independent and
hi's are mutually f-error-independent, - P( E(h1 hk) ) (1- e )km
. - Since there are at most Crk sets of bad
hypothesis choices, - P( U E(h1 hk his are
bad hypotheses) (1- e )km Crk. - The postulated Epicurus algorithm finds k
consistent hypotheses of Kolmogorov complexity at
most sßma The number r of hypotheses of this
complexity satisfies log r sß ma . By
assumption on m, Crk (1- e )-km/2 Using the
claim, the probability all k hypotheses having
error larger than e is less than - (1 - e )km Crk (1- e )km/2
- Substituting m we find that the right-hand
side is at most e.
27- When there is not enough data to assure that the
Occam learning converges, do Epicurus learning
and leave the final selection to the experts.