Conditional Entropy - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Conditional Entropy

Description:

... provides efficient detection of verbatim plagiarism with an objective percentage ... Verbatim plagiarism can be efficiently detected by comparing the relative ... – PowerPoint PPT presentation

Number of Views:134
Avg rating:3.0/5.0
Slides: 19
Provided by: cwat
Category:

less

Transcript and Presenter's Notes

Title: Conditional Entropy


1
Conditional Entropy
  • C27_B251,  MSCS building, UNE, Armidale, NSW,
    Australia on Friday, 23rd February 2007 from 12
    to 1pm.
  • by Dr. Charles R. WatsonSchool of Maths, Stats
    and Computer ScienceUniversity of New England,
    Armidale NSW 2351URL http//mcs.une.edu.au/cwat
    son7/I/ConditionalEntropy.html

2
Overview
  • Definitions Entropy, Redundancy
  • Examples text, DNA, Proteins
  • Demonstrations
  • Stream cipher
  • Tandem repeats
  • Splicing sites
  • Discussion High-impact, cross-disciplinary
    applications

3
Information Entropy
  • Shannon defines entropy in terms of a discrete
    random variable X, with possible states (or
    outcomes) x1...xn as

where
is the probability of the ith outcome of X.
Here, X is the set of sequence substrings of
given length or frame size.
4
Conditional Entropy
Conditional probability is written P(YX), and
is read "the probability of Y, given X".
5
Example
  • Imagine tossing a coin with two sides.
  • The entropy of one toss is log 2 1
  • The entropy of a toss given that the coin is a
    double header is log 1 0

6
Maximal Entropy
  • If pi 1/N for i1,N then E log N
  • For any skewed (non-uniform) distribution E lt log
    N
  • Maximum entropy for random data increases
    linearly with frame size.
  • Entropy measures the randomness of the
    probability distribution

7
Entropy Comparison
8
Redundancy low Entropy
  • Occurs in natural language text, computer files
    and non-coding DNA.
  • Can be used to
  • detect plagiarism,
  • reverse engineer software,
  • decrypt secret codes and
  • identify individuals by their unique DNA
    fingerprint.

9
(No Transcript)
10
Entropy of Human proteome as a function of frame
size
The Human Proteome with 13,151,137 amino acids
has low redundancy or high Entropy
11
(No Transcript)
12
Entropy of Aesops Fables
13
Plagiarism detection
  • The candidate thesis/assignment is loaded into
    the dictionary.
  • Then a linear complexity search is performed on
    all known theses, web index searches and
    hyperlink crawled pages.
  • The algorithm provides efficient detection of
    verbatim plagiarism with an objective percentage
    measure of original content.

14
Information structure deduction
  • Efficient feature comparison without comparing
    large sequences.
  • The relative location of repeats can match large
    structures.
  • Verbatim plagiarism can be efficiently detected
    by comparing the relative location of spaces and
    common letters such as 'e' and 't'.
  • This is tolerant to the insertion and deletion of
    text after copying the original because relative
    locations of patterns remain undisturbed.

15
DNA fingerprinting
  • Tandem repeats are easily identified by the
    recurring relative location of the pattern equal
    to the pattern length.
  • For example TCATCATCATCATCATC matches the
    reference pattern CATC at locations 1, 4, 7, 10
    and 13 with the relative locations 3, 3, 3, 3.
  • These patterns are used in CODIS-13

16
Conclusion
  • An efficient algorithm
  • Applicable to any type of data
  • Learning (dictionary construction) is
    unsupervised
  • This is a work in progress evolving as problems
    arise
  • The challenge for us now is to ask the right
    questions

17
References
  • Bayes, T. 1763, An Essay towards solving a
    Problem in the Doctrine of Chances.
    Philosophical Transactions of the Royal Society
    of London 53 (1763), 370-418.
  • Fredkin, E. 1960, Trie Memory. Communications
    of the ACM, 3(9)490-499, Sept. 1960.
  • Knuth, D. 1997, The Art of Computer Programming,
    Volume 3 Sorting and Searching, Third Edition.
    Addison-Wesley.
  • Shannon, C. E. 1948, A mathematical theory of
    communication, Bell System Technical Journal,
    vol. 27, pp. 379-423 and 623-656, July and
    October, 1948.
  • Wikipedia. Retrieved regularly from
    http//en.wikipedia.org/wiki/

18
Demonstrations
  • Stream cipher inherent entropy
  • Tandem repeats in non-coding DNA
  • Splicing sites conserved binding patterns for
    noncoding RNA removal
  • Homologous DNA repairing genes.
  • mtDNA homologs of Neanderthal Hs
Write a Comment
User Comments (0)
About PowerShow.com