GECCO Report - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

GECCO Report

Description:

Specialized: Open-source software, Bioinformatics, Complex Networks, ... Presentations (3 days) ... TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 23
Provided by: appsrvCs
Category:
Tags: gecco | report

less

Transcript and Presenter's Notes

Title: GECCO Report


1
GECCO Report Some Issues about TFBSs
  • Tak-Ming Chan
  • August 2 2007

2
GECCO Report
  • Genetic and Evolutionary Computation Conference
    (GECCO) 2007
  • Submissions totaled 577, Total accepted full
    papers 266Accepted as posters (abstracts) 210

3
GECCO07
  • Tutorials Workshops (2 days)
  • Introductory GA, GP, EDA,
  • Advanced Representations, Fitness landscapes,
    Problem hardness,
  • Specialized Open-source software,
    Bioinformatics, Complex Networks,
  • Presentations (3 days)
  • 14 tracks GA(109/43), GP(54/27), ES/EP(21/11),
    Real-World Applications (99/48), Biological
    Applications (25/10 ),
  • All sessions go simultaneously

4
GECCO07
  • Keynote Event
  • Public Debate on Complexity and Evolution
  • with Richard Dawkins (The Selfish Gene), Lewis
    Wolpert and Steve Jones (Almost Like a Whale)
  • Different view from the biologists about
    evolution, constraints, complexity
  • Audio download

http//www.cs.ucl.ac.uk/staff/p.bentley/evodebate.
html
5
Digest of some Presentations
  • Real-World Applications
  • A robust GP solution for hedge fund stock
    selection 1
  • Robustness for volatile and extreme scenarios
  • Fitness mean profit over the volatility (std.)
    of the scenario
  • Genetic Algorithms
  • Simple diversity mechanisms analysis 2
  • Different diversities (in two ( 1) EAs)
    suitable for different problems (two plateau
    functions)
  • Genotype diversity better in a multimodal
    problem
  • Phenotype diversity better in a needle in a
    haystack problem

6
Digest of some Presentations
  • Biological Applications
  • Prof. Congdon, the chair of my session had a
    conversation with me
  • GAMI (consensus-led) addressed the TFBS
    identification in a multimodal way
    (unintentionally)
  • In further work, they found Information Content
    was not ideal as expected for their datasets
  • Check it out for more in
  • 1 W. Yan, C. D. Clack, Evolving Robust GP
    Solutions for Hedge Fund Stock Selection in
    Emerging Markets, Proceedings of GECCO 07,
    pp2234-2241
  • 2 T. Friedrich, N. Hebbinghaus, F. Neumann,
    Rigorous Analyses of Simple Diversity Mechanisms,
    Proceedings of GECCO 07, pp1219-1225

7
A little bit more about London
  • Tired of the Big Ben, Tower Bridge, Buckingham
    Palace?
  • Try the Wellcome Collection near UCL
  • Especially good for seeing some interesting
    things about Medicine and Genetics

8
Some Issues about TFBSs
  • Information Content (IC)
  • Similarity measures between PWMs

Regulates Gene Expression
Transcription
Transcription Factor
TFBS
Gene
9
Evaluation of IC as a Metric for TFBS
Identification 3
  • Incorporated IC in GAMI which originally employed
    Match Count (MC) (by Congdon et al)
  • Expected IC should be more accurate than MC
    but it turned out to be not
  • IC missed some 100 conservation regions while MC
    did not
  • Several possible problems of IC addressed
  • Background frequencies
  • Different IC scores to a motif and the reverse
    complement
  • Synonyms problem

3 An Evaluation of Information Content as a
Metric for the Inference of Putative Conserved
Noncoding Regions in DNA Sequences Using a
Genetic Algorithms Approach, in 2006 IEEE
TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND
BIOINFORMATICS
10
Scores for MC and IC
  • Rough correlation different peaks

11
Background frequencies
  • IC has a very strong preference for CG-rich
    regions for the dataset (SOX21)
  • The background frequencies
  • However, the areas of conservation are not always
    CG-rich

12
Background frequencies
IS an issue!!
  • IC is not suitable for a strongly biased region
  • Li Gang raised the similar question (a 2-letter
    case for simplification)
  • log(0.9/0.8) 0.9 log(0.1/0.2) 0.1 0.0367
    log(0.5/0.8) 0.5 log(0.5/0.2) 0.5 0.2231
  • IC favors the random nucleotides over the
    conserved ones!
  • Further questions raised (by Cyrus)
  • Is it possible the promoter regions are really
    biased in real-world problem?
  • In this paper, the sequences in the datasets
    tested are 8kb-10kb long what about the promoter
    regions?

13
Different IC scores to a motif and the reverse
complement
MAY be an issue!!
  • Highest scoring motif with IC
  • caggcaccactcactgcccc (207.92)
  • the reverse complement C is less than G in
    SOX21 ggggcagtgagtggtgcctg (189.73)
  • Both score 117 (of a possible 120 (206)) with MC
  • Questions
  • Should the background sequences be recalculated
    for reverse complement?
  • How are the motif instances aligned according to
    this motif (see the next issue of synonyms)?

14
Synonyms problem
  • Different motifs correspond to the same IC score
  • Have a look at how the motif instances are chosen
    and aligned for IC in this work

High MC motif
Lower-quality synonym
15
Synonyms problem
  • The procedure
  • First, GAMI (Consensus-led) is performed to
    locate the best motifs based on MC
  • One motif may correspond to several instances in
    the sequences (they may together contribute to
    the same MC)
  • ATCGATCG ATCGATGG or ATCGAACG
  • When IC is tested, it is calculated only based on
    the combinations of these instances (up to 1000
    combinations)

16
Synonyms problem
NOT an issue!!
  • The result is the motif instances combination
    may correspond to another consensus but are
    forced to align with the specific motif
  • The representational power of consensus
  • The inappropriate use of ICit is used for
    position-led representations!

17
Future work concerning IC
  • Some of the above issues have to be worked on
  • Background frequencieshow to estimate?
  • Forward and Backward strands
  • Positional background frequencies
  • Pseudo-counts
  • Large pseudo-counts have pronounced affect on IC,
    especially for small sample size
  • Some work is proposed to estimate appropriate
    pseudo-counts
  • More
  • Context information of the instances
  • Additional information

18
Similarity measures between PWMs 4
  • Useful in EC for maintaining diversity
  • For position-led representation
  • Position frequency matrix (PFM)
  • Position weight matrix (PWM)

19
D for PFMs
  • test for each position

D (stands for the distance) is incremented by 1
20
For PFMs with different widths
  • Various shifts are tried to get the minimal D
    (note that overlap is at least 6 nucleotides)

21
Further issues
  • To adopt the D for PFMs in our work
  • Change the actual counts to normalized
    frequencies for the matrix used in our GA
  • C (Correlation Coefficient) for PWMs
  • Use a random DNA sequence to compute the sum of
    weights of the PWMs and then use C to measure the
    scores (skipped, not so practical)
  • The paper 4 did more than proposing D and C
  • Beyond this presentation

4 Measuring similarities between transcription
factor binding sites, Sep 2005 BMC Bioinformatics
(if 3.62)
22
The End
  • Thank you very much!
  • Q and A?
  • Conference Report
  • Issues for TFBSs

My footprint at Wellcome Collection Health
is most important
Write a Comment
User Comments (0)
About PowerShow.com