DESIGNING TRAINING REGULATORY DATASETS - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

DESIGNING TRAINING REGULATORY DATASETS

Description:

ThiL gene (S. typhimurium) encoding thiamin phosphate kinase can be displaced ... Systematic discovery of analogous enzymes in Thiamin biosynthesis. ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 25
Provided by: rode79
Category:

less

Transcript and Presenter's Notes

Title: DESIGNING TRAINING REGULATORY DATASETS


1
DESIGNING TRAINING REGULATORY DATASETS
  • Enrique Blanco
  • Xavier Messeguer
  • Roderic Guigó

2
OUR APPROACH
3
1. SEQUENCE AND FUNCTION
?
SIMILAR FUNCTION
SIMILAR SEQUENCE
Transthyretin NP_000362 (human) - NP_036813
(rat) MASHRLLLLCLAGLVFVSEAGPTGTGESKCPLMVKVLDAVRG
SPAINVAV MASLRLFLLCLAGLIFASEAGPGGAGESKCPLMVKVLDAVR
GSPAVDVAV .
HVFRKAADDTWEPFASGKTSE
SGELHGLTTEEEFVEGIYKVEIDTKSYWK KVFKRTADGSWEPFASGKTA
ESGELHGLTTDEKFTEGVYRVELDTKSYWK .
. ALGISPFHEHAEVVFTA
NDSGPRRYTIAALLSPYSYSTTAVVTNPKE ALGISPFHEYAEVVFTAND
SGHRHYTIAALLSPYSYSTTAVVSNPQN

4
2. FUNCTION AND SEQUENCE
ThiL gene (S. typhimurium) encoding thiamin
phosphate kinase can be displaced (functionally
equivalent) by THI80 (S. cerevisiae), encoding
thiamin pyrophosphokinase. Comparison of the
known structure of THI80 with the structure of
ThiL reveals different folds. Thus, two different
folds might catalyze the same reaction. Systemat
ic discovery of analogous enzymes in Thiamin
biosynthesis. Morett, Korbel, Rajan, Saab-Rincon,
Olvera, Olvera, Schmidt, Snel, Bork. Nature
Biotechnology 21, 790 - 795 (2003).
MACGEFSLIARYFDRVRSSRLDVETGIG-DDCALLNIPEKQTLAISTDTL
--MSEECIENPERIKIGTDLINIRNKMNLKELIHPNEDENSTLLILNQK
I . . . .. . . .
.. VAGNHFLPDIDPADLAYKALAVNLSDLAAMGADPAWLTLALTLP
EVDEPW DIPRPLFYKIWKLHDLKVCADGAANRLYDYLDDDETLRIKY-L
PNYIIGD . . . . .
LEAFSDSLFALLNYYDMQLIGGDTTRG-PLSMTLGIHGYI
PAGRALKRSG LDSLSEKVYKYYRKNKVTIIKQTTQYSTDFTKCVNLISL
HFNSPEFRSLI . . . . .
. . AKPGDWIYVTGTPGDSAAG--LAVLQNRLQVSEET
DAHYLIQR----HLR SNKDNLQSNHGIELEKGIHTLYNTMTESLVFSKV
TPISLLALGGIGGRFD . .. .
. PTPRILHGQALRDIASAAIDLSDGLISD
LGHIVKASGCGARVDVDALPKS QTVHSITQLYTLSENASYFKLCYMTPT
DLIFLIKKNGTLIEYDPQFRNTC . ..
. . . .. DAMMRHVDDGQALRWALSGGEDYE
LCFTVPELNRGALDVAIGQLGVPFTC IGNCGLLPIGEATLVKETRGLKW
DVKNWPTSVVTGRVSSSNRFVGDNCCF .
. .. . IGQMSADIEGLNFVRDGMP
VTFDWKGYDHFATP IDTKDDIILNVEIFVDKLIDFL-----------
. . ..
?
SIMILAR FUNCTION
SIMILAR SEQUENCE
?
5
3. FUNCTION AND SEQUENCE (TFBSs)
HNF1-? binding sites (human)
------------AGTTAATCATTGGCC---------
-------------GTTAATTATTGGCAAATGTCCC-
-------GTATGGGTTACTTATTCTCTCTTTGTTGA
------------GGTTAAGACTCTAAT---------
-------AGTCTAGTTAATAATCTACAATT------
---------TGAGATTAATA----------------
---------AATGATTAAAA----------------
-------------GTCAAACATTAAC----------
----------CCGATTAACCATTAACCCCCACCCC-
-------------GTTAATCAGAAAA----------
GGATGTATGTAGAATTACATAAGAA-----------
-------------CTTACTCAATAAC----------
SIMILAR SEQUENCE
SIMILAR FUNCTION
?
6
4. TF-MAPS A NEW ALPHABET
7
5. TF-MAPS A NEW FORM OF ALIGNMENT
MAP 1
MAP 2
  • We can align the TF-MAPS in this new alphabet
  • Mapping score
  • Gaps
  • Positional conservation

8
6. TF-MAP ALIGNMENT in PROMOTER CHARACTERIZATION
TTR PROMOTER RECONSTRUCTION
TTR gene ENSG00000118271
Pairwise TF-map alignments between TTR and 83
COREG(TTR) in CISRED A.G. Robertson et al.
cisRED a database system for genome-scale
computational discovery of regulatory elements.
Nucleic Acids Research, 34D68D73, 2006.
9
7. ACCURACY IN ABSENCE OF SEQUENCE SIMILARITY
The HRCZ-set (36 genes) SEQUENCE
ALIGNMENT Vs TF-MAP ALIGNMENT
10
8. RESULTS
NO
TF-map alignments are a simple reflection of
sequence conservation?
TF-MAP ALIGNMENT
CLUSTALW
11
DESIGN OF THE DATASET
12
9. PAIRWISE TF-MAP ALIGNMENT TRAINING
Predictions obtained with the database TRANSFAC
V. Matys et al. TRANSFAC and its module
TRANSCompel transcriptional gene regulation in
eukaryotes. Nucleic Acids Research 34 D108 -
D110 (2006)
Plots with the program gff2ps J. F. Abril and R.
Guigó. gff2ps visualizing genomic annotations.
Bioinformatics, 8743744 (2000)
TRAINING To systematically estimate the
parameters that are globally optimal, in terms
of real TFBS detection, in a set of
well-annotated promoter pairs
13
10. ACCURACY TESTS
  • Levels
  • Nucleotide
  • Site
  • Measures
  • Sensitivity 0,1
  • Specificity (PPV) 0,1
  • Correlation Coefficient -1,1
  • Coverage

H
REAL pair of TFBS
M
H
TF-MAP ALIGNMENT
M
  • A set of experimentally annotated promoters
  • The promoter sequences (mapping)
  • Coordinates of the real TFBSs (alignment)
  • TFBSs present in both promoters (alignment)
  • Human/Mouse orthologous genes

14
11. SOURCES OF INFORMATION
  • General Regulatory Repositories
  • Publications
  • The datasets of other programs
  • Individual experimental works
  • FORMATS / QUALITY / AVAILABILITY / STABILITY

15
12. ABS ANNOTATED BINDING SITES
16
13. MY OWN EXPERIENCE (1)
  • MANUAL DATA CURATION
  • FINDING THE PROMOTER SEQUENCES IN THE GENOME
  • The original promoter entry does not exist
    (GenBank)
  • The gene has another name
  • The gene has not been annotated yet (RefSeq)
  • The promoter sequence does not match the current
    TSS (RefSeq)
  • The promoter sequence is not a promoter sequence
    (RefSeq)
  • FINDING THE MOTIFS IN THE PROMOTER SEQUENCES
  • The binding motif is not in the original promoter
    sequence
  • The motif is not in the coordinates that it was
    expected to be
  • The motif has changed slightly (a few
    nucleotides)
  • There are several motifs that could correspond to
    the real one
  • The relative position among the motifs of the
    same gene is wrong

17
14. MY OWN EXPERIENCE (2)
  • TF-MAPS AND ANNOTATIONS
  • The mapping function is not defined for a given
    TF
  • The TFBS is not predicted by the mapping function
    in one of the orthologs
  • MATCHING THE ALIGNMENTS AND THE ANNOTATIONS
  • There are several mapping definitions that
    recognize the same motif

18
NEW CHALLENGES DESIGN OF FUTURE DATASETS
19
15. NON-COLLINEAR CONSERVATION
20
16. SITES IN OTHER SPECIES
COLLAGENASE-3 GENE (MMP13) promoters kindly
provided by Dr. López-Otín (Universidad de Oviedo)
21
17. ENCODE ChIP data
TRANSFAC VE2F1_Q3 10,000 bps
mouse
human
22
CONCLUSION
23
RESEARCH ON GENE REGULATION DUAL PERSONALITY?
COMPUTER SCIENTIST EXPERT 1
BIOINFORMATICIAN EXPERT 2
EXPERIMENTALIST? EXPERT 3
RESEARCHER
24
eblanco_at_imim.es
Write a Comment
User Comments (0)
About PowerShow.com