Title: DESIGNING TRAINING REGULATORY DATASETS
1DESIGNING TRAINING REGULATORY DATASETS
- Enrique Blanco
- Xavier Messeguer
- Roderic Guigó
2OUR APPROACH
31. SEQUENCE AND FUNCTION
?
SIMILAR FUNCTION
SIMILAR SEQUENCE
Transthyretin NP_000362 (human) - NP_036813
(rat) MASHRLLLLCLAGLVFVSEAGPTGTGESKCPLMVKVLDAVRG
SPAINVAV MASLRLFLLCLAGLIFASEAGPGGAGESKCPLMVKVLDAVR
GSPAVDVAV .
HVFRKAADDTWEPFASGKTSE
SGELHGLTTEEEFVEGIYKVEIDTKSYWK KVFKRTADGSWEPFASGKTA
ESGELHGLTTDEKFTEGVYRVELDTKSYWK .
. ALGISPFHEHAEVVFTA
NDSGPRRYTIAALLSPYSYSTTAVVTNPKE ALGISPFHEYAEVVFTAND
SGHRHYTIAALLSPYSYSTTAVVSNPQN
42. FUNCTION AND SEQUENCE
ThiL gene (S. typhimurium) encoding thiamin
phosphate kinase can be displaced (functionally
equivalent) by THI80 (S. cerevisiae), encoding
thiamin pyrophosphokinase. Comparison of the
known structure of THI80 with the structure of
ThiL reveals different folds. Thus, two different
folds might catalyze the same reaction. Systemat
ic discovery of analogous enzymes in Thiamin
biosynthesis. Morett, Korbel, Rajan, Saab-Rincon,
Olvera, Olvera, Schmidt, Snel, Bork. Nature
Biotechnology 21, 790 - 795 (2003).
MACGEFSLIARYFDRVRSSRLDVETGIG-DDCALLNIPEKQTLAISTDTL
--MSEECIENPERIKIGTDLINIRNKMNLKELIHPNEDENSTLLILNQK
I . . . .. . . .
.. VAGNHFLPDIDPADLAYKALAVNLSDLAAMGADPAWLTLALTLP
EVDEPW DIPRPLFYKIWKLHDLKVCADGAANRLYDYLDDDETLRIKY-L
PNYIIGD . . . . .
LEAFSDSLFALLNYYDMQLIGGDTTRG-PLSMTLGIHGYI
PAGRALKRSG LDSLSEKVYKYYRKNKVTIIKQTTQYSTDFTKCVNLISL
HFNSPEFRSLI . . . . .
. . AKPGDWIYVTGTPGDSAAG--LAVLQNRLQVSEET
DAHYLIQR----HLR SNKDNLQSNHGIELEKGIHTLYNTMTESLVFSKV
TPISLLALGGIGGRFD . .. .
. PTPRILHGQALRDIASAAIDLSDGLISD
LGHIVKASGCGARVDVDALPKS QTVHSITQLYTLSENASYFKLCYMTPT
DLIFLIKKNGTLIEYDPQFRNTC . ..
. . . .. DAMMRHVDDGQALRWALSGGEDYE
LCFTVPELNRGALDVAIGQLGVPFTC IGNCGLLPIGEATLVKETRGLKW
DVKNWPTSVVTGRVSSSNRFVGDNCCF .
. .. . IGQMSADIEGLNFVRDGMP
VTFDWKGYDHFATP IDTKDDIILNVEIFVDKLIDFL-----------
. . ..
?
SIMILAR FUNCTION
SIMILAR SEQUENCE
?
53. FUNCTION AND SEQUENCE (TFBSs)
HNF1-? binding sites (human)
------------AGTTAATCATTGGCC---------
-------------GTTAATTATTGGCAAATGTCCC-
-------GTATGGGTTACTTATTCTCTCTTTGTTGA
------------GGTTAAGACTCTAAT---------
-------AGTCTAGTTAATAATCTACAATT------
---------TGAGATTAATA----------------
---------AATGATTAAAA----------------
-------------GTCAAACATTAAC----------
----------CCGATTAACCATTAACCCCCACCCC-
-------------GTTAATCAGAAAA----------
GGATGTATGTAGAATTACATAAGAA-----------
-------------CTTACTCAATAAC----------
SIMILAR SEQUENCE
SIMILAR FUNCTION
?
64. TF-MAPS A NEW ALPHABET
75. TF-MAPS A NEW FORM OF ALIGNMENT
MAP 1
MAP 2
- We can align the TF-MAPS in this new alphabet
- Mapping score
- Gaps
- Positional conservation
86. TF-MAP ALIGNMENT in PROMOTER CHARACTERIZATION
TTR PROMOTER RECONSTRUCTION
TTR gene ENSG00000118271
Pairwise TF-map alignments between TTR and 83
COREG(TTR) in CISRED A.G. Robertson et al.
cisRED a database system for genome-scale
computational discovery of regulatory elements.
Nucleic Acids Research, 34D68D73, 2006.
97. ACCURACY IN ABSENCE OF SEQUENCE SIMILARITY
The HRCZ-set (36 genes) SEQUENCE
ALIGNMENT Vs TF-MAP ALIGNMENT
108. RESULTS
NO
TF-map alignments are a simple reflection of
sequence conservation?
TF-MAP ALIGNMENT
CLUSTALW
11DESIGN OF THE DATASET
129. PAIRWISE TF-MAP ALIGNMENT TRAINING
Predictions obtained with the database TRANSFAC
V. Matys et al. TRANSFAC and its module
TRANSCompel transcriptional gene regulation in
eukaryotes. Nucleic Acids Research 34 D108 -
D110 (2006)
Plots with the program gff2ps J. F. Abril and R.
Guigó. gff2ps visualizing genomic annotations.
Bioinformatics, 8743744 (2000)
TRAINING To systematically estimate the
parameters that are globally optimal, in terms
of real TFBS detection, in a set of
well-annotated promoter pairs
1310. ACCURACY TESTS
- Levels
- Nucleotide
- Site
- Measures
- Sensitivity 0,1
- Specificity (PPV) 0,1
- Correlation Coefficient -1,1
- Coverage
H
REAL pair of TFBS
M
H
TF-MAP ALIGNMENT
M
- A set of experimentally annotated promoters
- The promoter sequences (mapping)
- Coordinates of the real TFBSs (alignment)
- TFBSs present in both promoters (alignment)
- Human/Mouse orthologous genes
1411. SOURCES OF INFORMATION
- General Regulatory Repositories
- Publications
- The datasets of other programs
- Individual experimental works
- FORMATS / QUALITY / AVAILABILITY / STABILITY
1512. ABS ANNOTATED BINDING SITES
1613. MY OWN EXPERIENCE (1)
- MANUAL DATA CURATION
- FINDING THE PROMOTER SEQUENCES IN THE GENOME
- The original promoter entry does not exist
(GenBank) - The gene has another name
- The gene has not been annotated yet (RefSeq)
- The promoter sequence does not match the current
TSS (RefSeq) - The promoter sequence is not a promoter sequence
(RefSeq) - FINDING THE MOTIFS IN THE PROMOTER SEQUENCES
- The binding motif is not in the original promoter
sequence - The motif is not in the coordinates that it was
expected to be - The motif has changed slightly (a few
nucleotides) - There are several motifs that could correspond to
the real one - The relative position among the motifs of the
same gene is wrong
1714. MY OWN EXPERIENCE (2)
- TF-MAPS AND ANNOTATIONS
- The mapping function is not defined for a given
TF - The TFBS is not predicted by the mapping function
in one of the orthologs - MATCHING THE ALIGNMENTS AND THE ANNOTATIONS
- There are several mapping definitions that
recognize the same motif
18NEW CHALLENGES DESIGN OF FUTURE DATASETS
1915. NON-COLLINEAR CONSERVATION
2016. SITES IN OTHER SPECIES
COLLAGENASE-3 GENE (MMP13) promoters kindly
provided by Dr. López-Otín (Universidad de Oviedo)
2117. ENCODE ChIP data
TRANSFAC VE2F1_Q3 10,000 bps
mouse
human
22CONCLUSION
23RESEARCH ON GENE REGULATION DUAL PERSONALITY?
COMPUTER SCIENTIST EXPERT 1
BIOINFORMATICIAN EXPERT 2
EXPERIMENTALIST? EXPERT 3
RESEARCHER
24eblanco_at_imim.es