Title: Essential Transcriptomics: Understanding the New TRANSFAC Professional Database of Transcription Fac
1Essential Transcriptomics Understanding the New
TRANSFAC Professional Database of Transcription
Factors
- Yannick Pouliot, PhD
- Bioresearch Informationist
- Lane Medical Library Knowledge Management
Center - lanebioresearch_at_stanford.edu
- 11/12/2008
2The Bioresearch Informationist At Your Service
- Yannick Pouliot, PhD, Lane Medical Library
Knowledge Management Center - Bioresearch Informationist computational
biologist in residence - Role Support laboratory researchers regarding
biocomputational resources and their use - Contact lanebioresearch_at_stanford.edu
3Contents
- What has been licensed
- Access
- TRANSFAC database contents
- Components
- Contents
- Data types
- Example computational biology applications of
TRANSFAC data and tools
And please Dont get hung up - ask questions!
4Part I What is provided and how to access it
5BKL What Is Available
- BKL components licensed by Stanford
- Proteome
- TRANSFAC Pro (commercial version transcription
factor database) - Parts of ExPlain
- MATCH, CATCH tools TF binding site search tools
integrated with TRANSFAC - Whats not licensed
- Full-version of ExPlain
- Human Gene Mutation Database
- BRENDA (commercial version)
6So What Is BKL?
- BKL is a knowledge base of curated biomedical
data extracted from selected primary literature
sources - Knowledge base ? ordinary database
- BKL useful for
- Querying the literature in a robust manner
- particularly for visualizing complex
information - Analyzing your experiment data in the context of
what is known (biological significance) - Used following data acquisition, clean-up and
statistical analysis
7BKLs Data Curation Enables Robust Querying
- BKL provides much of its value by applying
rigorous curation and indexing of the data it
provides - Systematic extraction of information from a
source (more later) - Encoding the information ? enforcing structure ?
making it truly usable - Compensating for weaknesses in source data
- E.g, applying controlled vocabulary to compensate
for original text - NCBI and NLM provide somewhat similar curation,
but significantly less so - Closest equivalent system Ingenuity Pathways
Analysis
8BKL Content Sources
- Data extracted from
- Primary scientific literature
- Expert curation applied
- BKL DB updated weekly
- Public databases
- GO, OMIM, Ensembl, etc, etc.
9BKL Data Types
- Protein physical properties
- Sequence, isoelectric point, molecular mass,
transmembrane domain(s), structure, protein
domains, alternative splice forms - Gene ontology classification
- GO Molecular Function/Biological Process/Cellular
Component - Interaction data
- Protein-protein
- Protein complexes
- Protein binding
- Expression pattern
- Organ/Tissue/Cell type/Tumor type
- Orthology data
- Gene families
- Homology data
- Related proteins
- Classified by species
- BLAST results available, starting with summary
view - Disease association
- Biomarker or therapeutic target
- Disease mechanism
10Species Covered by BKL
- Homo sapiens
- Rattus novergicus
- Mus musculus
- Caenorhabditis elegans
- Yeast
- Saccharomyces cerevisae
- Saccharomyces pombe
- Large number of pathogenic fungi
- Overall, gt200 species
- In short, the usual mammals C. elegans and fungi
11Example Applications of BKL TF Data
- Determining what gene regions bind a TF
- Identifying
- The expression pattern of a TF
- The consequence of TF binding (activation,
inhibition) - Genes known to be regulated by a TF
- Obtaining the consensus DNA sequence that binds a
TF - TRANSFAC provides a position-specific matrix that
specifies possible nucleotide substitutions for
each position in the consensus sequence - ? Can use a position-specific site matrix (PSSM)
to perform sequence similarity searches using
MATCH or other program - Identifying group of TFs that interact with each
other to effect regulating gene transcription
12More Use Cases Of BKL
- Gaining a rapid understanding of a protein and
its properties, e.g. - Interactions
- Understanding what proteins interact with your
protein - Expression
- Determining where a protein is present or absent
- Understanding how a protein is regulated
- Comparative genomics
- finding homologs, orthologs
- Understanding large data sets from e.g.
- Microarray gene expression results
- Proteomics (mass spec data)
- Inter-relating identifiers
13New Developments Since Previous Version of
TRANSFAC Pro BKL
- Very different interface
- Proteome, TRANSFAC no longer exist as independent
applications - Instead, they are now integrated together AND
with other BIOBASE products - Much better visualizer
- Though problematical
- New additional application ExPlain
- used for complex tasks that include MATCH, CATCH
and others - Essentially a wizard
14Accessing BKL
- User-unlimited site license purchased by Lane
Library - SUNet ID required
- If you want your own environment, need to obtain
user login (free) - Now fully browser-based
- Nothing to install ? new
- except Flash plugin (for visualizer) if you
dont already have it - No known issues with different browsers
- Modern browsers preferable (IE7, FF2)
15Accessing BKL
lane.stanford.edu ? bioresearch
16Part II Data Contents
17Contents of TRANSFAC Professional 11.4
(12/14/2007)
861 TFs added since June 2007 (8.8 increase)
18TRANSFAC Contents Data Types
- Data on gt10,000 transcription factors and their
properties - Genes that express transcription factors
- Structural features of a transcriptionfactor
- Expression pattern
- ? TRANSFAC lists microarrays that include a TF
gene - Regulatory networks (NEW now uses viewer)
- Functional properties
- Interacting factors
- Position-specific matrices that can be used for
similarity searching of DNA sequences that might
bind a factor (e.g., using MEME and MAST). - CHIP-on-chip data
19Magnitude of Transcription Factor Universe
Provided by TRANSFAC (Feb 2008)
20Part III Searching and Understanding BKLs TF
Data
21Options Relevant for TF Searching
- Easiest Use Locus Report for your gene ?ties
everything together especially useful for
complex genes with e.g., 3 promoters (next slide
for more) - Otherwise, for TFs the following options are
relevant - Site gene sites that are bound by TFs or
complexes of TFs - Promoter
- Composite element minimal functional unit
within which both protein-DNA and protein-protein
interactions contribute to a highly specific
pattern of transcriptional regulation - Matrix nucleotide distribution matrices for the
binding sites of transcription factors - Functional region contains details about
regulatory regions of a gene ? broader than
Promoter, would include enhancers distal to
promoter
22Advanced Search Engine Site Searching
23The Best Starting Point The Locus Report
Relevant to TFs
24Expression Panel Comprehensive Expression Data
25Regulatory and Binding Elementswithin Gene
Regulation panel
26Understanding Binding Sites Regulatory Elements
27Sometimes Messy Data What Is Going On Here?
28TF Target Gene Binding and Regulation
Describes the protein binding to DNA and other
gene regulation activities attributed to the
protein
Note non-human source of ER
29ChIP-Chip Data (click on a promoter within Gene
Regulation panel)
Problem what is the source of CHIP-ON-CHIP data?
30TF Transcriptional Network(In Protein Binding
and Regulatory Activity)
- Requires launching the viewer and requesting
Regulates or Regulated By - Slow
- May not return if too much data returned
- User interface not great (what is that gene??)
Lists protein-protein interactions with OTHER TFs
ONLY
Regulated by
Regulates
31Annotation Section (Overlap?)
Q Is there overlap between data in Annotations
and other sections of Locus Report (e.g.,
Expression)? A Unknown
32Useful Identifiers and Links to Other Databases
33The Binding Sequence panel A Tool for
Evaluating TF Binding Sites
34Part IV Example Biocomputational Applications
35MATCH and PATCH For Finding TF Binding Sites
- MATCH Uses PSSM searching to find TF binding
sites - PATCH Uses pattern searching to find TF binding
sites - Nice tools, but not industry standard
- ? Problem How good are they? Who knows?
36What is a PSSM/PWM?
- PSSM Frequency or likelihood matrix that
describes the nucleotide variance at a position
in a DNA or protein sequence - Description from Bioinformatics Sequence,
Structure and Databanks A Practical Approach
eBook - Can be derived from TRANSFACs binding matrix
- Can be used as input for MATCH
- How well does MATCH perform? Who knows
- Another option using MEME MAST are classic
programs for searching DNA or protein using a
PSSM - ? Lane FAQ on MEME
37BioBases MATCH Program
- Now part of ExPlain tool ? essentially a wizard
- Need to login with (free) personal account
- MATCH is useful for e.g.
- Simple Searching for binding sites in individual
sequences - More sophisticated Identifying those genes
expressed in specific tissues or cell cycle
stages that include binding site by first
assembling collection of potential target genes
and searching that group - E.g., muscle-specific, immune-specific, etc
- Reminder Stanford has not licensed full ExPlain
- Some functions not available, although they are
listed in the documentation
38Part V Summary of Limitations
39Content Limitations
- Standard caution TRANSFAC should not be
considered comprehensive or fully up to date. - ? Use as first step in collecting TFs
- All data in a TRANSFAC record are derived from
the primary experimental literature - However, there are exceptions, e.g.
- TFs without a known binding site in species X are
sometimes included on the basis of binding
observed in species Y (orthology-based
transitive assignment) - ? note they dont make that clear
- Search engine can return TFs based on data
originating from computational analysis - Origin of data sometimes very unclear
- Source paper?
- Nature of evidence?
- Quality score?
- Application-specific jargon can be confusing
- compelsite, isogroup
- ? heavy requirement to read documentation to be
clear on terms and how they are used in
application
40Search Engine Limitations
- Use of querying topics can be confusing or
limiting - Various usability issues, presumably associated
with youth of application, e.g. - Data sets are incompletely integrated for search
purposes - Can find ER-alpha binding factor but not ER-1
(Locus Report lists them as synonyms) - Can find factor source using rec but not
recombinant - Homo sapiens does not overlap with human
41In Short
- TRANSFAC is a solid source of TF data of many
types - BKL provides nice integration of all aspects of a
TF - Goes way beyond what TRANSFAC used to provide
- However, text search engine is unreliable because
of lack of thesaurus