Title: CS177 Lecture 8 Bioinformatics Databases and genetic diseases
1CS177 Lecture 8 Bioinformatics Databases (and
genetic diseases)
2Lecture overview
- Very brief and fast overview of on-line
databases. - Formulating queries in Entrez.
- Molecular biology of diseases, including an
extensive example involving a lot of linking
between a number of Entrez databases.
3Bioinformatics Resources
- Reference Chapter 3 in Sequence Evolution
Function, E.V. Koonin and M.Y. Galperin, Kluwer
Academic 2003. - Available on the NCBI Bookshelf.
4Sequence Databases
- GenBank, EMBL, DDBJ archival (International
Nucleotide Sequence Database Collaboration)
sequences have a common accession - SWISS-PROT curated, non-redundant, entries
hyperlinked e.g. to PubMed TrEMBL entries not
yet ready for SWISS-PROT - Motifs PROSITE, BLOCKS, PRINTS
- Domains Pfam, SMART, ProDOM, COGs (NCBI)
- Motifs/domains InterPro, CDD (NCBI)
5More databases
- Structure PDB/RCSB, MMDB (NCBI), SCOP, CATH,
FSSP - Organism-specific e.g. E. coli, B. subtilis,
Synechocystis sp. (bacteria) yeast (unicellular
eukaryote) Arabidopsis, C. Elegans (WormBase),
Fruitfly, Human - COGs clusters of orthologous groups KEGG
biochemical pathways BIND protein-protein
interactions ENZYME LIGAND enzymes and their
substrates - PubChem (NCBI) chemical substances
6(No Transcript)
7(No Transcript)
8(No Transcript)
9PubChem (new)
10(No Transcript)
11The (ever expanding) Entrez System
NLM Catalog
PubChem
Compounds
BioAssays
Substances
Literature
Organism
Expression
HomoloGene
Gene
12 Links Between and Within Nodes
Word weight
Computational
3-D Structure
3 -D Structures
VAST
Phylogeny
Computational
Protein sequences
BLAST
BLAST
Computational
Computational
13Pubmed Computation of Related Articles
- The neighbors of a document are those documents
in the database that are the most similar to it.
The similarity between documents is measured by
the words they have in common, with some
adjustment for document lengths. - The value of a term is dependent on Global and
Local types of information - G - the number of different documents in the
database that contain the term - L - the number of times the term occurs in a
particular document
14Global and local weights
- The global weight of a term is greater for the
less frequent terms. The presence of a term that
occurred in most of the documents would really
tell one very little about a document. - The local weight of a term is the measure of its
importance in a particular document. Generally,
the more frequent a term is within a document,
the more important it is in representing the
content of that document.
15How we define similar documents
- The similarity between two documents is computed
by adding up the weights (local wt1 local wt2
global wt) of all of the terms the two documents
have in common. All results are ranked and the
most similar documents become Related Articles
16Entrez database queries
- The databases are indexed by different sets of
terms. - You can get to a particular DB by selecting it
and then entering a null query. - The Preview/Index tab displays the index terms
and can be used to formulate a query (if you
cant remember the syntax for the index). - Limits can be used e.g. to select publications
in a specified time range. - Details shows the interpretation of the query.
17(No Transcript)
18Exercises!
- How many protein structures are there that
include DNA and are from bacteria? bacteria
orgn AND 1100 DNAChainCount - In PubMed, how many articles are there from the
journal Science and have Alzheimer in the title
or abstract, and amyloid beta anywhere? How
many since the year 2000? - Notice that the results are not 100 accurate!
- In 3D Domains, how many domains are there with no
more than two helices and 8 to 10 strands and are
from the mouse? 02 HelixCount AND 810
StrandCount AND mouse orgn
19Investigating genetic diseases
- Now we will see examples of how bioinformatics
databases can be used to investigate genetic
diseases.
20Gene variants that can affect protein function
- Mutation to a stop codon truncates the protein
product! - Insertion/deletion of multiple bases changes the
sequence of amino acid residues. - Single point change could alter folding
properties of the protein. - Single point change could affect the active site
of the protein. - Single point change could affect an interaction
site with another molecule.
21Lodish et al. Molecular Cell Biology, W.H.
Freeman 2000
22Sickle cell anemia
- The first molecular disease, i.e. the first
genetic disease with a known molecular basis. - The most common variant is caused by a Glu6Val
mutation in the Hemoglobin ß-chain (HbS).
However, there are 100s of other mutations that
can cause this (OMIM lists 524 variants!). - This mutation causes the hemoglobin to
polymerize, in turn the red blood cells form
sickle shapes and clump together under low oxygen
conditions or high hemoglobin concentrations. - Confers some resistance to malaria, by inhibiting
parasite growth.
23NHLBI web site
24Exercise!
- Find an appropriate Hemoglobin structure and view
it in Cn3D. - Check the position of the Glu6Val mutation.
25P53 tumor suppressor protein
- Li-Fraumeni syndrome only one functional copy of
p53 predisposes to cancer. - Mutations in p53 are found in most tumor types.
- p53 binds to DNA and stimulates another gene to
produce p21, which binds to another protein cdk2.
This prevents the cell from progressing thru the
cell cycle.
26G. Giglia-Mari, A. Sarasi, Hum. Mutat. (2003) 21
217-228.
27Exercise!
- Use Cn3D to investigate the binding of p53 to
DNA. - Formulate a query for Structure that will require
the DNA molecules to be present (there are 2
structures like this).
28Important note!
- Most diseases (e.g. cancer) are complex and
involve multiple factors (not just a single
malfunctioning protein!).
29Investigating a genetic disease
- The following EST comes from a hemochromatosis
patient your task is to identify the gene and
specific mutation causing the illness, and why
the protein is not functioning properly. - The sequence
- TGCCTCCTTTGGTGAAGGTGACACATCATGTGACCTCTTCAG
- TGACCACTCTACGGTGTCGGGCCTTGAACTACTACCCCCAGA
- ACATCACCATGAAGTGGCTGAAGGATAAGCAGCCAATGGAT
- GCCAAGGAGTTCGAACCTAAAGACGTATTGCCCAATGGGGA
- TGGGACCTACCAGGGCTGGATAACCTTGGCTGTACCCCCTGG
- GGAAGAGCAGAGATATACGTACCAGGTGGAGCACCCAGGCC
- TGGATCAGCCCCTCATTGTGATCTGGG
30ESTs
- Expressed Sequence Tags useful for discovering
genes, obtaining data on gene expression/regulatio
n, and in genome mapping. - Short nucleotide sequences (200-500 bases or so)
derived from mRNA expressed in cells. - The introns from the genes will already be
spliced out. - mRNA is unstable, however, and so it is reverse
transcribed into cDNA.
31Hemochromatosis 2
- BLAST the example EST vs. the Human genome (could
take a few minutes). - - Which chromosome is hit?
- - What is the contig that is hit (reference
assembly)? - - Is the EST identical to the genomic sequence?
- - Take note of the coords of the difference.
- Click on Genome View.
- Select the map element at the bottom
corresponding to the contig.
32Hemochromatosis 3
- What gene is hit? Zoom in on the BLAST hit a few
times. - Display the entire gene sequence vi dl and
Display. - Copy and save the genomic sequence.
- Record the coords for the start of the genomic
sequence.
33Hemochromatosis 4
- Add the UniGene map to the view (if it is not
already there). Click on the UniGene link
Hs.233325. - Note Expression profile presents data for the
expression level of the gene in various tissues. - How many mRNAs and ESTs are there for the HFE
gene? - Take note of the mRNA accession NM_000410.
34Hemochromatosis 5
- Go to spidey http//www.ncbi.nlm.nih.gov/spidey
/ - To determine the intron/exon structure, paste the
HFE gene sequence into the upper box, and enter
the HFE mRNA accession NM_000410 in the lower
box. - Click Align.
35Hemochromatosis 6
- How many exons are there?
- Which exon codes the residue that is changed in
the original EST? (You have to do a little
arithmetic!) - Record some of the protein sequence around the
changed residue EQRYTCQVEHPG
36Hemochromatosis 7
- From the Map Viewer page click on the HFE gene
link. - How many HFE transcripts are there? Which is the
longest isoform? - Follow Links to Protein and then to the
report for NP_000410. - Determine the residue number that corresponds to
the mutation.
37(No Transcript)
38RNA splicing and isoforms
39Hemochromatosis 8
- What effect does the mutation in the original EST
have on the protein? (Look at the table for the
Genetic Code.) - Go back to the Gene Report read the summary and
take note of the GeneRIF bibliography notice the
C282Y entries. - Now go to Links and then to GeneView in dbSNP
to a list of known SNPs.
40Hemochromatosis 9
- In the SNP list note that the one you want is
currently shown. - Select view rs in gene region and then click on
view rs (actually, this is the default view). - How many nonsynonomous substitutions do you see?
- Do you see the one we are particularly interested
in?
41Digression SNPs
- Single Nucleotide Polymorphisms.
- A single base change that can occur in a persons
DNA. - On average SNPs occur about 1 of the time, most
are outside of protein coding regions. - Some SNPs may cause a disease some may be
associated with a disease others may affect
disposition to a disease others may be simple
genetic variation. - dbSNP archives SNPs and other variations such as
small-scale deletion/insertion polymorphisms
(DIPs), etc.
42(No Transcript)
43Hemochromatosis 10
- Back to the Gene Report, click on Links and go
to OMIM (can also get there via the Map
Viewer). - In the OMIM entry you can read a bit also click
on View List for Allelic Variants, where you
can see the mutation again.
44Hemochromatosis 11
- From the Gene Report again follow Links to
Protein and scroll down to NP_000401. - Click on Domains and then Show Details.
- What is the Conserved Domain in the region of
interest? - Follow the link to the CD.
- Click on View 3D Structure.
45Hemochromatosis 12
- Look for residue position 282 in the query
sequence. - Highlight that column.
- Is the Cys282 conserved in the family?
- The C282Y mutation therefore likely has the
effect of
46Aligning a sequence on a structure with Cn3D
(example)
- Example Use structure 1ne3A, align sequence for
1m5xA. - In Sequence/Alignment Viewer window select the
menu item Imports/Show Imports. - In the Import Viewer window select the menu item
Edit/Import Sequences. - In the Select Chain dialogue box select 1N3E A
and click OK. - In the Select Import Source dialogue box select
Network via GI/Accession and click OK. - In the Import Identifier dialogue box enter the
accession 31615545 and click OK. The new
sequence will appear. - Select Algorithms/BLAST single and use the
cursor to click anywhere on the 1m5xA sequence to
align it using BLAST.
47Aligning a sequence on a structure with Cn3D
(example cont.)
- Select the menu item Alignments/Merge All to
make the new alignment appear in the
Sequence/Alignment Viewer window. - The alignment should now appear in the
Sequence/Alignment Viewer window, aligned
residues will be red. - Close the Import Viewer window, pick another
color style for the alignment, if desired (e.g.
identity). - You can do this with multiple sequences
especially useful if there is no CD for the
structure.
48PDB
49PDB File Header
HEADER ISOMERASE/DNA
01-MAR-00 1EJ9 TITLE CRYSTAL STRUCTURE OF
HUMAN TOPOISOMERASE I DNA COMPLEX
COMPND MOL_ID 1
COMPND 2
MOLECULE DNA TOPOISOMERASE I
COMPND 3 CHAIN A
COMPND 4 FRAGMENT C-TERMINAL DOMAIN, RESIDUES
203-765 COMPND 5 EC
5.99.1.2
COMPND 6 ENGINEERED YES
COMPND 7 MUTATION YES
COMPND 8
MOL_ID 2
COMPND 9 MOLECULE DNA (5'-
COMPND 10 D(CAPAPAPAPAPGPAPCPTPCPAP
GPAPAPAPAPAPTP COMPND 11
TPTPTPT)-3')
COMPND 12 CHAIN C
COMPND 13 ENGINEERED YES
COMPND 14
MOL_ID 3
COMPND 15 MOLECULE DNA (5'-
COMPND 16 D(CAPAPAPAPAPTPTPTPTPTPCP
TPGPAPGPTPCPTP COMPND 17
TPTPTPT)-3')
COMPND 18 CHAIN D
COMPND 19 ENGINEERED YES
SOURCE MOL_ID
1
SOURCE 2 ORGANISM_SCIENTIFIC HOMO
SAPIENS
SOURCE 3 EXPRESSION_SYSTEM_COMMON BACULOVIRUS
EXPRESSION SYSTEM SOURCE 4
EXPRESSION_SYSTEM_CELL SF9 INSECT CELLS
SOURCE 5 MOL_ID 2
SOURCE 6 SYNTHETIC YES
SOURCE 7
MOL_ID 3
SOURCE 8 SYNTHETIC YES
KEYWDS PROTEIN-DNA COMPLEX, TYPE I
TOPOISOMERASE, HUMAN
REMARK 1
REMARK 2
REMARK 2 RESOLUTION. 2.60
ANGSTROMS.
REMARK 3
REMARK 3
REFINEMENT.
REMARK 3 PROGRAM
X-PLOR 3.1
REMARK 3 AUTHORS BRUNGER
REMARK 280
REMARK 280 CRYSTALLIZATION
CONDITIONS 27 PEG 400, 145 MM MGCL2, 20
REMARK 280 MM MES PH 6.8, 5 MM TRIS PH 8.0,
30 MM DTT REMARK 290
...
50(No Transcript)
51From Coordinates to Models
1EJ9 Human topoisomerase I
52Building the Structure Summary
Taxonomy
Pubmed
Protein
3D Domains
Domains
Nucleotide
53Indexing into MMDB
Structure
- Import only experimentally determined structures
- Convert to ASN.1
- Verify sequences
- Create backbone model (Ca, P only)
- Create single-conformer model
Add secondary structure
Add chemical bonds
inter-residue-bonds atom-id-1
molecule-id 1 , residue-id 1 , atom-id
1 , atom-id-2 molecule-id 1 ,
residue-id 2 , atom-id 9 ,
id 1 , name "helix 1" , type helix ,
location subgraph residues
interval molecule-id 1 , from
49 , to 61 ,
54Structure Indexing
topoisomerase AND 2dnachaincount AND
humanorganism
- Entrez
- MMDB-ID
- MMDB entry date
- EC number
- Organism
- Ligands
- PDB code
- PDB name
- PDB description
- Experimental
- Method
- Resolution
- Literature
- Article title
- Author
- Journal
- Publication date
- Counters
- Ligand types
- Modified amino acids
- Modified nucleotides
- Modified ribonucleotides
- Protein chains
- DNA chains
- RNA chains
- PDB
- Accession
- Release date
- Class
- Source
- Description
- Comment
55Creating Sequence Records
One record per chain
Protein
Nucleotide
Nucleotide
1EJ9C
1EJ9D
1EJ9A
56Annotating Secondary Structure
1EJ9 Human topoisomerase I
a-Helices ß-strands coils/loops
57Creating 3D Domains
3D Domain 0 1EJ9A0 entire polypeptide
58Creating 3D Domains
1EJ9A1
1EJ9A4
1EJ9A3
1EJ9A5
1EJ9A2
lt 3 Secondary Structure Elements
593D Domain Indexing
- Entrez
- SDI
- MMDB-ID
- Accession
- MMDB entry date
- Organism
- Domain number
- Cumulative number
- Literature
- Article title
- Author
- Publication date
- Counters
- Modified amino acids
- a-Helices
- ß-Strands
- Residues
- Molecular weight
Find all viral four helix bundles
- PDB
- Accession
- Release date
- Class
- Source
- Description
- Comment
4helixcount AND 0strandcount AND 0domainno
AND virusesorganism
REMEMBER 3D Domain 0 is the entire polypeptide
chain!