Title: Tools in bioinformatics
1Tools in bioinformatics
Fall 2009-10
2Goals
Overview
- To provide students with practical knowledge of
bioinformatics tools and their application in
research
Prerequisites
- The course Introduction to bioinformatics
- Familiarity with topics in molecular biology
(cell biology, biochemistry, and genetics) - Basic familiarity with computers internet
3Course website
Administration
http//ibis.tau.ac.il/intro_bioinfo/tools.html
4Administration
- Classes
- A class will be given every two weeks
- There are three class groupsSunday
1600-1800Monday 1200-1400 - Monday 1400-1600
- Location
- Computer classroom Sherman 03
5Administration
- Teachers
- Nimrod Rubinstein rubi_at_post.tau.ac.il (Sundays)
- Daiana Alaluf daianaal_at_post.tau.ac.il (Mondays I)
- Osnat Penn penn_at_post.tau.ac.il (Mondays II)
- Reception hoursEmail your instructor any
question at any time or set an appointment
(Britania 405, 6409245)
6Requirements
- Assignments 50 of final grade (compulsory)
- Assignments include class and home works
- Class works are planned to be completed during
the lesson and handed in at the end of it. They
will be checked but not graded. - Home works should be handed in the following
lesson (two weeks after their hand out). They
will be checked and graded. - Final project 50 of final grade
- When emailing your instructor (a question, your
assignment, or whatever) please state in the
Subject field Tools in Bioinfo, IDs, CW/HW
number (if relevant)
7BIOINFORMATICS DATABASES
8Whats in a database?
- Sequences genes, proteins, etc
- Full genomes
- Expression data
- Structures
- Annotation information about genes/proteins-
function- cellular location- chromosomal
location- introns/exons- phenotypes, diseases - Publications
9NCBI and Entrez
- One of the most largest and comprehensive
databases belonging to the NIH (national
institute of health. The primary Federal agency
for conducting and supporting medical research in
the USA) - Entrez is the search engine of NCBI
- Search for genes, proteins, genomes,
structures, diseases, publications, and more
http//www.ncbi.nlm.nih.gov
10PubMed NCBIs database of biomedical articles
Yang X, Kurteva S, Ren X, Lee S, Sodroski J.
Subunit stoichiometry of human immunodeficiency
virus type 1 envelope glycoprotein trimers during
virus entry into host cells , J Virol. 2006
May80(9)4388-95.
11Use fields!
YangAU AND glycoproteinTI AND 2006DP AND J
virolTA
For the full list of field tags go to help -gt
Search Field Descriptions and Tags
12Example
- Retrieve all publications in which the first
author is Davidovich C and the last author is
Yonath A
13Using limits
Retrieve the publications of Yonath A, in the
journals Nature and Proc Natl Acad Sci U S A.,
in the last 5 years
14Google scholar
http//scholar.google.com/
15(No Transcript)
16GenBank NCBIs gene protein database
- GenBank is an annotated collection of all
publicly available DNA sequences (and their
amino-acid translations) - Holds 106.5 billion bases of 108.5 million
sequence records (Oct. 2009)
17Searching NCBI for the protein human CD4
Search demonstration
18(No Transcript)
19Using field descriptions, qualifiers, and boolean
operators
- Cd4GENE AND humanORGN Or Cd4gene name AND
humanorganism - List of field codes http//www.ncbi.nlm.nih.gov/e
ntrez/query/static/help/Summary_Matrices.htmlSear
ch_Fields_and_Qualifiers - Boolean OperatorsANDORNOT
- Note do not use the field Protein name PROT,
only GENE!
20This time we directly search in the protein
database
21RefSeq
- Subcollection of NCBI databases with only
non-redundant, highly annotated entries (genomic
DNA, transcript (RNA), and protein products)
22(No Transcript)
23An explanation on GenBank records
24Fasta format
header
description
ID/accession
gt gi10835167refNP_000607.1 CD4 antigen
precursor Homo sapiens MNRGVPFRHLLLVLQLALLPAATQG
KKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPS
KLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVF
GLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQ
LELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFS
FPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPK
LQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRAT
QLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQ
CLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFC
VRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI
sequence
Save accession numbers for future use (makes
searching quicker)RefSeq accession number
NP_000607.1
24
25Downloading
25
26Swissprot
- A protein sequence database which strives to
provide a high level of annotation regarding
the function of a protein domains structure
post-translational modifications variants - One entry for each protein
http//www.expasy.ch/sprot
27(No Transcript)
28GenBank Vs. Swissprot
GenBank results
Swiss-Prot results
29PDB Protein Data Bank
- Main database of 3D structures of macromolecules
- Includes 61,000 entries (proteins, nucleic
acids, complex assemblies) - Is highly redundant
http//www.rcsb.org
30Human CD4 in complex with HIV gp120
PDB ID 1G9M
gp120
CD4
31Accession Numbers
Two letters followed by six digits, e.g.AY123456 One letter followed by five digits, e.g.U12345 GenBank EMBL
RefSeq accession numbers can be distinguished from GenBank accessions by their prefix 2 charactersunderscore, e.g. NP_015325NM_ mRNA transcript, NP_ protein Refseq
Six characters1 O,P,Q 2 0-9 3 A-Z,0-9 4 A-Z,0-95 A-Z,0-9 6 0-9 e.g.P12345 and Q9JJS7 SWISSPROT
One digit followed by three letters/digits, e.g.1hxw PDB
32GeneCards
- All-in-one database of human genes (a project by
the Weizmann institute) - Attempts to integrate as many as possible
databases, publications, and all available
knowledge
http//www.genecards.org
33(No Transcript)
34Organism specific databases
- Model organisms have independent databases
HIV database http//hiv-web.lanl.gov/content/inde
x
35Summary
- General and comprehensive databases
- NCBI, EMBL
- Genome specific databases (to be discussed)
- UCSC, ENSEMBL
- Highly annotated databases
- Human genes
- Genecards
- Proteins
- Swissprot, RefSeq
- Structures
- PDB
36As important
- Google (or any search engine)
37And always remember
- RT(F)M - Read the manual!!! (/help/FAQ)
38GO Gene Ontology
39Gene Ontology
- Strives to provide consistent descriptions of
gene products obtained from different databases - GO annotations include three hierarchical
ontologies of gene products - cellular component(s) the environment in which
the gene product functions - biological processe(s) the biological
program/pathway in which the gene product is
involved - molecular function(s) the elemental activities
of the gene product - E.g., cytochrome c
- cellular components mitochondrial matrix and
mitochondrial inner membrane - biological processes oxidative phosphorylation
and induction of cell death - molecular functions oxidoreductase activity
40AmiGO the official GO browser
41(No Transcript)
42(No Transcript)
43.
.
44Through NCBI
45.
.
46Enrichment analysis
Query set
Reference set
N
n
k
K
Total N genes Function f K genes
Total n genes Function f k genes
Is k/n gt K/N, significantly ???
47Statistical significance testing
Problem formulation In a group of N genes there
are K special ones If we sample n genes out of
N (without replacement), and found k special
ones, would that be considered a random
outcome? Mathematically, we use the
hypergeometric distribution to compute the
probability of obtaining k or more special ones
in a sample of n
48(No Transcript)
49Materials Methods
21,121 siRNA knockdown assays, literally covering
the entire coding-sequence part of the genome
50Results
273 HIV-dependency factors (HDFs) were discovered
Biological processes
51Subcellular localizations
Molecular functions
52Observations
- Nuclear pore complex their loss may impede HIV
nuclear access - Mediator members (couples TFs to Pol II)
requirement for activators to bind HIV LTRs - Enzymes involved in glycosilation HIVs envelope
protein is heavily glycosilated assisting in the
virus entry to cells