Tools in bioinformatics - PowerPoint PPT Presentation

1 / 52
About This Presentation
Title:

Tools in bioinformatics

Description:

To provide students with practical knowledge of bioinformatics tools and their ... Mathematically, we use the hypergeometric distribution to compute the ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 53
Provided by: tal8
Category:

less

Transcript and Presenter's Notes

Title: Tools in bioinformatics


1
Tools in bioinformatics
Fall 2009-10
2
Goals
Overview
  • To provide students with practical knowledge of
    bioinformatics tools and their application in
    research

Prerequisites
  • The course Introduction to bioinformatics
  • Familiarity with topics in molecular biology
    (cell biology, biochemistry, and genetics)
  • Basic familiarity with computers internet

3
Course website
Administration
http//ibis.tau.ac.il/intro_bioinfo/tools.html
4
Administration
  • Classes
  • A class will be given every two weeks
  • There are three class groupsSunday
    1600-1800Monday 1200-1400
  • Monday 1400-1600
  • Location
  • Computer classroom Sherman 03

5
Administration
  • Teachers
  • Nimrod Rubinstein rubi_at_post.tau.ac.il (Sundays)
  • Daiana Alaluf daianaal_at_post.tau.ac.il (Mondays I)
  • Osnat Penn penn_at_post.tau.ac.il (Mondays II)
  • Reception hoursEmail your instructor any
    question at any time or set an appointment
    (Britania 405, 6409245)

6
Requirements
  • Assignments 50 of final grade (compulsory)
  • Assignments include class and home works
  • Class works are planned to be completed during
    the lesson and handed in at the end of it. They
    will be checked but not graded.
  • Home works should be handed in the following
    lesson (two weeks after their hand out). They
    will be checked and graded.
  • Final project 50 of final grade
  • When emailing your instructor (a question, your
    assignment, or whatever) please state in the
    Subject field Tools in Bioinfo, IDs, CW/HW
    number (if relevant)

7
BIOINFORMATICS DATABASES
8
Whats in a database?
  • Sequences genes, proteins, etc
  • Full genomes
  • Expression data
  • Structures
  • Annotation information about genes/proteins-
    function- cellular location- chromosomal
    location- introns/exons- phenotypes, diseases
  • Publications

9
NCBI and Entrez
  • One of the most largest and comprehensive
    databases belonging to the NIH (national
    institute of health. The primary Federal agency
    for conducting and supporting medical research in
    the USA)
  • Entrez is the search engine of NCBI
  • Search for genes, proteins, genomes,
    structures, diseases, publications, and more

http//www.ncbi.nlm.nih.gov
10
PubMed NCBIs database of biomedical articles
Yang X, Kurteva S, Ren X, Lee S, Sodroski J.
Subunit stoichiometry of human immunodeficiency
virus type 1 envelope glycoprotein trimers during
virus entry into host cells , J Virol. 2006
May80(9)4388-95.
11
Use fields!
YangAU AND glycoproteinTI AND 2006DP AND J
virolTA
For the full list of field tags go to help -gt
Search Field Descriptions and Tags
12
Example
  • Retrieve all publications in which the first
    author is Davidovich C and the last author is
    Yonath A

13
Using limits
Retrieve the publications of Yonath A, in the
journals Nature and Proc Natl Acad Sci U S A.,
in the last 5 years
14
Google scholar
http//scholar.google.com/
15
(No Transcript)
16
GenBank NCBIs gene protein database
  • GenBank is an annotated collection of all
    publicly available DNA sequences (and their
    amino-acid translations)
  • Holds 106.5 billion bases of 108.5 million
    sequence records (Oct. 2009)

17
Searching NCBI for the protein human CD4
Search demonstration
18
(No Transcript)
19
Using field descriptions, qualifiers, and boolean
operators
  • Cd4GENE AND humanORGN Or Cd4gene name AND
    humanorganism
  • List of field codes http//www.ncbi.nlm.nih.gov/e
    ntrez/query/static/help/Summary_Matrices.htmlSear
    ch_Fields_and_Qualifiers
  • Boolean OperatorsANDORNOT
  • Note do not use the field Protein name PROT,
    only GENE!

20
This time we directly search in the protein
database
21
RefSeq
  • Subcollection of NCBI databases with only
    non-redundant, highly annotated entries (genomic
    DNA, transcript (RNA), and protein products)

22
(No Transcript)
23
An explanation on GenBank records
24
Fasta format
header
description
ID/accession
gt gi10835167refNP_000607.1 CD4 antigen
precursor Homo sapiens MNRGVPFRHLLLVLQLALLPAATQG
KKVVLGKKGDTVELTCTASQKKSIQFHWKNSNQIKILGNQGSFLTKGPS
KLNDRADSRRSLWDQGNFPLIIKNLKIEDSDTYICEVEDQKEEVQLLVF
GLTANSDTHLLQGQSLTLTLESPPGSSPSVQCRSPRGKNIQGGKTLSVSQ
LELQDSGTWTCTVLQNQKKVEFKIDIVVLAFQKASSIVYKKEGEQVEFS
FPLAFTVEKLTGSGELWWQAERASSSKSWITFDLKNKEVSVKRVTQDPK
LQMGKKLPLHLTLPQALPQYAGSGNLTLALEAKTGKLHQEVNLVVMRAT
QLQKNLTCEVWGPTSPKLMLSLKLENKEAKVSKREKAVWVLNPEAGMWQ
CLLSDSGQVLLESNIKVLPTWSTPVQPMALIVLGGVAGLLLFIGLGIFFC
VRCRHRRRQAERMSQIKRLLSEKKTCQCPHRFQKTCSPI
sequence
Save accession numbers for future use (makes
searching quicker)RefSeq accession number
NP_000607.1
24
25
Downloading
25
26
Swissprot
  • A protein sequence database which strives to
    provide a high level of annotation regarding
    the function of a protein domains structure
    post-translational modifications variants
  • One entry for each protein

http//www.expasy.ch/sprot
27
(No Transcript)
28
GenBank Vs. Swissprot
GenBank results
Swiss-Prot results
29
PDB Protein Data Bank
  • Main database of 3D structures of macromolecules
  • Includes 61,000 entries (proteins, nucleic
    acids, complex assemblies)
  • Is highly redundant

http//www.rcsb.org
30
Human CD4 in complex with HIV gp120
PDB ID 1G9M
gp120
CD4
31
Accession Numbers
Two letters followed by six digits, e.g.AY123456 One letter followed by five digits, e.g.U12345 GenBank EMBL
RefSeq accession numbers can be distinguished from GenBank accessions by their prefix 2 charactersunderscore, e.g. NP_015325NM_ mRNA transcript, NP_ protein Refseq
Six characters1 O,P,Q 2 0-9 3 A-Z,0-9 4 A-Z,0-95 A-Z,0-9 6 0-9 e.g.P12345 and Q9JJS7 SWISSPROT
One digit followed by three letters/digits, e.g.1hxw PDB
32
GeneCards
  • All-in-one database of human genes (a project by
    the Weizmann institute)
  • Attempts to integrate as many as possible
    databases, publications, and all available
    knowledge

http//www.genecards.org
33
(No Transcript)
34
Organism specific databases
  • Model organisms have independent databases

HIV database http//hiv-web.lanl.gov/content/inde
x
35
Summary
  • General and comprehensive databases
  • NCBI, EMBL
  • Genome specific databases (to be discussed)
  • UCSC, ENSEMBL
  • Highly annotated databases
  • Human genes
  • Genecards
  • Proteins
  • Swissprot, RefSeq
  • Structures
  • PDB

36
As important
  1. Google (or any search engine)

37
And always remember
  • RT(F)M - Read the manual!!! (/help/FAQ)

38
GO Gene Ontology
39
Gene Ontology
  • Strives to provide consistent descriptions of
    gene products obtained from different databases
  • GO annotations include three hierarchical
    ontologies of gene products
  • cellular component(s) the environment in which
    the gene product functions
  • biological processe(s) the biological
    program/pathway in which the gene product is
    involved
  • molecular function(s) the elemental activities
    of the gene product
  • E.g., cytochrome c
  • cellular components mitochondrial matrix and
    mitochondrial inner membrane
  • biological processes oxidative phosphorylation
    and induction of cell death
  • molecular functions oxidoreductase activity

40
AmiGO the official GO browser
41
(No Transcript)
42
(No Transcript)
43
.
.
44
Through NCBI
45
.
.
46
Enrichment analysis
Query set
Reference set
N
n
k
K
Total N genes Function f K genes
Total n genes Function f k genes
Is k/n gt K/N, significantly ???
47
Statistical significance testing
Problem formulation In a group of N genes there
are K special ones If we sample n genes out of
N (without replacement), and found k special
ones, would that be considered a random
outcome? Mathematically, we use the
hypergeometric distribution to compute the
probability of obtaining k or more special ones
in a sample of n
48
(No Transcript)
49
Materials Methods
21,121 siRNA knockdown assays, literally covering
the entire coding-sequence part of the genome
50
Results
273 HIV-dependency factors (HDFs) were discovered
Biological processes
51
Subcellular localizations
Molecular functions
52
Observations
  • Nuclear pore complex their loss may impede HIV
    nuclear access
  • Mediator members (couples TFs to Pol II)
    requirement for activators to bind HIV LTRs
  • Enzymes involved in glycosilation HIVs envelope
    protein is heavily glycosilated assisting in the
    virus entry to cells
Write a Comment
User Comments (0)
About PowerShow.com