GBIO001-9 Bioinformatics - PowerPoint PPT Presentation

About This Presentation
Title:

GBIO001-9 Bioinformatics

Description:

GBIO001-9 Bioinformatics Introduction – PowerPoint PPT presentation

Number of Views:233
Avg rating:3.0/5.0
Slides: 75
Provided by: Mariaj596
Category:

less

Transcript and Presenter's Notes

Title: GBIO001-9 Bioinformatics


1
GBIO001-9 Bioinformatics
  • Introduction

2
Instructors
  • Course instructor
  • Kristel Van Steen
  • Office 0/15
  • kristel.VanSteen_at_ulg.ac.be
  • http//www.montefiore.ulg.ac.be/kvansteen/Teachin
    g20132014.html
  • Practical sessions coordinator
  • Kyrylo Bessonov (Kirill)
  • Office B37 1/16
  • kbessonov_at_ulg.ac.be

3
Overview
  1. Introduction to course scope
  2. Evaluation mode/schedule details
  3. Online systems
  4. Assignment submission system
  5. HW group sign up system
  6. Introduction to R language
  7. Basic syntax and data types
  8. Installation of key R libraries
  9. Introduction to public databases
  10. Homework mini assignment

4
Bioinformatics
  • Definition the collection, classification,
    storage, and analysis of biochemical and
    biological information using computers especially
    as applied to molecular genetics and genomics
    (Merriam-Webster dictionary)
  • Definition a field that works on the problems
    involving intersection of Biology/Computer
    Science/Statistics

5
Course Scope
  • This course is introduction to bioinformatics
    field covering wide array of topics
  • a) accessing and working with main biological DB
    (PubMed, Ensembl)
  • b) sequence alignments
  • c) phylogenetics
  • d) statistical genetics
  • f) microarray/genotype data analysis

6
Course expected outcomes
  • At the end of the course students are expected to
    gain a taste of various bioinformatics fields
    coupled to hands-on knowledge. Students should be
    able to perform multiple sequence alignments,
    query biological databases programmatically,
    perform GWA and microarray analysis, present
    scientific papers, have basic statistics
    knowledge (in the context of genetics)

7
Course practical aspects
  • Mode of delivery in class
  • Activities individual and group work
  • reading of scientific literature
  • practical assignments (analysis of
    papers/programming in R)
  • in-class group presentations
  • Meeting times
  • Tuesdays from 2pm-6pm (by the latest)
  • Check website each week for details
  • Room 1.21, Montefiore Institute (B28)

8
Course practical aspects
  • Course material will be posted one day before
    the next class on Prof. Kristel Van Steen
    (lectures) and/or Kyrylo Bessonovs (practicals)
    website(s).
  • Assignment submission will be done online via a
    special submission website
  • After the deadline, the assignment should be
    e-mailed to Kyrylo Bessonov (kbessonov_at_ulg.ac.be)

9
What will we be doing?
  • Well cover a selected recent topics in
    bioinformatics both trough lectures and
    assignments (including student presentations)
  • that basically means that well be reading papers
    from the bioinformatics literature and
    analyzing/critiquing them
  • hands-on lectures that will allow you to
    understand practical aspects of the
    bioinformatics topics
  • Self-learning through assignments

10
How will we do it?
  • Theory classes
  • All course notes are in English.
  • Main instructor Kristel van Steen
  • Guest lectures are to be expected on various
    bioinformatics topics
  • The theory part of the course is meant to be
    interactive
  • In-class discussions of papers / topics

11
How will we do it?
  • Practical classes
  • During these classes will be looking at practical
    aspects of the topics introduced in theory
    classes. It is suggested to execute sample R
    scripts and demonstrations on your PCs.
  • Optional reading assignments will be assigned
  • to prepare for discussions in class based on the
    previously posted papers (no grading yet
    participation grades)
  • Homework assignments are of 3 types (graded)
  • Homework assignments result in a group report
    and can be handed in electronically in French or
    English
  • Homework assignments constitute an important part
    of this course

12
Types of HW assignments
  • Three types of homework assignments are
  • Literature style assignment (Type 1)
  • A group of students is asked to select a paper
    from the provided ones. The group prepares
    in-class presentation and a written report
  • All oral presentations of HW1,HW2, HW3 will be
    done during our last class on Dec 10th,2013
  • Programming style assignment (Type 2)
  • A group is asked to develop an R code to answer
    assignment questions
  • Classical style assignment (Type 3)
  • A group is provided with questions to be answered
    in the written report. Usually R scripts are
    provided and require execution / modification

13
HW assignments details
  • Every homework assignment involves writing a
    short report of no more than the equivalent of
    four single-spaced typed pages of text, excluding
    figures, tables and bibliography.
  • It should contain an abstract (e.g., depending on
    the homework style description of the paper
    content, description of the problem) and a
    results/discussion part. If citations are made to
    other papers, there should be a bibliography (any
    style is OK)! Only one report per group is
    needed.
  • One member of the group should submit only the
    selected type of the HW and full names of group
    participants via online system

14
Selection of HW
  • Total of 4 graded assignments.
  • Students are asked to try all 3 types of
    assignments to gain broader exposure to course
    material
  • e.g. if group 1 selected type 1 assignment for
    HW1b, it should select either type 2 or 3 for HW2
  • Assignments will be posted on the practical
    website

15
Assigned HW deadlines
HW ID Main topic Due Date
HW1a Databases Oct 8th
HW1b GWAS Nov 8th
HW2 Sequences alignments Dec 10th
HW3 Microarrays / Clustering Jan 8th (preliminary)
Notes 1) Type 1 Literature style HW 1 to 3
will be all three presented during Dec 10th
class 2) The written report should be submitted
as per due dates shown in the above table
16
Course Grading
  • Written exam 40 of final mark
  • Multiple choice questions/open book
  • Assignments 50 of final mark
  • Reports of Homework assignments (1 per group)
    are handed in electronically in English or French
  • Participation in group and in-class discussions
    (10)

17
Course materials
  • These will be both posted on Prof. Kristel van
    Steens and Kyrylo Bessonovs websites. Please
    check both sites
  • There is no course book
  • Course syllabus and schedule will be posted online

18
Assignment Submission
  • Step by Step Guide

19
Assignment submission
  • All assignments should be zipped into one file
    (.zip) and submitted online
  • Create a submission account

20
Account creation
  • Any member of the group can submit assignment
  • Account details will be emailed to you
    automatically
  • All GBIO009-1 students should create an account

21
Submit your assignment
  • After account creation login into a submission
    page
  • The remaining time to deadline is displayed. Good
    idea to check it from time to time in order to be
    on top of things
  • File extension should be zip
  • Can submit assignment as many times as you wish

22
(No Transcript)
23
Introduction to
  • A basic tutorial

24
Definition
  • R is a free software environment for statistical
    computing and graphics1
  • R is considered to be one of the most widely used
    languges amongst statisticians, data miners,
    bioinformaticians and others.
  • R is free implementation of S language
  • Other commercial statistical packages are SPSS,
    SAS, MatLab

1 R Core Team, R A Language and Environment for
Statistical Computing, Vienna, Austria
(http//www.R-project.org/)
25
Why to learn R?
  • Since it is free and open-source, R is widely
    used by bioinformaticians and statisticians
  • It is multiplatform and free
  • Has wide very wide selection of additional
    libraries that allow it to use in many domains
    including bioinformatics
  • Main library repositories CRAN and BioConductor

26
Programming? Should I be scared?
  • R is a scripting language and, as such, is much
    more easier to learn than other compiled
    languages as C
  • R has reasonably well written documentation
    (vignettes)
  • Syntax in R is simple and intuitive if one has
    basic statistics skills
  • R scripts will be provided and explained in-class

27
Topics covered in this tutorial
  • Operators / Variables
  • Main objects types
  • Plotting and plot modification functions
  • Writing and reading data to/from files

28
Variables/Operators
  • Variables store one element
  • x lt- 25
  • Here x variable is assigned value 25
  • Check value assigned to the variable x
  • gtx
  • 1 25
  • Basic mathematical operators that could be
    applied to variables (),(-),(/),()
  • Use parenthesis to obtain desired sequence of
    mathematical operations

29
Arithmetic operators
  • What is the value of small z here?
  • gtx lt- 25
  • gt y lt- 15
  • gt z lt- (x y)2
  • gt Z lt- zz
  • gt z
  • 1 80

30
Vectors
  • Vectors have only 1 dimension and represent
    enumerated sequence of data. They can also store
    variables
  • gt v1 lt- c(1, 2, 3, 4, 5)
  • gt mean(v1)
  • 1 3
  • The elements of a vector are specified /modified
    with braces (e.g. number)
  • gt v11 lt- 48
  • gt v1
  • 1 48 2 3 4 5

31
Logical operators
  • These operators mostly work on vectors, matrices
    and other data types
  • Type of data is not important, the same operators
    are used for numeric and character data types

Operator Description
lt less than
lt less than or equal to
gt greater than
gt greater than or equal to
exactly equal to
! not equal to
!x Not x
x y x OR y
x y x AND y
32
Logical operators
  • Can be applied to vectors in the following way.
    The return value is either True or False
  • gt v1
  • 1 48 2 3 4 5
  • gt v1 lt 3
  • 1 FALSE TRUE TRUE FALSE FALSE

33
R workspace
  • Display all workplace objects (variables,
    vectors, etc.) via ls()
  • gtls()
  • 1 "Z" "v1" "x" "y" "z"
  • Useful tip to save workplace and restore from
    a file use
  • gtsave.image(file " workplace.rda")
  • gtload(file "workplace.rda")

34
How to find help info?
  • Any function in R has help information
  • To invoke help use ? Sign or help()
  • ? function_name()
  • ? mean
  • help(mean, try.all.packagesT)
  • To search in all packages installed in your R
    installation always use try.all.packagesT in
    help()
  • To search for a key word in R documentation use
    help.search()
  • help.search("mean")

35
Basic data types
  • Data could be of 3 basic data types
  • numeric
  • character
  • logical
  • Numeric variable type
  • gt x lt- 1
  • gt mode(x)
  • 1 "numeric"

36
Basic data types
  • Logical variable type (True/False)
  • gt y lt- 3lt4
  • gt mode(y)
  • 1 "logical"
  • Character variable type
  • gt z lt- "Hello class"
  • gt mode(z)
  • 1 "character"

37
Objects/Data structures
  • The main data objects in R are
  • Matrices (single data type)
  • Data frames (supports various data types)
  • Lists (contain set of vectors)
  • Other more complex objects with slots
  • Matrices are 2D objects (rows/columns)
  • gt m lt- matrix(0,2,3)
  • gt m
  • ,1 ,2 ,3
  • 1, 0 0 0
  • 2, 0 0 0

38
Lists
  • Lists contain various vectors. Each vector in the
    list can be accessed by double braces number
  • gt x lt- c(1, 2, 3, 4)
  • gt y lt- c(2, 3, 4)
  • gt L1 lt- list(x, y)
  • gt L1
  • 1
  • 1 1 2 3 4
  • 2
  • 1 2 3 4

39
Data frames
  • Data frames are similar to matrices but can
    contain various data types
  • gt x lt- c(1,5,10)
  • gt y lt- c("A", "B", "C")
  • gt z lt-data.frame(x,y)
  • x y
  • 1 1 A
  • 2 5 B
  • 3 10 C
  • To get/change column and row names use colnames()
    and rownames()

40
Factors
  • Factors are special in that they contain both
    integer and character vectors. Thus each unique
    variable has corresponding name and number
  • gt letters c("A","B","C","A","C","C")
  • gt letters factor(letters)
  • 1 A B C A C C
  • Levels A B C
  • gt summary(letters)
  • A B C
  • 2 1 3

41
Input/Output
  • To read data into R from a text file use
    read.table()
  • read help(read.table) to learn more
  • scan() is a more flexible alternative
  • raw_data lt-read.table(file"data_file.txt")
  • To write data into R from a text file use
    read.table()
  • gt write.table(mydata, "data_file.txt")

42
Conversion between data types
  • One can convert one type of data into another
    using as.xxx where xxx is a data type

43
Plots generation in R
  • R provides very rich set of plotting
    possibilities
  • The basic command is plot()
  • Each library has its own version of plot()
    function
  • When R plots graphics it opens graphical device
    that could be either a window or a file

44
Plotting functions
  • R offers following array of plotting functions

Function Description
plot(x) plot of the values of x variable on the y axis
plot(x,y) bi-variable plot of x and y values (both axis scaled based on values of x and y variables)
pie(y) circular pie-char
boxplot(x) Plots a box plot showing variables via their quantiles
hist(x) Plots a histogram(bar plot)
45
Plot modification functions
  • Often R plots are not optimal and one would like
    to add colors or to correct position of the
    legend or do other appropriate modifications
  • R has an array of graphical parameters that are a
    bit complex to learn at first glance. Consult
    here the full list
  • Some of the graphical parameters can be specified
    inside plot() or using other graphical functions
    such as lines()

46
Plot modification functions
Function Description
points(x,y) add points to the plot using coordinates specified in x and y vectors
lines(x,y) adds a line using coordinates in x and y
mtext(text,side3) adds text to a given margin specified by side number
boxplot(x) this a histogram that bins values of x into categories represented as bars
arrows(x0,y0,x1,y1, angle30, code1) adds arrow to the plot specified by the x0, y0, x1, y1 coordinates. Angle provides rotational angle and code specifies at which end arrow should be drawn
abline(hy) draws horizontal line at y coordinate
rect(x1, y1, x2, y2) draws rectangle at x1, y1, x2, y2 coordinates
legend(x,y) plots legend of the plot at the position specified by x and y vectors used to generate a given plot
title() adds title to the plot
axis(side, vect) adds axis depending on the chosen one of the 4 sides vector specifying where tick marks are drawn
47
Installation of new libraries
  • There are two main R repositories
  • CRAN
  • BioConductor
  • To install package/library from CRAN
  • install.packages("seqinr")
  • To install packages from BioConductor
  • source("http//bioconductor.org/biocLite.R")
    biocLite("GenomicRanges")

48
Installation of new libraries
  • Download and install latest R version on your PC.
    Go to http//cran.r-project.org/
  • Install following libraries by running
  • install.packages(c("seqinr", "muscle", "ape",
    "GenABEL")
  • source("http//bioconductor.org/biocLite.R")
  • biocLite("limma","affy","hgu133plus2.db","Biosting
    s")

49
Conclusions
  • We hope this course will provide you with the
    good array of analytical and practical skills
  • We chose R for this course as it is very flexible
    language with large scope of applications and is
    widely used
  • Our next class is October 1st
  • Prof. Kristel van Steen will cover introduction
    to bioinformatics and molecular biology topics

50
What are we looking for?
  • Data databases

51
Biologists Collect Lots of Data
  • Hundreds of thousands of species to explore
  • Millions of written articles in scientific
    journals
  • Detailed genetic information
  • gene names
  • phenotype of mutants
  • location of genes/mutations on chromosomes
  • linkage (distances between genes)
  • High Throughput lab technologies
  • PCR
  • Rapid inexpensive DNA sequencing (Illumina HiSeq)
  • Microarrays (Affymetrix)
  • Genome-wide SNP chips / SNP arrays (Illumina)
  • Must store data such that
  • Minimum data quality is checked
  • Well annotated according to standards
  • Made available to wide public to foster research

52
What is database?
  • Organized collection of data
  • Information is stored in "records, "fields,
    tables
  • Fields are categories
  • Must contain data of the same type (e.g. columns
    below)
  • Records contain data that is related to one
    object (e.g. protein, SNP) (e.g. rows below)

SNP ID SNPSeqID Gene primer -primer
D1Mit160_1 10.MMHAP67FLD1.seq lymphocyte antigen 84 AAGGTAAAAGGCAATCAGCACAGCC TCAACCTGGAGTCAGAGGCT
M-05554_1 12.MMHAP31FLD3.seq procollagen, type III, alpha TGCGCAGAAGCTGAAGTCTA TTTTGAGGTGTTAATGGTTCT
53
Genome sequencing generates lots of data
54
Biological Databases
The number of databases is contantly growing!-
OBRC Online Bioinformatics Resources Collection
currently lists over 2826 databases
(2013)
55
Main databases by category
  • Literature
  • PubMed scientific medical abstracts/citations
  • Health
  • OMIM online mendelian inheritance in man
  • Nucleotide Sequences
  • Nucleotide DNA and RNA sequences
  • Genomes
  • Genome genome sequencing projects by organism
  • dbSNP short genetic variations
  • Genes
  • Protein protein sequences
  • UniProt protein sequences and related
    information
  • Chemicals
  • PubChem Compound chemical information with
    structures, information and links
  • Pathways
  • BioSystems molecular pathways with links to
    genes, proteins
  • KEGG Pathway information on main biological
    pathways

56
Growth of UniProtKB database
  • UniProtKB contains mainly protein sequences
    (entries). The database growth is exponential
  • Data management issues? (e.g. storage, search,
    indexing?)

number of entries
Source http//www.ebi.ac.uk/uniprot/TrEMBLstats
57
Primary and Secondary Databases
Primary databases REAL EXPERIMENTAL DATA
(raw) Biomolecular sequences or structures and
associated annotation information (organism,
function, mutation linked to disease,
functional/structural patterns, bibliographic
etc.) Secondary databases DERIVED INFORMATION
(analyzed and annotated) Fruits of analyses of
primary data in the primary sources (patterns,
blocks, profiles etc. which represent the most
conserved features of multiple alignments)
58
Primary Databases
  • Sequence Information
  • DNA EMBL, Genbank, DDBJ
  • Protein SwissProt, TREMBL, PIR, OWL
  • Genome Information
  • GDB, MGD, ACeDB
  • Structure Information
  • PDB, NDB, CCDB/CSD

59
Secondary Databases
  • Sequence-related Information
  • ProSite, Enzyme, REBase
  • Genome-related Information
  • OMIM, TransFac
  • Structure-related Information
  • DSSP, HSSP, FSSP, PDBFinder
  • Pathway Information
  • KEGG, Pathways

60
GenBank database
  • Contains all DNA and protein sequences described
    in the scientific literature or collected in
    publicly funded research
  • One can search by protein name to get DNA/mRNA
    sequences
  • The search results could be filtered by species
    and other parameters

61
GenBank main fields
62
NCBI Databases contain more than just DNA
protein sequences
NCBI main portal http//www.ncbi.nlm.nih.gov/
63
Fasta format to store sequences
  • The FASTA format is now universal for all
    databases and software that handles DNA and
    protein sequences
  • Specifications
  • One header line
  • starts with gt with a ends with return
  • Saccharomyces cerevisiae strain YC81 actin (ACT1)
    gene
  • GenBank JQ288018.1
  • gtgi380876362gbJQ288018.1 Saccharomyces
    cerevisiae strain YC81 actin (ACT1) gene, partial
    cds TGGCATCATACCTTCTACAACGAATTGAGAGTTGCCCCAGAAGAAC
    ACCCTGTTCTTTTGACTGAAGCTCCAATGAACCCTAAATCAAACAGAGAA
    AAGATGACTCAAATTATGTTTGAAACTTTCAACGTTCCAGCCTTCTACGT
    TTCCATCCAAGCCGTTTTGTCCTTGTACTCTTCCGGTAGAACTACTGGTA
    TTGTTTTGGATTCCGGTGATGGTGTTACTCACGTCGTTCCAATTTACGCT
    GGTTTCTCTCTACCTCACGCCATTTTGAGAATCGATTTGGCCGGTAGAGA
    TTTGACTGACTACTTGATGAAGATCTTGAGTGAACGTGGTTACTCTTTCT
    CCACCACTGCTGAAAGAGAAATTGTCCGTGACATCAAGGAAAAACTATGT
    TACGTCGCCTTGGACTTCGAGCAAGAAATGCAAACCGCTGCTCAATCTTC
    TTCAATTGAAAAATCCTACGAACTTCCAGATGGTCAAGTCATCACTATTG
    GTAAC

64
OMIM database
  • Online Mendelian Inheritance in Man (OMIM)
  •  information on all known mendelian disorders
    linked to over 12,000 genes
  • Started at 1960s by Dr. Victor A. McKusick as a
    catalog of mendelian traits and disorders
  • Linked disease data
  • Links disease phenotypes and causative genes
  • Used by physicians and geneticists

65
OMIM basic search
  • Online Tutorial http//www.openhelix.com/OMIM
  • Each search results entry has , , or symbol
  • entries are the most informative as molecular
    basis of phenotype genotype association is
    known is known
  • Will do search on Ankylosing spondylitis (AS)
  • AS characterized by chronic inflammation of spine

66
OMIM-search results
  • Look for the entires that link to the genes.
    Apply filters if needed

Filter results if known SNP is associated to the
entry
Some of the interesting entries. Try to look for
the ones with sign
67
OMIM-entries
68
OMIM Gene ID -entries
69
OMIM-Finding disease linked genes
  • Read the report of given top gene linked
    phenotype
  • Mapping Linkage heterogeneity section
  • Go back to the original results
  • Previously seen entry 607562 IL23R

70
PubMed database
  • PubMed is one of the best known database in the
    whole scientific community
  • Most of biology related literature from all the
    related fields are being indexed by this database
  • It has very powerful mechanism of constructing
    search queries
  • Many search fields ? Logical operatiors (AND,
    OR)
  • Provides electronic links to most journals
  • Example of searching by author articles published
    within 2012-2013

71
Homework 1a
  • Exploring OMIM and PubMed databases

72
Homework 1a
  • Instructions
  • Only one type of HW.
  • This is Type 3 HW
  • Individual work. No groups
  • Total of 2 easy questions to answer
  • Do not forget to take print screen snapshots to
    show your work
  • Due date October 8th at midnight
  • Upload your completed HW using the submission
    system

73
Homework 1a
  • Even though it is not critical for this HW,
    register still online for HW1a as shown below to
    gain the habit

74
  • Last slide! Thanks for attention!
  • Next class is on Oct 1st!
Write a Comment
User Comments (0)
About PowerShow.com