Title: Homework 1 and 2 review session
1Homework 1 and 2 review session
- Presented by
- Kirill Bessonov
- November 2012
2HW1 classical Q A (GenomeGraphs) (1)
- First two questions were on Bioconductor
libraries. There are BioC 608 packages - To get citations on particular library use
- citation("library_name")
- You were asked to get genomic data on specific
gene - library(GenomeGraphs)
- download the whole database of Ensemble IDs
- ensembl_Human_Genes useMart("ensembl",dataset"h
sapiens_gene_ensembl") - get info on gene form the database on the
Ensemble ID - gene lt- makeGene(id "ENSG00000115145",
type"ensembl_gene_id", biomart
ensembl_Human_Genes ) - get info on transcript
- transcript lt- makeTranscript(id
"ENSG00000115145", type"ensembl_gene_id",
biomart ensembl_Human_Genes) - gdPlot ( list("gene"gene, "transcripts"transcrip
t)) - retrieve info from the database displaying first
25 entries - getBM(c("ensembl_gene_id", "hgnc_symbol",
"description"), filterc("with_exon_transcript",
"with_protein_id", "with_transcript_variation"),va
lueslist(TRUE, TRUE, TRUE), ensembl_Human_Genes
)125,
3HW1 classical Q A (GenomeGraphs) (2)
- What is the gene name (i.e. hgnc_symbol) and
function represented by the Ensembl ID -
ENSG00000115145? - geneInfogetBM(c("ensembl_gene_id",
"hgnc_symbol", "description"), filterc("with_exon
_transcript", "with_protein_id",
"with_transcript_variation"),valueslist(TRUE,
TRUE, TRUE), ensembl_Human_Genes ) - gt geneInfogeneInfoensembl_gene_id
"ENSG00000115145", - ensembl_gene_id hgnc_symbol
description - 4829 ENSG00000115145 STAM2 signal
transducing adaptor molecule (SH3 domain and ITAM
motif) 2 - How many exons does the ensemble id
ENSG00000115145 has? 51 exons - attr(gene, "ens")
- ensembl_gene_id ensembl_transcript_id
ensembl_exon_id exon_chrom_start exon_chrom_end
rank strand biotype - 1 ENSG00000115145 ENST00000263904
ENSE00001351655 153032117 153032506
1 -1 protein_coding - ENSG00000115145 ENST00000263904
ENSE00002888710 153006659 153006743
2 -1 protein_coding -
- 48 ENSG00000115145 ENST00000494589
ENSE00002785037 153004538 153004636
3 -1 protein_coding - 49 ENSG00000115145 ENST00000494589
ENSE00002808134 153003676 153003822
4 -1 protein_coding - 50 ENSG00000115145 ENST00000494589
ENSE00002929781 153001402 153001471
5 -1 protein_coding - 51 ENSG00000115145 ENST00000494589
ENSE00001828491 153000503 153000527
6 -1 protein_coding
4HW1 classical Q A (GenomeGraphs) (3)
- Execute the following command. How many
chromosomes do you see? - 25 chromosomes. 22 autosomal pairs, 1 sex pair
and one mitochondrial chromosome - Why the number of chromosomes in this Ensembl
dataset is greater than 23 chromosome pairs? What
does MT, X and Y refer to? - Because of the MT chromosome, since X and Y can
be grouped to a single pair - gt getBM("chromosome_name","","",
ensembl_Human_Genes)c(122,433435),1 - 1 "1" "10" "11" "12" "13" "14" "15" "16" "17"
"18" "19" "2" "20" "21" "22" "3" "4" "5" "6"
"7" "8" "9" "MT" "X" "Y"
5HW2 Pairwise alignments (classical QA)
6HW2 Pairwise alignments (classical QA) Q1
- Please align globally using NeedlemanWunsch
algorithm the following DNA sequences. Use - The following scoring rules a) gap -5 b) match
between two bases 5 c) mismatch between two
bases 3
7HW2 Pairwise alignments (classical QA) Q3
- Do local protein alignment using BLOSUM 62 matrix
on the HEAGAWGHEE and PAWHAE sequence. The
scoring rules are a) gap -8 matches and
mismatches are given in BLOSUM 62 matrix.
8HW2 Pairwise alignments (classical QA) Q5
- Produce a dot plot of Human and Mouse p53
proteins from previous question and paste the
plot below. - Complete the lines of R code to get the dot
plot. - Are both proteins similar?
- Yes, very similar since we see clear diagonal
corresponding to gt90 of sequences length - Where is/are the region(s) of greatest variation
occur? - Between 50-100
9HW2 Pairwise alignments (classical QA) Q7
- What global alignment score do you get for the
two p53 proteins, when you use the BLOSUM62 alignm
ent matrix, a gap opening penalty of -10 and a
gap extension penalty of -0.5? Answer score of
1556 - query("p53_HUMAN", "ACP04637")
- p53_HUMAN_seq getSequence(p53_HUMAN)
-
- query("p53_MOUSE", "ACP02340")
- p53_MOUSE_seq getSequence(p53_MOUSE)
- globalAlign lt- pairwiseAlignment(p53_HUMAN_seq,
p53_MOUSE_seq, substitutionMatrix "BLOSUM62",
gapOpening -10, gapExtension -0.5) - Errors the R-code was not stated and the ID of
proteins were not given such as Uniprot ID P04637
10HW2 Computer Style Implementation of NW
algorithm in R
11HW2 Computer style (NW algorithm) 1
- Given the pseudo-code implement NW algorithm in R
- Algorithm has two parts
- Calculation of the alignment F-matrix
- Finding the optimal path(s) through the matrix
for to length(A) F(i,0) ? di for j0 to
length(B) F(0,j) ? dj for i1 to length(A)
for j1 to length(B) Match ?
F(i-1,j-1) S(Ai, Bj) Delete ? F(i-1, j)
d Insert ? F(i, j-1) d F(i,j) ?
max(Match, Insert, Delete)
d gap penalty score i and j positions in A
B sequences
12HW2 Computer style (NW algorithm) 2
- Fmatrix function(A,B)
- fmatrix matrix(0, nrow (nchar(A)1) , ncol
nchar(B)1) - d -8 this is gap penalty
- for(i in 0 nchar(A))
- fmatrixi1,1 d i populates initial
row with gap penalty -
- for(j in 0 nchar(B))
- fmatrix1,j1 d i
-
- for(i in 1 nchar(A))
- for(j in 1 nchar(B))
- score rules(A,B) get me sccore for the
pair of aa or nt - match fmatrixi,j score
- delete fmatrixi,j1 d
- insert fmatrixi1,j d
- fmatrixi1,j1 max(match,delete,insert
) -
-
- colnames(fmatrix) strsplit( paste(" " , B,
sep""), "")1
13HW2 Computer style (NW algorithm) 3
- rules function(A,B)
- s.matrix lt- matrix(rep(0,16), nrow 4, ncol4,
byrowTRUE, dimnames list(c("A","C","G","T"),c
("A","C","T","G"))) - s.matrix"A", c(2,-1,-1,-1)
- s.matrix"C", c(-1,2,-1,-1)
- s.matrix"T", c(-1,-1,2,-1)
- s.matrix"G", c(-1,-1,-1,2)
gt s.matrix A C T G A 2 -1 -1 -1 C -1 2 -1
-1 G -1 -1 2 -1 T -1 -1 -1 2
14HW2 Computer style (NW algorithm) 4
- Check the F-matrix
- fmatrixFmatrix("ATCG", "TG")
- T G
- -32 -32 -32
- A -8 -16 -24
- T -16 -6 -14
- C -24 -14 -4
- G -32 -22 -12
- Start finding the optimal path(s) through the
matrix - AlignmentA ""
- AlignmentB ""
- i nchar(A) 1
- j nchar(B) 1
- while(i gt 1 j gt 1)
-
- CurrentScore fmatrixi,j get score
at current position of F-matrix
15HW1 Computer style (NW algorithm) 5
- Selecting the bottom right cell and starting to
trace-back the path of optimal alignment - AlignmentA ""
- AlignmentB ""
- while(i gt 1 j gt 1)
- CurrentScore fmatrixi,j
- ScoreDiag fmatrixi - 1, j - 1
- ScoreUp fmatrixi, j - 1
- ScoreLeft fmatrixi - 1, j
-
- considering the score came from diagonal
- if (CurrentScore ScoreDiag
s.matrixsubstr(A,i,i), substr(B,j,j)) ) - AlignmentA paste(substr(A,i-1,i-1),Alig
nmentA, sep "") - AlignmentB paste(substr(B,j-1,j-1),Alig
nmentB, sep "") - i i - 1
- j j - 1
-
Which cell of the F-matrix I am now?
On diagonal path previous next cell
16HW2 Computer style (NW algorithm) 6
- considering if the score comes from left
(introducing a gap) - else if(CurrentScore ScoreLeft d)
-
- AlignmentA paste(substr(A,i-1,i-1),AlignmentA
, sep "") - AlignmentB paste( "-", AlignmentB, sep
"") - i i - 1
- considering if the score comes from upper cell
(introducing a gap) - else if(CurrentScore ScoreUp d)
-
- AlignmentA paste( "-", AlignmentA, sep "")
- AlignmentB paste(substr(B,j-1,j-1),
AlignmentB, sep "") - j j 1
- print(AlignmentA)
- print(AlignmentB)
- finalScore cat("Final score ",fmatrix(nchar(A)
1),(nchar(B)1))
17HW2 Computer style (NW algorithm) 7
- The scoring matrices could have been accessed
though character indices not requiring conversion
and making code faster - How one would output more than one BEST possible
alignments? - Please use more comments in your R-code
- Would be nice to see trace-backs visually
- Also the scoring rules were not stated clearly