Title: Introduction to the GCG Wisconsin Package
1Introduction to the GCG Wisconsin Package
- The Center for Bioinformatics
- UNC at Chapel Hill
- Jianping (JP) Jin Ph.D.
- Bioinformatics Scientist
- Phone (919)843-6105
- E-mail jjin_at_email.unc.edu
- Fax (919)843-3103
2What is GCG
- An integrated package of over 130 programs (the
GCG Wisconsin Package). - For extensive analyses of nucleic acid and
protein sequences. - Associated with most major public nucleic acid
and protein databases. - Works on UNIX OS.
3Why use GCG
- Removes the need for the constant collection of
new software by end users. - Removes the need to learn new interface as new
software is released. - Provides a flow of analyses within a single
interface. - Unix environment allows users to automate
complex, repetitive tasks. - Allows users to use multiple processors to
accelerate their jobs. - Supports almost all public databases that can be
updated daily. Fast local search.
4Flexibility or Automation
- 1. MEME upstream regulatory motifs
- 2. MotifSearch genes sharing these potential
regulatory motifs - 3. PileUp multiple sequence alignment
- 4. Distances extract pairwise distances from the
alignment - 5. GrowTree a phylogenetics tree.
5Interfaces
- Command Line Running programs from UNIX system
prompt. - SeqLab Graphic Users Interface, requiring an X
windows display. - SeqWeb to a core set of sequence analysis
program.
6Limitations with GCG
- The GUI interface does not give the users the
full access to the power of the command line, nor
to the complete set of programs. - Many programs place a limit of the maximum size
of the sequences that they can handle (350 Kb).
This limitation will be removed in version 11.
7Databases GCG Supports
- Nucleic acid databases
- GenBank
- EMBL (abridged)
- Protein databases
- NRL_3D
- UniProt (SWISS-PROT, PIR, TrEMBL)
- PROSITE, Pfam,
- Restriction Enzymes (REBASE)
8Database Update Services
- DataServe Automatically updates nucleic acid on
a daily basis via FTP. - DataExtended the most compete set of nucleic
acid and protein data. The timing of the release
is coordinated with the major GenBank release,
2-3 months. - DataBasic Similar to DataExtended, but excludes
EST and GSS data from GenBank and EMBL.
9File Importing and Exporting
- Reformat
- FromEMBL
- FromGenBank
- FromPIR ToPIR
- FromStaden ToStaden
- FromIG ToIG
- FromFastA ToFastA
10File Formats with GCG
- Single sequence files (in GCG format)
- List (a list of files)
- MSF (multiple sequence format)
- RSF (rich sequence format)
11Typical program
12Result from MAP analysis
13X-Windows server must be running
14SeqLab Main Window (List Mode)
15SeqLab Editor Mode
16Display by Features
17SeqLab Editor Mode (cont.)
18SeqLab Output Manager
19GCG Programs
- 1. Comparison
- 2. Database Searching and Retrieval
- 3. DNA/RNA Secondary Structure
- 4. Editing and Publication
- 5. Evolution
- 6. Fragment Assembly
- 7. Importing and exporting
- 8. Mapping
- 9. Primer Selection
- 10. Protein Analysis
- 11. Translation
20Create your own sequence
21PlasmidMap
22FindPatterns
23HmmerPfam Analysis
24Gene Finding (FRAME)
25Restriction Enzyme Map
26Consensus Sequence
27Phylogenetic Tree (Cladogram)
28Peptide Structure
29Peptide Structure (2)
30Isoelectric Analysis
31Transmemberane Domains
32Neucleic Acid 2nd Structure
33Pairwise Comparison (Gap)
- Neelman Wunsch algorithm.
- A global alignment covering the whole length of
both sequences and the resulting sequences are of
the same length with inserted gaps. - Good when two sequences are closely related.
34Pairwise Comparison (BestFit)
- Algorithm of Smith and Waterman.
- Local homology alignment that finds the best
segment of similarity b/w two sequences. - The most sensitive sequence comparison method
available.
35Comparison of two sequences
36GapShow
37Multiple Comparison (PileUp)
- The method of Feng and Doolittle similar to
Higgins Sharp. - A series of progressive pairwise alignments (up
to 500 seq.) generate a final alignment. - An extension of Gap, not ideal for finding the
best local region of similarity, such as a shared
motif.
38Multiple Comparison by Pileup
39Multiple Comparison by Pileup
40Dendrogram by Pileup
41Database Search
- Nearly always employ local alignment algorithms.
- Often use heuristic methods (for a screen),
FASTA and BLAST. - Assures the seq.are given correct local
similarity score, but no guarantee that all seq.
with high Smith-Waterman scores pass through the
screen.
42BLAST
- Accepts a number of sequences as input and
specify any number of DBs. Blast
INfile2PIR,SWPLUS -INfilehsp70.msf. - Support 5 BLAST programs, but no gap alignment
available for TBLASTX. - For non-coding nucleotide homology search,
considering either reducing the word size from 11
to 6/7, or using the FASTA. - The number of scoring matrices is limited,
BLOSUM62/45/80 and PAM70 available for MATRix
parameter.
43Database Search (SSearch)
- A rigorous Smith-Waterman search for similarity
between a query sequence and a group of sequences
of the same type. - The most sensitive method available for
similarity search. - Very slow.
44HmmerSearch
- Use a profile HMM as a query to search a sequence
database. - Profile HMM a position specific scoring table, a
statistical model of the consensus of a multiple
sequence alignment. - Output can be used for any GCG program that
accepts list file.
45Profile Hidden Markov Model
46HmmerSearch
47HmmerSearch (cont.)
48HmmerSearch (cont.)
49HS (cont.Histogram of scores)
50HS (cont. resulting alignment)
51NetBLAST
- Sends your query sequences over the internet to a
server at NCBI, Bethesda. - Some limitations on NetBLAST, e.g. prohibiting
TBLASTX search vs. the nr database, only Alu,
EST, GSS, STS. - Not support as many options as are available
with BLAST.
52 NetBLAST
53PSIBLAST
- Similar to BLAST, except using position-specific
scoring matrices during the search. - Use protein sequence(s) to iteratively search
protein database(s).
54MEME and MotifSearch
- Multiple EM Motif Elicitation, a tool for
discovering motifs in a group of DNA or protein
sequences. - Motif a sequence pattern that occurs repeatedly
in a group of related sequences. - Use a set of MEME profiles to search a database
for new sequences similar to the original family.
55MEME PROFILE
56MEME (cont.)
57GrowTree (Cladogram)
58Access to GCG on Campus
- 1. Onyen and password plus sign up to BioSci
service at http//onyen.unc.edu - 2. Computer connected to the Campus network
- 3. Postscript printer connected to the campus
network - 4. SSH Secure Client
- 5. X-Windows Server (optional).
59Sign up BioScience
60Log onto GCG
61Log onto GCG (cont.)
62GCG Welcome Page
63How to get seqlab to run
- Open X-Windows
- Logon to the GCG server, nun.isis.unc.edu,
through SSH Secure Shell Client - At the prompt () enter the command export
DISPLAYyourMachineIP0.0 - Enter the command xterm to activate the xterm
window - On the GCG main window enter the command seqlab
to activate the SeqLab GUI.
64How to get SeqLab to run (cont.)