Introduction to the GCG Wisconsin Package - PowerPoint PPT Presentation

1 / 64
About This Presentation
Title:

Introduction to the GCG Wisconsin Package

Description:

Introduction to the GCG Wisconsin Package The Center for Bioinformatics UNC at Chapel Hill Jianping (JP) Jin Ph.D. Bioinformatics Scientist Phone: (919)843-6105 – PowerPoint PPT presentation

Number of Views:111
Avg rating:3.0/5.0
Slides: 65
Provided by: jjin1
Category:

less

Transcript and Presenter's Notes

Title: Introduction to the GCG Wisconsin Package


1
Introduction to the GCG Wisconsin Package
  • The Center for Bioinformatics
  • UNC at Chapel Hill
  • Jianping (JP) Jin Ph.D.
  • Bioinformatics Scientist
  • Phone (919)843-6105
  • E-mail jjin_at_email.unc.edu
  • Fax (919)843-3103

2
What is GCG
  • An integrated package of over 130 programs (the
    GCG Wisconsin Package).
  • For extensive analyses of nucleic acid and
    protein sequences.
  • Associated with most major public nucleic acid
    and protein databases.
  • Works on UNIX OS.

3
Why use GCG
  • Removes the need for the constant collection of
    new software by end users.
  • Removes the need to learn new interface as new
    software is released.
  • Provides a flow of analyses within a single
    interface.
  • Unix environment allows users to automate
    complex, repetitive tasks.
  • Allows users to use multiple processors to
    accelerate their jobs.
  • Supports almost all public databases that can be
    updated daily. Fast local search.

4
Flexibility or Automation
  • 1. MEME upstream regulatory motifs
  • 2. MotifSearch genes sharing these potential
    regulatory motifs
  • 3. PileUp multiple sequence alignment
  • 4. Distances extract pairwise distances from the
    alignment
  • 5. GrowTree a phylogenetics tree.

5
Interfaces
  • Command Line Running programs from UNIX system
    prompt.
  • SeqLab Graphic Users Interface, requiring an X
    windows display.
  • SeqWeb to a core set of sequence analysis
    program.

6
Limitations with GCG
  • The GUI interface does not give the users the
    full access to the power of the command line, nor
    to the complete set of programs.
  • Many programs place a limit of the maximum size
    of the sequences that they can handle (350 Kb).
    This limitation will be removed in version 11.

7
Databases GCG Supports
  • Nucleic acid databases
  • GenBank
  • EMBL (abridged)
  • Protein databases
  • NRL_3D
  • UniProt (SWISS-PROT, PIR, TrEMBL)
  • PROSITE, Pfam,
  • Restriction Enzymes (REBASE)

8
Database Update Services
  • DataServe Automatically updates nucleic acid on
    a daily basis via FTP.
  • DataExtended the most compete set of nucleic
    acid and protein data. The timing of the release
    is coordinated with the major GenBank release,
    2-3 months.
  • DataBasic Similar to DataExtended, but excludes
    EST and GSS data from GenBank and EMBL.

9
File Importing and Exporting
  • Reformat
  • FromEMBL
  • FromGenBank
  • FromPIR ToPIR
  • FromStaden ToStaden
  • FromIG ToIG
  • FromFastA ToFastA

10
File Formats with GCG
  • Single sequence files (in GCG format)
  • List (a list of files)
  • MSF (multiple sequence format)
  • RSF (rich sequence format)

11
Typical program
12
Result from MAP analysis
13
X-Windows server must be running
14
SeqLab Main Window (List Mode)
15
SeqLab Editor Mode
16
Display by Features
17
SeqLab Editor Mode (cont.)
18
SeqLab Output Manager
19
GCG Programs
  • 1. Comparison
  • 2. Database Searching and Retrieval
  • 3. DNA/RNA Secondary Structure
  • 4. Editing and Publication
  • 5. Evolution
  • 6. Fragment Assembly
  • 7. Importing and exporting
  • 8. Mapping
  • 9. Primer Selection
  • 10. Protein Analysis
  • 11. Translation

20
Create your own sequence
21
PlasmidMap
22
FindPatterns
23
HmmerPfam Analysis
24
Gene Finding (FRAME)
25
Restriction Enzyme Map
26
Consensus Sequence
27
Phylogenetic Tree (Cladogram)
28
Peptide Structure
29
Peptide Structure (2)
30
Isoelectric Analysis
31
Transmemberane Domains
32
Neucleic Acid 2nd Structure
33
Pairwise Comparison (Gap)
  • Neelman Wunsch algorithm.
  • A global alignment covering the whole length of
    both sequences and the resulting sequences are of
    the same length with inserted gaps.
  • Good when two sequences are closely related.

34
Pairwise Comparison (BestFit)
  • Algorithm of Smith and Waterman.
  • Local homology alignment that finds the best
    segment of similarity b/w two sequences.
  • The most sensitive sequence comparison method
    available.

35
Comparison of two sequences
36
GapShow
37
Multiple Comparison (PileUp)
  • The method of Feng and Doolittle similar to
    Higgins Sharp.
  • A series of progressive pairwise alignments (up
    to 500 seq.) generate a final alignment.
  • An extension of Gap, not ideal for finding the
    best local region of similarity, such as a shared
    motif.

38
Multiple Comparison by Pileup
39
Multiple Comparison by Pileup
40
Dendrogram by Pileup
41
Database Search
  • Nearly always employ local alignment algorithms.
  • Often use heuristic methods (for a screen),
    FASTA and BLAST.
  • Assures the seq.are given correct local
    similarity score, but no guarantee that all seq.
    with high Smith-Waterman scores pass through the
    screen.

42
BLAST
  • Accepts a number of sequences as input and
    specify any number of DBs. Blast
    INfile2PIR,SWPLUS -INfilehsp70.msf.
  • Support 5 BLAST programs, but no gap alignment
    available for TBLASTX.
  • For non-coding nucleotide homology search,
    considering either reducing the word size from 11
    to 6/7, or using the FASTA.
  • The number of scoring matrices is limited,
    BLOSUM62/45/80 and PAM70 available for MATRix
    parameter.

43
Database Search (SSearch)
  • A rigorous Smith-Waterman search for similarity
    between a query sequence and a group of sequences
    of the same type.
  • The most sensitive method available for
    similarity search.
  • Very slow.

44
HmmerSearch
  • Use a profile HMM as a query to search a sequence
    database.
  • Profile HMM a position specific scoring table, a
    statistical model of the consensus of a multiple
    sequence alignment.
  • Output can be used for any GCG program that
    accepts list file.

45
Profile Hidden Markov Model
46
HmmerSearch
47
HmmerSearch (cont.)
48
HmmerSearch (cont.)
49
HS (cont.Histogram of scores)
50
HS (cont. resulting alignment)
51
NetBLAST
  • Sends your query sequences over the internet to a
    server at NCBI, Bethesda.
  • Some limitations on NetBLAST, e.g. prohibiting
    TBLASTX search vs. the nr database, only Alu,
    EST, GSS, STS.
  • Not support as many options as are available
    with BLAST.

52
NetBLAST
53
PSIBLAST
  • Similar to BLAST, except using position-specific
    scoring matrices during the search.
  • Use protein sequence(s) to iteratively search
    protein database(s).

54
MEME and MotifSearch
  • Multiple EM Motif Elicitation, a tool for
    discovering motifs in a group of DNA or protein
    sequences.
  • Motif a sequence pattern that occurs repeatedly
    in a group of related sequences.
  • Use a set of MEME profiles to search a database
    for new sequences similar to the original family.

55
MEME PROFILE
56
MEME (cont.)
57
GrowTree (Cladogram)
58
Access to GCG on Campus
  • 1. Onyen and password plus sign up to BioSci
    service at http//onyen.unc.edu
  • 2. Computer connected to the Campus network
  • 3. Postscript printer connected to the campus
    network
  • 4. SSH Secure Client
  • 5. X-Windows Server (optional).

59
Sign up BioScience
60
Log onto GCG
61
Log onto GCG (cont.)
62
GCG Welcome Page
63
How to get seqlab to run
  • Open X-Windows
  • Logon to the GCG server, nun.isis.unc.edu,
    through SSH Secure Shell Client
  • At the prompt () enter the command export
    DISPLAYyourMachineIP0.0
  • Enter the command xterm to activate the xterm
    window
  • On the GCG main window enter the command seqlab
    to activate the SeqLab GUI.

64
How to get SeqLab to run (cont.)
Write a Comment
User Comments (0)
About PowerShow.com