Midterm Project - PowerPoint PPT Presentation

About This Presentation
Title:

Midterm Project

Description:

Midterm Project. Database Schema. GeneIDTable. Information about 'gene' ... protein_id, protein_fun, comment. All entries are of type longtext. Database Schema ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 20
Provided by: yiying
Learn more at: https://www.cise.ufl.edu
Category:

less

Transcript and Presenter's Notes

Title: Midterm Project


1
Midterm Project
2
Database Schema
  • GeneIDTable
  • Information about gene and corresponding
    protein
  • gene_id, gene_name, gene_seq, protein_id,
    protein_name, protein_seq, gene_type
  • gene_id primary key (type varchar(255))
  • gene_type type varchar(255)
  • All other entries are of type longtext

3
Database Schema
  • GeneFuncTable
  • Information about gene functions
  • gene_id, gene_fun, comment
  • gene_id foreign key
  • All entries are of type longtext

4
Database Schema
  • ProteinFuncTable
  • Information about protein functions
  • protein_id, protein_fun, comment
  • All entries are of type longtext

5
Database Schema
  • PathwayFuncTable
  • Information about pathway functions
  • pathway_id, pathway_name, pathway_fun,
    pathway_loc, comment
  • All entries are of type longtext

6
Database Schema
  • PathwayTable
  • Information about gene pathway association
  • gene_id, pathway_id
  • gene_id type varchar(255)
  • pathway_id type longtext

7
Database Schema
  • BiologicalProcessTable
  • Gene Ontology related table
  • Information about biological processes of a
    particular gene
  • gene_id, GO_num, biological_process
  • gene_id foreign key (type varchar(255))
  • All other entries are of type longtext

8
Database Schema
  • CellularComponentTable
  • Gene Ontology related table
  • Information about cellular component
  • gene_id, GO_num, cellular_component
  • gene_id foreign key (type varchar(255))
  • All other entries are of type longtext

9
Database Schema
  • MolecularFunctionTable
  • Gene Ontology related table
  • Information about molecular functions
  • gene_id, GO_num, molecular_function
  • gene_id foreign key (type varchar(255))
  • All entries are of type longtext

10
Steps to Follow Step 1
  • Get the RefSeq Accession Number of your species
    from the NCBI Genome database
  • e.g. NC_000913 for Escherichia Coli K12

11
Steps to Follow Step 2
  • Downloading files needed using the NCBI ftp site
    (ftp//ftp.ncbi.nlm.nih.gov)
  • genomes/Bacteria/species name/RefSeq .gbk
    (main information for genes and proteins and GO
    functions)
  • e.g. genomes/Bacteria/Escherichia_coli_k12/NC_0009
    13.gbk
  • genomes/Bacteria/species name/RefSeq .ffn
    (gene sequence)
  • e.g. genomes/Bacteria/Escherichia_coli_k12/NC_0009
    13.ffn

12
Steps to Follow Step 3
  • Go to KEGG selected organisms (http//www.genome.j
    p/kegg/catalog/org_list.html)
  • Find your species and click the second column of
    the species (e.g. eco for E Coli)
  • Go to pathway maps to get pathway information
    to put into the PathwayFunc table

13
Steps to Follow Step 4
  • Use eutils function of NCBI Entrez to get the
    file that contains gene pathway association
    (http//eutils.ncbi.nlm.nih.gov/entrez/eutils/)
  • Use esearch to search your species in the gene
    database http//eutils.ncbi.nlm.nih.gov/entrez/eut
    ils/esearch.fcgi?dbdatabasetermqueryusehistory
    y
  • Use efetch to fetch the result file
  • http//eutils.ncbi.nlm.nih.gov/entrez/eutils/efetc
    h.fcgi?dbdatabaseWebEnvWebEnvStringquery_keyk
    ey

14
Steps to Follow Step 5
  • Edit .gbk file to remove the beginning and the
    end part
  • Parse the .gbk and the .ffn file to fill all the
    tables except the PathwayFunc table and Pathway
    table
  • Link to the sample parser file
  • Parse.java

15
Steps to Follow Step 6
  • Parse the eutils resulting file to get the gene
    pathway association
  • Link to the sample parsePath file
  • ParsePath.java

16
Database Name Format
  • Example species Escherichia Coli K12
  • Species name Escherichia_Coli_K12
  • Database name escherichia_coli_k12

17
Sample Output File
  • outputFile.txt (output file after parsing .gbk
    and .ffn files)
  • outputPath.txt (output file after parsing gene
    pathway association file)
  • PathwayFunc.txt (output file after analyzing KEGG
    pathways)

18
To Find the Number of Genes
  • Search your species in NCBI gene database
  • e.g. Escherichia Coli K12 orgn
  • Check the number of genes in your result with
    this number

19
  • Submit your project (the 3 output files, the
    parsers if any changes) to
  • vgummulu_at_cise.ufl.edu
  • Any questions
  • yizhang_at_cise.ufl.edu
  • anupamd_at_ufl.edu
Write a Comment
User Comments (0)
About PowerShow.com