GenBank Files and Libraries - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

GenBank Files and Libraries

Description:

GenBank records have defined structure making it easy to parse the desired ... AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and. Takano,T. ... – PowerPoint PPT presentation

Number of Views:280
Avg rating:3.0/5.0
Slides: 25
Provided by: MartinS48
Category:

less

Transcript and Presenter's Notes

Title: GenBank Files and Libraries


1
GenBank Files and Libraries
2
GenBank Files and Libraries
  • GenBank records have defined structure making it
    easy to parse the desired information with
    regular expressions
  • GenBank Libraries are files that contain many
    individual GenBank records
  • For instance the GenBank Library of human RNA has
    28,158 records

3
GenBank Files and Libraries
  • Methods for processing large files
  • The input separator variable /
  • Using offsets to control file access

4
GenBank Files and Libraries
  • The input separator variable /
  • By default this input separator is set to \n
  • When a file is fed into an array, each element is
    split on the \n character
  • _at_array ltFHgt

5
GenBank Files and Libraries
  • The input separator does not have to be \n
  • You can set this variable in order to split a
    file however you like
  • For instance, GenBank libraries have many
    individual records
  • Suppose you want to create an array where each
    element is a GenBank record from a library

6
GenBank Files and Libraries
  • sub get_next_record
  • my(fh) _at__ filehandle stored in
    fh
  • my(record) ''
  • my(save_input_separator) /
  • / "//\n"
  • record ltfhgt
  • / save_input_separator restore /
  • return record

7
GenBank Files and Libraries
  • Using the get_next_record sub
  • while(record get_next_record(fh))
  • push(_at_records, record)

8
GenBank Files and Libraries
  • Using the input separator variable /
  • Essentially, this is the same as collecting a
    file into a scalar and then using the split
    command
  • library get_file(fh)
  • _at_records split(/\/\/\n/, library)

9
GenBank Files and Libraries
  • Using the file offset to control file reading
  • perl keeps track of where it is while reading a
    file
  • similar to how it keeps track of where it is
    during pattern matching

10
GenBank Files and Libraries
  • Offsets
  • You can get access to the offset value by using
    the tell function
  • offset tell(filehandle)
  • The offset value can be very useful when using
    very large files (like GenBank libraries)

11
GenBank Files and Libraries
  • example10-8.pl
  • processes a GenBank library using an array of
    records
  • array was made using input separator
  • each GenBank record was parsed to save Accession
    and offset
  • DBM file was written with key accession and
    value offset
  • allows rapid access to individual records in the
    file without having to scroll through the whole
    file

12
GenBank Files and Libraries
  • !/usr/bin/perl
  • Example 10-8 - make a DBM index of a GenBank
    library,
  • and demonstrate its use interactively
  • use strict
  • use warnings
  • use BeginPerlBioinfo see Chapter 6 about
    this module

13
GenBank Files and Libraries
  • Declare and initialize variables
  • my fh
  • my record
  • my dna
  • my annotation
  • my fields
  • my dbm
  • my answer
  • my offset
  • my library 'library.gb'

14
GenBank Files and Libraries
  • open DBM file, creating if necessary
  • unless(dbmopen(dbm, 'GB', 0644))
  • print "Cannot open DBM file GB with mode
    0644\n"
  • exit

15
GenBank Files and Libraries
  • Parse GenBank library, saving accession number
    and
  • offset in DBM file
  • open file and save filehandle in fh
  • fh open_file(library)
  • get offset ltlt should start at 0
  • offset tell(fh)

16
GenBank Files and Libraries
  • while ( record get_next_record(fh) )
  • (annotation, dna) get_annotation_and_dna(
    record)
  • fields parse_annotation(annotation)
  • my accession fields'ACCESSION'
  • extract just the accession number from the
    accession
  • field and remove any trailing spaces
  • accession s/ACCESSION\s//
  • accession s/\s//
  • store the key/value of accession/offset
  • dbmaccession offset
  • get offset for next record
  • offset tell(fh)

17
GenBank Files and Libraries
  • print "Here are the available accession
    numbers\n"
  • print join ( "\n", keys dbm ), "\n"
  • print "Enter accession number (or quit) "
  • This was written for library with 5 records
  • Not a great plan for one with thousands

18
GenBank Files and Libraries
  • while( answer ltSTDINgt )
  • chomp answer
  • if(answer /\sq/)
  • last
  • offset dbmanswer
  • if (defined offset)
  • seek(fh, offset, 0) ltltltltlt seek
    uses offset
  • record get_next_record(fh)
  • print record
  • else

19
GenBank Files and Libraries
  • else
  • print "Do not have an entry for
    answer\n"
  • print "\nEnter accession number (or quit) "

20
GenBank Files and Libraries
  • seek function
  • seek (FH, offset, 0)
  • This positions the pointer to this FH at the
    offset value
  • The 3rd argument is for 'WHENCE'
  • 0 offset read from start
  • 1 offset read from current position
  • 2 offset read from end (negative offset
    expected)

21
GenBank Files and Libraries
  • LOCUS AB031069 2487 bp mRNA
    PRI 27-MAY-2000
  • DEFINITION Homo sapiens PCCX1 mRNA for protein
    containing CXXC domain 1,
  • complete cds.
  • ACCESSION AB031069
  • VERSION AB031069.1 GI8100074
  • KEYWORDS .
  • SOURCE Homo sapiens embryo male lung
    fibroblast cell_lineHuS-L12 cDNA to
  • mRNA.
  • ORGANISM Homo sapiens
  • Eukaryota Metazoa Chordata
    Craniata Vertebrata Euteleostomi
  • Mammalia Eutheria Primates
    Catarrhini Hominidae Homo.
  • REFERENCE 1 (sites)
  • AUTHORS Fujino,T., Hasegawa,M., Shibata,S.,
    Kishimoto,T., Imai,Si. and
  • Takano,T.
  • TITLE PCCX1, a novel DNA-binding protein
    with PHD finger and CXXC domain,
  • is regulated by proteolysis
  • JOURNAL Biochem. Biophys. Res. Commun. 271
    (2), 305-310 (2000)

22
GenBank Files and Libraries
LOCUS AB031069 2487 bp mRNA
PRI 27-MAY-2000 DEFINITION Homo sapiens
PCCX1 mRNA for protein containing CXXC domain 1,
complete cds. ACCESSION
AB031069 VERSION AB031069.1
GI8100074 KEYWORDS . SOURCE Homo sapiens
embryo male lung fibroblast cell_lineHuS-L12
cDNA to mRNA. ORGANISM Homo
sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi
Mammalia Eutheria Primates Catarrhini
Hominidae Homo. REFERENCE 1 (sites) AUTHORS
Fujino,T., Hasegawa,M., Shibata,S.,
Kishimoto,T., Imai,Si. and Takano,T.
TITLE PCCX1, a novel DNA-binding protein
with PHD finger and CXXC domain, is
regulated by proteolysis JOURNAL Biochem.
Biophys. Res. Commun. 271 (2), 305-310 (2000)
23
GenBank Files and Libraries
LOCUS AB031069 2487 bp mRNA
PRI 27-MAY-2000 DEFINITION Homo sapiens
PCCX1 mRNA for protein containing CXXC domain 1,
complete cds. ACCESSION
AB031069 VERSION AB031069.1
GI8100074 KEYWORDS . SOURCE Homo sapiens
embryo male lung fibroblast cell_lineHuS-L12
cDNA to mRNA. ORGANISM Homo
sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi
Mammalia Eutheria Primates Catarrhini
Hominidae Homo. REFERENCE 1 (sites) AUTHORS
Fujino,T., Hasegawa,M., Shibata,S.,
Kishimoto,T., Imai,Si. and Takano,T.
TITLE PCCX1, a novel DNA-binding protein
with PHD finger and CXXC domain, is
regulated by proteolysis JOURNAL Biochem.
Biophys. Res. Commun. 271 (2), 305-310 (2000)
24
GenBank Files and Libraries
  • example10-8.pl
  • processes a GenBank library using an array of
    records
  • array was made using input separator
  • each GenBank record was parsed to save Accession
    and offset
  • DBM file was written with key accession and
    value offset
  • allows rapid access to individual records in the
    file without having to scroll through the whole
    file
Write a Comment
User Comments (0)
About PowerShow.com