Title: GenBank Files and Libraries
1GenBank Files and Libraries
2GenBank Files and Libraries
- GenBank records have defined structure making it
easy to parse the desired information with
regular expressions - GenBank Libraries are files that contain many
individual GenBank records - For instance the GenBank Library of human RNA has
28,158 records
3GenBank Files and Libraries
- Methods for processing large files
- The input separator variable /
- Using offsets to control file access
4GenBank Files and Libraries
- The input separator variable /
- By default this input separator is set to \n
- When a file is fed into an array, each element is
split on the \n character - _at_array ltFHgt
5GenBank Files and Libraries
- The input separator does not have to be \n
- You can set this variable in order to split a
file however you like - For instance, GenBank libraries have many
individual records - Suppose you want to create an array where each
element is a GenBank record from a library
6GenBank Files and Libraries
- sub get_next_record
- my(fh) _at__ filehandle stored in
fh - my(record) ''
- my(save_input_separator) /
- / "//\n"
- record ltfhgt
- / save_input_separator restore /
- return record
7GenBank Files and Libraries
- Using the get_next_record sub
- while(record get_next_record(fh))
- push(_at_records, record)
-
8GenBank Files and Libraries
- Using the input separator variable /
- Essentially, this is the same as collecting a
file into a scalar and then using the split
command - library get_file(fh)
- _at_records split(/\/\/\n/, library)
9GenBank Files and Libraries
- Using the file offset to control file reading
- perl keeps track of where it is while reading a
file - similar to how it keeps track of where it is
during pattern matching
10GenBank Files and Libraries
- Offsets
- You can get access to the offset value by using
the tell function - offset tell(filehandle)
- The offset value can be very useful when using
very large files (like GenBank libraries)
11GenBank Files and Libraries
- example10-8.pl
- processes a GenBank library using an array of
records - array was made using input separator
- each GenBank record was parsed to save Accession
and offset - DBM file was written with key accession and
value offset - allows rapid access to individual records in the
file without having to scroll through the whole
file
12GenBank Files and Libraries
- !/usr/bin/perl
- Example 10-8 - make a DBM index of a GenBank
library, - and demonstrate its use interactively
- use strict
- use warnings
- use BeginPerlBioinfo see Chapter 6 about
this module
13GenBank Files and Libraries
- Declare and initialize variables
- my fh
- my record
- my dna
- my annotation
- my fields
- my dbm
- my answer
- my offset
- my library 'library.gb'
14GenBank Files and Libraries
- open DBM file, creating if necessary
- unless(dbmopen(dbm, 'GB', 0644))
- print "Cannot open DBM file GB with mode
0644\n" - exit
-
15GenBank Files and Libraries
- Parse GenBank library, saving accession number
and - offset in DBM file
- open file and save filehandle in fh
- fh open_file(library)
- get offset ltlt should start at 0
- offset tell(fh)
16GenBank Files and Libraries
- while ( record get_next_record(fh) )
- (annotation, dna) get_annotation_and_dna(
record) - fields parse_annotation(annotation)
- my accession fields'ACCESSION'
- extract just the accession number from the
accession - field and remove any trailing spaces
- accession s/ACCESSION\s//
- accession s/\s//
- store the key/value of accession/offset
- dbmaccession offset
- get offset for next record
- offset tell(fh)
-
17GenBank Files and Libraries
- print "Here are the available accession
numbers\n" - print join ( "\n", keys dbm ), "\n"
- print "Enter accession number (or quit) "
- This was written for library with 5 records
- Not a great plan for one with thousands
18GenBank Files and Libraries
- while( answer ltSTDINgt )
- chomp answer
- if(answer /\sq/)
- last
-
- offset dbmanswer
- if (defined offset)
- seek(fh, offset, 0) ltltltltlt seek
uses offset - record get_next_record(fh)
- print record
- else
19GenBank Files and Libraries
- else
- print "Do not have an entry for
answer\n" -
- print "\nEnter accession number (or quit) "
-
20GenBank Files and Libraries
- seek function
- seek (FH, offset, 0)
- This positions the pointer to this FH at the
offset value - The 3rd argument is for 'WHENCE'
- 0 offset read from start
- 1 offset read from current position
- 2 offset read from end (negative offset
expected)
21GenBank Files and Libraries
- LOCUS AB031069 2487 bp mRNA
PRI 27-MAY-2000 - DEFINITION Homo sapiens PCCX1 mRNA for protein
containing CXXC domain 1, - complete cds.
- ACCESSION AB031069
- VERSION AB031069.1 GI8100074
- KEYWORDS .
- SOURCE Homo sapiens embryo male lung
fibroblast cell_lineHuS-L12 cDNA to - mRNA.
- ORGANISM Homo sapiens
- Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi - Mammalia Eutheria Primates
Catarrhini Hominidae Homo. - REFERENCE 1 (sites)
- AUTHORS Fujino,T., Hasegawa,M., Shibata,S.,
Kishimoto,T., Imai,Si. and - Takano,T.
- TITLE PCCX1, a novel DNA-binding protein
with PHD finger and CXXC domain, - is regulated by proteolysis
- JOURNAL Biochem. Biophys. Res. Commun. 271
(2), 305-310 (2000) -
22GenBank Files and Libraries
LOCUS AB031069 2487 bp mRNA
PRI 27-MAY-2000 DEFINITION Homo sapiens
PCCX1 mRNA for protein containing CXXC domain 1,
complete cds. ACCESSION
AB031069 VERSION AB031069.1
GI8100074 KEYWORDS . SOURCE Homo sapiens
embryo male lung fibroblast cell_lineHuS-L12
cDNA to mRNA. ORGANISM Homo
sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi
Mammalia Eutheria Primates Catarrhini
Hominidae Homo. REFERENCE 1 (sites) AUTHORS
Fujino,T., Hasegawa,M., Shibata,S.,
Kishimoto,T., Imai,Si. and Takano,T.
TITLE PCCX1, a novel DNA-binding protein
with PHD finger and CXXC domain, is
regulated by proteolysis JOURNAL Biochem.
Biophys. Res. Commun. 271 (2), 305-310 (2000)
23GenBank Files and Libraries
LOCUS AB031069 2487 bp mRNA
PRI 27-MAY-2000 DEFINITION Homo sapiens
PCCX1 mRNA for protein containing CXXC domain 1,
complete cds. ACCESSION
AB031069 VERSION AB031069.1
GI8100074 KEYWORDS . SOURCE Homo sapiens
embryo male lung fibroblast cell_lineHuS-L12
cDNA to mRNA. ORGANISM Homo
sapiens Eukaryota Metazoa Chordata
Craniata Vertebrata Euteleostomi
Mammalia Eutheria Primates Catarrhini
Hominidae Homo. REFERENCE 1 (sites) AUTHORS
Fujino,T., Hasegawa,M., Shibata,S.,
Kishimoto,T., Imai,Si. and Takano,T.
TITLE PCCX1, a novel DNA-binding protein
with PHD finger and CXXC domain, is
regulated by proteolysis JOURNAL Biochem.
Biophys. Res. Commun. 271 (2), 305-310 (2000)
24GenBank Files and Libraries
- example10-8.pl
- processes a GenBank library using an array of
records - array was made using input separator
- each GenBank record was parsed to save Accession
and offset - DBM file was written with key accession and
value offset - allows rapid access to individual records in the
file without having to scroll through the whole
file