INSDC Sequencing Project Registry: NCBI web service protocol - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

INSDC Sequencing Project Registry: NCBI web service protocol

Description:

Flag to auto assign locus_tag && Locus_tag_prefix. eError ... that the status has matured to 'eConfirmed', or to see if a problem was detected. ... – PowerPoint PPT presentation

Number of Views:95
Avg rating:3.0/5.0
Slides: 30
Provided by: paulk49
Category:

less

Transcript and Presenter's Notes

Title: INSDC Sequencing Project Registry: NCBI web service protocol


1
INSDC Sequencing Project Registry NCBI web
service protocol
Use and step-by-step description
National Center for Biotechnology Information,
NIH, Bethesda, MD. USA
2
Project definition
  • A project is defined as a collection of INSDC
    records originating from a single organization,
    or from a consortium of coordinating
    organizations.
  • The collective database records from a project
    make a complete genome from a single organism
    studies or a metagenome comprising communities of
    organisms.
  • A project may contain genomic sequences, EST
    libraries and any other sequences that contribute
    to the assembly and annotation of the genome or
    metagenome

3
Field definitions
  • Assigned by INSDC
  • project ID, locus-tag prefix
  • Mandatory fields
  • submitter contact info submitting
    organization
  • project type (single organism or metagenomic)
  • project name (for metagenomic) organism name
    (for single organism)
  • strain/isolate/breed (for single organism)
  • physical source of material (for single
    organism)
  • Optional fields
  • project description
    project URL
  • replicon names,
    estimated sizes
  • sequencing method
    sequencing depth
  • estimated/calculated genome size

4
Schematic diagram of a generic eukaryotic genome
project
Nucleotide data at NCBI (GenBank)
6 Large-scale cDNA sequencing (incomplete) Center
B
1 Genomic sequencing (WGS) and assembly
and annotation (complete) Center B
Genomic data at NCBI (RefSeq)
Organism-specific overview
Links to third-party sites
2 Genomic sequencing (WGS) (complete) Center A
Nucleotide data at NCBI (GenBank)
4 BAC-ends sequencing (incomplete) Center F
project
overview
external data
NCBI data
5
Main tables in Genome Project database
6
(No Transcript)
7
International Nucleotide Sequence Database
Collaboration
Locus-tag prefix for annotated genes
8
INSDC project
NCBI genome project submission CGI
EMBL genome project submission CGI
NCBI Server
DDBJ genome project submission CGI
NCBI Project Database
http//www.ncbi.nlm.nih.gov/projects/gpws
9
  • Web services are web-based enterprise
    applications that use open, XML-based standards
    and transport protocols to exchange data with
    calling clients
  • WSDL is an XML-based service description on how
    to communicate using the web service namely, the
    protocol bindings and message formats required to
    interact with the web services listed in its
    directory.
  • WSDL is often used in combination with SOAP and
    XML Schema to provide web services over the
    internet. A client program connecting to a web
    service can read the WSDL to determine what
    functions are available on the server. Any
    special datatypes used are embedded in the WSDL
    file in the form of XML Schema. The client can
    then use SOAP to actually call one of the
    functions listed in the WSDL.

10
NCBI Web service implementation
  • Web service methods
  • Submit Project
  • Update Project
  • Delete Project
  • Check Status
  • Get Document ID
  • Get Document
  • Others
  • Bulk dump
  • Conflict resolution

11
Submitting a new project eSubmit(example
successful submission)
eSubmit names Locus_tag_prefix
Collab CGI
NCBI Server
eOK eNone
Normal case with requested Locus_tag prefix
12
New project submission - eSubmit (inconsistent
request)
eSubmit names Flag to auto assign locus_tag
Locus_tag_prefix
Collab CGI
NCBI Server
eError eProvidedLocusTagPrefixWillBeIgnored
Data Error If CSubmission.AutoAssignment flag is
set and pLocusTagPrefix is provided by the
submitter.
13
Providing Reliability
NCBI is providing dual middleware and SQL servers
Sql Server1
In this case, the choice of which API server is
used is by load balancing, even when both
middleware servers are available
NCBI api Server
Data are stored redundantly on two SQL Servers
Having Both or any one API server available
provides full functionality
Sql Server2
NCBI api Server
14
Normal handling of conflicting request
Reject
New request
Sql Server1
Conflicting request
New request
New request
When both servers are up, no problem, both get
the new request.
Sql Server2
So, in this state, if a Conflicting request
incompatible with the new request is made, it can
be rejected, as it should be.
New request
15
There are multiple RARE reasons why a valid
request could have problems.
  • Connection to NCBI could be down, anywhere in
    between the Collab CGI and NCBI. Among rare
    events, we expect this to be the most common
    problem.
  • The entire NCBI site could be down.
    Historically, this has been extremely rare.
  • One or both of the database servers hosting the
    service could be down. (See later slides for
    partial service provided with one server up.)
  • (If any NCBI middleware API server is up, request
    handled.)

16
Benefits of Redundant SQL Servers
  • If any Server is up, requests for information can
    be handled.
  • If any server is up, submissions for project IDs
    and locus_tags can be accepted.
  • Normally, a server going down and coming back
    requires the only minimal action of checking back
    to confirm that the state is now ok.

17
Expected transient case that can be handled
automatically
Reject
Received, not confirmed
New request
Sql Server1
Conflicting request
New request
NCBI Maintenance task
New request
The Collab API would receive the status
eReceived, until the maintainence task
completed, then for the new request, it would
then receive the eConfirmed status.
When a server is down, but then comes back up the
request would normally be propagated by NCBI
maintenance tasks.
Sql Server2
So, in this corrected state, if a Conflicting
request incompatible with the new request is
made, it can still be rejected, as it should be.
New request
18
Collab handling of eReceived
  • Following slides will provide more information
    about why the eReceived return is necessary as
    a possible return.
  • To handle it, the collaborators can check back to
    confirm that the status has matured to
    eConfirmed, or to see if a problem was
    detected.
  • The possible EXTREMELY RARE and UNLIKELY problems
    will be presented in following slides.

19
Two Phase Commit
  • Computer Scientists might recognize the problem
    as a natural consequence of a two phase commit.
  • Normally, the two phases are hidden from
    submitters.
  • If the second phase is blocked by a server being
    down, then this complexity is revealed by the
    receipt of the eReceived status.

20
Unavoidable Complexity Caused by Redundant SQL
Servers
  • Redundant SQL Servers both prevent data loss and
    maximize uptime for queries. That is why we
    choose to accept the complexity of the two phase
    commit.
  • Even in this case the request can be accepted,
    but confirmation has to be after a delay.

21
Why bother with the two phase commit at all?
  • Although expected to be EXTREMELY RARE and
    UNLIKELY, the following slide shows a sequence of
    events prevented by the current system.
  • This slide shows what would NOT HAPPEN in the
    proposed system because of the two phase commit.
  • The following slide shows what would happen
    WITHOUT the two phase commit.

22
Illustration of what we will not allow and must
protect against
Accepted!
Unacceptable state prevented by two stage commit
New request
New request
Conflicting request
But, when one server is down, then come back and
the first comes down, watch what can happen!
So, in this state, if a conflicting request
incompatible with the new request it could be
accepted, leading to an unacceptable data state.
Conflicting request
23
Why this event is expected to be so rare
  • This event requires the following sequence
  • A server 2 going down.
  • Server 1 accepting a request, then going down,
    while
  • Server 2 comes back up to accept the
    conflicting request.

24
How this event would be handled.
  • Should this unlikely event happen, instead of the
    status maturing from eReceived to eConfirmed, it
    would degrade to eConflict for both.
  • The desired correction would be decided among the
    collaborators, by dialog.
  • Database would be patched to reflect the desired
    outcome.

25
Illustration of rare event
  • The following slide illustrates the sequence of
    events should this rare sequence of events occur.
  • It may never happen, but the two phase commit
    makes is possible, so we want to be clear at the
    beginning, what would happen, and how it would be
    handled.

26
Proposed responses should this happen
Received, not confirmed
Received, not confirmed
New request
New request
Conflicting request
But, when one server is down, then come back and
the first comes down, this is what we should do
So, both requests are received, but not
confirmed. Processes running on our servers will
detect this for manual attention.
Conflicting request
27
The previous should be a rare event
  • Then why bother handling it? Because
  • The cost of automatically making this mistake
    would be high, and
  • The more normal, typical, frequent and expected
    recovery, as on previous slides, are handled
    automatically.
  • It will be noticed by a eReceived state degrading
    to eConflict.

28
Handling of eReceived
  • All of the rare cases are noticed by the receipt
    of eReceived.
  • Collaborators need to check back for changes to
    the status of eReceived projects.
  • If the status matures to eConfirmed, no further
    action is needed.
  • If the status degrades to eConflict, then
    discussion will be needed. This will be rare!

29
Non-collab users of the NCBI Genome Project data
  • Public data is available in Entrez
  • Use eUtils (also implemented as NCBI web service)
  • Discussion on the data elements in Etrez Genome
    Project Docsum
  • ftp dumps
Write a Comment
User Comments (0)
About PowerShow.com