Title: A Classification of Biological Data Artifacts
1A Classification of Biological Data Artifacts
1,2Judice L.Y. Koh, 2Mong Li Lee, 1Vladimir
Brusic
1Knowledge Discovery Department, Institute for
Infocomm Research 2School of Computing, National
University of Singapore
2http//research.i2r.a-star.edu.sg/Templar/
GenPept, EMBL, TrEMBL, NCBI RefSeq, Patent
database
3Data warehousing process
4Biological data quality
Public molecular databases (GenBank, Swiss-Prot,
DDBJ, EMBL, PIR, among others) provide rich
sources of biological data. Information for data
analysis and data sources for knowledge discovery
in the BioWare data warehouse. The accuracy of
data analysis and the ability to produce correct
results from data mining relies on the quality of
data. But how can we ensure high quality data
in our data warehouse?
5Objectives of our study
- of biological data artifacts in biological
databases - Critical assessment of the quality of data which
biologists/computer scientists have been using
for data analysis and data mining. - Roadmap to improving the quality of data in
molecular databases. - Form the basis of biological data cleaning.
6Biological data artifacts
- Errors, discrepancies, redundancies, ambiguities,
and incompleteness in molecular databases
reducing the quality of the biological data.
7Outline of presentation
- Sources of Biological Data Artifacts
- HEADER Artifacts
- FEATURE Artifacts
- SEQUENCE Artifacts
- Data Cleaning Framework
- Conclusion
8Sources of biological data artifacts
- (1) Diverse sources of data
- Extensive duplication
- Repeated submissions of the sequences to same or
different databases - Cross-updating of databases (Propagation of
errors) - (2) Data Annotation
- Enrichment of sequences with descriptions of
their structural and functional features, related
references and other sequence information - By database annotators or sequence submitters
- Databases have different mechanisms for data
annotation - (GENBANK only direct submission SWISS-PROT
all sequence records) - Data entry errors can be introduced
- Different interpretations
- (3) Lack of standardized nomenclature
- Variations in naming conventions
- Synonyms, homonyms, and abbreviations
- (4) Inadequacy of data quality control mechanisms
- Systematic approaches to data cleaning are
lacking
9Classification of data artifacts
- HEADER General information of the record.
- FEATURE - Descriptions of the structural,
functional, and other physico-chemical properties
of the sequence and regions of interest. - SEQUENCE Nucleotide or Protein sequence.
10Spelling errors
Invalid values
Numerical names
Format violation
Undersized or oversized names
Synonyms
Ambiguity
HEADER
Homonyms/Abbreviations
Misuse of fields
Concatenated values
Incompatible schema
Mis-fielded values
Cross-annotation error
Conflicting features across different database
records
Features do not correspond with sequence
Annotation error
FEATURE
Over-prediction
Putative features
Under-prediction
Sequence structure violation
Uninformative sequences
CDS miscoding
Sequence entry error
Dubious sequences
Undersized sequences
Annotation error
Fragments
Dubious records
Vector contaminated sequences
SEQUENCE
Fragments
Replication of sequence information
Duplicates
Different views
Overlapping annotations of the same sequence
11Outline of presentation
- Sources of Biological Data Artifacts
- HEADER Artifacts
- FEATURE Artifacts
- SEQUENCE Artifacts
- Data Cleaning Framework
- Conclusion
12Spelling errors
Invalid values
Format violation
Ambiguity
HEADER
- Usually typo errors
- Occurs in different fields of the record
- We identified 569 possible misspelled words
affecting up to 20,505 nucleotide records in
Entrez.
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
13Spelling errors
Invalid values
Format violation
Ambiguity
- Undersized/Oversized fields
- 0.05 or 83 protein names gathered from the
UniProt data records (Release 2.3) are longer
than 400 characters. - Protein record DCB2_HUMAN in UniProt
- Definition Discoidin, CUB and LCCL domain
containing protein 2 precursor - Synonym 1 Endothelial and smooth muscle
cell-derived neuropilin-like protein - Synonym 2 CUB, LCCL and coagulation factor
V/VIII-homology domains protein 1. - Protein record ACTM_LELER and ACTM_HELTB
- Synonym M
HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
14Synonym / Homonym
Invalid values
Abbreviation
Misuse of fields
Ambiguity
HEADER
- Synonyms Different names given to the same
sequence - Homonyms Different sequences given the same
name - The scorpion neurotoxin BmK-X precursor has a
permutation of synonyms - It is also known as BmKX, BmK10, BmK-M10,
Bmk M10, Neurotoxin M10, Alpha-neurotoxin
TX9, and BmKalphaTx9.
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
http//www.expasy.org/cgi-bin/niceprot.pl?P45697
Fragments
SEQUENCE
Duplicates
15Synonym / Homonym
Invalid values
Abbreviation
Misuse of fields
Ambiguity
HEADER
- Different types of sequences can have the same
abbreviation. - BMK stands for Big Map Kinase, B-cell/myeloid
kinase, bovine midkine, as well as for
Bradykinin-potentiating peptide. - GK is the abbreviation for both Glycerol
Kinase and Geko gene of Drosophila
melanogaster (Fruit fly).
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
http//www.expasy.org/cgi-bin/niceprot.pl?Q9R1D9
The manifestation of synonyms, homonyms and
abbreviations results in information ambiguities
which cause problems in sequence identification
and keyword searching.
Fragments
SEQUENCE
Duplicates
16Synonym / Homonym
Invalid values
Abbreviation
Misuse of fields
Ambiguity
Ambiguous field values
HEADER
Incompatible schema
Cross-annotation error
Definition includes species, length of sequence,
etc.
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
Retrievedbproteinlist_uids639947doptGenPept
Duplicates
17Concatenated values
Invalid values
Mis-fielded values
- Field concatenations occur during data
transformations. - When data fields of finer granularity are
transformed into schema with corresponding data
fields of coarser granularity, the field values
are concatenated. - Multiple field values can be concatenated using
and or or. - The gene name of the Swiss-Prot entry P29834 was
GRP 0.9 or GRP-1. This was recently corrected.
Ambiguity
HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
http//www.expasy.org/cgi-bin/niceprot.pl?P15228
18Concatenated values
Invalid values
Mis-fielded values
Ambiguity
HEADER
Flaws in schema mapping Source fields not taken
into account in the transformed data schema may
be incorrectly mapped to a wrong field.
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence is directly submitted to GENBANK
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
Retrievedbproteinlist_uids18071172doptGenPep
t
19Outline of presentation
- Sources of Biological Data Artifacts
- HEADER Artifacts
- FEATURE Artifacts
- SEQUENCE Artifacts
- Data Cleaning Framework
- Conclusion
20Invalid values
Conflicting features across different databases
Ambiguity
HEADER
- Multiple database records of the same nucleotide
or protein sequences contain inconsistent or
conflicting feature annotations. - data entry errors,
- mis-annotation of sequence functions,
- different expert interpretations, and
- inference of features or annotation transfer
based on best matches of low sequence similarity. - Different annotation groups
- A comparative study of the annotations by three
different groups of 340 genes of Mycoplasma
genitalium genome showed that incompatible
descriptions were assigned to 8 of these genes. - Brenner SE (1999) Errors in genome annotation.
TIG 15 132-133. - Same annotation group
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
21http//www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db
nucleotideval11692004
http//www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db
nucleotideval11692006
22Invalid values
Putative features
Ambiguity
HEADER
- Functional annotation sometimes involve searching
for the highest matching annotated sequence in
the database. - Extrapolate features from the most similar known
searched sequences. - In some cases, even the highest matching sequence
from database search may have weak sequence
similarities and therefore does not share similar
functions as the query sequence (Bork, 2000 and
Guigo et al., 2000). - Blind inference can cause erroneous functional
assignment. - A study found that 24 of the Chlamydia
trachomatis sequences contained erroneous
functional assignments (Iliopoulos et al., 2003).
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
23Invalid values
Intron/Exon overlaps
Ambiguity
HEADER
- Illogical feature entities that do not
correspond to the logical constraints of the gene
structure. - 12 out of 42,359 nucleotide sequences have
overlapping intron/exon region.
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Introns and exons must be non-overlapping except
in cases of alternative splicing.
Dubious sequences
Fragments
SEQUENCE
Duplicates
24Invalid values
Intron/Exon overlaps
Ambiguity
HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Syn7 gene of putative polyketide synthase in NCBI
TPA record BN000507 has overlapping intron 5 and
exon 6. rpb7 RNA polymerase II subunit in
GENBANK record AF055916 has overlapping exon 1
and exon 2.
Fragments
SEQUENCE
Duplicates
25Outline of presentation
- Sources of Biological Data Artifacts
- HEADER Artifacts
- FEATURE Artifacts
- SEQUENCE Artifacts
- Data Cleaning Framework
- Conclusion
26Uninformative sequence
Invalid values
Undersized sequence
Vector contaminated sequence
Ambiguity
- Sequences have meaningless content
- A profuse percentage of the unknown residues
(X) or unknown bases (N) can reduce the
complexity of the sequence and thus, the
information content of the sequence. - Three out of the nine residues of the unknown
protein CP19 XXFESXEMR in UniProt record
UN19_CLOPA are unknown. - The chain C of a MHC protein XFVKQNAXALX in
PDB contains 30 unknown residues.
HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
27Uninformative sequence
Invalid values
Undersized sequence
Vector contaminated sequence
Ambiguity
- Sequences have meaningless content
- Among the 5,146,255 protein records queried
using Entrez to the major protein or translated
nucleotide databases , 3,327 protein sequences
are shorter than four residues (as of Sep,
2004). - In Nov 2004, the total number of undersized
protein sequences increases to 3,350. - Among 43,026,887 nucleotide records queried
using Entrez to major nucleotide databases, 1,448
records contain sequences shorter than six bases
(as of Sep, 2004). - In Nov 2004, the total number of undersized
nucleotide sequences increases to 1,711.
HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
28Uninformative sequence
Invalid values
Undersized sequence
Vector contaminated sequence
Ambiguity
- Vectors are agents that carry DNA fragments into
a host cell. - The vector sequences probe and bind the DNA
fragments at the 5 and 3 sites. - The DNA fragment is then isolated from its
vectors by cutting at the restriction enzyme
sites.
HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
8 out of 8,850 Candida Albicans sequences are
possibly contaminated with vectors commonly used
for the cloning of Candida Albicans sequences.
We used BLAST to search for regions in the
Candida Albicans sequences which matches any of
the 18 cloning vectors. From the matched results,
we selected those with matches at the 3 or 5
ends of Candida Albicans sequences. Matching
sections of the sequences extend from 30 bases to
1,154 bases.
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
29Invalid values
Fragmented sequences in different records
Ambiguity
- Extensive redundancy is caused by records
containing fragmented or overlapping sequences
with more complete sequences in other records.
HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
30Replication of sequence information
Invalid values
Different views
Overlapping annotations of the same sequence
Ambiguity
- Identical sequences with the same annotations
- Submission of the same sequence to different
databases - Repeated submission of the same sequence to the
same database - Initially submitted by different groups
- Protein sequences may be translated from
duplicate nucleotide sequences
HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
Retrievedbproteinlist_uids11692005doptGenPep
t
Duplicates
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
Retrievedbproteinlist_uids11692005doptGenPep
t
31Invalid values
Replication of sequence information
Different views
Overlapping annotations of the same sequence
Ambiguity
HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
http//www.expasy.org/cgi-bin/niceprot.pl?Q95P69
Dubious sequences
Fragments
SEQUENCE
http//www.expasy.org/cgi-bin/niceprot.pl?Q9GNG8
Duplicates
32Outline of presentation
- Sources of Biological Data Artifacts
- HEADER Artifacts
- FEATURE Artifacts
- SEQUENCE Artifacts
- Data Cleaning Framework
- Conclusion
33Spelling errors
Invalid values
Numerical names
Format violation
Undersized or oversized names
Synonyms
Ambiguity
HEADER
Homonyms/Abbreviations
Misuse of fields
Concatenated values
Incompatible schema
Mis-fielded values
Cross-annotation error
Conflicting features across different database
records
Features do not correspond with sequence
Annotation error
FEATURE
Over-prediction
Putative features
Under-prediction
Sequence structure violation
Uninformative sequences
CDS miscoding
Sequence entry error
Dubious sequences
Undersized sequences
Annotation error
Fragments
Dubious records
Vector contaminated sequences
SEQUENCE
Fragments
Replication of sequence information
Duplicates
Different views
Overlapping annotations of the same sequence
34Spelling errors
Dictionary lookup
Synonyms
Homonyms/Abbreviations
ATTRIBUTE
Uninformative sequences
Undersized sequences
Integrity constraints
Format violation
Misuse of fields
Vector screening
Vector contaminated sequences
Features do not correspond with sequence
Sequence Structure Parser
RECORD
Sequence structure violation
Concatenated values
Schema remapping
Mis-fielded values
SINGLE- SOURCE DATABASE
Replication of sequence information
Different views
Duplicate detection
Overlapping annotations of the same sequence
MULTI- SOURCE DATABASE
Fragments
Putative features
Comparative analysis
Cross-annotation error
35Outline of presentation
- Sources of Biological Data Artifacts
- HEADER Artifacts
- FEATURE Artifacts
- SEQUENCE Artifacts
- Data Cleaning Framework
- Conclusion
36Conclusion
- 9 types of data artifacts.
- A combination of critical artifacts (vector
contaminated sequences, duplicates, sequence
structure violations) and non-critical artifacts
(misspellings, synonyms). - At least 20,000 sequence records in public
databases contain some form of artifacts. - Depreciating data quality requires more
attention. - The identification of these artifacts are
important pre-step to accurate data mining and
knowledge discovery. - This classification provides a basis for design
of biological data cleaning methods.
37- Acknowledgement
- Supervisors Prof. Vladimir Brusic, Dr. Lee Mong
Li - Biologists Asif M. Khan, Paul T.J. Tan, Heiny
Tan, Kenneth Lee, Songsak Tongchusak, Wilson Goh - Engineer Kavitha Gopalakrishnan