A Classification of Biological Data Artifacts - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

A Classification of Biological Data Artifacts

Description:

1Knowledge Discovery Department, Institute for Infocomm Research ... level ampicillin resistance in Enterococcus faecium. gi|1143442|emb|X92687.1|EFPBP5G ... – PowerPoint PPT presentation

Number of Views:157
Avg rating:3.0/5.0
Slides: 38
Provided by: ente65
Category:

less

Transcript and Presenter's Notes

Title: A Classification of Biological Data Artifacts


1
A Classification of Biological Data Artifacts
1,2Judice L.Y. Koh, 2Mong Li Lee, 1Vladimir
Brusic
1Knowledge Discovery Department, Institute for
Infocomm Research 2School of Computing, National
University of Singapore
2
http//research.i2r.a-star.edu.sg/Templar/
GenPept, EMBL, TrEMBL, NCBI RefSeq, Patent
database
3
Data warehousing process
4
Biological data quality
Public molecular databases (GenBank, Swiss-Prot,
DDBJ, EMBL, PIR, among others) provide rich
sources of biological data. Information for data
analysis and data sources for knowledge discovery
in the BioWare data warehouse. The accuracy of
data analysis and the ability to produce correct
results from data mining relies on the quality of
data. But how can we ensure high quality data
in our data warehouse?
5
Objectives of our study
  • of biological data artifacts in biological
    databases
  • Critical assessment of the quality of data which
    biologists/computer scientists have been using
    for data analysis and data mining.
  • Roadmap to improving the quality of data in
    molecular databases.
  • Form the basis of biological data cleaning.

6
Biological data artifacts
  • Errors, discrepancies, redundancies, ambiguities,
    and incompleteness in molecular databases
    reducing the quality of the biological data.

7
Outline of presentation
  • Sources of Biological Data Artifacts
  • HEADER Artifacts
  • FEATURE Artifacts
  • SEQUENCE Artifacts
  • Data Cleaning Framework
  • Conclusion

8
Sources of biological data artifacts
  • (1) Diverse sources of data
  • Extensive duplication
  • Repeated submissions of the sequences to same or
    different databases
  • Cross-updating of databases (Propagation of
    errors)
  • (2) Data Annotation
  • Enrichment of sequences with descriptions of
    their structural and functional features, related
    references and other sequence information
  • By database annotators or sequence submitters
  • Databases have different mechanisms for data
    annotation
  • (GENBANK only direct submission SWISS-PROT
    all sequence records)
  • Data entry errors can be introduced
  • Different interpretations
  • (3) Lack of standardized nomenclature
  • Variations in naming conventions
  • Synonyms, homonyms, and abbreviations
  • (4) Inadequacy of data quality control mechanisms
  • Systematic approaches to data cleaning are
    lacking

9
Classification of data artifacts
  • HEADER General information of the record.
  • FEATURE - Descriptions of the structural,
    functional, and other physico-chemical properties
    of the sequence and regions of interest.
  • SEQUENCE Nucleotide or Protein sequence.

10
Spelling errors
Invalid values
Numerical names
Format violation
Undersized or oversized names
Synonyms
Ambiguity
HEADER
Homonyms/Abbreviations
Misuse of fields
Concatenated values
Incompatible schema
Mis-fielded values
Cross-annotation error
Conflicting features across different database
records
Features do not correspond with sequence
Annotation error
FEATURE
Over-prediction
Putative features
Under-prediction
Sequence structure violation
Uninformative sequences
CDS miscoding
Sequence entry error
Dubious sequences
Undersized sequences
Annotation error
Fragments
Dubious records
Vector contaminated sequences
SEQUENCE
Fragments
Replication of sequence information
Duplicates
Different views
Overlapping annotations of the same sequence
11
Outline of presentation
  • Sources of Biological Data Artifacts
  • HEADER Artifacts
  • FEATURE Artifacts
  • SEQUENCE Artifacts
  • Data Cleaning Framework
  • Conclusion

12
Spelling errors
Invalid values
Format violation
Ambiguity
HEADER
  • Usually typo errors
  • Occurs in different fields of the record
  • We identified 569 possible misspelled words
    affecting up to 20,505 nucleotide records in
    Entrez.

Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
13
Spelling errors
Invalid values
Format violation
Ambiguity
  • Undersized/Oversized fields
  • 0.05 or 83 protein names gathered from the
    UniProt data records (Release 2.3) are longer
    than 400 characters.
  • Protein record DCB2_HUMAN in UniProt
  • Definition Discoidin, CUB and LCCL domain
    containing protein 2 precursor
  • Synonym 1 Endothelial and smooth muscle
    cell-derived neuropilin-like protein
  • Synonym 2 CUB, LCCL and coagulation factor
    V/VIII-homology domains protein 1.
  • Protein record ACTM_LELER and ACTM_HELTB
  • Synonym M

HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
14
Synonym / Homonym
Invalid values
Abbreviation
Misuse of fields
Ambiguity
HEADER
  • Synonyms Different names given to the same
    sequence
  • Homonyms Different sequences given the same
    name
  • The scorpion neurotoxin BmK-X precursor has a
    permutation of synonyms
  • It is also known as BmKX, BmK10, BmK-M10,
    Bmk M10, Neurotoxin M10, Alpha-neurotoxin
    TX9, and BmKalphaTx9.

Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
http//www.expasy.org/cgi-bin/niceprot.pl?P45697
Fragments
SEQUENCE
Duplicates
15
Synonym / Homonym
Invalid values
Abbreviation
Misuse of fields
Ambiguity
HEADER
  • Different types of sequences can have the same
    abbreviation.
  • BMK stands for Big Map Kinase, B-cell/myeloid
    kinase, bovine midkine, as well as for
    Bradykinin-potentiating peptide.
  • GK is the abbreviation for both Glycerol
    Kinase and Geko gene of Drosophila
    melanogaster (Fruit fly).

Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
http//www.expasy.org/cgi-bin/niceprot.pl?Q9R1D9
The manifestation of synonyms, homonyms and
abbreviations results in information ambiguities
which cause problems in sequence identification
and keyword searching.
Fragments
SEQUENCE
Duplicates
16
Synonym / Homonym
Invalid values
Abbreviation
Misuse of fields
Ambiguity
Ambiguous field values
HEADER
Incompatible schema
Cross-annotation error
Definition includes species, length of sequence,
etc.
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
Retrievedbproteinlist_uids639947doptGenPept
Duplicates
17
Concatenated values
Invalid values
Mis-fielded values
  • Field concatenations occur during data
    transformations.
  • When data fields of finer granularity are
    transformed into schema with corresponding data
    fields of coarser granularity, the field values
    are concatenated.
  • Multiple field values can be concatenated using
    and or or.
  • The gene name of the Swiss-Prot entry P29834 was
    GRP 0.9 or GRP-1. This was recently corrected.

Ambiguity
HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
http//www.expasy.org/cgi-bin/niceprot.pl?P15228
18
Concatenated values
Invalid values
Mis-fielded values
Ambiguity
HEADER
Flaws in schema mapping Source fields not taken
into account in the transformed data schema may
be incorrectly mapped to a wrong field.
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence is directly submitted to GENBANK
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
Retrievedbproteinlist_uids18071172doptGenPep
t
19
Outline of presentation
  • Sources of Biological Data Artifacts
  • HEADER Artifacts
  • FEATURE Artifacts
  • SEQUENCE Artifacts
  • Data Cleaning Framework
  • Conclusion

20
Invalid values
Conflicting features across different databases
Ambiguity
HEADER
  • Multiple database records of the same nucleotide
    or protein sequences contain inconsistent or
    conflicting feature annotations.
  • data entry errors,
  • mis-annotation of sequence functions,
  • different expert interpretations, and
  • inference of features or annotation transfer
    based on best matches of low sequence similarity.
  • Different annotation groups
  • A comparative study of the annotations by three
    different groups of 340 genes of Mycoplasma
    genitalium genome showed that incompatible
    descriptions were assigned to 8 of these genes.
  • Brenner SE (1999) Errors in genome annotation.
    TIG 15 132-133.
  • Same annotation group

Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
21
http//www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db
nucleotideval11692004
http//www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db
nucleotideval11692006
22
Invalid values
Putative features
Ambiguity
HEADER
  • Functional annotation sometimes involve searching
    for the highest matching annotated sequence in
    the database.
  • Extrapolate features from the most similar known
    searched sequences.
  • In some cases, even the highest matching sequence
    from database search may have weak sequence
    similarities and therefore does not share similar
    functions as the query sequence (Bork, 2000 and
    Guigo et al., 2000).
  • Blind inference can cause erroneous functional
    assignment.
  • A study found that 24 of the Chlamydia
    trachomatis sequences contained erroneous
    functional assignments (Iliopoulos et al., 2003).

Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
23
Invalid values
Intron/Exon overlaps
Ambiguity
HEADER
  • Illogical feature entities that do not
    correspond to the logical constraints of the gene
    structure.
  • 12 out of 42,359 nucleotide sequences have
    overlapping intron/exon region.

Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Introns and exons must be non-overlapping except
in cases of alternative splicing.
Dubious sequences
Fragments
SEQUENCE
Duplicates
24
Invalid values
Intron/Exon overlaps
Ambiguity
HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Syn7 gene of putative polyketide synthase in NCBI
TPA record BN000507 has overlapping intron 5 and
exon 6. rpb7 RNA polymerase II subunit in
GENBANK record AF055916 has overlapping exon 1
and exon 2.
Fragments
SEQUENCE
Duplicates
25
Outline of presentation
  • Sources of Biological Data Artifacts
  • HEADER Artifacts
  • FEATURE Artifacts
  • SEQUENCE Artifacts
  • Data Cleaning Framework
  • Conclusion

26
Uninformative sequence
Invalid values
Undersized sequence
Vector contaminated sequence
Ambiguity
  • Sequences have meaningless content
  • A profuse percentage of the unknown residues
    (X) or unknown bases (N) can reduce the
    complexity of the sequence and thus, the
    information content of the sequence.
  • Three out of the nine residues of the unknown
    protein CP19 XXFESXEMR in UniProt record
    UN19_CLOPA are unknown.
  • The chain C of a MHC protein XFVKQNAXALX in
    PDB contains 30 unknown residues.

HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
27
Uninformative sequence
Invalid values
Undersized sequence
Vector contaminated sequence
Ambiguity
  • Sequences have meaningless content
  • Among the 5,146,255 protein records queried
    using Entrez to the major protein or translated
    nucleotide databases , 3,327 protein sequences
    are shorter than four residues (as of Sep,
    2004).
  • In Nov 2004, the total number of undersized
    protein sequences increases to 3,350.
  • Among 43,026,887 nucleotide records queried
    using Entrez to major nucleotide databases, 1,448
    records contain sequences shorter than six bases
    (as of Sep, 2004).
  • In Nov 2004, the total number of undersized
    nucleotide sequences increases to 1,711.

HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
28
Uninformative sequence
Invalid values
Undersized sequence
Vector contaminated sequence
Ambiguity
  • Vectors are agents that carry DNA fragments into
    a host cell.
  • The vector sequences probe and bind the DNA
    fragments at the 5 and 3 sites.
  • The DNA fragment is then isolated from its
    vectors by cutting at the restriction enzyme
    sites.

HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
8 out of 8,850 Candida Albicans sequences are
possibly contaminated with vectors commonly used
for the cloning of Candida Albicans sequences.
We used BLAST to search for regions in the
Candida Albicans sequences which matches any of
the 18 cloning vectors. From the matched results,
we selected those with matches at the 3 or 5
ends of Candida Albicans sequences. Matching
sections of the sequences extend from 30 bases to
1,154 bases.
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
29
Invalid values
Fragmented sequences in different records
Ambiguity
  • Extensive redundancy is caused by records
    containing fragmented or overlapping sequences
    with more complete sequences in other records.

HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
Duplicates
30
Replication of sequence information
Invalid values
Different views
Overlapping annotations of the same sequence
Ambiguity
  • Identical sequences with the same annotations
  • Submission of the same sequence to different
    databases
  • Repeated submission of the same sequence to the
    same database
  • Initially submitted by different groups
  • Protein sequences may be translated from
    duplicate nucleotide sequences

HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
Dubious sequences
Fragments
SEQUENCE
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
Retrievedbproteinlist_uids11692005doptGenPep
t
Duplicates
http//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
Retrievedbproteinlist_uids11692005doptGenPep
t
31
Invalid values
Replication of sequence information
Different views
Overlapping annotations of the same sequence
Ambiguity
HEADER
Incompatible schema
Cross-annotation error
Annotation error
FEATURE
Sequence structure violation
http//www.expasy.org/cgi-bin/niceprot.pl?Q95P69
Dubious sequences
Fragments
SEQUENCE
http//www.expasy.org/cgi-bin/niceprot.pl?Q9GNG8
Duplicates
32
Outline of presentation
  • Sources of Biological Data Artifacts
  • HEADER Artifacts
  • FEATURE Artifacts
  • SEQUENCE Artifacts
  • Data Cleaning Framework
  • Conclusion

33
Spelling errors
Invalid values
Numerical names
Format violation
Undersized or oversized names
Synonyms
Ambiguity
HEADER
Homonyms/Abbreviations
Misuse of fields
Concatenated values
Incompatible schema
Mis-fielded values
Cross-annotation error
Conflicting features across different database
records
Features do not correspond with sequence
Annotation error
FEATURE
Over-prediction
Putative features
Under-prediction
Sequence structure violation
Uninformative sequences
CDS miscoding
Sequence entry error
Dubious sequences
Undersized sequences
Annotation error
Fragments
Dubious records
Vector contaminated sequences
SEQUENCE
Fragments
Replication of sequence information
Duplicates
Different views
Overlapping annotations of the same sequence
34
Spelling errors
Dictionary lookup
Synonyms
Homonyms/Abbreviations
ATTRIBUTE
Uninformative sequences
Undersized sequences
Integrity constraints
Format violation
Misuse of fields
Vector screening
Vector contaminated sequences
Features do not correspond with sequence
Sequence Structure Parser
RECORD
Sequence structure violation
Concatenated values
Schema remapping
Mis-fielded values
SINGLE- SOURCE DATABASE
Replication of sequence information
Different views
Duplicate detection
Overlapping annotations of the same sequence
MULTI- SOURCE DATABASE
Fragments
Putative features
Comparative analysis
Cross-annotation error
35
Outline of presentation
  • Sources of Biological Data Artifacts
  • HEADER Artifacts
  • FEATURE Artifacts
  • SEQUENCE Artifacts
  • Data Cleaning Framework
  • Conclusion

36
Conclusion
  • 9 types of data artifacts.
  • A combination of critical artifacts (vector
    contaminated sequences, duplicates, sequence
    structure violations) and non-critical artifacts
    (misspellings, synonyms).
  • At least 20,000 sequence records in public
    databases contain some form of artifacts.
  • Depreciating data quality requires more
    attention.
  • The identification of these artifacts are
    important pre-step to accurate data mining and
    knowledge discovery.
  • This classification provides a basis for design
    of biological data cleaning methods.

37
  • Acknowledgement
  • Supervisors Prof. Vladimir Brusic, Dr. Lee Mong
    Li
  • Biologists Asif M. Khan, Paul T.J. Tan, Heiny
    Tan, Kenneth Lee, Songsak Tongchusak, Wilson Goh
  • Engineer Kavitha Gopalakrishnan
Write a Comment
User Comments (0)
About PowerShow.com