Supporting on-the-fly data Integration for bioinformatics - PowerPoint PPT Presentation

1 / 100
About This Presentation
Title:

Supporting on-the-fly data Integration for bioinformatics

Description:

... Data mining algorithms for semi-automatic writing Reusable by different requests on same data Code generation Request analysis ... Phylip Raw RSF ... sequence ... – PowerPoint PPT presentation

Number of Views:185
Avg rating:3.0/5.0
Slides: 101
Provided by: DD675
Category:

less

Transcript and Presenter's Notes

Title: Supporting on-the-fly data Integration for bioinformatics


1
Supporting on-the-fly data Integration for
bioinformatics
  • Candidate Xuan Zhang
  • Advisor Gagan Agrawal

2
Road Map
  • Mission Statement
  • Motivation
  • Implementation
  • Comprehensive Examples
  • Future work
  • Conclusion

3
Mission Statement
  • Enhance information integration systems on
  • Functionality
  • On-the-fly data incorporation
  • Flat file data process
  • Usability
  • Declarative interface
  • Low programming requirement

4
Motivation
  • Integration is essential for biological research
  • Biological data include
  • Sequences DNA (GenBank), protein (Swiss-Prot)
  • Structure RNA (RNAbase), protein (PDB)
  • Interaction pathway (KEGG), regulation (GRBase)
  • Function disease (OMIM)
  • 2ndary protein family (Pfam)
  • Biological data is inter-related.

5
Motivation
  • Challenges of bioinformatics integration
  • Data volume overwhelming
  • DNA sequence 100 gigabases (August, 2005)
  • Data growth
  • exponential

Figure provided by PDB
6
Motivation
  • Challenges of bioinformatics integration (cont.)
  • Tools Many and more
  • Service interfaces Variety
  • Web pages
  • Web service
  • Grid service

7
Motivation
  • Challenges of bioinformatics integration (cont.)
  • Inter-operability Low
  • Heterogeneous data sources
  • Semi-structured by nature
  • Flat file, relational, object-oriented databases
  • Independently developed tools
  • No data exchange standard
  • Little Collaboration

8
Road Map
  • Mission Statement
  • Motivation
  • Implementation
  • Future
  • Conclusion
  • Approach Overview
  • Advantage
  • Components

9
Approach Summary
  • Metadata
  • Declarative description of data
  • Data mining algorithms for semi-automatic writing
  • Reusable by different requests on same data
  • Code generation
  • Request analysis and execution separated
  • General modules with plug-in data module

10
System Overview
Understand Data
Process Data
Data File
User Request
Metadata Description
Layout Miner
Answer
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Code Generation
Request Processor
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Schema Miner
Information Integration System
11
Advantages
  • Simple interface
  • At metadata level, declarative
  • General data model
  • Semi-structured data
  • Flat file data
  • Low human involvement
  • Semi-automatic data incorporation
  • Low maintenance cost
  • OK Performance
  • Linear scale guaranteed

12
Road Map
  • Mission Statement
  • Motivation
  • Implementation
  • Future
  • Conclusion
  • Approach Overview
  • Advantage
  • Components

13
System Components
  • Understand data
  • Layout mining
  • Schema mining
  • Process data
  • Wrapper generation
  • Query Process
  • Query Process with indices

14
Layout Mining
Data File
  • Goal 1 Separate delimiters from values
  • D-score location frequency
  • Goal 2 Organize delimiters and values
  • NFA

Token Parser
Tokens
Delimiter Mining
Candidate Delimiters
Layout Learning
Layout Descriptor
15
Schema Mining Road Map
  • Schema Mining
  • Overview
  • Mining System
  • Core Mining Algorithm
  • Experiments

16
Schema Mining Goals
  • Ultimate goal discover schema about an unknown
    flat file dataset
  • Immediate goal Assign attributes with meaningful
    labels

17
Our Approach
  • Summarize values from bottom up
  • Use knowledge from
  • Ontology
  • Heuristics
  • A head-up attribute label ? attribute name
  • What we can mine
  • date
  • What we cannot do
  • Creation date, last modification date, birthday,

18
Schema Mining Road Map
  • Schema Mining
  • Overview
  • Mining System
  • Core Mining Algorithm
  • Experiments

19
Schema Mining System
Raw attribute values
  • Major Components
  • Data Cleaning and summarization
  • Score calculation
  • Score function
  • Ontology
  • Heuristics
  • Score Clustering

Value cleaning and summarization
Attribute summaries
Score calculation
Cutoff values
Scores
Clustering algorithm
Labeling
Attribute Labels
20
Data Summarization
  • Goal reduce amount of data
  • Collect frequent tokens
  • Approximate frequent token mining algorithm
  • Goal reduce amount of data
  • Collect frequent tokens
  • Approximate frequent token mining algorithm
  • Token categorization by profile
  • Token profile a ordered list of N(numerical),
    A(alphabetic) and special characters
  • Token categories
  • Word, number, else and other user defined
    categories

21
Score Function Template
  • Desired property
  • Simple
  • Adjustable trade-off between sensitivity and
    error tolerance

22
Score Clustering
  • Goal Sort attributes into three groups, H
    (high), L (low) and M (middle), by scores
  • Mathematically, find two scores, scorei and
    scorej, from score1, score2, score3, , scoreN,
    to minimize the standard deviation
  • N (number of attributes) is not large. Exact
    answer can be found.

23
Schema Mining Road Map
  • Schema Mining
  • Overview
  • Mining System
  • Core Mining Algorithm
  • Mining with ontology
  • Mining with heuristics
  • Experiments

24
Use of Ontology
  • An observation a similarity between ontology and
    schema
  • Both satisfy is-a relation
  • E.g Diabetes is a disease.
  • Ontology diabetes is a child of disease
  • Schema diabetes is a valid instance of
    attribute disease
  • Common ancestors in ontology attribute label

25
Real-world Complications
  • To find an arbitrary value in an ontology
  • Complete and comprehensive ontology?
  • Selective sampling
  • Error-free dataset?
  • Adjustable sensitivity fault tolerance
  • Performance

26
Ontology Database
  • Goal to approximate a complete comprehensive
    ontology database
  • Approach
  • Complete sample popular terms
  • Comprehensive public ontology databases
    common facts
  • Result
  • 6 major categories
  • 386 terms

27
Ontology Based Metrics (1)
  • Occurrence(term)
  • Frequent_Counti,
  • if termFrequent_Tokeni
  • mini0, t Frequent_Counti,
  • if termFrequent_Token0Frequent_Tokent
  • 0, else
  • Strength(term)
  • Occurrence(term) ? Strength(child_term)

28
Ontology Based Metrics (2)
  • Two factors
  • Relative strength compared with other concepts
  • Completeness of ontology as a whole
  • Ontology score product of two factors
  • Each modulated by the template score function

29
Mining With Heuristics (1)
  • Use token profile
  • number N, N.N
  • date N-A-N, N/N/N
  • Use frequent token counts
  • identification Frequent_Counts1
  • Use other token information
  • biological sequence length gt45, or in 10s

30
Mining With Heuristics (2)
  • Use token sequence information
  • people name length (23), separator (, or
    and), profile (not number, date)
  • Again, these counts are modulated by the template
    function to calculate scores

31
Schema Mining Road Map
  • Schema Mining
  • Overview
  • Mining System
  • Core Mining Algorithm
  • Experiments

32
Schema Mining Experiment Design
  • Datasets
  • GenBank, UniProt SWISSPROT and Pfam
  • Cutoff values
  • Exact clustering
  • Evaluation
  • Weighted Cohens Kappa
  • Compare group most, middle and little with true
    label Y(yes), P(partial) and N(no)

33
Result Summary Kappa
Very good
Good
Moderate
1 cellular component, 2 database, 3 date, 4
free text, 5 ID, 6 molecule type, 7 name, 8
number, 9 organism, 10 publication method, 11
sequence
34
Cellular Component (O)
35
Date (H)
36
Organism Name (O)
37
Schema Mining Summary
  • According to Kappa tests, results are good or
    very good
  • Possible improvement
  • Clustering method with better intelligence
  • Better ontology database
  • More involved language analysis
  • Hybrid of bottom-up and top-down approaches

38
System Components
  • Understand data
  • Layout mining
  • Schema mining
  • Process data
  • Metadata description language
  • Wrapper generation
  • Query Process
  • Query Process with indices

39
Data Process Overview
  • Automatic code generation approach
  • Input
  • Metadata about datasets involved
  • Optional
  • Implicit data transformation task
  • Request by users
  • Indexing functions
  • Output
  • Executable programs
  • General modules
  • Task-specific data module

40
Metadata Description
  • Two aspects of data in flat files
  • Logical view of the data
  • Physical data organization
  • Two components of every data descriptor
  • Schema description
  • Layout description
  • Design goals
  • Powerful
  • Easy for writing and interpretation

41
Metadata Challenges
  • Examples of sequence formats
  • ALN/ClustalW format
  • AMPS Block file format
  • ClustalW
  • Codata
  • EMBL
  • GCG/MSF
  • GDE
  • Genebank
  • Fasta (Pearson)
  • NBRF/PIR
  • PDB format
  • Pfam/Stockholm format
  • Phylip
  • Raw
  • RSF
  • UniProtKB/Swiss-Prot

gtFOSB_MOUSE Protein fosB. 338 bp
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGE
MPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYD
MPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPRE
ETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAE
LESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGST
SAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPV
VSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
  • Major Challenges
  • Various representation
  • Semi-structured data

name "Short name for sequence" longname "Long
(more descriptive) name for sequence"
sequence-ID "Unique ID number" creation-date
"mm/dd/yy hhmmss" direction -11
strandedness 12 type DNARNAPROTEINTEXTM
ASK offset (-999999,999999) group-ID (0,999)
creator "Author's name" descrip "Verbose
description comments "Lines of comments that
can be fairly arbitrary text about a sequence.
Return characters are allowed, but no internal
double quotes or brace characters. Remember to
close with a double quote" sequence
"gctagctagctagctagctcttagctgtagtcgtagctgatgctagct
gatgctagctagctagctagctgatcgatgctagctgatcgtagctgacg
gactgatgctagctagctagctagctgtctagtgtcgtagtgcttattg
c"
LOCUS MMFOSB 4145 bp mRNA linear ROD 12-SEP-1993
DEFINITION Mouse fosB mRNA. ACCESSION X14897
VERSION X14897.1 GI50991 KEYWORDS fos
cellular oncogene fosB oncogene oncogene.
SOURCE Mus musculus. ORGANISM Mus
musculus Eukaryota Metazoa Chordata Craniata
Vertebrata Euteleostomi Mammalia Eutheria
Rodentia Sciurognathi Muridae Murinae Mus.
REFERENCE 1 (bases 1 to 4145) AUTHORS
Zerial,M., Toschi,L., Ryseck,R.P.,
Schuermann,M., Muller,R. and Bravo,R. TITLE
The product of a novel growth factor activated
gene, fos B, interacts with JUN proteins
enhancing their DNA binding activity
JOURNAL EMBO J. 8 (3), 805-813 (1989) MEDLINE
89251612 PUBMED 2498083 COMMENT
cloneAC113-1 cell lineNIH3T3. FEATURES
Location/Qualifiers source 1..4145
/organism"Mus musculus" /db_xref"taxon100
90 CDS 1202..2218 /note"fosB protein (AA
1-338)" /codon_start1 /protein_id"CAA33026.1"
/db_xref"GI50992" /db_xref"MGD95575"
/db_xref"SWISS-PROTP13346" /translation"MFQAFPG
DYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQEC
AGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAV
DPYDMPGT SYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPR
RPREETLTPEEEEKRRV RRERNKLAAAKCRNRRRELTDRLQAETDQLEE
EKAELESEIAELQKEKERLEFVLVAH KPGCKIPYEEGPGPGPLAEVRDL
PGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNL
TASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQP
SDPLNSPS LLAL" BASE COUNT 960 a 1186 c 1007 g 991
t 1 others ORIGIN 1 ataaattctt attttgacac
tcaccaaaat agtcacctgg aaaacccgct ttttgtgaca 61
aagtacagaa ggcttggtca catttaaatc actgagaact
agagagaaat actatcgcaa 121 actgtaatag acattacatc
cataaaagtt tccccagtcc ttattgtaat attgcacagt 181
gcaattgcta catggcaaac tagtgtagca tagaagtcaa
agcaaaaaca aaccaaagaa 241 aggagccaca agagtaaaac
tgttcaacag ttaatagttc aaactaagcc attgaatcta 301
tcattgggat cgttaaaatg aatcttccta caccttgcag
tgtatgattt aacttttaca 361 gaacacaagc caagtttaaa
atcagcagta gagatattaa aatgaaaagg tttgctaata 421
gagtaacatt aaataccctg aaggaaaaaa aacctaaata
tcaaaataac tgattaaaat 481 tcacttgcaa attagcacac
gaatatgcaa cttggaaatc atgcagtgtt ttatttaaga 541
aaacataaaa caaaactatt aaaatagttt tagagggggt
aaaatccagg tcctctgcca 601 ggatgctaaa attagacttc
aggggaattt tgaagtcttc aattttgaaa cctattaaaa 661
agcccatgat tacagttaat taagagcagt gcacgcaaca
gtgacacgcc tttagagagc 721 attactgtgt atgaacatgt
tggctgctac cagccacagt caatttaaca aggctgctca 781
gtcatgaact taatacagag agagcacgcc taggcagcaa
gcacagcttg ctgggccact 841 ttcctccctg tcgtgacaca
atcaatccgt gtacttggtg tatctgaagc gcacgctgca 901
ccgcggcact gcccggcggg tttctgggcg gggagcgatc
cccgcgtcgc cccccgtgaa 961 accgacagag cctggacttt
caggaggtac agcggcggtc tgaaggggat ctgggatctt 1021
gcagagggaa cttgcatcga aacttgggca gttctccgaa
ccggagacta agcttccccg 1081 agcagcgcac tttggagacg
tgtccggtct actccggact cgcatctcat tccactcggc 1141
catagccttg gcttcccggc gacctcagcg tggtcacagg
ggcccccctg tgcccaggga 1201 aatgtttcaa gcttttcccg
gagactacga ctccggctcc cggtgtagct catcaccctc 1261
cgccgagtct cagtacctgt cttcggtgga ctccttcggc
agtccaccca ccgccgccgc 1321 ctcccaggag tgcgccggtc
tcggggaaat gcccggctcc ttcgtgccaa cggtcaccgc 1381
aatcacaacc agccaggatc ttcagtggct cgtgcaaccc
accctcatct cttccatggc 1441 c
List and example provided by EMBL-EBI
42
Schema Descriptors
  • Follow XML DTD standard for semi-structured data
  • Simple attribute list for relational data

lt?xml version'1.0' encoding'UTF-8'?gt lt!ELEMENT
FASTA (ID, DESCRIPTION, SEQ)gt lt!ELEMENT ID
(PCDATA)gt lt!ELEMENT DESCRIPTION
(PCDATA)gt lt!ELEMENT SEQ (PCDATA)gt
FASTA //Schema Name ID string //Data type
definitions DESCRIPTION string SEQ string
43
Layout Descriptors
  • Overall structure (FASTA example)
  • DATASET FASTAData //Dataset name
  • DATATYPE FASTA //Schema name
  • DATASPACE LINESIZE80
  • // ---- File layout details goes here ----
  • DATA osu/fasta //File location

44
File Layout
  • Key observations on line-based biological data
    files
  • Strings of variable length
  • Delimiters widely used
  • Data fields may be divided into variables
  • Repetitive structures

gtseq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP
\ngtseq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFL
PRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKL
GGRDSRSGSPMARRELVISLIVES \n gtseq3
45
Layout Descriptors
  • File layout (FASTA example)

gtseq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP
\ngtseq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFL
PRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKL
GGRDSRSGSPMARRELVISLIVES \n gtseq3
  • DATASPACE LINESIZE80
  • lt
  • gt ID DESCRIPTION
  • lt \n SEQ gt
  • \n EOF
  • gt

46
System Component
  • Understand data
  • Layout mining
  • Schema mining
  • Process data
  • Metadata description language
  • Wrapper generation
  • Query execution
  • Query execution with indices

47
Wrapper Generation Road Map
  • Motivation and overview
  • System structure
  • Wrapper generation
  • Wrapper execution
  • Experiments

48
Wrapper Generation Motivation
  • Wrappers are essential for bioinformatics
    integration
  • Heterogeneous data sources
  • Function transform data
  • Current solutions
  • Manually written wrappers
  • Scripts

49
Wrapper GenerationAdvantages
  • Wrapper generated automatically
  • Stand-alone programs for integration systems and
    workflows
  • Little human interference. New resources can be
    integrated on-the-fly
  • Direct transformation. No unnecessary
    intermediate form needed
  • Only requires data description at metadata level,
    one descriptor/data source
  • Transfer data from flat files directly
  • No DB support required
  • No other domain or format heuristics

50
Wrapper GenerationSystem Overview
Schema Descriptors
Layout Descriptor
Mapping File
Mapping Generator
Layout Parser
Mapping Parser
Data Entry Representation
Schema Mapping
Wrapper generation system
Application Analyzer
WRAPINFO
Source Dataset
Target Dataset
DataReader
DataWriter
Synchronizer
wrapper
51
Layout Parse Tree
  • FASTA example

DATASPACE root linesize 80
  • DATASPACE LINESIZE80
  • lt
  • gt ID DESCRIPTION
  • lt \n SEQ gt
  • \n EOF
  • gt

lt gt
lt gt
-DESCRIPTION
\n-DUMMY EOF
gt-ID
Internal node environment
Leaf delimiter-variable (DLM-VAR) pair
\n-SEQ
52
Schema Mapping
  • Algorithm strict name matching
  • for field ft in target schema
  • for field fs in source schema
  • if ftfs then add pair (fs, ft) to the mapping
  • Output
  • A list of attribute pairs
  • A editable file for user to verify and modify

53
Wrapping Assumptions
  • Convert semi-structured (and structured) data to
    structured data
  • Both datasets are stored record-wise
  • Order of records not disturbed after wrapping

Semi-structured Structured
Data can be transformed entry by entry
54
Application Analyzer
  • Task to generate clear directions for wrapper
    and organize them in WRAPINFOR
  • Sub-tasks
  • What values to store
  • How to extract values
  • How to store values
  • How to write values

55
Important Concepts (1)
  • Useful
  • An attribute is useful iff its values are in
    target
  • Reachable
  • node b is reachable from node a, if there exists
    a valid layout configuration such that a.DLM and
    b.DLM defines the boundaries of a.VAR.
  • i.e a.DLM a.VAR b.DLM
  • A value instance is between
  • Its own delimiter
  • The first appearance of its reachable delimiters

56
Important Concepts (2)
  • Attribute Cardinality
  • Regular attribute fixed number of values per
    entry
  • ID
  • Semi-structured attribute varied number of
    values per entry
  • References

57
WRAPINFOR
  • Contents information to answer a particular
    wrapping task
  • Forms in XML
  • 5 look-up tables
  • Delimiter, Usefulness, Cardinality, Label,
    Reachable
  • 3 parameters
  • one_to_one_total, one_to_multiple_total,
    complete_in
  • Function plug into general modules to form a
    functional wrapper

58
Wrapper Generation Road Map
  • Motivation and overview of our approach
  • System structure
  • Wrapper generation
  • Wrapper execution
  • Experiments

59
Wrapper Overview
Value buffer
one_to_multiple_values
FA
RA
RA
Output dataset
Input dataset
Dataset buffer
DataReader
DataWriter
one_to_one_values
load
run
run
halt
Synchronizer
60
Wrapper Structure
  • One data module WRAPINFO
  • Three general action module
  • Synchronizer central controler
  • DataReader, DataWriter interact with datasets
  • One value buffer
  • Suitable for data grid
  • Transform data one entry at a time

61
Wrapper Execution
  • DataReader
  • Extract attribute value
  • Delimiter table Reachable table
  • Fill value buffer Label look-up table
  • DataWriter
  • Retrieve from value buffer Label look-up table
  • Write target file
  • Delimiter table Reachable table label table
  • Synchronizer
  • Call DataReader on source parameters
  • Call DataWriter on target parameters

62
Wrapper Experiments (1)
  • Analysis time constant
  • Execution time linear

(in logarithm)
(in logarithm)
TRANSFAC-to-Reference Problem
63
Wrapper Experiments (2)
  • Performance comparable to handwritten codes

SWISSPROT-to-FASTA Problem
64
System Components
  • Understand data
  • Layout mining
  • Schema mining
  • Process data
  • Metadata description language
  • Wrapper generation
  • Query execution
  • Query execution with indices

65
Query Execution Road Map
  • Motivation
  • System Overview
  • System Implementation
  • Languages
  • System
  • Experiments

66
Limitation of Wrapper
  • Data Wrapping
  • Data formatting Data projection
  • Other query types
  • Selection
  • Cross Product
  • Join
  • New Functionalities
  • Value examination
  • Multiple datasets

67
Advantages
  • Retrieve multiple pieces of information all at
    once
  • Data easily available
  • Declarative languages only
  • High flexibility
  • Low over-head
  • Suitable for data grid

68
System Enhanced
query
Source/target names
Query parser
Dataset descriptors
Descriptor parser
Metadata collection
mappings
Schema Layout information
Application analyzer
Query analysis
Query execution
QUERYINFOR
Source data files
Target Data file
DataReader
DataWriter
Synchronizer
69
Query ExecutionRoad Map
  • Motivation
  • System Overview
  • System Implementation
  • Languages
  • Metadata Description Language
  • Query Language
  • System
  • Query Analysis
  • Query Execution
  • Experiments

70
Query Language
  • Declarative, SQL-like
  • Projection, selection, cross product, join
    queries
  • Example

Target dataset
AUTOWRAP POSTBLAST FROM BLASTP, SWISSPROT BY
BLASTP.SP_ID SWISSPROT.ID WHERE
POSTBLAST.QUERY BLASTP.QUERY POSTBLAST.SP_AC
BLASTP.SP_AC POSTBLAST.SP_ID
BLASTP.SP_ID POSTBLAST.FULL_DESCR
SWISSPROT.DE POSTBLAST.SEQUENCE
SWISSPORT.SQ POSTBLAST.SCORE
BLASTP.SCORE POSTBLAST.E_VALUE BLASTP.E_VALUE
Source datasets
Join criteria
Attribute pairs
71
Application AnalyzerEnhancement
  • Constant values in query
  • Pseudo-label look-up table
  • Other query information
  • Parameters comparing field pairs
  • Output QUERYINFOR

72
Query Execution
  • Query-Proc Structure
  • DataReader and DataWriter
  • Similar to wrapper
  • Value buffer
  • Store useful values from one data entry of every
    source dataset

QUERYINFOR
Source data files
Target Data file
DataReader
DataWriter
Synchronizer
73
Enhanced Synchronizer
  • Synchronizer
  • Set up pseudo-attributes Pseudo label look-up
    table
  • Call DataReader on source 1 and 2 Call
    DataWriter on target Parameters
  • Test join conditions Parameters
  • Clean value buffer Parameters

74
Post-BLAST Query
  • Goal Enhance BLAST output to FASTA format
  • Query Join query between BLAST output (source 1)
    and SWISSPROT (source 2)
  • 2 modes
  • UNIQUE halt once a match found in source 2
  • ALL search all source 2 entries

75
Chip-Supplement Query
  • Goal Look up microarray genes information into
    tabular format
  • Query Join query between protein array and yeast
    genome database
  • 2 queries
  • Chip-Supplement
  • array join genome
  • Chip-Supplement-Sorted
  • genome join array

76
OMIM-Plus Query
  • Add reverse links of proteins to disease database
  • Join query between OMIM database and SWISSPROT
    database
  • Results in OMIM form
  • 86.38 seconds/entry 12,158 OMIM entry 291.7
    hours

77
System Components
  • Understand data
  • Layout mining
  • Schema mining
  • Process data
  • Metadata description language
  • Wrapper generation
  • Query execution
  • Query execution with indices

78
Query with IndicesRoad Map
  • Motivation and Overview
  • System
  • System Enhancement
  • Language
  • System Implementation
  • Experiments

79
Query With IndicesMotivation
  • Goal
  • Improve the performance of query-proc program
  • Index
  • Maintain the advantages
  • Flat file based
  • Low requirement on programming

80
Challenges Approaches
  • Various indexing algorithms for various
    biological data
  • User defined indexing functions
  • Standard function interfaces
  • Flat file data
  • Values parsed implicitly and ready to be indexed
  • Byte offset as pointer
  • Metadata about indices
  • Layout descriptor

81
System Revisit
query
Source/target names
Query parser
Dataset descriptors
Descriptor parser
Metadata collection
mappings
Schema Layout information
Application analyzer
Query analysis
Query execution
QUERYINFOR
Source data files
Target data file
DataReader
DataWriter
Synchronizer
Index file
Index functions
82
Language Enhancement
  • Describe indices
  • Indexing is a property of dataset
  • Extend layout descriptors
  • Maintain query format

DATASET name INDEX attributeindex_file_loc
index_gen_funindex_retr_funfun_loc ,
attributeindex_file_locindex_gen_funindex_retr_
funfun_loc
New meaning of If index available, use index
retrieving function Else, compare values
directly
AUTOWRAP GNAMES FROM CHIPDATA, YEASTGENOME BY
CHIPDATA.GENE YEASTGENOME.ID WHERE
83
System Enhancement
  • Metadata Descriptor Parser
  • parse index information
  • Application Analyzer
  • index information index look-up table
  • test condition compare_field_indexing

84
Query-Proc Enhancement
  • Synchronizer
  • if index is applicable, check availability of
    index data file
  • If no, call index generation function
  • Load indices
  • Call index retrieving function first for
    candidate entry list

85
Microarray Gene Information Look-up
  • Goal gather information about genes (120)
  • Query microarray output join genome database
  • Index gene names in genome

86
BLAST-ENHANCE Query
  • Goal Add extra information to BLAST output
  • Query BLAST output join Swiss-Prot database
  • Index protein ID in Swiss-Prot

87
OMIM-PLUS Query
  • Goal add Swiss-Prot link to OMIM
  • Query OMIM join Swiss-Prot
  • Index protein ID in Swiss-Prot

88
Homology Search Query
  • Goal find similar sequences
  • Query query sequence list sequence database
  • Indexing algorithm
  • Sequence-based
  • Transformation of sub-string composition
  • Indexing n-D numerical values

89
Homology Search (1)
  • Index (Singhs algorithm)
  • Data yeast genome
  • wavelet coefficients
  • minimum bounding rectangles

90
Homology Search (2)
  • Index (Ferhatosmanoglus algorithm)
  • Data GenBank
  • Wavelet coefficients
  • Scalar quantization
  • R-tree

91
Road Map
  • Mission Statement
  • Motivation
  • Implementation
  • Comprehensive Example
  • Future work
  • Conclusion

92
Gene Name Nomenclature
  • It is crucial to identify genes CORRECTLY and
    UNAMBIGUOUSLY
  • Genes with multiple names
  • Multiple gene share same names
  • Historically, little central control on naming
    process

As biologists strive to make sense of the
growing wealth of genomic information, this messy
nomenclature is becoming a bugbear Helen
Pearson, Nature, 2001
93
Gene Name in DBs
  • Databases related to genes
  • Genome databases (main force in nomenclature)
  • SGD (yeast)
  • HGNC (human)
  • TAIR (a plant)
  • dictyBase (an one-cell amoeba)
  • Curated gene databases
  • Entrez Gene by NCBI
  • Curated gene product databases
  • Swiss-Prot by SIB and EBI

94
Queries About Gene Name
  • Gene identifiers usages in databases
  • How are gene symbols in DB A used in DB B?
  • How are gene alias in DB A used in DB B?
  • Nomenclature across species
  • Q1-Q2 genome Entrez Gene, Swiss-Prot
  • Q3-Q4 Entrez Gene Swiss-Prot
  • Nomenclature over time
  • Q5-Q7 Swiss-Prot genome

95
Challenges
  • Various data representation
  • Line-based texts
  • Tabular forms with or without title
  • Format evolves over time
  • Data storage
  • Large volume
  • Each file queried limited times

Metadata descriptors
Format and schema learning
Flat file processing
96
Integration System Revisit
Understand Data
Process Data
Genome Entrez Gene Swiss-Prot
Data File
User Request
- Join queries
Metadata Description
Layout Miner
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Code Generation
Query Processor
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Schema Miner
Information Integration System
97
Nomenclature Results (1)
  • Across Species

98
Nomenclature Results (2)
  • Over time

Q5 How many gene ID in Swiss-Prot are gene ID in
genome? Q6 How many gene ID in Swiss-Prot are
alias in genome? Q7 How many gene alias in
Swiss-Prot are gene ID in genome?
99
Performance
  • Linear w.r.t. source 1 size

100
Conclusion
  • A frame work and a set of tools for on-the-fly
    flat file data integration
  • New data source understood semi-automatically by
    data mining tools
  • New data processed automatically by generated
    programs
  • Advantages
  • High level interface, flat file based, ok
    performance, low maintenance cost
Write a Comment
User Comments (0)
About PowerShow.com