Title: Supporting on-the-fly data Integration for bioinformatics
1Supporting on-the-fly data Integration for
bioinformatics
- Candidate Xuan Zhang
- Advisor Gagan Agrawal
2Road Map
- Mission Statement
- Motivation
- Implementation
- Comprehensive Examples
- Future work
- Conclusion
3Mission Statement
- Enhance information integration systems on
- Functionality
- On-the-fly data incorporation
- Flat file data process
- Usability
- Declarative interface
- Low programming requirement
4Motivation
- Integration is essential for biological research
- Biological data include
- Sequences DNA (GenBank), protein (Swiss-Prot)
- Structure RNA (RNAbase), protein (PDB)
- Interaction pathway (KEGG), regulation (GRBase)
- Function disease (OMIM)
- 2ndary protein family (Pfam)
- Biological data is inter-related.
5Motivation
- Challenges of bioinformatics integration
- Data volume overwhelming
- DNA sequence 100 gigabases (August, 2005)
- Data growth
- exponential
Figure provided by PDB
6Motivation
- Challenges of bioinformatics integration (cont.)
- Tools Many and more
- Service interfaces Variety
- Web pages
- Web service
- Grid service
7Motivation
- Challenges of bioinformatics integration (cont.)
- Inter-operability Low
- Heterogeneous data sources
- Semi-structured by nature
- Flat file, relational, object-oriented databases
- Independently developed tools
- No data exchange standard
- Little Collaboration
8Road Map
- Mission Statement
- Motivation
- Implementation
- Future
- Conclusion
- Approach Overview
- Advantage
- Components
9Approach Summary
- Metadata
- Declarative description of data
- Data mining algorithms for semi-automatic writing
- Reusable by different requests on same data
- Code generation
- Request analysis and execution separated
- General modules with plug-in data module
10System Overview
Understand Data
Process Data
Data File
User Request
Metadata Description
Layout Miner
Answer
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Code Generation
Request Processor
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Schema Miner
Information Integration System
11Advantages
- Simple interface
- At metadata level, declarative
- General data model
- Semi-structured data
- Flat file data
- Low human involvement
- Semi-automatic data incorporation
- Low maintenance cost
- OK Performance
- Linear scale guaranteed
12Road Map
- Mission Statement
- Motivation
- Implementation
- Future
- Conclusion
- Approach Overview
- Advantage
- Components
13System Components
- Understand data
- Layout mining
- Schema mining
- Process data
- Wrapper generation
- Query Process
- Query Process with indices
14Layout Mining
Data File
- Goal 1 Separate delimiters from values
- D-score location frequency
- Goal 2 Organize delimiters and values
- NFA
Token Parser
Tokens
Delimiter Mining
Candidate Delimiters
Layout Learning
Layout Descriptor
15Schema Mining Road Map
- Schema Mining
- Overview
- Mining System
- Core Mining Algorithm
- Experiments
16Schema Mining Goals
- Ultimate goal discover schema about an unknown
flat file dataset - Immediate goal Assign attributes with meaningful
labels
17Our Approach
- Summarize values from bottom up
- Use knowledge from
- Ontology
- Heuristics
- A head-up attribute label ? attribute name
- What we can mine
- date
- What we cannot do
- Creation date, last modification date, birthday,
18Schema Mining Road Map
- Schema Mining
- Overview
- Mining System
- Core Mining Algorithm
- Experiments
19Schema Mining System
Raw attribute values
- Major Components
- Data Cleaning and summarization
- Score calculation
- Score function
- Ontology
- Heuristics
- Score Clustering
Value cleaning and summarization
Attribute summaries
Score calculation
Cutoff values
Scores
Clustering algorithm
Labeling
Attribute Labels
20Data Summarization
- Goal reduce amount of data
- Collect frequent tokens
- Approximate frequent token mining algorithm
- Goal reduce amount of data
- Collect frequent tokens
- Approximate frequent token mining algorithm
- Token categorization by profile
- Token profile a ordered list of N(numerical),
A(alphabetic) and special characters - Token categories
- Word, number, else and other user defined
categories
21Score Function Template
- Desired property
- Simple
- Adjustable trade-off between sensitivity and
error tolerance
22Score Clustering
- Goal Sort attributes into three groups, H
(high), L (low) and M (middle), by scores - Mathematically, find two scores, scorei and
scorej, from score1, score2, score3, , scoreN,
to minimize the standard deviation - N (number of attributes) is not large. Exact
answer can be found.
23Schema Mining Road Map
- Schema Mining
- Overview
- Mining System
- Core Mining Algorithm
- Mining with ontology
- Mining with heuristics
- Experiments
24Use of Ontology
- An observation a similarity between ontology and
schema - Both satisfy is-a relation
- E.g Diabetes is a disease.
- Ontology diabetes is a child of disease
- Schema diabetes is a valid instance of
attribute disease - Common ancestors in ontology attribute label
25Real-world Complications
- To find an arbitrary value in an ontology
- Complete and comprehensive ontology?
- Selective sampling
- Error-free dataset?
- Adjustable sensitivity fault tolerance
- Performance
26Ontology Database
- Goal to approximate a complete comprehensive
ontology database - Approach
- Complete sample popular terms
- Comprehensive public ontology databases
common facts - Result
- 6 major categories
- 386 terms
27Ontology Based Metrics (1)
- Occurrence(term)
- Frequent_Counti,
- if termFrequent_Tokeni
- mini0, t Frequent_Counti,
- if termFrequent_Token0Frequent_Tokent
- 0, else
- Strength(term)
- Occurrence(term) ? Strength(child_term)
28Ontology Based Metrics (2)
- Two factors
- Relative strength compared with other concepts
- Completeness of ontology as a whole
- Ontology score product of two factors
- Each modulated by the template score function
29Mining With Heuristics (1)
- Use token profile
- number N, N.N
- date N-A-N, N/N/N
- Use frequent token counts
- identification Frequent_Counts1
- Use other token information
- biological sequence length gt45, or in 10s
30Mining With Heuristics (2)
- Use token sequence information
- people name length (23), separator (, or
and), profile (not number, date) - Again, these counts are modulated by the template
function to calculate scores
31Schema Mining Road Map
- Schema Mining
- Overview
- Mining System
- Core Mining Algorithm
- Experiments
32Schema Mining Experiment Design
- Datasets
- GenBank, UniProt SWISSPROT and Pfam
- Cutoff values
- Exact clustering
- Evaluation
- Weighted Cohens Kappa
- Compare group most, middle and little with true
label Y(yes), P(partial) and N(no)
33Result Summary Kappa
Very good
Good
Moderate
1 cellular component, 2 database, 3 date, 4
free text, 5 ID, 6 molecule type, 7 name, 8
number, 9 organism, 10 publication method, 11
sequence
34Cellular Component (O)
35Date (H)
36Organism Name (O)
37Schema Mining Summary
- According to Kappa tests, results are good or
very good - Possible improvement
- Clustering method with better intelligence
- Better ontology database
- More involved language analysis
- Hybrid of bottom-up and top-down approaches
38System Components
- Understand data
- Layout mining
- Schema mining
- Process data
- Metadata description language
- Wrapper generation
- Query Process
- Query Process with indices
39Data Process Overview
- Automatic code generation approach
- Input
- Metadata about datasets involved
- Optional
- Implicit data transformation task
- Request by users
- Indexing functions
- Output
- Executable programs
- General modules
- Task-specific data module
40Metadata Description
- Two aspects of data in flat files
- Logical view of the data
- Physical data organization
- Two components of every data descriptor
- Schema description
- Layout description
- Design goals
- Powerful
- Easy for writing and interpretation
41Metadata Challenges
- Examples of sequence formats
- ALN/ClustalW format
- AMPS Block file format
- ClustalW
- Codata
- EMBL
- GCG/MSF
- GDE
- Genebank
- Fasta (Pearson)
- NBRF/PIR
- PDB format
- Pfam/Stockholm format
- Phylip
- Raw
- RSF
- UniProtKB/Swiss-Prot
gtFOSB_MOUSE Protein fosB. 338 bp
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGE
MPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYD
MPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPRE
ETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAE
LESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGST
SAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPV
VSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL
- Major Challenges
- Various representation
- Semi-structured data
name "Short name for sequence" longname "Long
(more descriptive) name for sequence"
sequence-ID "Unique ID number" creation-date
"mm/dd/yy hhmmss" direction -11
strandedness 12 type DNARNAPROTEINTEXTM
ASK offset (-999999,999999) group-ID (0,999)
creator "Author's name" descrip "Verbose
description comments "Lines of comments that
can be fairly arbitrary text about a sequence.
Return characters are allowed, but no internal
double quotes or brace characters. Remember to
close with a double quote" sequence
"gctagctagctagctagctcttagctgtagtcgtagctgatgctagct
gatgctagctagctagctagctgatcgatgctagctgatcgtagctgacg
gactgatgctagctagctagctagctgtctagtgtcgtagtgcttattg
c"
LOCUS MMFOSB 4145 bp mRNA linear ROD 12-SEP-1993
DEFINITION Mouse fosB mRNA. ACCESSION X14897
VERSION X14897.1 GI50991 KEYWORDS fos
cellular oncogene fosB oncogene oncogene.
SOURCE Mus musculus. ORGANISM Mus
musculus Eukaryota Metazoa Chordata Craniata
Vertebrata Euteleostomi Mammalia Eutheria
Rodentia Sciurognathi Muridae Murinae Mus.
REFERENCE 1 (bases 1 to 4145) AUTHORS
Zerial,M., Toschi,L., Ryseck,R.P.,
Schuermann,M., Muller,R. and Bravo,R. TITLE
The product of a novel growth factor activated
gene, fos B, interacts with JUN proteins
enhancing their DNA binding activity
JOURNAL EMBO J. 8 (3), 805-813 (1989) MEDLINE
89251612 PUBMED 2498083 COMMENT
cloneAC113-1 cell lineNIH3T3. FEATURES
Location/Qualifiers source 1..4145
/organism"Mus musculus" /db_xref"taxon100
90 CDS 1202..2218 /note"fosB protein (AA
1-338)" /codon_start1 /protein_id"CAA33026.1"
/db_xref"GI50992" /db_xref"MGD95575"
/db_xref"SWISS-PROTP13346" /translation"MFQAFPG
DYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQEC
AGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAV
DPYDMPGT SYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPR
RPREETLTPEEEEKRRV RRERNKLAAAKCRNRRRELTDRLQAETDQLEE
EKAELESEIAELQKEKERLEFVLVAH KPGCKIPYEEGPGPGPLAEVRDL
PGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNL
TASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQP
SDPLNSPS LLAL" BASE COUNT 960 a 1186 c 1007 g 991
t 1 others ORIGIN 1 ataaattctt attttgacac
tcaccaaaat agtcacctgg aaaacccgct ttttgtgaca 61
aagtacagaa ggcttggtca catttaaatc actgagaact
agagagaaat actatcgcaa 121 actgtaatag acattacatc
cataaaagtt tccccagtcc ttattgtaat attgcacagt 181
gcaattgcta catggcaaac tagtgtagca tagaagtcaa
agcaaaaaca aaccaaagaa 241 aggagccaca agagtaaaac
tgttcaacag ttaatagttc aaactaagcc attgaatcta 301
tcattgggat cgttaaaatg aatcttccta caccttgcag
tgtatgattt aacttttaca 361 gaacacaagc caagtttaaa
atcagcagta gagatattaa aatgaaaagg tttgctaata 421
gagtaacatt aaataccctg aaggaaaaaa aacctaaata
tcaaaataac tgattaaaat 481 tcacttgcaa attagcacac
gaatatgcaa cttggaaatc atgcagtgtt ttatttaaga 541
aaacataaaa caaaactatt aaaatagttt tagagggggt
aaaatccagg tcctctgcca 601 ggatgctaaa attagacttc
aggggaattt tgaagtcttc aattttgaaa cctattaaaa 661
agcccatgat tacagttaat taagagcagt gcacgcaaca
gtgacacgcc tttagagagc 721 attactgtgt atgaacatgt
tggctgctac cagccacagt caatttaaca aggctgctca 781
gtcatgaact taatacagag agagcacgcc taggcagcaa
gcacagcttg ctgggccact 841 ttcctccctg tcgtgacaca
atcaatccgt gtacttggtg tatctgaagc gcacgctgca 901
ccgcggcact gcccggcggg tttctgggcg gggagcgatc
cccgcgtcgc cccccgtgaa 961 accgacagag cctggacttt
caggaggtac agcggcggtc tgaaggggat ctgggatctt 1021
gcagagggaa cttgcatcga aacttgggca gttctccgaa
ccggagacta agcttccccg 1081 agcagcgcac tttggagacg
tgtccggtct actccggact cgcatctcat tccactcggc 1141
catagccttg gcttcccggc gacctcagcg tggtcacagg
ggcccccctg tgcccaggga 1201 aatgtttcaa gcttttcccg
gagactacga ctccggctcc cggtgtagct catcaccctc 1261
cgccgagtct cagtacctgt cttcggtgga ctccttcggc
agtccaccca ccgccgccgc 1321 ctcccaggag tgcgccggtc
tcggggaaat gcccggctcc ttcgtgccaa cggtcaccgc 1381
aatcacaacc agccaggatc ttcagtggct cgtgcaaccc
accctcatct cttccatggc 1441 c
List and example provided by EMBL-EBI
42Schema Descriptors
- Follow XML DTD standard for semi-structured data
- Simple attribute list for relational data
lt?xml version'1.0' encoding'UTF-8'?gt lt!ELEMENT
FASTA (ID, DESCRIPTION, SEQ)gt lt!ELEMENT ID
(PCDATA)gt lt!ELEMENT DESCRIPTION
(PCDATA)gt lt!ELEMENT SEQ (PCDATA)gt
FASTA //Schema Name ID string //Data type
definitions DESCRIPTION string SEQ string
43Layout Descriptors
- Overall structure (FASTA example)
- DATASET FASTAData //Dataset name
- DATATYPE FASTA //Schema name
- DATASPACE LINESIZE80
- // ---- File layout details goes here ----
-
- DATA osu/fasta //File location
-
44File Layout
- Key observations on line-based biological data
files - Strings of variable length
- Delimiters widely used
- Data fields may be divided into variables
- Repetitive structures
gtseq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP
\ngtseq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFL
PRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKL
GGRDSRSGSPMARRELVISLIVES \n gtseq3
45Layout Descriptors
- File layout (FASTA example)
gtseq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP
\ngtseq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFL
PRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKL
GGRDSRSGSPMARRELVISLIVES \n gtseq3
- DATASPACE LINESIZE80
- lt
- gt ID DESCRIPTION
- lt \n SEQ gt
- \n EOF
- gt
46System Component
- Understand data
- Layout mining
- Schema mining
- Process data
- Metadata description language
- Wrapper generation
- Query execution
- Query execution with indices
47Wrapper Generation Road Map
- Motivation and overview
- System structure
- Wrapper generation
- Wrapper execution
- Experiments
48Wrapper Generation Motivation
- Wrappers are essential for bioinformatics
integration - Heterogeneous data sources
- Function transform data
- Current solutions
- Manually written wrappers
- Scripts
49Wrapper GenerationAdvantages
- Wrapper generated automatically
- Stand-alone programs for integration systems and
workflows - Little human interference. New resources can be
integrated on-the-fly - Direct transformation. No unnecessary
intermediate form needed - Only requires data description at metadata level,
one descriptor/data source - Transfer data from flat files directly
- No DB support required
- No other domain or format heuristics
50Wrapper GenerationSystem Overview
Schema Descriptors
Layout Descriptor
Mapping File
Mapping Generator
Layout Parser
Mapping Parser
Data Entry Representation
Schema Mapping
Wrapper generation system
Application Analyzer
WRAPINFO
Source Dataset
Target Dataset
DataReader
DataWriter
Synchronizer
wrapper
51Layout Parse Tree
DATASPACE root linesize 80
- DATASPACE LINESIZE80
- lt
- gt ID DESCRIPTION
- lt \n SEQ gt
- \n EOF
- gt
lt gt
lt gt
-DESCRIPTION
\n-DUMMY EOF
gt-ID
Internal node environment
Leaf delimiter-variable (DLM-VAR) pair
\n-SEQ
52Schema Mapping
- Algorithm strict name matching
- for field ft in target schema
- for field fs in source schema
- if ftfs then add pair (fs, ft) to the mapping
- Output
- A list of attribute pairs
- A editable file for user to verify and modify
53Wrapping Assumptions
- Convert semi-structured (and structured) data to
structured data - Both datasets are stored record-wise
- Order of records not disturbed after wrapping
Semi-structured Structured
Data can be transformed entry by entry
54Application Analyzer
- Task to generate clear directions for wrapper
and organize them in WRAPINFOR - Sub-tasks
- What values to store
- How to extract values
- How to store values
- How to write values
55Important Concepts (1)
- Useful
- An attribute is useful iff its values are in
target - Reachable
- node b is reachable from node a, if there exists
a valid layout configuration such that a.DLM and
b.DLM defines the boundaries of a.VAR. - i.e a.DLM a.VAR b.DLM
- A value instance is between
- Its own delimiter
- The first appearance of its reachable delimiters
56Important Concepts (2)
- Attribute Cardinality
- Regular attribute fixed number of values per
entry - ID
- Semi-structured attribute varied number of
values per entry - References
57WRAPINFOR
- Contents information to answer a particular
wrapping task - Forms in XML
- 5 look-up tables
- Delimiter, Usefulness, Cardinality, Label,
Reachable - 3 parameters
- one_to_one_total, one_to_multiple_total,
complete_in - Function plug into general modules to form a
functional wrapper
58Wrapper Generation Road Map
- Motivation and overview of our approach
- System structure
- Wrapper generation
- Wrapper execution
- Experiments
59Wrapper Overview
Value buffer
one_to_multiple_values
FA
RA
RA
Output dataset
Input dataset
Dataset buffer
DataReader
DataWriter
one_to_one_values
load
run
run
halt
Synchronizer
60Wrapper Structure
- One data module WRAPINFO
- Three general action module
- Synchronizer central controler
- DataReader, DataWriter interact with datasets
- One value buffer
- Suitable for data grid
- Transform data one entry at a time
61Wrapper Execution
- DataReader
- Extract attribute value
- Delimiter table Reachable table
- Fill value buffer Label look-up table
- DataWriter
- Retrieve from value buffer Label look-up table
- Write target file
- Delimiter table Reachable table label table
- Synchronizer
- Call DataReader on source parameters
- Call DataWriter on target parameters
62Wrapper Experiments (1)
- Analysis time constant
- Execution time linear
(in logarithm)
(in logarithm)
TRANSFAC-to-Reference Problem
63Wrapper Experiments (2)
- Performance comparable to handwritten codes
SWISSPROT-to-FASTA Problem
64System Components
- Understand data
- Layout mining
- Schema mining
- Process data
- Metadata description language
- Wrapper generation
- Query execution
- Query execution with indices
65Query Execution Road Map
- Motivation
- System Overview
- System Implementation
- Languages
- System
- Experiments
66Limitation of Wrapper
- Data Wrapping
- Data formatting Data projection
- Other query types
- Selection
- Cross Product
- Join
- New Functionalities
- Value examination
- Multiple datasets
67Advantages
- Retrieve multiple pieces of information all at
once - Data easily available
- Declarative languages only
- High flexibility
- Low over-head
- Suitable for data grid
68System Enhanced
query
Source/target names
Query parser
Dataset descriptors
Descriptor parser
Metadata collection
mappings
Schema Layout information
Application analyzer
Query analysis
Query execution
QUERYINFOR
Source data files
Target Data file
DataReader
DataWriter
Synchronizer
69Query ExecutionRoad Map
- Motivation
- System Overview
- System Implementation
- Languages
- Metadata Description Language
- Query Language
- System
- Query Analysis
- Query Execution
- Experiments
70Query Language
- Declarative, SQL-like
- Projection, selection, cross product, join
queries - Example
Target dataset
AUTOWRAP POSTBLAST FROM BLASTP, SWISSPROT BY
BLASTP.SP_ID SWISSPROT.ID WHERE
POSTBLAST.QUERY BLASTP.QUERY POSTBLAST.SP_AC
BLASTP.SP_AC POSTBLAST.SP_ID
BLASTP.SP_ID POSTBLAST.FULL_DESCR
SWISSPROT.DE POSTBLAST.SEQUENCE
SWISSPORT.SQ POSTBLAST.SCORE
BLASTP.SCORE POSTBLAST.E_VALUE BLASTP.E_VALUE
Source datasets
Join criteria
Attribute pairs
71Application AnalyzerEnhancement
- Constant values in query
- Pseudo-label look-up table
- Other query information
- Parameters comparing field pairs
- Output QUERYINFOR
72Query Execution
- Query-Proc Structure
- DataReader and DataWriter
- Similar to wrapper
- Value buffer
- Store useful values from one data entry of every
source dataset
QUERYINFOR
Source data files
Target Data file
DataReader
DataWriter
Synchronizer
73Enhanced Synchronizer
- Synchronizer
- Set up pseudo-attributes Pseudo label look-up
table - Call DataReader on source 1 and 2 Call
DataWriter on target Parameters - Test join conditions Parameters
- Clean value buffer Parameters
74Post-BLAST Query
- Goal Enhance BLAST output to FASTA format
- Query Join query between BLAST output (source 1)
and SWISSPROT (source 2) - 2 modes
- UNIQUE halt once a match found in source 2
- ALL search all source 2 entries
75Chip-Supplement Query
- Goal Look up microarray genes information into
tabular format - Query Join query between protein array and yeast
genome database - 2 queries
- Chip-Supplement
- array join genome
- Chip-Supplement-Sorted
- genome join array
76OMIM-Plus Query
- Add reverse links of proteins to disease database
- Join query between OMIM database and SWISSPROT
database - Results in OMIM form
- 86.38 seconds/entry 12,158 OMIM entry 291.7
hours
77System Components
- Understand data
- Layout mining
- Schema mining
- Process data
- Metadata description language
- Wrapper generation
- Query execution
- Query execution with indices
78Query with IndicesRoad Map
- Motivation and Overview
- System
- System Enhancement
- Language
- System Implementation
- Experiments
79Query With IndicesMotivation
- Goal
- Improve the performance of query-proc program
- Index
- Maintain the advantages
- Flat file based
- Low requirement on programming
80Challenges Approaches
- Various indexing algorithms for various
biological data - User defined indexing functions
- Standard function interfaces
- Flat file data
- Values parsed implicitly and ready to be indexed
- Byte offset as pointer
- Metadata about indices
- Layout descriptor
81System Revisit
query
Source/target names
Query parser
Dataset descriptors
Descriptor parser
Metadata collection
mappings
Schema Layout information
Application analyzer
Query analysis
Query execution
QUERYINFOR
Source data files
Target data file
DataReader
DataWriter
Synchronizer
Index file
Index functions
82Language Enhancement
- Describe indices
- Indexing is a property of dataset
- Extend layout descriptors
- Maintain query format
DATASET name INDEX attributeindex_file_loc
index_gen_funindex_retr_funfun_loc ,
attributeindex_file_locindex_gen_funindex_retr_
funfun_loc
New meaning of If index available, use index
retrieving function Else, compare values
directly
AUTOWRAP GNAMES FROM CHIPDATA, YEASTGENOME BY
CHIPDATA.GENE YEASTGENOME.ID WHERE
83System Enhancement
- Metadata Descriptor Parser
- parse index information
- Application Analyzer
- index information index look-up table
- test condition compare_field_indexing
84Query-Proc Enhancement
- Synchronizer
- if index is applicable, check availability of
index data file - If no, call index generation function
- Load indices
- Call index retrieving function first for
candidate entry list
85Microarray Gene Information Look-up
- Goal gather information about genes (120)
- Query microarray output join genome database
- Index gene names in genome
86BLAST-ENHANCE Query
- Goal Add extra information to BLAST output
- Query BLAST output join Swiss-Prot database
- Index protein ID in Swiss-Prot
87OMIM-PLUS Query
- Goal add Swiss-Prot link to OMIM
- Query OMIM join Swiss-Prot
- Index protein ID in Swiss-Prot
88Homology Search Query
- Goal find similar sequences
- Query query sequence list sequence database
- Indexing algorithm
- Sequence-based
- Transformation of sub-string composition
- Indexing n-D numerical values
89Homology Search (1)
- Index (Singhs algorithm)
- Data yeast genome
- wavelet coefficients
- minimum bounding rectangles
90Homology Search (2)
- Index (Ferhatosmanoglus algorithm)
- Data GenBank
- Wavelet coefficients
- Scalar quantization
- R-tree
91Road Map
- Mission Statement
- Motivation
- Implementation
- Comprehensive Example
- Future work
- Conclusion
92Gene Name Nomenclature
- It is crucial to identify genes CORRECTLY and
UNAMBIGUOUSLY - Genes with multiple names
- Multiple gene share same names
- Historically, little central control on naming
process
As biologists strive to make sense of the
growing wealth of genomic information, this messy
nomenclature is becoming a bugbear Helen
Pearson, Nature, 2001
93Gene Name in DBs
- Databases related to genes
- Genome databases (main force in nomenclature)
- SGD (yeast)
- HGNC (human)
- TAIR (a plant)
- dictyBase (an one-cell amoeba)
- Curated gene databases
- Entrez Gene by NCBI
- Curated gene product databases
- Swiss-Prot by SIB and EBI
94Queries About Gene Name
- Gene identifiers usages in databases
- How are gene symbols in DB A used in DB B?
- How are gene alias in DB A used in DB B?
- Nomenclature across species
- Q1-Q2 genome Entrez Gene, Swiss-Prot
- Q3-Q4 Entrez Gene Swiss-Prot
- Nomenclature over time
- Q5-Q7 Swiss-Prot genome
95Challenges
- Various data representation
- Line-based texts
- Tabular forms with or without title
- Format evolves over time
- Data storage
- Large volume
- Each file queried limited times
Metadata descriptors
Format and schema learning
Flat file processing
96Integration System Revisit
Understand Data
Process Data
Genome Entrez Gene Swiss-Prot
Data File
User Request
- Join queries
Metadata Description
Layout Miner
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Code Generation
Query Processor
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Schema Miner
Information Integration System
97Nomenclature Results (1)
98Nomenclature Results (2)
Q5 How many gene ID in Swiss-Prot are gene ID in
genome? Q6 How many gene ID in Swiss-Prot are
alias in genome? Q7 How many gene alias in
Swiss-Prot are gene ID in genome?
99Performance
- Linear w.r.t. source 1 size
100Conclusion
- A frame work and a set of tools for on-the-fly
flat file data integration - New data source understood semi-automatically by
data mining tools - New data processed automatically by generated
programs - Advantages
- High level interface, flat file based, ok
performance, low maintenance cost