Supporting on-the-fly data Integration for bioinformatics

About This Presentation

Title:

Supporting on-the-fly data Integration for bioinformatics

Description:

... Data mining algorithms for semi-automatic writing Reusable by different requests on same data Code generation Request analysis ... Phylip Raw RSF ... sequence ... – PowerPoint PPT presentation

Number of Views:185

Avg rating:3.0/5.0

Slides: 101

Provided by: DD675

Category:

more less

Transcript and Presenter's Notes

Title: Supporting on-the-fly data Integration for bioinformatics

1
Supporting on-the-fly data Integration for
bioinformatics

Candidate Xuan Zhang
Advisor Gagan Agrawal

2
Road Map

Mission Statement
Motivation
Implementation
Comprehensive Examples
Future work
Conclusion

3
Mission Statement

Enhance information integration systems on
Functionality
On-the-fly data incorporation
Flat file data process
Usability
Declarative interface
Low programming requirement

4
Motivation

Integration is essential for biological research
Biological data include
Sequences DNA (GenBank), protein (Swiss-Prot)
Structure RNA (RNAbase), protein (PDB)
Interaction pathway (KEGG), regulation (GRBase)
Function disease (OMIM)
2ndary protein family (Pfam)
Biological data is inter-related.

5
Motivation

Challenges of bioinformatics integration
Data volume overwhelming
DNA sequence 100 gigabases (August, 2005)
Data growth
exponential

Figure provided by PDB
6
Motivation

Challenges of bioinformatics integration (cont.)
Tools Many and more
Service interfaces Variety
Web pages
Web service
Grid service

7
Motivation

Challenges of bioinformatics integration (cont.)
Inter-operability Low
Heterogeneous data sources
Semi-structured by nature
Flat file, relational, object-oriented databases
Independently developed tools
No data exchange standard
Little Collaboration

8
Road Map

Mission Statement
Motivation
Implementation
Future
Conclusion

Approach Overview
Advantage
Components

9
Approach Summary

Metadata
Declarative description of data
Data mining algorithms for semi-automatic writing
Reusable by different requests on same data
Code generation
Request analysis and execution separated
General modules with plug-in data module

10
System Overview
Understand Data
Process Data
Data File
User Request
Metadata Description
Layout Miner
Answer
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Code Generation
Request Processor
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Schema Miner
Information Integration System
11
Advantages

Simple interface
At metadata level, declarative
General data model
Semi-structured data
Flat file data
Low human involvement
Semi-automatic data incorporation
Low maintenance cost
OK Performance
Linear scale guaranteed

12
Road Map

Mission Statement
Motivation
Implementation
Future
Conclusion

Approach Overview
Advantage
Components

13
System Components

Understand data
Layout mining
Schema mining
Process data
Wrapper generation
Query Process
Query Process with indices

14
Layout Mining
Data File

Goal 1 Separate delimiters from values
D-score location frequency
Goal 2 Organize delimiters and values
NFA

Token Parser
Tokens
Delimiter Mining
Candidate Delimiters
Layout Learning
Layout Descriptor
15
Schema Mining Road Map

Schema Mining
Overview
Mining System
Core Mining Algorithm
Experiments

16
Schema Mining Goals

Ultimate goal discover schema about an unknown
flat file dataset
Immediate goal Assign attributes with meaningful
labels

17
Our Approach

Summarize values from bottom up
Use knowledge from
Ontology
Heuristics
A head-up attribute label ? attribute name
What we can mine
date
What we cannot do
Creation date, last modification date, birthday,

18
Schema Mining Road Map

Schema Mining
Overview
Mining System
Core Mining Algorithm
Experiments

19
Schema Mining System
Raw attribute values

Major Components
Data Cleaning and summarization
Score calculation
Score function
Ontology
Heuristics
Score Clustering

Value cleaning and summarization
Attribute summaries
Score calculation
Cutoff values
Scores
Clustering algorithm
Labeling
Attribute Labels
20
Data Summarization

Goal reduce amount of data
Collect frequent tokens
Approximate frequent token mining algorithm

Goal reduce amount of data
Collect frequent tokens
Approximate frequent token mining algorithm
Token categorization by profile
Token profile a ordered list of N(numerical),
A(alphabetic) and special characters
Token categories
Word, number, else and other user defined
categories

21
Score Function Template

Desired property
Simple
Adjustable trade-off between sensitivity and
error tolerance

22
Score Clustering

Goal Sort attributes into three groups, H
(high), L (low) and M (middle), by scores
Mathematically, find two scores, scorei and
scorej, from score1, score2, score3, , scoreN,
to minimize the standard deviation
N (number of attributes) is not large. Exact
answer can be found.

23
Schema Mining Road Map

Schema Mining
Overview
Mining System
Core Mining Algorithm
Mining with ontology
Mining with heuristics
Experiments

24
Use of Ontology

An observation a similarity between ontology and
schema
Both satisfy is-a relation
E.g Diabetes is a disease.
Ontology diabetes is a child of disease
Schema diabetes is a valid instance of
attribute disease
Common ancestors in ontology attribute label

25
Real-world Complications

To find an arbitrary value in an ontology
Complete and comprehensive ontology?
Selective sampling
Error-free dataset?
Adjustable sensitivity fault tolerance
Performance

26
Ontology Database

Goal to approximate a complete comprehensive
ontology database
Approach
Complete sample popular terms
Comprehensive public ontology databases
common facts
Result
6 major categories
386 terms

27
Ontology Based Metrics (1)

Occurrence(term)
Frequent_Counti,
if termFrequent_Tokeni
mini0, t Frequent_Counti,
if termFrequent_Token0Frequent_Tokent
0, else
Strength(term)
Occurrence(term) ? Strength(child_term)

28
Ontology Based Metrics (2)

Two factors
Relative strength compared with other concepts
Completeness of ontology as a whole
Ontology score product of two factors
Each modulated by the template score function

29
Mining With Heuristics (1)

Use token profile
number N, N.N
date N-A-N, N/N/N
Use frequent token counts
identification Frequent_Counts1
Use other token information
biological sequence length gt45, or in 10s

30
Mining With Heuristics (2)

Use token sequence information
people name length (23), separator (, or
and), profile (not number, date)
Again, these counts are modulated by the template
function to calculate scores

31
Schema Mining Road Map

Schema Mining
Overview
Mining System
Core Mining Algorithm
Experiments

32
Schema Mining Experiment Design

Datasets
GenBank, UniProt SWISSPROT and Pfam
Cutoff values
Exact clustering
Evaluation
Weighted Cohens Kappa
Compare group most, middle and little with true
label Y(yes), P(partial) and N(no)

33
Result Summary Kappa
Very good
Good
Moderate
1 cellular component, 2 database, 3 date, 4
free text, 5 ID, 6 molecule type, 7 name, 8
number, 9 organism, 10 publication method, 11
sequence
34
Cellular Component (O)
35
Date (H)
36
Organism Name (O)
37
Schema Mining Summary

According to Kappa tests, results are good or
very good
Possible improvement
Clustering method with better intelligence
Better ontology database
More involved language analysis
Hybrid of bottom-up and top-down approaches

38
System Components

Understand data
Layout mining
Schema mining
Process data
Metadata description language
Wrapper generation
Query Process
Query Process with indices

39
Data Process Overview

Automatic code generation approach
Input
Metadata about datasets involved
Optional
Implicit data transformation task
Request by users
Indexing functions
Output
Executable programs
General modules
Task-specific data module

40
Metadata Description

Two aspects of data in flat files
Logical view of the data
Physical data organization
Two components of every data descriptor
Schema description
Layout description
Design goals
Powerful
Easy for writing and interpretation

41
Metadata Challenges

Examples of sequence formats
ALN/ClustalW format
AMPS Block file format
ClustalW
Codata
EMBL
GCG/MSF
GDE
Genebank
Fasta (Pearson)
NBRF/PIR
PDB format
Pfam/Stockholm format
Phylip
Raw
RSF
UniProtKB/Swiss-Prot

gtFOSB_MOUSE Protein fosB. 338 bp
MFQAFPGDYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQECAGLGE
MPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAVDPYD
MPGTSYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPRRPRE
ETLTPEEEEKRRVRRERNKLAAAKCRNRRRELTDRLQAETDQLEEEKAE
LESEIAELQKEKERLEFVLVAHKPGCKIPYEEGPGPGPLAEVRDLPGST
SAKEDGFGWLLPPPPPPPLPFQSSRDAPPNLTASLFTHSEVQVLGDPFPV
VSPSYTSSFVLTCPEVSAFAGAQRTSGSEQPSDPLNSPSLLAL

Major Challenges
Various representation
Semi-structured data

name "Short name for sequence" longname "Long
(more descriptive) name for sequence"
sequence-ID "Unique ID number" creation-date
"mm/dd/yy hhmmss" direction -11
strandedness 12 type DNARNAPROTEINTEXTM
ASK offset (-999999,999999) group-ID (0,999)
creator "Author's name" descrip "Verbose
description comments "Lines of comments that
can be fairly arbitrary text about a sequence.
Return characters are allowed, but no internal
double quotes or brace characters. Remember to
close with a double quote" sequence
"gctagctagctagctagctcttagctgtagtcgtagctgatgctagct
gatgctagctagctagctagctgatcgatgctagctgatcgtagctgacg
gactgatgctagctagctagctagctgtctagtgtcgtagtgcttattg
c"
LOCUS MMFOSB 4145 bp mRNA linear ROD 12-SEP-1993
DEFINITION Mouse fosB mRNA. ACCESSION X14897
VERSION X14897.1 GI50991 KEYWORDS fos
cellular oncogene fosB oncogene oncogene.
SOURCE Mus musculus. ORGANISM Mus
musculus Eukaryota Metazoa Chordata Craniata
Vertebrata Euteleostomi Mammalia Eutheria
Rodentia Sciurognathi Muridae Murinae Mus.
REFERENCE 1 (bases 1 to 4145) AUTHORS
Zerial,M., Toschi,L., Ryseck,R.P.,
Schuermann,M., Muller,R. and Bravo,R. TITLE
The product of a novel growth factor activated
gene, fos B, interacts with JUN proteins
enhancing their DNA binding activity
JOURNAL EMBO J. 8 (3), 805-813 (1989) MEDLINE
89251612 PUBMED 2498083 COMMENT
cloneAC113-1 cell lineNIH3T3. FEATURES
Location/Qualifiers source 1..4145
/organism"Mus musculus" /db_xref"taxon100
90 CDS 1202..2218 /note"fosB protein (AA
1-338)" /codon_start1 /protein_id"CAA33026.1"
/db_xref"GI50992" /db_xref"MGD95575"
/db_xref"SWISS-PROTP13346" /translation"MFQAFPG
DYDSGSRCSSSPSAESQYLSSVDSFGSPPTAAASQEC
AGLGEMPGSFVPTVTAITTSQDLQWLVQPTLISSMAQSQGQPLASQPPAV
DPYDMPGT SYSTPGLSAYSTGGASGSGGPSTSTTTSGPVSARPARARPR
RPREETLTPEEEEKRRV RRERNKLAAAKCRNRRRELTDRLQAETDQLEE
EKAELESEIAELQKEKERLEFVLVAH KPGCKIPYEEGPGPGPLAEVRDL
PGSTSAKEDGFGWLLPPPPPPPLPFQSSRDAPPNL
TASLFTHSEVQVLGDPFPVVSPSYTSSFVLTCPEVSAFAGAQRTSGSEQP
SDPLNSPS LLAL" BASE COUNT 960 a 1186 c 1007 g 991
t 1 others ORIGIN 1 ataaattctt attttgacac
tcaccaaaat agtcacctgg aaaacccgct ttttgtgaca 61
aagtacagaa ggcttggtca catttaaatc actgagaact
agagagaaat actatcgcaa 121 actgtaatag acattacatc
cataaaagtt tccccagtcc ttattgtaat attgcacagt 181
gcaattgcta catggcaaac tagtgtagca tagaagtcaa
agcaaaaaca aaccaaagaa 241 aggagccaca agagtaaaac
tgttcaacag ttaatagttc aaactaagcc attgaatcta 301
tcattgggat cgttaaaatg aatcttccta caccttgcag
tgtatgattt aacttttaca 361 gaacacaagc caagtttaaa
atcagcagta gagatattaa aatgaaaagg tttgctaata 421
gagtaacatt aaataccctg aaggaaaaaa aacctaaata
tcaaaataac tgattaaaat 481 tcacttgcaa attagcacac
gaatatgcaa cttggaaatc atgcagtgtt ttatttaaga 541
aaacataaaa caaaactatt aaaatagttt tagagggggt
aaaatccagg tcctctgcca 601 ggatgctaaa attagacttc
aggggaattt tgaagtcttc aattttgaaa cctattaaaa 661
agcccatgat tacagttaat taagagcagt gcacgcaaca
gtgacacgcc tttagagagc 721 attactgtgt atgaacatgt
tggctgctac cagccacagt caatttaaca aggctgctca 781
gtcatgaact taatacagag agagcacgcc taggcagcaa
gcacagcttg ctgggccact 841 ttcctccctg tcgtgacaca
atcaatccgt gtacttggtg tatctgaagc gcacgctgca 901
ccgcggcact gcccggcggg tttctgggcg gggagcgatc
cccgcgtcgc cccccgtgaa 961 accgacagag cctggacttt
caggaggtac agcggcggtc tgaaggggat ctgggatctt 1021
gcagagggaa cttgcatcga aacttgggca gttctccgaa
ccggagacta agcttccccg 1081 agcagcgcac tttggagacg
tgtccggtct actccggact cgcatctcat tccactcggc 1141
catagccttg gcttcccggc gacctcagcg tggtcacagg
ggcccccctg tgcccaggga 1201 aatgtttcaa gcttttcccg
gagactacga ctccggctcc cggtgtagct catcaccctc 1261
cgccgagtct cagtacctgt cttcggtgga ctccttcggc
agtccaccca ccgccgccgc 1321 ctcccaggag tgcgccggtc
tcggggaaat gcccggctcc ttcgtgccaa cggtcaccgc 1381
aatcacaacc agccaggatc ttcagtggct cgtgcaaccc
accctcatct cttccatggc 1441 c
List and example provided by EMBL-EBI
42
Schema Descriptors

Follow XML DTD standard for semi-structured data
Simple attribute list for relational data

lt?xml version'1.0' encoding'UTF-8'?gt lt!ELEMENT
FASTA (ID, DESCRIPTION, SEQ)gt lt!ELEMENT ID
(PCDATA)gt lt!ELEMENT DESCRIPTION
(PCDATA)gt lt!ELEMENT SEQ (PCDATA)gt
FASTA //Schema Name ID string //Data type
definitions DESCRIPTION string SEQ string
43
Layout Descriptors

Overall structure (FASTA example)

DATASET FASTAData //Dataset name
DATATYPE FASTA //Schema name
DATASPACE LINESIZE80
// ---- File layout details goes here ----
DATA osu/fasta //File location

44
File Layout

Key observations on line-based biological data
files
Strings of variable length
Delimiters widely used
Data fields may be divided into variables
Repetitive structures

gtseq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP
\ngtseq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFL
PRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKL
GGRDSRSGSPMARRELVISLIVES \n gtseq3
45
Layout Descriptors

File layout (FASTA example)

gtseq1 comment1 \nASTPGHTIIYEAVCLHNDRTTIP
\ngtseq2 comment2 \nASQKRPSQRHGSKYLATASTMDHARHGFL
PRHRDTGILDSIGRFFGGDRGAPK \nKSAHKGFKGVDAQGTLSKIFKL
GGRDSRSGSPMARRELVISLIVES \n gtseq3

DATASPACE LINESIZE80
lt
gt ID DESCRIPTION
lt \n SEQ gt
\n EOF
gt

46
System Component

Understand data
Layout mining
Schema mining
Process data
Metadata description language
Wrapper generation
Query execution
Query execution with indices

47
Wrapper Generation Road Map

Motivation and overview
System structure
Wrapper generation
Wrapper execution
Experiments

48
Wrapper Generation Motivation

Wrappers are essential for bioinformatics
integration
Heterogeneous data sources
Function transform data
Current solutions
Manually written wrappers
Scripts

49
Wrapper GenerationAdvantages

Wrapper generated automatically
Stand-alone programs for integration systems and
workflows
Little human interference. New resources can be
integrated on-the-fly
Direct transformation. No unnecessary
intermediate form needed
Only requires data description at metadata level,
one descriptor/data source
Transfer data from flat files directly
No DB support required
No other domain or format heuristics

50
Wrapper GenerationSystem Overview
Schema Descriptors
Layout Descriptor
Mapping File
Mapping Generator
Layout Parser
Mapping Parser
Data Entry Representation
Schema Mapping
Wrapper generation system
Application Analyzer
WRAPINFO
Source Dataset
Target Dataset
DataReader
DataWriter
Synchronizer
wrapper
51
Layout Parse Tree

FASTA example

DATASPACE root linesize 80

DATASPACE LINESIZE80
lt
gt ID DESCRIPTION
lt \n SEQ gt
\n EOF
gt

lt gt
lt gt
-DESCRIPTION
\n-DUMMY EOF
gt-ID
Internal node environment
Leaf delimiter-variable (DLM-VAR) pair
\n-SEQ
52
Schema Mapping

Algorithm strict name matching
for field ft in target schema
for field fs in source schema
if ftfs then add pair (fs, ft) to the mapping
Output
A list of attribute pairs
A editable file for user to verify and modify

53
Wrapping Assumptions

Convert semi-structured (and structured) data to
structured data
Both datasets are stored record-wise
Order of records not disturbed after wrapping

Semi-structured Structured
Data can be transformed entry by entry
54
Application Analyzer

Task to generate clear directions for wrapper
and organize them in WRAPINFOR
Sub-tasks
What values to store
How to extract values
How to store values
How to write values

55
Important Concepts (1)

Useful
An attribute is useful iff its values are in
target
Reachable
node b is reachable from node a, if there exists
a valid layout configuration such that a.DLM and
b.DLM defines the boundaries of a.VAR.
i.e a.DLM a.VAR b.DLM
A value instance is between
Its own delimiter
The first appearance of its reachable delimiters

56
Important Concepts (2)

Attribute Cardinality
Regular attribute fixed number of values per
entry
ID
Semi-structured attribute varied number of
values per entry
References

57
WRAPINFOR

Contents information to answer a particular
wrapping task
Forms in XML
5 look-up tables
Delimiter, Usefulness, Cardinality, Label,
Reachable
3 parameters
one_to_one_total, one_to_multiple_total,
complete_in
Function plug into general modules to form a
functional wrapper

58
Wrapper Generation Road Map

Motivation and overview of our approach
System structure
Wrapper generation
Wrapper execution
Experiments

59
Wrapper Overview
Value buffer
one_to_multiple_values
FA
RA
RA
Output dataset
Input dataset
Dataset buffer
DataReader
DataWriter
one_to_one_values
load
run
run
halt
Synchronizer
60
Wrapper Structure

One data module WRAPINFO
Three general action module
Synchronizer central controler
DataReader, DataWriter interact with datasets
One value buffer
Suitable for data grid
Transform data one entry at a time

61
Wrapper Execution

DataReader
Extract attribute value
Delimiter table Reachable table
Fill value buffer Label look-up table
DataWriter
Retrieve from value buffer Label look-up table
Write target file
Delimiter table Reachable table label table
Synchronizer
Call DataReader on source parameters
Call DataWriter on target parameters

62
Wrapper Experiments (1)

Analysis time constant
Execution time linear

(in logarithm)
(in logarithm)
TRANSFAC-to-Reference Problem
63
Wrapper Experiments (2)

Performance comparable to handwritten codes

SWISSPROT-to-FASTA Problem
64
System Components

Understand data
Layout mining
Schema mining
Process data
Metadata description language
Wrapper generation
Query execution
Query execution with indices

65
Query Execution Road Map

Motivation
System Overview
System Implementation
Languages
System
Experiments

66
Limitation of Wrapper

Data Wrapping
Data formatting Data projection
Other query types
Selection
Cross Product
Join

New Functionalities
Value examination
Multiple datasets

67
Advantages

Retrieve multiple pieces of information all at
once
Data easily available
Declarative languages only
High flexibility
Low over-head
Suitable for data grid

68
System Enhanced
query
Source/target names
Query parser
Dataset descriptors
Descriptor parser
Metadata collection
mappings
Schema Layout information
Application analyzer
Query analysis
Query execution
QUERYINFOR
Source data files
Target Data file
DataReader
DataWriter
Synchronizer
69
Query ExecutionRoad Map

Motivation
System Overview
System Implementation
Languages
Metadata Description Language
Query Language
System
Query Analysis
Query Execution
Experiments

70
Query Language

Declarative, SQL-like
Projection, selection, cross product, join
queries
Example

Target dataset
AUTOWRAP POSTBLAST FROM BLASTP, SWISSPROT BY
BLASTP.SP_ID SWISSPROT.ID WHERE
POSTBLAST.QUERY BLASTP.QUERY POSTBLAST.SP_AC
BLASTP.SP_AC POSTBLAST.SP_ID
BLASTP.SP_ID POSTBLAST.FULL_DESCR
SWISSPROT.DE POSTBLAST.SEQUENCE
SWISSPORT.SQ POSTBLAST.SCORE
BLASTP.SCORE POSTBLAST.E_VALUE BLASTP.E_VALUE
Source datasets
Join criteria
Attribute pairs
71
Application AnalyzerEnhancement

Constant values in query
Pseudo-label look-up table
Other query information
Parameters comparing field pairs
Output QUERYINFOR

72
Query Execution

Query-Proc Structure
DataReader and DataWriter
Similar to wrapper
Value buffer
Store useful values from one data entry of every
source dataset

QUERYINFOR
Source data files
Target Data file
DataReader
DataWriter
Synchronizer
73
Enhanced Synchronizer

Synchronizer
Set up pseudo-attributes Pseudo label look-up
table
Call DataReader on source 1 and 2 Call
DataWriter on target Parameters
Test join conditions Parameters
Clean value buffer Parameters

74
Post-BLAST Query

Goal Enhance BLAST output to FASTA format
Query Join query between BLAST output (source 1)
and SWISSPROT (source 2)
2 modes
UNIQUE halt once a match found in source 2
ALL search all source 2 entries

75
Chip-Supplement Query

Goal Look up microarray genes information into
tabular format
Query Join query between protein array and yeast
genome database
2 queries
Chip-Supplement
array join genome
Chip-Supplement-Sorted
genome join array

76
OMIM-Plus Query

Add reverse links of proteins to disease database
Join query between OMIM database and SWISSPROT
database
Results in OMIM form
86.38 seconds/entry 12,158 OMIM entry 291.7
hours

77
System Components

Understand data
Layout mining
Schema mining
Process data
Metadata description language
Wrapper generation
Query execution
Query execution with indices

78
Query with IndicesRoad Map

Motivation and Overview
System
System Enhancement
Language
System Implementation
Experiments

79
Query With IndicesMotivation

Goal
Improve the performance of query-proc program
Index
Maintain the advantages
Flat file based
Low requirement on programming

80
Challenges Approaches

Various indexing algorithms for various
biological data
User defined indexing functions
Standard function interfaces
Flat file data
Values parsed implicitly and ready to be indexed
Byte offset as pointer
Metadata about indices
Layout descriptor

81
System Revisit
query
Source/target names
Query parser
Dataset descriptors
Descriptor parser
Metadata collection
mappings
Schema Layout information
Application analyzer
Query analysis
Query execution
QUERYINFOR
Source data files
Target data file
DataReader
DataWriter
Synchronizer
Index file
Index functions
82
Language Enhancement

Describe indices
Indexing is a property of dataset
Extend layout descriptors
Maintain query format

DATASET name INDEX attributeindex_file_loc
index_gen_funindex_retr_funfun_loc ,
attributeindex_file_locindex_gen_funindex_retr_
funfun_loc
New meaning of If index available, use index
retrieving function Else, compare values
directly
AUTOWRAP GNAMES FROM CHIPDATA, YEASTGENOME BY
CHIPDATA.GENE YEASTGENOME.ID WHERE
83
System Enhancement

Metadata Descriptor Parser
parse index information
Application Analyzer
index information index look-up table
test condition compare_field_indexing

84
Query-Proc Enhancement

Synchronizer
if index is applicable, check availability of
index data file
If no, call index generation function
Load indices
Call index retrieving function first for
candidate entry list

85
Microarray Gene Information Look-up

Goal gather information about genes (120)
Query microarray output join genome database
Index gene names in genome

86
BLAST-ENHANCE Query

Goal Add extra information to BLAST output
Query BLAST output join Swiss-Prot database
Index protein ID in Swiss-Prot

87
OMIM-PLUS Query

Goal add Swiss-Prot link to OMIM
Query OMIM join Swiss-Prot
Index protein ID in Swiss-Prot

88
Homology Search Query

Goal find similar sequences
Query query sequence list sequence database
Indexing algorithm
Sequence-based
Transformation of sub-string composition
Indexing n-D numerical values

89
Homology Search (1)

Index (Singhs algorithm)
Data yeast genome
wavelet coefficients
minimum bounding rectangles

90
Homology Search (2)

Index (Ferhatosmanoglus algorithm)
Data GenBank
Wavelet coefficients
Scalar quantization
R-tree

91
Road Map

Mission Statement
Motivation
Implementation
Comprehensive Example
Future work
Conclusion

92
Gene Name Nomenclature

It is crucial to identify genes CORRECTLY and
UNAMBIGUOUSLY
Genes with multiple names
Multiple gene share same names
Historically, little central control on naming
process

As biologists strive to make sense of the
growing wealth of genomic information, this messy
nomenclature is becoming a bugbear Helen
Pearson, Nature, 2001
93
Gene Name in DBs

Databases related to genes
Genome databases (main force in nomenclature)
SGD (yeast)
HGNC (human)
TAIR (a plant)
dictyBase (an one-cell amoeba)
Curated gene databases
Entrez Gene by NCBI
Curated gene product databases
Swiss-Prot by SIB and EBI

94
Queries About Gene Name

Gene identifiers usages in databases
How are gene symbols in DB A used in DB B?
How are gene alias in DB A used in DB B?
Nomenclature across species
Q1-Q2 genome Entrez Gene, Swiss-Prot
Q3-Q4 Entrez Gene Swiss-Prot
Nomenclature over time
Q5-Q7 Swiss-Prot genome

95
Challenges

Various data representation
Line-based texts
Tabular forms with or without title
Format evolves over time
Data storage
Large volume
Each file queried limited times

Metadata descriptors
Format and schema learning
Flat file processing
96
Integration System Revisit
Understand Data
Process Data
Genome Entrez Gene Swiss-Prot
Data File
User Request
- Join queries
Metadata Description
Layout Miner
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Code Generation
Query Processor
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Layout Descriptor --------------------------------
------------------- Schema Descriptor
Schema Miner
Information Integration System
97
Nomenclature Results (1)

Across Species

98
Nomenclature Results (2)

Over time

Q5 How many gene ID in Swiss-Prot are gene ID in
genome? Q6 How many gene ID in Swiss-Prot are
alias in genome? Q7 How many gene alias in
Swiss-Prot are gene ID in genome?
99
Performance