Title: Views: Fundamental Building Blocks in the Process of Knowledge Discovery
1Views Fundamental Building Blocks in the Process
of Knowledge Discovery
- Hideo Bannai1 , Yoshinori Tamada2
- Osamu Maruyama3, Kenta Nakai1, Satoru Miyano1
-
- 1 Human Genome Center, Inst. of
Medical Science,
Univ. of Tokyo - 2 Dept. of Mathematical Sciences,
Tokai Univ. - 3 Faculty of Mathematics, Kyushu
Univ.
2Motivation
- Speed up the computational knowledge discovery
process as a whole
3The Knowledge Discovery Process
Evaluation/ Interpretation
Fayyad et al. 1996
- The keys to success
- Develop a good algorithm for the problem
- Find or generate appropriate attributes
- Allow human intervention
Data Mining
Transformation
Patterns
Preprocessing
Transformed Data
Selection
Preprocessed Data
Data
Target Data
4Contents
- Basic Concepts
- Entity, View, View Operation, View Design
- Example discovery process
- Characterization of sub-cellular localization
signals
5Basic Concepts - Entity
- An entity is any object.
- Example
- The set E of S. cerevisiae gene names. Each gene
name is an entity. - YAL069W, YAL068C, YLR188W,
6Basic Concepts - View
- A view is a function over entities
- v E ? R call R the range set
- Example
- View over the set of S. cerevisiae genes names
atgatcgtaaataa
atgcatcgtaaatc
atgattgtaagaat
7Basic Concepts - View Operation
- A view operator is a function that creating new
views from existing views. - A function ? R?R over range sets induces a
view operator ?v?v - v E ? R, ?R?R
- ?v ? ?v v E ? R ? R
- ?v(e) ?(v(e))
8Basic Concepts - View Operation
- Example codon translation
MIVNNTH
MHRKSLRR
MIVRMIRL
9Basic Concepts - View Operation
Parameter patternatcgt
true
true
false
10Basic Concepts - View Operation
- Hypothesis Generation is a view operation
(Hypothesis view) - Let E entity set, V views available
- Supervised Learning (v target view)
- SLearner E, V, v ? v where for most
e?E, v(e) v(e) - Unsupervised Learning
- ULearnerE, V ? v
11Basic Concepts - View Operation
- Example ID3 algorithm and Decision Tree View
ID3
POS
POS
NEG
NEG
POS
POS
12View Design
- Views and view operators can be combined to
create new views. - We call the structure of this combination, the
view design - Careful design of views by experts is an
interface for human intervention
13The Knowledge Discovery Process
- Data Preparation Transformation of entity set
using views - Data Mining Generation of new views from
existing views and entities - Interpretation/Evaluation Evaluation of the
generated views - Knowledge Consolidation Using newly created
views, for the next discovery task
14Sub-cellular Localization Signals
- Proteins are carried to specific locations after
synthesized in the cytosol. - A short amino acid sequence holds the information
(localization signal) - Signals at the N-terminal
- signal peptide (SP)
- mitochondrial targeting peptide (mTP)
- chloroplast transit peptide (cTP)
15Previous Work
- PSORT 1992 Nakai, Kanehisa
- An expert system to predict localization sites
- Looks at the amino acid frequency of first 20
residues -gt linear discrimination - MitoProt 1996 Claros?Vincens
- Linear discrimination using 47 numerical
attributes trying to model various knowledge
about the signals (Mitochondrial only) - TargetP 2000 Emanuelsson et al
- Neural network system (Best predictor so far)
16Our Approach
- Find simple interpretable rules, which are still
accurate - Design views based on discussion with an expert
- Data
- 940 sequences of SP(269), mTP(368), cTP(141),
Other(162)
17View Design for Experiments
View Design 1 (binary classifier)
- Can we find structural characteristics?
MLRSAVRLAGKDVRFGEDA
(0, 15)
MLRSAVRLAGKDVRF
001000100000010
R?1 Other?0
True
0100010
18View Design for Experiments
View Design 2 (binary classifier)
- What about overall composition?
MLRSAVRLAGKDVRFGEDA
(5, 10)
VRLAGKDVRF
0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 1.0, 0.0
Net Charge
0.1
False
0.8
19View Design for Experiments
View Design 3 (binary classifier)
View Design 1
View Design 2
Logical And
20View Design for Experiments
- Decision list of View Design 2 and 3
- Hypothesis generation
- DListE, view2, view3, v ?
P1
P2
P3
sequence
SP? View Design 2
Other? View Design 3
mTP? View Design 3
no
no
no
no
cTP
yes
yes
yes
SP
Other
mTP
21Results
- 5-fold cross validation
- Accuracy was close to TargetP
- Our view
TargetP - SP 237/269 88.1 245/269 91.1
- mTP 304/368 82.6 300/368 81.5
- cTP 112/141 79.4 120/141 85.1
- Other 141/162 87.0 137/162 84.0
22Results
View Design 1
View Design 2
Logical And
23Results
- The rules discovered were interpreted by an
expert - We were able to extract interpretable knowledge
about the signals. - Hydropathy of SP
- A pattern capturing the amphiphilic a-helix of
cTP and mTP - The expert was surprised that such simple rules
could predict so accurately
24Hypothesis Creator
- Library implementing views, and view operators
http//www.HypothesisCreator.net/
25HC Core Team
- Yoshinori Tamada
- Osamu Maruyama
- Hideo Bannai
- Satoru Miyano
26Collaborators
- Takao Goto
- Hirotaka Katou
- Kentaro Kawamoto
- Hideo Miyake
- Kaori Samuraki
27Alphabet Indexing
- A mapping from one character to another.
- It is generally used to classify an alphabet, to
a smaller alphabet. - Example
- K,R,D,E ? 1
- ACFGHILMNPQSTVWY ? 2
- MALRAVRSVRA ? 22212212212
28AAindex Database
- An entry in the AAindex database is
- a mapping from an amino acid to a numeric value,
representing biochemical properties of the amino
acid. - Example