Title: Tomohiro Mitsumori1, Sevrani Fation1, Masaki Murata2
1Gene/protein name recognition using Support
Vector Machine after dictionary matching
Tomohiro Mitsumori1, Sevrani Fation1, Masaki
Murata2 Kouichi Doi1 and Hirohumi Doi1
1Graduate School of Information Science, Nara
Institute Science and Technology (NAIST),
8916-5, Takayama-cho, Ikoma-shi, Nara, 630-0101,
Japan mitsumor, fation, doy_at_is.aist-nara.ac.jp
doi_at_cl-sciences.co.jp 2Keihanna Human
Info-Communication Research Center,
Communications Research Laboratory (CRL) 2-2-2,
Hikaridai, Seika-cho, Soraku-gun, Kyoto,
619-0289, Japan murata_at_crl.go.jp
2System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
TAGGED_GENE_CORPUS
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
3Our system (Methods and Resource)
Method ? Machine Learning Algorism
Support Vector Machine (SVM) Tool ? Yamcha
(Yet Another Multipurpose Chunk Annotator)
SVM based chunker
http//cl.aist-nara.ac.jp/taku-ku/software/yamcha
/ Training corpus ? TAGGED_GENE_CORPUS External
Information source ? SWISS-PROT and TrEMBL
4System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
5Features using our experiments
6System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
7Orthographic Features
8System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
9Prefix Feature
uni-gram, bi-gram and tri-gram of beginning
letter of word e.g. such ? s , su , suc
NF-kappa ? N , NF , NF-
( Letter gram)
Suffix Feature
uni-gram, bi-gram and tri-gram of ending letter
of word e.g. such ? h , ch , uch
NF-kappa ? a , pa , ppa
( Letter gram)
10System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
11Gene/Protein Name Dictionary
Protein name pattern matching uni-gram,
bi-gram and tri-gram Gene name pattern matching
uni-gram
( Word gram )
( Word gram )
12Tagging Method of dictionary feature
such as NF-kappa B that are
example sentence
Y
N
N
Y
N
N
1 gram
N
Y
Y
2 gram
N
N
N
N
3 gram
N
N
Y the focused word was shown in Dictionary
(SWISS-PROT and / or TrEMBL). N the focused
word was not shown Dictionary.
13Example of Feature Vector
Preceding Feature
Feature Vector
Focused Word
14System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
15Format of Yamcha (e.g.chunking to Noun Phrase)
Estimated tags
Features
SVM classify
SVM Learn
Test Data
Output
Training Data
Answer Tag
16BIO representation
B Begin of chunk I Inside of chunk O
Other e. g. such as NF-kappa B that are
O O B I
O O
17System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
18Support Vector Machine search Margin as
possible as Large.
Hyper Plane
Hyper Plane
Small Margin
Large Margin
Positive example (gene/protein names)
Negative example (other words)
19Expanding for Multi class
Methods One vs. Rest (B vs. IO) , (I vs. BO)
,and (O vs. BI) Pair wise (B vs. I) , (B vs. O)
and (I vs. O)
Our Method ?
20SVM parameters using our experiments
21Three Runs
1st run using SWISS-PROT with exact match 2nd
run using SWISS-PROT with Regular Expression (
Perl ) (e.g. NF-kappa B ? NF\Wkappa\WB
? NF kappa B , NFkappa B,
NFkappaB ) 3rd run using SWISS-PROT and
TrEMBL with exact match
22System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
23Results of our experiments
24Conclusion
- We carried out gene/protein name recognition
using SVM algorism. - We used bag of word, POS, prefix, suffix and
dictionary features. - (3) Our best results were 0.8245 (precision),
0.7416 (recall) and 0.7809 (balanced f-score). - (4) Effect of dictionary feature was about 0.025
.
25Thank You !!