Tomohiro Mitsumori1, Sevrani Fation1, Masaki Murata2 - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Tomohiro Mitsumori1, Sevrani Fation1, Masaki Murata2

Description:

Our system (Methods and Resource) Method Machine Learning ... External Information source SWISS-PROT and TrEMBL. training data. test data. Feature extraction ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 26
Provided by: mits6
Category:

less

Transcript and Presenter's Notes

Title: Tomohiro Mitsumori1, Sevrani Fation1, Masaki Murata2


1
Gene/protein name recognition using Support
Vector Machine after dictionary matching
Tomohiro Mitsumori1, Sevrani Fation1, Masaki
Murata2 Kouichi Doi1 and Hirohumi Doi1
1Graduate School of Information Science, Nara
Institute Science and Technology (NAIST),
8916-5, Takayama-cho, Ikoma-shi, Nara, 630-0101,
Japan mitsumor, fation, doy_at_is.aist-nara.ac.jp
doi_at_cl-sciences.co.jp 2Keihanna Human
Info-Communication Research Center,
Communications Research Laboratory (CRL) 2-2-2,
Hikaridai, Seika-cho, Soraku-gun, Kyoto,
619-0289, Japan murata_at_crl.go.jp
2
System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
TAGGED_GENE_CORPUS
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
3
Our system (Methods and Resource)
Method ? Machine Learning Algorism
Support Vector Machine (SVM) Tool ? Yamcha
(Yet Another Multipurpose Chunk Annotator)
SVM based chunker
http//cl.aist-nara.ac.jp/taku-ku/software/yamcha
/ Training corpus ? TAGGED_GENE_CORPUS External
Information source ? SWISS-PROT and TrEMBL
4
System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
5
Features using our experiments
6
System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
7
Orthographic Features
8
System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
9
Prefix Feature
uni-gram, bi-gram and tri-gram of beginning
letter of word e.g. such ? s , su , suc
NF-kappa ? N , NF , NF-
( Letter gram)
Suffix Feature
uni-gram, bi-gram and tri-gram of ending letter
of word e.g. such ? h , ch , uch
NF-kappa ? a , pa , ppa
( Letter gram)
10
System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
11
Gene/Protein Name Dictionary
Protein name pattern matching uni-gram,
bi-gram and tri-gram Gene name pattern matching
uni-gram
( Word gram )
( Word gram )
12
Tagging Method of dictionary feature
such as NF-kappa B that are

example sentence
Y
N
N
Y
N
N
1 gram
N
Y
Y
2 gram
N
N
N
N
3 gram
N
N
Y the focused word was shown in Dictionary
(SWISS-PROT and / or TrEMBL). N the focused
word was not shown Dictionary.
13
Example of Feature Vector
Preceding Feature
Feature Vector
Focused Word
14
System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
15
Format of Yamcha (e.g.chunking to Noun Phrase)
Estimated tags
Features
SVM classify
SVM Learn
Test Data
Output
Training Data
Answer Tag
16
BIO representation
B Begin of chunk I Inside of chunk O
Other e. g. such as NF-kappa B that are
O O B I
O O
17
System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
18
Support Vector Machine search Margin as
possible as Large.
Hyper Plane
Hyper Plane
Small Margin
Large Margin
Positive example (gene/protein names)
Negative example (other words)
19
Expanding for Multi class
Methods One vs. Rest (B vs. IO) , (I vs. BO)
,and (O vs. BI) Pair wise (B vs. I) , (B vs. O)
and (I vs. O)
Our Method ?
20
SVM parameters using our experiments
21
Three Runs
1st run using SWISS-PROT with exact match 2nd
run using SWISS-PROT with Regular Expression (
Perl ) (e.g. NF-kappa B ? NF\Wkappa\WB
? NF kappa B , NFkappa B,
NFkappaB ) 3rd run using SWISS-PROT and
TrEMBL with exact match
22
System description of our system
Yamcha
Feature extraction
word, pos, orthographic, prefix, suffix ,
dictionary, preceding class
training data
SVM learning
gene/protein name dictionary of SWISS-PROT and
TrEMBL
Tagging on gene/protein names
word, pos, orthographic, prefix, suffix ,
dictionary
test data
SVM classification
evaluating preceding class
Feature extraction
Learning
Classification
23
Results of our experiments
24
Conclusion
  • We carried out gene/protein name recognition
    using SVM algorism.
  • We used bag of word, POS, prefix, suffix and
    dictionary features.
  • (3) Our best results were 0.8245 (precision),
    0.7416 (recall) and 0.7809 (balanced f-score).
  • (4) Effect of dictionary feature was about 0.025
    .

25
Thank You !!
Write a Comment
User Comments (0)
About PowerShow.com