Title: Concept Modeling in Bio-informatics
1Concept Modeling in Bio-informatics
- Sanida Omerovic, Saso Tomazic,
- Mateo Valero, Milos Milovanovic, David
Torrents - University of Ljubljana, Slovenia
- UPC, Barcelona, Catalonia
-
IPSI Firence-2007
2WHAT IS CONCEPT ?
3Decision Making Algorithm
4Concept Modeling Layer
- What is concept?
- How is it modeled?
- How is it built?
- How is it exploited?
- How is it updated?
5Classification of concept modeling (CM) and
decision making systems (DMS)
- This classification is made based on the
following assumption - Any decision making system, regardless if the
process is performed entirely by humans,
supported by machines or totally automated, is a
layered process, with one layer - (explicit or implicit) which can be called
- Concept Modeling Layer
6Purpose (DMS)
- General
- Specialized (Bio-informatics)
-
7Bio-informatics
- Genomic researchers mostly deal with
similarity issues between genomic sequences.
Genomic sequences are treated as long sequences
of letters - A (adenine)
- G (Guanine)
- C (Cytosine)
- T (Thymine)
- which represents nitrogenous bases in protein
structure.
8DNA sequence
DNA sequence is presented as an array of letters
which are mapping the nucleotides in DNA
(consisted of one of four types of
nitrogenousbases A/G/C/T, a five-carbon sugar,
and molecule of phosphoric acid).
(A)
(T)
(G)
(C)
DNA chemistry compound
DNA sequence
9DNA sequence analysis
- GATTCATCGA CCATCAAAT GATT
Useful data
Noisy data
Start sequence
Start sequence
End sequence
10Bio-informatics in DMS
- Sequence concept
- (still impossible/there is no protein
conceptual model) - Sequence analysis (software BLAST, Smith
Waterman, FASTA, etc) - Sequence retrieval (easy/ available for free on
the WEB ENSEMBL.ORG, NCBI, UCSC, etc.) - Sequencing (hard/laboratory work on the level of
chemical reactions to conclude weather C/T/G/A is
in question in DNA chain)
11Sequence analysis
- In the example shown at next two figures, one can
see a fraction of the results obtained from a
BLAST comparison of protein SLC7A7 (human)
against a SwissProt database of proteins. - We selected two illustrative examples that show
from a perfect (word) mach to a similar mach.
12BLAST Sample session, perfect match
- gtgi12643348spQ9UHI5LAT2_HUMAN
lthttp//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
RetrievedbProteinlist_uids12643348doptGenPe
ptgt Gene info lthttp//www.ncbi.nlm.nih.gov/entrez/
query.fcgi?dbgenecmdsearchterm126433485BPUID
5Dgt Large neutral amino acids transporter small
subunit 2 (L-type - amino acid transporter 2) (hLAT2)
- Length535
- Score 665 bits (1717), Expect 0.0, Method
Composition-based stats. - Identities 332/332 (100), Positives 332/332
(100), Gaps 0/332 (0) - Query 1 MGIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFL
QGSFAYGGWNFLNYVTEELVDPYK 60 - MGIVQICKGEYFWLEPKNAFENFQEPDIGL
VALAFLQGSFAYGGWNFLNYVTEELVDPYK - Sbjct 204 MGIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFL
QGSFAYGGWNFLNYVTEELVDPYK 263 - Query 61 NLPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASN
AVAVTFGEKLLGVMAWIMPISVA 120 - NLPRAIFISIPLVTFVYVFANVAYVTAMSP
QELLASNAVAVTFGEKLLGVMAWIMPISVA - Sbjct 264 NLPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASN
AVAVTFGEKLLGVMAWIMPISVA 323 - Query 121 LSTFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVKR
CTPIPALLFTCISTLLMLVTSD 180 - LSTFGGVNGSLFTSSRLFFAGAREGHLPSV
LAMIHVKRCTPIPALLFTCISTLLMLVTSD - Sbjct 324 LSTFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHV
KRCTPIPALLFTCISTLLMLVTSD 383 - Query 181 MYTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPIK
INLLFPIIYLLFWAFLLVFSLW 240 - MYTLINYVGFINYLFYGVTVAGQIVLRWKK
PDIPRPIKINLLFPIIYLLFWAFLLVFSLW - Sbjct 384 MYTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRP
IKINLLFPIIYLLFWAFLLVFSLW 443 - Query 241 SEPVVCGIGLAIMLTGVPVYFLGVYWQHKPKCFSDFIE
LLTLVSQKMCVVVYPEVERGSG 300 - SEPVVCGIGLAIMLTGVPVYFLGVYWQHKP
KCFSDFIELLTLVSQKMCVVVYPEVERGSG
13BLAST Sample session, similar match
- gtgi12643378spQ9UM01YLA1_HUMAN
lthttp//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
RetrievedbProteinlist_uids12643378doptGenPe
ptgt Gene info lthttp//www.ncbi.nlm.nih.gov/entrez/
query.fcgi?dbgenecmdsearchterm126433785BPUID
5Dgt YL amino acid transporter 1 (y()L-type
amino acid transporter - 1) (yLAT-1) (YLAT1) (Monocyte amino acid
permease 2) (MOP-2) - Length511
- Score 257 bits (656), Expect 4e-68, Method
Composition-based stats. - Identities 138/315 (43), Positives 203/315
(64), Gaps 10/315 (3) - Query 2 GIVQICKGEYFWLEPKNAFENFQEPDIGLVALA
FLQGSFAYGGWNFLNYVTEELVDPYKN 61 - GIV G E NFE
G ALA FY GW LNYVTEE P N - Sbjct 202 GIVRLGQGASTHFE--NSFEG-SSFAVGDIALA
LYSALFSYSGWDTLNYVTEEIKNPERN 258 - Query 62 LPRAIFISIPLVTFVYVFANVAYVTAMSPQELLA
SNAVAVTFGEKLLGVMAWIMPISVAL 121 - LP I ISPVT Y NVAY T
LASAVAVTF G WIPSVAL - Sbjct 259 LPLSIGISMPIVTIIYILTNVAYYTVLDMRDIL
ASDAVAVTFADQIFGIFNWIIPLSVAL 318 - Query 122 STFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVK
RCTPIPALLFTCISTLLMLVTSDM 181 - S FGGN S SRLFF GREGHLP
MIHVR TPPLLF I L L D - Sbjct 319 SCFGGLNASIVAASRLFFVGSREGHLPDAICMIHV
ERFTPVPSLLFNGIMALIYLCVEDI 378 - Query 182 YTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPI
KINLLFPIIYLLFWAFLLVFSLWS 241 - LINY F F G GQ
LRWKPD PRPK FPI L FL LS - Sbjct 379 FQLINYYSFSYWFFVGLSIVGQLYLRWKEPDRPR
PLKLSVFFPIVFCLCTIFLVAVPLYS 438 - Query 242 EPVVCGIGLAIMLTGVPVYFL--GVYWQHKPKCFSD
FIELLTLVSQKMCVVVYPEVERGS 299 - IGAI LGP YFL
V P T Q C V E
14BLAST outputScore 257 bits (656), Expect
4e-68, Method Composition-based
stats.Identities 138/315 (43), Positives
203/315 (64), Gaps 10/315 (3)
- BLAST expresses the level of similarity between
query sequence and database sequence in terms of
Score, Expectations, Method, Identities,
Positives, and Gaps. Here is where our DMA layer
is finishing, and from this point inferring need
to be done by researchers on the bases of
software (ex. BLAST) output, and knowledge
gathered elsewhere (book, computers, brains). - Also, a forthcoming challenge in the field of
comparative genomic analysis is to compare large
amounts of genomic data (letters).For example,
if one wants to compare one mammalian genomic
sequence against all existing mammalian
sequences, one would need a database with memory
storage of 60 GB (Saragasso Sea project).
15Application for text analysis
- Frequency (number of occurrences)
- Distance
- --------------------------
- Exclude stop word lists (and, if, or etc)
- Stemming (traveling gt travel traveled gt
travel) - Synonyms (sick ill)
- Visual Basic
16Home-made Brandy Production
- Grape-gathering is the first phase in the
production of brandy, through it might be made
also from plums, figs, pears or cornel berries.
The gathered grapes are crushed and then poured
into wooden barrels. They are mixed several times
a day, the more often the better. The obtained
mass is called wine-marc. The process of
alcoholic fermentation usually lasts fifteen or
thirty days. When it is finished, or when, as
usually people say the marc is still,
distillation begins i.e. the making of brandy,
which is done in special copper cauldrons. Hand
made copper cauldrons can still be found in
Tuscany households
17word word frequency distance
- brandy grape 10 0
- brandy alcohol 4 1
- brandy distillation 3 3
- brandy strength 3 5
- brandy making 2 5
-
18Concept criteria
- Frequency gt 5
- Distance lt 2
- Concepts
- brandy grape
- brandy alcohol
- Transcription
- brandy - made of - grapes
- brandy - kind of - alcohol
19Concept Modeling layer
- Implicit (concepts are not explicitly mentioned)
- Protein
conceptual model - Explicit (concepts are explicitly mentioned
and/or defined) -
Frequency gt 5 -
Distance lt 2
20Concept definition (CM)
- Node in concept network
- (semantic web)
- Node in concept web
21Concept definitons
- Structure that carries meaning.
- Needs other concepts and relations among them to
be defined. Without relations concept can not
exist. - Relations between concepts can also be observed
as concepts. - All concepts can be related among each other,
forming whether - 1. concept web (where relations are concepts
also) - 2. concept network (where relations are not
concepts)
22Concept Network Concept Web
23Concept Modeling Learning Module
24Thank you for your attention! Questions?