Concept Modeling in Bio-informatics - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Concept Modeling in Bio-informatics

Description:

This classification is made based on the following assumption: ... (software BLAST, Smith Waterman, FASTA, etc) Sequence retrieval ... – PowerPoint PPT presentation

Number of Views:60
Avg rating:3.0/5.0
Slides: 25
Provided by: sani4
Category:

less

Transcript and Presenter's Notes

Title: Concept Modeling in Bio-informatics


1
Concept Modeling in Bio-informatics
  • Sanida Omerovic, Saso Tomazic,
  • Mateo Valero, Milos Milovanovic, David
    Torrents
  • University of Ljubljana, Slovenia
  • UPC, Barcelona, Catalonia

  • IPSI Firence-2007

2
WHAT IS CONCEPT ?
3
Decision Making Algorithm
4
Concept Modeling Layer
  • What is concept?
  • How is it modeled?
  • How is it built?
  • How is it exploited?
  • How is it updated?

5
Classification of concept modeling (CM) and
decision making systems (DMS)
  • This classification is made based on the
    following assumption
  • Any decision making system, regardless if the
    process is performed entirely by humans,
    supported by machines or totally automated, is a
    layered process, with one layer
  • (explicit or implicit) which can be called
  • Concept Modeling Layer

6
Purpose (DMS)
  • General
  • Specialized (Bio-informatics)

7
Bio-informatics
  • Genomic researchers mostly deal with
    similarity issues between genomic sequences.
    Genomic sequences are treated as long sequences
    of letters
  • A (adenine)
  • G (Guanine)
  • C (Cytosine)
  • T (Thymine)
  • which represents nitrogenous bases in protein
    structure.

8
DNA sequence
DNA sequence is presented as an array of letters
which are mapping the nucleotides in DNA
(consisted of one of four types of
nitrogenousbases A/G/C/T, a five-carbon sugar,
and molecule of phosphoric acid).
(A)
(T)
(G)
(C)
DNA chemistry compound
DNA sequence
9
DNA sequence analysis
  • GATTCATCGA CCATCAAAT GATT

Useful data
Noisy data
Start sequence
Start sequence
End sequence
10
Bio-informatics in DMS
  • Sequence concept
  • (still impossible/there is no protein
    conceptual model)
  • Sequence analysis (software BLAST, Smith
    Waterman, FASTA, etc)
  • Sequence retrieval (easy/ available for free on
    the WEB ENSEMBL.ORG, NCBI, UCSC, etc.)
  • Sequencing (hard/laboratory work on the level of
    chemical reactions to conclude weather C/T/G/A is
    in question in DNA chain)

11
Sequence analysis
  • In the example shown at next two figures, one can
    see a fraction of the results obtained from a
    BLAST comparison of protein SLC7A7 (human)
    against a SwissProt database of proteins.
  • We selected two illustrative examples that show
    from a perfect (word) mach to a similar mach.

12
BLAST Sample session, perfect match
  • gtgi12643348spQ9UHI5LAT2_HUMAN
    lthttp//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
    RetrievedbProteinlist_uids12643348doptGenPe
    ptgt Gene info lthttp//www.ncbi.nlm.nih.gov/entrez/
    query.fcgi?dbgenecmdsearchterm126433485BPUID
    5Dgt Large neutral amino acids transporter small
    subunit 2 (L-type
  • amino acid transporter 2) (hLAT2)
  • Length535
  • Score 665 bits (1717), Expect 0.0, Method
    Composition-based stats.
  • Identities 332/332 (100), Positives 332/332
    (100), Gaps 0/332 (0)
  • Query 1 MGIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFL
    QGSFAYGGWNFLNYVTEELVDPYK 60
  • MGIVQICKGEYFWLEPKNAFENFQEPDIGL
    VALAFLQGSFAYGGWNFLNYVTEELVDPYK
  • Sbjct 204 MGIVQICKGEYFWLEPKNAFENFQEPDIGLVALAFL
    QGSFAYGGWNFLNYVTEELVDPYK 263
  • Query 61 NLPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASN
    AVAVTFGEKLLGVMAWIMPISVA 120
  • NLPRAIFISIPLVTFVYVFANVAYVTAMSP
    QELLASNAVAVTFGEKLLGVMAWIMPISVA
  • Sbjct 264 NLPRAIFISIPLVTFVYVFANVAYVTAMSPQELLASN
    AVAVTFGEKLLGVMAWIMPISVA 323
  • Query 121 LSTFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVKR
    CTPIPALLFTCISTLLMLVTSD 180
  • LSTFGGVNGSLFTSSRLFFAGAREGHLPSV
    LAMIHVKRCTPIPALLFTCISTLLMLVTSD
  • Sbjct 324 LSTFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHV
    KRCTPIPALLFTCISTLLMLVTSD 383
  • Query 181 MYTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPIK
    INLLFPIIYLLFWAFLLVFSLW 240
  • MYTLINYVGFINYLFYGVTVAGQIVLRWKK
    PDIPRPIKINLLFPIIYLLFWAFLLVFSLW
  • Sbjct 384 MYTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRP
    IKINLLFPIIYLLFWAFLLVFSLW 443
  • Query 241 SEPVVCGIGLAIMLTGVPVYFLGVYWQHKPKCFSDFIE
    LLTLVSQKMCVVVYPEVERGSG 300
  • SEPVVCGIGLAIMLTGVPVYFLGVYWQHKP
    KCFSDFIELLTLVSQKMCVVVYPEVERGSG

13
BLAST Sample session, similar match
  • gtgi12643378spQ9UM01YLA1_HUMAN
    lthttp//www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd
    RetrievedbProteinlist_uids12643378doptGenPe
    ptgt Gene info lthttp//www.ncbi.nlm.nih.gov/entrez/
    query.fcgi?dbgenecmdsearchterm126433785BPUID
    5Dgt YL amino acid transporter 1 (y()L-type
    amino acid transporter
  • 1) (yLAT-1) (YLAT1) (Monocyte amino acid
    permease 2) (MOP-2)
  • Length511
  • Score 257 bits (656), Expect 4e-68, Method
    Composition-based stats.
  • Identities 138/315 (43), Positives 203/315
    (64), Gaps 10/315 (3)
  • Query 2 GIVQICKGEYFWLEPKNAFENFQEPDIGLVALA
    FLQGSFAYGGWNFLNYVTEELVDPYKN 61
  • GIV G E NFE
    G ALA FY GW LNYVTEE P N
  • Sbjct 202 GIVRLGQGASTHFE--NSFEG-SSFAVGDIALA
    LYSALFSYSGWDTLNYVTEEIKNPERN 258
  • Query 62 LPRAIFISIPLVTFVYVFANVAYVTAMSPQELLA
    SNAVAVTFGEKLLGVMAWIMPISVAL 121
  • LP I ISPVT Y NVAY T
    LASAVAVTF G WIPSVAL
  • Sbjct 259 LPLSIGISMPIVTIIYILTNVAYYTVLDMRDIL
    ASDAVAVTFADQIFGIFNWIIPLSVAL 318
  • Query 122 STFGGVNGSLFTSSRLFFAGAREGHLPSVLAMIHVK
    RCTPIPALLFTCISTLLMLVTSDM 181
  • S FGGN S SRLFF GREGHLP
    MIHVR TPPLLF I L L D
  • Sbjct 319 SCFGGLNASIVAASRLFFVGSREGHLPDAICMIHV
    ERFTPVPSLLFNGIMALIYLCVEDI 378
  • Query 182 YTLINYVGFINYLFYGVTVAGQIVLRWKKPDIPRPI
    KINLLFPIIYLLFWAFLLVFSLWS 241
  • LINY F F G GQ
    LRWKPD PRPK FPI L FL LS
  • Sbjct 379 FQLINYYSFSYWFFVGLSIVGQLYLRWKEPDRPR
    PLKLSVFFPIVFCLCTIFLVAVPLYS 438
  • Query 242 EPVVCGIGLAIMLTGVPVYFL--GVYWQHKPKCFSD
    FIELLTLVSQKMCVVVYPEVERGS 299
  • IGAI LGP YFL
    V P T Q C V E

14
BLAST outputScore 257 bits (656), Expect
4e-68, Method Composition-based
stats.Identities 138/315 (43), Positives
203/315 (64), Gaps 10/315 (3)
  • BLAST expresses the level of similarity between
    query sequence and database sequence in terms of
    Score, Expectations, Method, Identities,
    Positives, and Gaps. Here is where our DMA layer
    is finishing, and from this point inferring need
    to be done by researchers on the bases of
    software (ex. BLAST) output, and knowledge
    gathered elsewhere (book, computers, brains).
  • Also, a forthcoming challenge in the field of
    comparative genomic analysis is to compare large
    amounts of genomic data (letters).For example,
    if one wants to compare one mammalian genomic
    sequence against all existing mammalian
    sequences, one would need a database with memory
    storage of 60 GB (Saragasso Sea project).

15
Application for text analysis
  • Frequency (number of occurrences)
  • Distance
  • --------------------------
  • Exclude stop word lists (and, if, or etc)
  • Stemming (traveling gt travel traveled gt
    travel)
  • Synonyms (sick ill)
  • Visual Basic

16
Home-made Brandy Production
  • Grape-gathering is the first phase in the
    production of brandy, through it might be made
    also from plums, figs, pears or cornel berries.
    The gathered grapes are crushed and then poured
    into wooden barrels. They are mixed several times
    a day, the more often the better. The obtained
    mass is called wine-marc. The process of
    alcoholic fermentation usually lasts fifteen or
    thirty days. When it is finished, or when, as
    usually people say the marc is still,
    distillation begins i.e. the making of brandy,
    which is done in special copper cauldrons. Hand
    made copper cauldrons can still be found in
    Tuscany households

17
word word frequency distance
  • brandy grape 10 0
  • brandy alcohol 4 1
  • brandy distillation 3 3
  • brandy strength 3 5
  • brandy making 2 5

18
Concept criteria
  • Frequency gt 5
  • Distance lt 2
  • Concepts
  • brandy grape
  • brandy alcohol
  • Transcription
  • brandy - made of - grapes
  • brandy - kind of - alcohol

19
Concept Modeling layer
  • Implicit (concepts are not explicitly mentioned)
  • Protein
    conceptual model
  • Explicit (concepts are explicitly mentioned
    and/or defined)

  • Frequency gt 5

  • Distance lt 2

20
Concept definition (CM)
  • Node in concept network
  • (semantic web)
  • Node in concept web

21
Concept definitons
  • Structure that carries meaning.
  • Needs other concepts and relations among them to
    be defined. Without relations concept can not
    exist.
  • Relations between concepts can also be observed
    as concepts.
  • All concepts can be related among each other,
    forming whether
  • 1. concept web (where relations are concepts
    also)
  • 2. concept network (where relations are not
    concepts)

22
Concept Network Concept Web
23
Concept Modeling Learning Module
24
Thank you for your attention! Questions?
Write a Comment
User Comments (0)
About PowerShow.com