Lecture Integrated Logic Systems - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Lecture Integrated Logic Systems

Description:

Systems integration and interoperation prime concern ... PMID 7578980: 'Primed monocytes transcribed TNF mRNA at a higher rate than ... – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 49
Provided by: biotecTu
Category:

less

Transcript and Presenter's Notes

Title: Lecture Integrated Logic Systems


1
Lecture Integrated Logic Systems
  • The Semantic Web and Ontology term extraction
    from research articles
  • Andreas Doms
  • adoms_at_biotec.tu-dresden.de

2
Bioinformatics and the Semantic Web
  • NSF and EUs strategic research workshop found
    that bioinformatics could play the role for the
    semantic web, which physics played for the web.
  • Why?
  • Masses of information
  • Data is public
  • Data is online
  • Data (more and more often) published in XML
  • Data standards are accepted and actively
    developed by bioinformatics community
  • Much valuable information scattered (as
    production cheap and hence often not centralised)
  • Systems integration and interoperation prime
    concern
  • Prediction In the not too distant future many
    tools and databases will be accessible as web
    services

3
The Web
  • A great success story, but
  • its the web for humans, not machines

4
Example Pubmed
  • gt12.000.000 literature abstracts
  • Great resource if one knows what one is looking
    for
  • Kox1 has 17 hits
  • But diabetes will produce gt200.000
  • Often need to automatically process abstracts

5
Results of PubMed
  • Lorenz P, Transcriptional repression mediated by
    the KRAB domain of the human C2H2 zinc finger
    protein Kox1/ZNF10 does not require histone
    deacetylation.Biol Chem. 2001 Apr382(4)637-44.
  • Fredericks WJ. An engineered PAX3-KRAB
    transcriptional repressor inhibits the malignant
    phenotype of alveolar rhabdomyosarcoma cells
    harboring the endogenous PAX3-FKHR oncogene.Mol
    Cell Biol. 2000 Jul20(14)5019-31....

6
Results of PubMed
  • ??????????????????????????????????????????????????
    ??????????????????????????????????????????????????
    ??????????????????????????????????????????????????
    ?????????????????????????????????????
  • ??????????????????????????????????????????????????
    ??????????????????????????????????????????????????
    ??????????????????????????????????????????????????
    ?????????????????????????????????????????????????
    ????????????????....

7
Results of PubMed
  • ??????????????????????????????????????????????????
    ??????????????????????????????????????????????????
    ??????????????????????????????????????????????????
    ?????????????????????????????????????????????????
    ??????????????????????????????
  • ??????????????????????????????????????????????????
    ??????????????????????????????????????????????????
    ??????????????????????????????????????????????????
    ?????????????????????????????????????????????????
    ??????????????????????????????
  • ...

8
GeneOntology
  • Biologists have recognised the problem of
    semantic inter-operability between disparate
    information sources
  • GeneOntology (GO) is effort to provide common
    vocabulary for molecular biology
  • GO has gt19.000 terms in three branches
    function, process, localisation

9
Ontology Merging
  • Big research efforts are being made in the field
    of the Semantic Web
  • One problem here is the convergence of ontology
    from different sources
  • Research group A develops an ontology for
    biological processes in mice
  • Research group B develops an ontology for
    biological processes in human
  • Many of the concept will describe the same
    process only in two different organisms

10
Ontology Merging
  • How can we merge those two ontologies?
  • How to automatically align concepts or subtrees?
  • Matching of the concepts names?
  • Graph matching of the descendants of the concept?
  • Domain specific ontologies vs. one global
    ontology?

11
Browser
  • Difference between WWW and the Semantic Web?
  • How to browse the Semantic Web?
  • Why does Google work so fine?
  • Why is searching the in the Semantic Web
    different?

12
Browser
  • Difference between WWW and the Semantic Web?
  • Plain text or HTML with plain text vs. Semantic
    Facts and Web Services
  • How to browse the Semantic Web?
  • You want direct answers to your concrete
    questions
  • Why does Google work so fine?
  • Answers to be found in one single document,
    citations to good resources give heuristic
  • Why is searching the in the Semantic Web
    different?
  • Reasoning over facts, new information spread over
    several resources in the web

13
From HTML to XML
  • The Semantic Web needs fact. (Semantically rich
    information)
  • A resource A in the web can have an ltAuthorgt
  • The ltAuthorgt is a resource B itself (personal
    webpage)
  • Some fact base says that B ltcreatedgt A
  • Most knowledge in the web is in plain text
  • Only very slowly information will be created as
    semantic fact by web authors
  • What about the currently available millions of
    sites?

14
Semantic Annotation
  • Natural language is very ambiguous
  • Try to extract fact from natural language texts
  • First problem find and disambiguousize entities
  • Use domain specific ontologies on domain specific
    texts

15
Example
  • PubMed Artikel Nummer 7638186In this report we
    demonstrate that IL-12 induces tyrosine
    phosphorylation of a recently identified STAT
    family member, STAT4, and show that STAT4
    expression is regulated by T-cell activation.

tyrosine phosphorylation of STAT protein(is a
GOTerm!!!)
16
Term Extraction
  • Sequence alignment algorithms very successful
    used in bioinformatics
  • Can it be used for term extraction?

17
Term Extraction
18
Term Extraction
  • No uniform vocabulary is used
  • GeneOntology is an attempt to unify language
    molecular biology GOTerms
  • isomerase activity
  • cis-trans isomerase activity
  • peptidyl-prolyl cis-trans isomerase activity
  • cyclophilin-type peptidyl-prolyl cis-trans
    isomerase activity

19
The GeneOntology
Gene Ontology
Molecular function
Biological process
Cellular Component















20
Example
  • Query for levamisole inhibitor retrieves more
    than 100 hits in PubMed.
  • Which enzymatic functions does levamisole
    inhibitor have?
  • Query for levamisole inhibitor enzymatic activity
    retrieves only 5 hits.
  • Many relevant articles missing.

21
(No Transcript)
22
Transferase 8 Kinase 6 Hydrolase
58 Oxidoreductase 2 Lyase 1
23
Phosphofructokinase
24
In PubMed this articel is at position 84 listed
25
Levamisole direclty inhibits tumor
phosphofructokinase
26
Term Extraction
  • Cant we use Sequence alignment for the term
    extraction?
  • ATAGCTGCTAGCTAGATGTACTAGCATCGT
  • GCATGTAGGCATC

27
Term Extraction
  • What is similar to natural language
    texts?ATAGCTGCTAGCTAGATGTACTAGCATCGT
    GC---ATGTAG--GCATC

28
The idea
  • PubMed Articel 7638186 Interleukin 12 induces
    tyrosine phosphorylation and activation of STAT4
    in human lymphocytes.Interleukin 12 (IL-12) is
    an important immunoregulatory cytokine whose
    receptor is a member of the hematopoietin
    receptor superfamily. We have recently
    demonstrated that stimulation of human T and
    natural killer cells with IL-12 induces tyrosine
    phosphorylation of the Janus family tyrosine
    kinase JAK2 and Tyk2, implicating these kinases
    in the immediate biochemical response to IL-12.
    Recently, transcription factors known as STATs
    (signal transducers and activators of
    transcription) have been shown to be tyrosine
    phosphorylated and activated in response to a
    number of cytokines that bind hematopoietin
    receptors and activate JAK kinases. In this
    report we demonstrate that IL-12 induces tyrosine
    phosphorylation of a recently identified STAT
    family member, STAT4, and show that STAT4
    expression is regulated by T-cell activation.
    Furthermore, we show that IL-12 stimulates
    formation of a DNA-binding complex that
    recognizes a DNA sequence previously shown to
    bind STAT proteins and that this complex contains
    STAT4. These data, and the recent demonstration
    of JAK phosphorylation by IL-12, identify a rapid
    signal-transduction pathway likely to mediate
    IL-12-induced gene expression.

29
The idea
  • PubMed Articel 7638186 Interleukin 12 induces
    tyrosine phosphorylation and activation of STAT4
    in human lymphocytes.Interleukin 12 (IL-12) is
    an important immunoregulatory cytokine whose
    receptor is a member of the hematopoietin
    receptor superfamily. We have recently
    demonstrated that stimulation of human T and
    natural killer cells with IL-12 induces tyrosine
    phosphorylation of the Janus family tyrosine
    kinase JAK2 and Tyk2, implicating these kinases
    in the immediate biochemical response to IL-12.
    Recently, transcription factors known as STATs
    (signal transducers and activators of
    transcription) have been shown to be tyrosine
    phosphorylated and activated in response to a
    number of cytokines that bind hematopoietin
    receptors and activate JAK kinases. In this
    report we demonstrate that IL-12 induces tyrosine
    phosphorylation of a recently identified STAT
    family member, STAT4, and show that STAT4
    expression is regulated by T-cell activation.
    Furthermore, we show that IL-12 stimulates
    formation of a DNA-binding complex that
    recognizes a DNA sequence previously shown to
    bind STAT proteins and that this complex contains
    STAT4. These data, and the recent demonstration
    of JAK phosphorylation by IL-12, identify a rapid
    signal-transduction pathway likely to mediate
    IL-12-induced gene expression.

30
The idea
  • Lets find tyrosine phosphorylation of STAT
    protein
  • In this report we demonstrate that IL-12 induces
    tyrosine phosphorylation of a recently identified
    STAT family member, STAT4, and show that STAT4
    expression is regulated by T-cell activation.
    Gaps and one missing words is the word
    protein really needed for the correct
    extraction?

31
Local alignment for term extraction
  • Like in sequence alignment for protein sequence
    the algorithm aligns words from the GOTerm
    against words from free text.
  • Gaps in the GOTerm are allowed
  • Deletion of words from the term are allowed
  • Mutations are allowed in the form of stemmed
    word forms
  • Activity,active ?activ (word stem)

32
Local alignment for term extraction
  • What to do if a word of a GOTerm is missing in
    the text but the concept is clearly mentioned in
    the sentence.
  • Do we align against the full text or against
    paragraphs or sentences?
  • Currently the best results are being made with
    per sentence alignment
  • How to penalize the gaps?

33
Local alignment for term extraction
  • The current implementation
  • Takes one articles abstract
  • Tokenizes it into sentences
  • For each potential term to be found in the
    sentence it aligns it with the sequence of word
    from the text (sentence and term are tokenized
    before using the same tokenizer)
  • The scoring function (to compare two words) uses
    the concept of information content to calculate
    the gain or the penalty (for mismatches)
  • Each term has an information content which is the
    sum of the information contents of ist word
  • If all words of a term could be found (in the
    correct order, and there are no gaps) the term is
    matched 100

34
Extraction Examples
  • Grammatical roots. (Terms are used in a modified
    version because of English grammar rules.)
  • PMID 7744799 The protein products of this gene
    contain the basic-helix-loop-helix motif
    characteristic of a large family of transcription
    factors that bind to the canon ical DNA sequence
    CANNTG as protein heterodimers.
  • is mentioning the GO term transcription factor
    binding (GO0008134), but stemming is needed here
    to identify bind as binding.

35
Extraction Examples
  • Insertions. (Examples of insertion into the
    original term without changing the meaning are
    possible)
  • PMID 7578980 Primed monocytes transcribed TNF
    mRNA at a higher rate than freshly isolated
    monocytes upon activation with LPS. (monocyte
    activation (GO0042117)) and, with even longer
    insertion
  • PMID 7612661 Although all nm23 proteins contain
    nucleoside diphosphate (NDP) kinase activity, it
    has not been established that the enzyme activity
    mediated the various functions of nm23 proteins.
    (protein kinase activity (GO0004672)).
  • These examples suggest to allow insertions into
    the terms. The insertions are words which do not
    invalidate the meaning of the original term.

36
Extraction Examples
  • Hyphenated compounds vs. spaced words
  • The use of hyphenated compounds and spaced words
    is not allways consistent Terms like
    thioredoxin-disulfide reductase activity
    (GO0004791) occur with and without the hyphen
    between the first two words.

37
Extraction Examples
  • Alternatives. The term small-molecule carrier or
    transporter (GO0005468) can be mentioned in the
    form small-molecule carrier or small-molecule
    transporter.
  • Endonuclease activity, active with either ribo-
    or deoxyribonucleic acids and producing
    5-phosphomonoesters (GO0016893) most likely
    will be used without the complementary subclause
    after the comma, although omitting it without
    reference to the context can make it ambiguous.
    1,239 terms (7.3) contain one or more commas.

38
Extraction Examples
  • Square brackets.
  • Terms have brackets in their names like
    methionine synthase reductase activity
    (GO0030586).

39
Extraction Examples
  • Sensu extension.
  • For terms like structural constituent of chorion
    (sensu Insecta) (GO0005213), it is unlikely that
    the author mentions them with the suffix (sensu
    Insecta), as it will be clear from the abstracts
    context that the topic is about insects.
  • In GO, 438 terms have such a suffix.

40
  • Comma-separated descriptive subclauses.
  • In the GO version of March 2004, 799 terms have
    commas followed by a space, e.g. hydrolase
    activity, acting on acid anhydrides, acting on
    GTP, involved in cellular and subcellular
    movement (GO0050800). In the current GO version,
    this syntactical concept is used to encode
    relations other than is a or part of.

41
Frequent words in the Gene Ontology
42
Dynamic programming
43
Idea of dynamic programming
44
Backtracking
45
(No Transcript)
46
(No Transcript)
47
Other approaches
  • Regular Expression
  • Edit distances (?)
  • Machine learning
  • Linguistic analysis

48
Tutorial today
  • Prepare your presentation at a machine in the
    lab.
  • Each student must comment on the project
Write a Comment
User Comments (0)
About PowerShow.com