Citances: Citation Sentences for Semantic Analysis of Bioscience Text - PowerPoint PPT Presentation

About This Presentation
Title:

Citances: Citation Sentences for Semantic Analysis of Bioscience Text

Description:

... death is mediated by the release of mitochondrial apoptogenic factors (Tournier et al., 2000) ... Traditional citation analysis dates back to the 1960's ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 18
Provided by: Sar1
Category:

less

Transcript and Presenter's Notes

Title: Citances: Citation Sentences for Semantic Analysis of Bioscience Text


1
Citances Citation Sentences for Semantic
Analysis ofBioscience Text
  • Preslav I. Nakov, Ariel S. Schwartz, and Marti
    A. HearstComputer Science Division and
    SIMSUniversity of California, Berkeley
    http//biotext.berkeley.edu

Supported by NSF DBI-0317510 and a gift from
Genentech
2
Overview
  • We propose the use of the text of the sentences
    surrounding citations as an important tool for
    semantic interpretation of bioscience text.
  • We hypothesize several different uses of citation
    sentences (which we call citances), including
  • the creation of training and testing data for
    semantic analysis (especially for entity and
    relation recognition),
  • synonym set creation,
  • database curation,
  • document summarization,
  • and information retrieval generally.
  • We illustrate some of these ideas, showing that
    citations to one document in particular align
    well with what a hand-built curator extracted.
  • We also show preliminary results on the problem
    of normalizing the different ways that the same
    concepts are expressed within a set of citances,
    using and improving on existing techniques in
    automatic paraphrase generation.

3
Motivation for using Citances in Bioscience Text
  • We are interested in utilizing the large volume
    of available bioscience text when designing
    information extraction and retrieval tools.
  • While the size of available text is growing
    rapidly, only few small annotated corpora for the
    bioscience domain exist.
  • Full text (as opposed to abstracts) is becoming
    more available, providing new opportunities for
    automatic text processing.
  • Citances provide an opportunity for coping with
    this limitation. They essentially contain a
    semi-annotated corpora for free.

4
The Nature of Citances in Bioscience Literature
  • Citations are particularly abundant in
    biosciences.
  • Nearly every statement is backed up with at least
    one citation.
  • It is quite common for papers in the bioscience
    domain to be cited by 30-100 other papers.
  • The citances tend to state known biological facts
    with reference to the original papers that
    discovered them.
  • The cited facts are typically stated in a more
    concise way in the citing papers than in the
    original papers.
  • As the same facts are repeatedly stated in
    different ways in different papers, statistical
    models can be trained on existing citances to
    identify similar facts in unseen text.

5
Examples of Citances
The genetic data presented here clearly show
that the Eiger-induced small eye phenotype
depends strongly on the JNK signaling pathway. In
mammals, it has been demonstrated that the JNK
pathway is essential for the execution of
stress-induced cell death. JNK3, a JNK isoform
that is selectively expressed in the nervous
system, is required for neuronal cell death
caused by excitotoxic stress (Yang et al., 1997).
Embryonic fibroblasts from mouse deficient for
both JNK1 and JNK2 are resistant to UV-stimulated
apoptosis (Tournier et al., 2000). Whitfield et
al. (2001) have shown that Bim acts downstream of
the JNK pathway in NGF-deprivation-induced
neuronal cell death. One possible downstream
mechanism of the JNK pathway to induce cell death
may be transcriptional upregulation of Bim.
However, our results suggest the possibility that
Eiger-induced cell death signaling may be
independent of downstream jun expression, similar
to the observation that the effect of UV to cause
cell death does not require new gene expression
(Tournier et al., 2000). The JNK signaling also
mediates heat shock-induced cell death, the
execution of which is caspase independent (Gabai
et al., 2000). Furthermore, overexpression of the
EDA receptor or TAJ/TROY, a member of the TNF
receptor superfamily that exhibits extensive
homology to the EDA receptor, results in the
activation of the JNK pathway and
caspase-independent cell death (Eby et al., 2000
Kumar et al., 2001). In some cases, JNK-induced
cell death is mediated by the release of
mitochondrial apoptogenic factors (Tournier et
al., 2000). Recently, it has been shown that
cancer cell death induced by TRAIL, a mammalian
TNF superfamily ligand, requires mitochondrial
release of Smac (Deng et al., 2002). One possible
mechanism of Eiger-induced cell death may be
JNK-mediated release of mitochondrial
caspase-independent cell death factors. In fact,
the Drosophila genome also encodes homologs of
such molecules AIF, endo G and HtrA2. (Igaki
et al., EMBO J. 2002 June 21 (12) 30093018)
6
Illustrating Diagram
17
12
17
42
23
27
9
16
Fact 1
Fact 2

Fact n
7
7
A Source for Unannotated Comparable Corpora
  • Comparable corpora are a useful resource for the
    development of NLP tools for question answering
    and summarization.
  • Most domains outside of news do not contain many
    articles discussing the same events, but
    bioscience citances have some of the requisite
    characteristics in that they include redundancies
    that allow identification of comparable
    sentences.
  • We later demonstrate the use of citances as
    comparable corpora for automatic paraphrase
    extraction.

8
Summarization of the Target Papers
  • The set of citances that refer to a specific
    paper can be viewed as an indication of the
    important facts in the paper as seen by the
    scientific community in that field.
  • This is an excellent resource for summarization.
    In fact, we believe that a paper that is cited
    enough times can be summarized using only the
    citances pointing to it.
  • Instead of showing the user all the citances
    pointing to a paper (as is done in CiteSeer and
    in Nanba et al. (2000)), we propose to first
    cluster related citances, and then display to the
    user only a summary of each cluster.
  • The facts expressed by each cluster can be
    extracted and stored in a database.
  • This could facilitate answering advanced queries
    on facts, such as retrieve all documents that
    describe which genes upregulate gene G.

9
Synonym Identification and Disambiguation
  • Bioscience literature is rife with abbreviations
    and synonyms.
  • Citances referring to the same article may allow
    synonyms to be identified and recorded.
  • A collection of related citances can help
    disambiguate terms with multiple meanings, since
    in some of the citances an unambiguous form of
    the term might be present.

10
Entity Recognition and Relation Extraction
  • Citances provide us a way to build a model of
    many of the different ways to express a
    relationship type R between entities of type A
    and B.
  • We can seed learning algorithms with several
    examples using concepts that are semantically
    similar to A and similar to B, for which relation
    R is known to hold.
  • Then we can train a model to recognize this kind
    of relation for situations for which the relation
    is not known.
  • Since the results may extend to sentences that
    are not citances as well, citances-based corpora
    should provide a good collection for building NLP
    tools for recognizing entities and relations in
    unseen text.

11
Targets for Curation
  • We hypothesize that citances contain the most
    important information expressed in the cited
    document, and therefore contain the information
    that curators would want to make use of.
  • We have found support for this hypothesis with
    two sample papers being used by a cancer
    researcher who is recording information about the
    process of apoptosis.

12
Improved Citation Indexes for Information
Retrieval
  • Citation indexes can be improved
  • by combining methods that use citances context
    (e.g., Mercer and Di Marco (2004)) with methods
    that use citances content (e.g., Bradshaw
    (2003)).
  • For example, indexing terms can be taken from
    citances referring to a target paper, weighting
    them both by their relative frequency and the
    type of citations they appear in.

13
Related Work
  • Traditional citation analysis dates back to the
    1960s (Garfield). Includes
  • Citation categorization,
  • Context analysis,
  • Citer motivation.
  • Citation indexing systems, such as ISIs SCI, and
    CiteSeer.
  • Mercer and Di Marco (2004) propose to improve
    citation indexing using citation types.
  • Bradshaw (2003) introduces Reference Directed
    Indexing (RDI), which indexes documents using the
    terms in the citances citing them.

14
Related Work (cont.)
  • Teufel and Moens (2002) identify citances to
    improve summarization of the citing paper. They
    give lower weight to citances as candidate
    sentences for summarization.
  • Nanba et. al. (2000) use citances as features for
    classifying papers into topics.
  • Related field to citation indexing is the use of
    link structure and anchor text of Web pages.
  • Applications include IR, classification, Web
    crawlers, and summarization.
  • See the full paper for references.

15
Issues for Processing Citances
  • Text span
  • Identification of the appropriate phrase, clause,
    or sentence that constructs a citance.
  • Correct mapping of citations when shown as lists
    or groups (e.g., 22-25).
  • Grouping citances by topic
  • Citances that cite the same document should be
    group by the facts they state.
  • Normalizing or paraphrasing citances
  • For IR, summarization, learning synonyms,
    relation extraction, question answering, and
    machine translation.

16
Paraphrasing Citances
17
Conclusions
  • We have motivated and discussed the potentially
    enormous role that the use of sentences
    surrounding citations, or citances, can have for
    automated analysis of bioscience literature.
  • In work not yet reported, we have found that
    citances align very well with rich information
    being curated by hand by a molecular biologist,
    and suspect they will be equally useful for other
    curation tasks.
  • We also hypothesize that it will be a gold mine
    of data for training algorithms to perform
    semantic analysis of bioscience text, and will
    improve the results of querying the bioscience
    literature.
  • Much work must be done before citances can be put
    to full use.
  • We have demonstrated some initial results in
    paraphrasing citances that discuss the same
    topic, but more work remains to be done to
    improve results, and to group similar citances
    together.
  • In future work, we plan to thoroughly explore the
    possibilities surrounding the analysis and use of
    citances for bioscience text analysis.
Write a Comment
User Comments (0)
About PowerShow.com