Title: Citances: Citation Sentences for Semantic Analysis of Bioscience Text
1Citances Citation Sentences for Semantic
Analysis ofBioscience Text
- Preslav I. Nakov, Ariel S. Schwartz, and Marti
A. HearstComputer Science Division and
SIMSUniversity of California, Berkeley
http//biotext.berkeley.edu
Supported by NSF DBI-0317510 and a gift from
Genentech
2Overview
- We propose the use of the text of the sentences
surrounding citations as an important tool for
semantic interpretation of bioscience text. - We hypothesize several different uses of citation
sentences (which we call citances), including - the creation of training and testing data for
semantic analysis (especially for entity and
relation recognition), - synonym set creation,
- database curation,
- document summarization,
- and information retrieval generally.
- We illustrate some of these ideas, showing that
citations to one document in particular align
well with what a hand-built curator extracted. - We also show preliminary results on the problem
of normalizing the different ways that the same
concepts are expressed within a set of citances,
using and improving on existing techniques in
automatic paraphrase generation.
3Motivation for using Citances in Bioscience Text
- We are interested in utilizing the large volume
of available bioscience text when designing
information extraction and retrieval tools. - While the size of available text is growing
rapidly, only few small annotated corpora for the
bioscience domain exist. - Full text (as opposed to abstracts) is becoming
more available, providing new opportunities for
automatic text processing. - Citances provide an opportunity for coping with
this limitation. They essentially contain a
semi-annotated corpora for free.
4The Nature of Citances in Bioscience Literature
- Citations are particularly abundant in
biosciences. - Nearly every statement is backed up with at least
one citation. - It is quite common for papers in the bioscience
domain to be cited by 30-100 other papers. - The citances tend to state known biological facts
with reference to the original papers that
discovered them. - The cited facts are typically stated in a more
concise way in the citing papers than in the
original papers. - As the same facts are repeatedly stated in
different ways in different papers, statistical
models can be trained on existing citances to
identify similar facts in unseen text.
5Examples of Citances
The genetic data presented here clearly show
that the Eiger-induced small eye phenotype
depends strongly on the JNK signaling pathway. In
mammals, it has been demonstrated that the JNK
pathway is essential for the execution of
stress-induced cell death. JNK3, a JNK isoform
that is selectively expressed in the nervous
system, is required for neuronal cell death
caused by excitotoxic stress (Yang et al., 1997).
Embryonic fibroblasts from mouse deficient for
both JNK1 and JNK2 are resistant to UV-stimulated
apoptosis (Tournier et al., 2000). Whitfield et
al. (2001) have shown that Bim acts downstream of
the JNK pathway in NGF-deprivation-induced
neuronal cell death. One possible downstream
mechanism of the JNK pathway to induce cell death
may be transcriptional upregulation of Bim.
However, our results suggest the possibility that
Eiger-induced cell death signaling may be
independent of downstream jun expression, similar
to the observation that the effect of UV to cause
cell death does not require new gene expression
(Tournier et al., 2000). The JNK signaling also
mediates heat shock-induced cell death, the
execution of which is caspase independent (Gabai
et al., 2000). Furthermore, overexpression of the
EDA receptor or TAJ/TROY, a member of the TNF
receptor superfamily that exhibits extensive
homology to the EDA receptor, results in the
activation of the JNK pathway and
caspase-independent cell death (Eby et al., 2000
Kumar et al., 2001). In some cases, JNK-induced
cell death is mediated by the release of
mitochondrial apoptogenic factors (Tournier et
al., 2000). Recently, it has been shown that
cancer cell death induced by TRAIL, a mammalian
TNF superfamily ligand, requires mitochondrial
release of Smac (Deng et al., 2002). One possible
mechanism of Eiger-induced cell death may be
JNK-mediated release of mitochondrial
caspase-independent cell death factors. In fact,
the Drosophila genome also encodes homologs of
such molecules AIF, endo G and HtrA2. (Igaki
et al., EMBO J. 2002 June 21 (12) 30093018)
6Illustrating Diagram
17
12
17
42
23
27
9
16
Fact 1
Fact 2
Fact n
7
7A Source for Unannotated Comparable Corpora
- Comparable corpora are a useful resource for the
development of NLP tools for question answering
and summarization. - Most domains outside of news do not contain many
articles discussing the same events, but
bioscience citances have some of the requisite
characteristics in that they include redundancies
that allow identification of comparable
sentences. - We later demonstrate the use of citances as
comparable corpora for automatic paraphrase
extraction.
8Summarization of the Target Papers
- The set of citances that refer to a specific
paper can be viewed as an indication of the
important facts in the paper as seen by the
scientific community in that field. - This is an excellent resource for summarization.
In fact, we believe that a paper that is cited
enough times can be summarized using only the
citances pointing to it. - Instead of showing the user all the citances
pointing to a paper (as is done in CiteSeer and
in Nanba et al. (2000)), we propose to first
cluster related citances, and then display to the
user only a summary of each cluster. - The facts expressed by each cluster can be
extracted and stored in a database. - This could facilitate answering advanced queries
on facts, such as retrieve all documents that
describe which genes upregulate gene G.
9Synonym Identification and Disambiguation
- Bioscience literature is rife with abbreviations
and synonyms. - Citances referring to the same article may allow
synonyms to be identified and recorded. - A collection of related citances can help
disambiguate terms with multiple meanings, since
in some of the citances an unambiguous form of
the term might be present.
10Entity Recognition and Relation Extraction
- Citances provide us a way to build a model of
many of the different ways to express a
relationship type R between entities of type A
and B. - We can seed learning algorithms with several
examples using concepts that are semantically
similar to A and similar to B, for which relation
R is known to hold. - Then we can train a model to recognize this kind
of relation for situations for which the relation
is not known. - Since the results may extend to sentences that
are not citances as well, citances-based corpora
should provide a good collection for building NLP
tools for recognizing entities and relations in
unseen text.
11Targets for Curation
- We hypothesize that citances contain the most
important information expressed in the cited
document, and therefore contain the information
that curators would want to make use of. - We have found support for this hypothesis with
two sample papers being used by a cancer
researcher who is recording information about the
process of apoptosis.
12Improved Citation Indexes for Information
Retrieval
- Citation indexes can be improved
- by combining methods that use citances context
(e.g., Mercer and Di Marco (2004)) with methods
that use citances content (e.g., Bradshaw
(2003)). - For example, indexing terms can be taken from
citances referring to a target paper, weighting
them both by their relative frequency and the
type of citations they appear in.
13Related Work
- Traditional citation analysis dates back to the
1960s (Garfield). Includes - Citation categorization,
- Context analysis,
- Citer motivation.
- Citation indexing systems, such as ISIs SCI, and
CiteSeer. - Mercer and Di Marco (2004) propose to improve
citation indexing using citation types. - Bradshaw (2003) introduces Reference Directed
Indexing (RDI), which indexes documents using the
terms in the citances citing them.
14Related Work (cont.)
- Teufel and Moens (2002) identify citances to
improve summarization of the citing paper. They
give lower weight to citances as candidate
sentences for summarization. - Nanba et. al. (2000) use citances as features for
classifying papers into topics. - Related field to citation indexing is the use of
link structure and anchor text of Web pages. - Applications include IR, classification, Web
crawlers, and summarization. - See the full paper for references.
15Issues for Processing Citances
- Text span
- Identification of the appropriate phrase, clause,
or sentence that constructs a citance. - Correct mapping of citations when shown as lists
or groups (e.g., 22-25). - Grouping citances by topic
- Citances that cite the same document should be
group by the facts they state. - Normalizing or paraphrasing citances
- For IR, summarization, learning synonyms,
relation extraction, question answering, and
machine translation.
16Paraphrasing Citances
17Conclusions
- We have motivated and discussed the potentially
enormous role that the use of sentences
surrounding citations, or citances, can have for
automated analysis of bioscience literature. - In work not yet reported, we have found that
citances align very well with rich information
being curated by hand by a molecular biologist,
and suspect they will be equally useful for other
curation tasks. - We also hypothesize that it will be a gold mine
of data for training algorithms to perform
semantic analysis of bioscience text, and will
improve the results of querying the bioscience
literature. - Much work must be done before citances can be put
to full use. - We have demonstrated some initial results in
paraphrasing citances that discuss the same
topic, but more work remains to be done to
improve results, and to group similar citances
together. - In future work, we plan to thoroughly explore the
possibilities surrounding the analysis and use of
citances for bioscience text analysis.