Title: CBioC:%20Massive%20Collaborative%20Curation%20of%20Biomedical%20Literature
1CBioC Massive Collaborative Curation of
Biomedical Literature
- Chitta Baral, Hasan Davulcu,
- Anthony Gitter, Graciela Gonzalez, Geeta
Joshi-Tope, - Mutsumi Nakamura, Prabhdeep Singh, Luis Tari, and
Lian Yu.
2Premise current status of curation from text
- Our initial focus is on curation of knowledge
nuggets from Biomedical articles. - About 15 million abstracts in Pubmed
- 3 million published by US and EU researchers
during 1994-2004 (800 articles per day) - 300 K articles published so far reporting
protein-protein interactions in human, yeast and
mouse. - BIND (in 7 yrs) -- 23K DIP 3K MINT 2.4K.
3Premise High cost of human curation
- Overwhelming cost of large curation efforts may
be unsustainable for long periods - BIND Nov 2005 bad news.
- Operated for 7 years
- Listed over 100 curators programmers
- CND 29 million received in 2003, plus other
funding - Curation efforts of AFCS has recently stopped.
- Lack of funding for some genome annotation
projects.
4Premise summary
- Human curation of text is expensive.
- Human curation of text is not scalable.
- Human curation of text is not sustainable.
5Why not resort to computers? do automatic
extraction
- Lessons from DARPA funded MUCs (message
understanding conferences) in 90s for a decade
and at the cost of tens of millions of dollars. - Getting to 60 recall and precision is quick
- Then every 5 improvement is about a years work.
- Even when we get to 90 for an individual entity
extraction - for recognizing 4 related entities (.9)4 .64
- Lessons from Biomedical text extraction
- No proper evaluation.
- Recognized that recall and precision is not very
good even in the best systems.
6What do we do?
- How do we curate not only the existing articles,
but also the future articles? - Too important to give up!
- Need to think of a new way to do it.
- Faster computers, better sequencing technology
and better algorithms came to the rescue of the
Human Genome project. - Hmm. What resources are we overlooking?
7Key Idea
- If lots of articles are being written then lot of
people are writing them and lot of people are
reading them. - If only we could make these people (the authors
and the readers) contribute to the curation
effort - Especially the readers the ones who need the
curated data!
8Mass collaboration has worked in
- Wikipedia
- Project Gutenberg
- Netflix rating
- Amazon rating
- Etc.
9Mass collaborative curation initial hurdles
- Russ Altman mentioned the challenges with respect
to the authors. - Sticking to a format
- Submitting data
- An average reader
- (S)he is not normally interested in filling a
blank curation form. - We can not make an average reader go though
curation training. - So it has to be very different from just making
the existing curation tools available to the mass
and expect them to contribute.
10Mass collaborative curation key initial ideas
- Make it very easy
- user need not remember where (which database,
which web page) to put the curated knowledge. - Curation opportunity should present itself
seamlessly. - Curation should not be a burden to an average
user - Make the curated knowledge thin.
- There should be immediate rewards
- Do not start with a blank slate.
11Realization of the key ideas a biologist with a
gene name
- Goes to Pubmed, types the gene name, clicks on
one of the abstracts - Curation panel presents itself automatically
- Our approach calls for researchers to contribute
to the curation of facts as they read and
research over the web - But not with a blank slate
- No one wants to be the first one!
- Automatic extraction jump-starts the process, and
then researchers improve upon the extracted data,
ironing out inconsistencies by subsequent edits
on a massive scale. - Thin Schemas
- Average users turned off by traditional wide
schemas - Wide schemas need to be broken down.
12Case Study with CBioC
- When the abstract is displayed, all of the
interactions reported in the abstract are shown. - The interactions are either automatically
extracted in advance by our system or for brand
new abstracts the extraction process is done at
display time. - Thus, data becomes immediately available.
- Researchers then edit the extracted data, add new
interactions, vote on the accuracy of the
extraction, assign a confidence rating, and read
comments from other researchers. - If one or more of them goes deep into obtaining
related info, the effort is not wasted and the
rest of the community benefits.
13(No Transcript)
14Basic curation with CBioC
- Interactions are corrected, incorrect extractions
are voted down, and rated on reliability based
on the experimental evidence presented by the
author. - It takes a few seconds to vote on the correctness
of the extractions - With little effort by each researcher,
information is made available immediately to the
whole community.
15(No Transcript)
16(No Transcript)
17with more effort
- Any researcher that wishes to do a bit more can
- add interactions missed by the extraction system
- add interactions reported within the full article
- fill up more fields in the database (such as
organism, experimental method, location of the
interaction, or supporting evidence). - Added interactions are subject to the community
vote, just as the automatically extracted
interactions.
18(No Transcript)
19(No Transcript)
20Case Study 2 Modifying
- A researcher could also modify the reported
interactions - For example, consider the following statement in
PMID 16297884 -
- PKCalpha but not PKCepsilon phosphorylated the
catalytic subunit of the p110alpha/p85alpha PI3K
21Case Study 2 Modifying
- The automatic extraction system extracted
(PKCepsilon, phosphorylates, p110alpha/ p85alpha
PI3K), an error caused by the grammatical
construction of the statement. - In this case, the researcher should vote No on
the accuracy of the extraction. This one cannot
really be modified, it will eventually be voted
down by enough No votes. - and/or click Modify and edit the interaction
and then rate its reliability based on the
evidence presented by the author.
22(No Transcript)
23(No Transcript)
24(No Transcript)
25(No Transcript)
26(No Transcript)
27Addressing challenges
- Use ontologies and some automated tools to ensure
consistency issues. - To enter data user must register.
- Does each voter has equal weight?
- Trust management
28Summary so far
- Information/curation window pops up
automatically. - Automatic extraction is used as a boot strap so
that no user is working on a blank slate. - Users vote on correctness, make corrections, add
fact. - Suppose 60 precision and recall of automatic
extraction system - A person will have an easier time discarding 40
of wrongly extracted text than identifying 60 of
correct entries and entering them!
29Very useful byproducts
- Avoids some problems with existing human curation
approach - Curators bias
- Curators miss things
- Curators have disagreements
- Slow access to newest findings
- Researchers at large have little or no control
over what gets curated and when - A large curated corpus of text gets created
- Very useful to evaluate and improve automated
extraction systems.
30Other features
- Other abstracts related to the specific
interaction are accessible through the More
Articles link. - We are in the process of integrating data from
publicly available databases. - All data (raw and processed) will be publicly
available - Working on independent data access and querying
engine.
31Issues and further challenges
- Works well for certain kind of knowledge curation
(interactions, ), but what about others (genome
annotation ?) - Null values
- Full papers versus abstracts
- Are thin schemas enough?
- Curating new kind of knowledge
32Current status, current funding, call for
collaboration
- Funded by Arizona State University
- Second (basic) beta version released.
- Proposals sent for a fully functional
implementation. - Some collaboration with outside groups are in
works.
33Current publications
- Collaborative Curation of Data from Bio-medical
Texts and Abstracts and its integration. Chitta
Baral, Hasan Davulcu, Mutsumi Nakamura, Prabhdeep
Singh, Lian Yu and Luis Tari. Proceedings of the
2nd International Workshop on Data Integration in
the Life Sciences (DILS'05), San Diego, July
2022, 2005. In Lecture notes of computer
science. Springer. - An initial report.
- Ready to be submitted to a journal.
34Group members and advisory board
- Post docs Lian Yu and Graciela Gonzalez
- Biomedical expertise Geeta Joshi-Tope
(curation), Mike Berens (signal transduction in
oncology) - Students Luis Tari, Prabhdeep Singh, Anthony
Gitter, Amanda Ziegler - Advisory board Gary Bader, Ken Fukuda, Shankar
Subramanian.
35Thanks