Data Description - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

Data Description

Description:

Data Description. Gold standard of tagged gene/protein names to train ... Xenopus laevis similar to POU domain gene. Defining a True Positive Gene/Protein Name ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 11
Provided by: tan81
Category:
Tags: data | description

less

Transcript and Presenter's Notes

Title: Data Description


1
Data Description
  • Gold standard of tagged gene/protein names to
    train and test information extraction programs
  • 20K sentences from MEDLINE
  • 10K related to known gene names
  • 10K unrelated to known gene names
  • Biology Background of Annotators
  • Biochemistry, genetics, molecular biology

2
Challenges
  • Defining a true positive gene/protein name
  • Wide definition with specificity constraint
  • Preservation of semantics
  • Partial match with semantic constraints
  • Tokenization
  • Interannotator agreement
  • Annotation rules
  • Consistency Problems
  • Human limitations
  • Applying the definition of a true positive
  • Entering data

3
Defining a True Positive Gene/Protein Name
  • Wide definition with specificity constraint
  • Include domains, complexes, subunits, promoters
    IFF refer to a specific gene/protein
  • GenBank entries justify this definition
  • Sf3b4, splicing factor 3b, subunit 4
  • Mus musculus transaldolase gene, promoter region
  • GenBank exceptions to this definition were
    incorporated into annotation scheme
  • bHLH transcription factor mRNA
  • Xenopus laevis similar to POU domain gene

4
Defining a True Positive Gene/Protein Name
  • Preservation of semantics
  • rabies immunoglobulin (RIG)
  • immunoglobulin alone in this context does not
    convey true semantics
  • IGG receptor
  • igg alone in this context does not convey true
    semantics
  • Tumor necrosis factor 1
  • Tumor necrosis factor alone in this context does
    not convey true semantics

5
Defining a True Positve Gene/Protein Name
  • Partial match with semantic constraints
  • Exact match
  • Inflexible
  • Partial match
  • Inaccurate
  • Insertions
  • all p53 genes
  • Deletions
  • major histocompatibility complex
  • Gold standard plus acceptable alternatives

6
Tokenization
  • Gold standard with acceptable alternatives
  • Need sentence indices to determine which
    alternatives are allowed for each gold standard
    name
  • Consider an example sentence
  • tumor necrosis factor appears 3 times
  • tumor necrosis factor 1 appears 2 times
  • tumor necrosis factor is not an acceptable
    alternative for the places in the sentence where
    tumor necrosis factor 1 occurs
  • Problematic reliance on exact tokenization
  • Necessary for correct sentence indices
  • TOKENIZED_CORPUS provided

7
Interannotator Agreement
  • Reflects difficulty of task
  • Compare one annotator's tags to at least two
    other annotators tags
  • Broadcast news IE
  • 82-93 F-score (Robinson et al. DARPA Hub 4)
  • PASTA biological IE system
  • 89 F-score for terminology (Demetriou
    Gaizauskas)
  • Did not perform interannotator agreement study
  • Datasets originally intended for internal use
    only

8
Consistency Problems
  • Applying the definition of a true positive
    gene/protein name
  • Conjuct/Adjunct judgement calls
  • src homology 2 and 3
  • Nested gene/protein name judgement calls
  • stress-activated protein kinase-Jun N-terminal
    kinase
  • Semantic Lookup Time
  • E2 RAD5 ( UBC2 )
  • LAZ3/BCL6 BTB/POZ
  • Are they stand-alone synonyms?
  • Is it a complex or fusion protein?
  • Sentence context is often not enough
  • Entire abstract context is sometimes not enough
    even for experts
  • Tradeoff between accuracy and annotation time
    required

9
Consistency Problems
  • Generating acceptable alternative indices
  • Manual generation open to human error
  • The prosomal RNA-binding protein p27K is a member
    of the alpha-type human prosomal gene family.
  • name start end sentence id entered into
    text box on web page
  • p27K 4 4 3919
  • protein p27K 3 4 3919
  • RNA-binding protein p27K 2 4 3919
  • alpha-type human prosomal gene 10 13 3919

10
Future Implications
  • More careful definition of true positives,
    including grey areas
  • Will improve each annotators consistency
  • Will improve interannotator agreement
  • Make tokenization as consistent as possible
  • Relax need for 100 tokenization agreement for
    gold standard/test sets
  • Automate generation of acceptable alternative
    indices
Write a Comment
User Comments (0)
About PowerShow.com