Challenges of Term Recognition in Biology PowerPoint PPT Presentation

presentation player overlay
1 / 20
About This Presentation
Transcript and Presenter's Notes

Title: Challenges of Term Recognition in Biology


1
Challenges of Term Recognition in Biology
  • Sophia Ananiadou
  • http//www.cs.salford.ac.uk/NLP.html
  • University of Salford
  • UK National Text Mining Centre

2
Outline
  • Automatic term recognition in Biology
  • ATR vs NER
  • Term formation patterns
  • Nested terms
  • Term variation
  • Acronyms
  • Conclusion

3
Automatic Term Extraction
  • One of the most crucial research topics in
    Biomedicine
  • New terms (names of genes, proteins, drugs etc)
    are constantly created
  • Existing resources are not sufficient
  • over 280 databases in use (manually created and
    curated)
  • controlled vocabularies are static repositories
  • impossible to update manually terminologies due
    to dynamic nature of domain

4
Automatic Term Recognition
  • ATR deals with spotting terms and their variants
    in texts and producing a list of candidate terms
  • Recognition is the first step of term
    identification
  • Classification (broad or fine grained) is the
    next step
  • Last step is term integration with terminological
    resources, dictionaries, ontologies and databases

5
The whole picture
Term identification
Recognition
Classification
Integration
6
NER and ATR
  • ATR recognition grouping of term variants
  • NER recognition classification of occurrences
    of terms into specific classes e.g. gene,
    protein
  • In NER the steps of recognition and
    classification are merged, a classified
    terminological instance is a named entity
  • The tasks of ATR and NER share techniques but
    their ultimate goals are different
  • ATR for resource building, lexica ontologies
  • NER first step of IE, text mining

7
Naming in Biology
  • Naming conventions in Biology
  • Term formation guidelines from formal bodies e.g.
    Guidelines for Human Gene Nomenclature
  • Are these sufficient? How often do we encounter
    ad-hoc names?
  • Bride of sevenless (boss)
  • Yotiao
  • Term formation patterns
  • Use of existing resources, narrow, widen or
    adjust the meaning of a wordform
  • Use of general English words causes ambiguity
    was, not

8
Term formation patterns
  • Modification of existing resources
  • Affixation (transferase)
  • Compounding (retinoic acid receptor)
  • Acronyms (RAR retinoic acid receptor)
  • Neologisms
  • Foreign imports (cytomegalovirus)
  • Numerals (9-cis retinoic acid)
  • Symbols (Ca2 -calmodulin-dependent protein)
  • Eponyms (Ewing sarcoma)

9
Inner structure of terms
  • Majority of terms are multi-word units
  • Restricted set of syntactic patterns
  • Maximal vs nested term
  • leukaemic T cell line Kit225
  • Recognising the boundaries of multi-word terms
  • In Genia, 1/3 of all nested terms appear more
    than once as nested, half of nested terms do not
    appear independently on the corpus
  • Spotting nested terms on their own in corpus not
    sufficient
  • Establishing semantic relationships among
    constituents (as a first step to building
    ontologies)

10
Term variation and usage of synonyms
  • High correlation between degree of term variation
    and dynamic nature of domain
  • Terminological variation is the ability to
    realise a concept in many ways
  • A term variant is a synonym there is no change
    in meaning
  • Variation occurs in controlled vocabularies and
    in texts
  • Discrepancy of types of variation between the two

11
Types of variation
  • Orthographic
  • amino acid / amino-acid, oestrogen / estrogen
  • Morphological
  • cellular gene / cell gene / cell genes
  • Lexical
  • Carcinoma / cancer
  • Structural
  • Prepositional variants promoter of gene
  • Coordinated variants adrenal glands and gonads
  • Acronyms

12
Term variation part of ATR
  • Enhancing performance through linguistic
    normalisation of variants
  • Is it term coordination or conjunction?

13
Limitations (coordination)
  • The recognition and extraction of coordinated
    terms is highly ambiguous
  • Argument coordination (B and T cells) 90
  • Head coordination (adrenal glands and gonads) 10
  • Morphosyntactic clues are not sufficient as not
    systematic (argument coordination)
  • Jun and Fos families
  • mRNA and protein levels
  • More background knowledge needed for the
    extraction and decoding of term coordinations

14
Coordination
  • N1 and N2 PCP N3
  • chicken and mouse stimulating factors
  • chicken stimulating factor (N1 PCP N3)
  • mouse stimulating factor (N2 PCP N3)
  • dimerization and DNA binding domains
  • dimerization domain (N1 N3)
  • DNA binding domain (N2 PCP N3)

15
Acronyms
  • Acronyms are a very productive type of term
    variation
  • Acronym variation (synonymy)
  • NF kappa B / NF kB / nuclear factor kappa B
  • Acronym ambiguity (polysemy) even in controlled
    vocabularies
  • GR for glucocorticoid receptor and glutathione
    reductase

16
Acronyms (2)
  • Acronym patterns acronym formation structures
  • Acronym definition patterns are syntactic
    patterns describing contexts where acronyms are
    introduced
  • Linking acronyms with expanded forms (EFs)
  • Acronym generation from EFs as part of term
    variation
  • An issue is the selection of an optimal EF window
    for correct matching between acronym and EF

17
Acronyms (3)
  • Most well known acronyms not defined in the
    corpus only 25 acronyms defined in corpus
  • Need of acronym dictionaries to link acronyms
    with expanded forms
  • Further improvement of acronym acquisition
    systems (90 precision, 70-80 recall) on
    syntactic / semantic information
  • Acronym variation ambiguity still a challenge

18
Summary
  • Handling of variations is an integral part of ATR
  • Nested terms (recognition of boundaries
    semantic relationships among constituents)
  • Unobserved variation generation (structural,
    spelling, acronym)
  • Combining existing ontologies with ATR for finer
    and broader semantic classification and
    integration with databases

19
UK National Centre for Text Mining
  • JISC award (2004-2007)
  • Establish a high quality service provision for
    the UK bioscience community
  • Identify the best of breed tools academia
    industry
  • Types of services
  • Facilitating access to tools, resources support
  • Offering on-line use of resources and tools (also
    to guide and instruct)
  • Offering a one-stop shop for complete, end-to-end
    processing
  • We welcome collaboration. Contact
  • S.Ananiadou_at_salford.ac.uk

20
References
  • Frantzi, K., Ananiadou, S. Mima, H. (2000)
    Automatic recognition of multi-word terms.
    International Journal of Digital Libraries
    3115-130, Springer-Verlag
  • Nenadic, G., Spasic, I., Ananiadou, S. (2002)
    Automatic acronym acquisition and management with
    domain specific texts, in Proc.LREC-3, Las
    Palmas, Spain, 2155-2162
  • Nenadic, G., Spasic, I. Ananiadou, S. (2003)
    Terminology-Driven Mining of Biomedical
    Literature, in Journal of Bioinformatics, Vol.
    19(8), 938-943, Oxford University Press.
  • Nenadic, G., Spasic, I. and Ananiadou, S. (2004)
    Mining the Biomedical Literature Whats in a
    Term? in Proceedings of International Joint
    Conference on Natural Language Processing
    (IJCNLP-04), Hainan Island, China, 247-254
Write a Comment
User Comments (0)
About PowerShow.com