Supporting Annotation Layers for Natural Language Processing - PowerPoint PPT Presentation

About This Presentation
Title:

Supporting Annotation Layers for Natural Language Processing

Description:

Computer Science Division and SIMS. University of California, Berkeley ... Demo of LQL (Layered Query Language) on examples taken from the NLP literature. ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 13
Provided by: Sar1
Category:

less

Transcript and Presenter's Notes

Title: Supporting Annotation Layers for Natural Language Processing


1
Supporting Annotation Layers for Natural Language
Processing
  • Preslav Nakov, Ariel Schwartz, Brian Wolf, Marti
    Hearst
  • Computer Science Division and SIMSUniversity of
    California, Berkeley http//biotext.berkeley.edu

Supported by NSF DBI-0317510 and a gift from
Genentech
2
Project overview
  • A system for flexible querying of text that has
    been annotated with the results of NLP
    processing.
  • Supports
  • self-overlapping and parallel layers,
  • integration of syntactic and ontological
    hierarchies,
  • and tight integration with SQL.
  • Designed to scale to very large corpora.
  • Demo of LQL (Layered Query Language) on examples
    taken from the NLP literature.

3
Key Contributions
  • Multiple overlapping layers (cannot be expressed
    in a single XML file)
  • Self-overlapping, parallel layers, allowing
    multiple syntactic parses of the same text
  • Integration of multiple intersecting hierarchies
    (e.g. MeSH, UMLS, Wordnet)
  • Specialized query language
  • Flexible results format
  • Focused on scaling annotation-based queries to
    very large corpora (millions of documents) with
    many layers of annotations
  • 1.4 million MEDLINE abstracts
  • 10 million sentences annotated
  • 320 million multi-layered annotations
  • 70 GB database size.

4
Layers of Annotations
  • Each annotation represents an interval spanning a
    sequence of characters
  • absolute start and end positions
  • Each layer corresponds to a conceptually
    different kind of annotation
  • Layers can be
  • Sequential
  • Overlapping (e.g., two multiple-word concepts
    sharing a word)
  • Hierarchical
  • spanning, when the intervals are nested as in a
    parse tree, or
  • ontologically, when the token itself is derived
    from a hierarchical ontology

5
Annotation Layers Example
6
System Architecture(Main table)
7
System Architecture(Indexes)
  • (Forward) doc_idsectionlayer_idsentencefirst_
    word_poslast_word_postag_type
  • (Inverted) layer_idtag_typedoc_idsectionsente
    ncefirst_word_poslast_word_pos
  • (Inverted) word_idlayer_idtag_typedoc_idsecti
    onsentencefirst_word_pos

8
Example query I
  • Protein-Protein Interactions
  • Goal Find all sentences that consist of a noun
    phrase containing a gene followed by a
    morphological variant of the verb activate,
    inhibit, or bind, followed by another NP
    containing a gene.

9
Example query I - LQL
  • SELECT p1_text, verb_content, p2_text, COUNT()
    AS cnt
  • FROM (
  • BEGIN_LQL
  • layer'sentence' ALLOW GAPS
  • layer'shallow_parse' tag_name'NP'
  • layer'gene'
  • AS p1
  • layer'pos' tag_name"verb"
  • (content "activate" content
    "inhibit" content "bind")
  • AS verb
  • layer'shallow_parse' tag_name'NP'
  • layer'gene'
  • AS p2
  • SELECT p1.text AS p1_text, verb.content AS
    verb_content, p2.text AS p2_text
  • END_LQL
  • ) lql
  • GROUP BY p1_text, verb_content, p2_text
  • ORDER BY count() DESC

10
Example query I Sample output
11
Example query II
  • ChemicalDisease Interactions
  • Adherence to statin prevents one coronary heart
    disease event for every 429 patients.
  • Goal extract the relation that statin
    (potentially) prevents coronary heart disease.
  • MeSH C subtree contains diseases
  • MeSH supplementary concepts represent chemicals.

12
Example query II - LQL
  • layer'sentence' NO ORDER, ALLOW GAPS
    layer'shallow_parse' tag_name'NP
    layer'chemicals' AS chemical
    layer'shallow_parse' tag_name'NP'
    layer'mesh' tree_number 'C' AS disease
    AS sent SELECT sent.pmid,
    chemical.text, disease.text, sent.text
Write a Comment
User Comments (0)
About PowerShow.com