Authors: Marius Pasca and Benjamin Van Durme - PowerPoint PPT Presentation

1 / 9
About This Presentation
Title:

Authors: Marius Pasca and Benjamin Van Durme

Description:

Weakly-Supervised Acquisition of Open-Domain Classes and Class Attributes from Web Documents and Query Logs Authors: Marius Pasca and Benjamin Van Durme – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 10
Provided by: Bona50
Learn more at: https://cs.nyu.edu
Category:

less

Transcript and Presenter's Notes

Title: Authors: Marius Pasca and Benjamin Van Durme


1
Weakly-Supervised Acquisition of Open-Domain
Classes and Class Attributes from Web Documents
and Query Logs
  • Authors Marius Pasca and Benjamin Van Durme
  • Presented by Bonan Min

2
Overview
  • Introduce a method which mines
  • a collection of Web search queries
  • a collection of Web documents
  • to acquire
  • open-domain classes in the form of instance sets
  • e.g.,whales, seals, dolphins, sea lions
  • associated with class labels
  • e.g., marine animals
  • as well as large sets of open-domain attributes
    for each class
  • e.g., circulatory system, life cycle, evolution,
    food chain and scientific name for the class
    marine animals

3
Acquire Labeled sets of Instances
  • two conditions must be met
  • The class label must be a non-recursive noun
    phrase whose last component is a plural-form noun
    (e.g., zoonotic diseases).
  • The instance must also occur as a complete query
    somewhere in the query logs.

Used to filter out inaccurate paris
To emphasize precision or recall. However, this
seems imply that labeled classes dont overlap
4
Mining Open-Domain Class Attributes
  • four stages
  • identification of a noisy pool of candidate
    attributes, as remainders of queries that also
    contain one of the class instances.
  • cast jay and silent bob strike back
  • construction of internal search-signature vector
    representations for each candidate attribute,
    based on queries that contain a candidate
    attribute and a class instance.
  • These vectors consist of counts tied to the
    frequency with which an attribute occurs with a
    given templatized query. e.g., cast for kill
    bill , feature X for Y
  • construction of a reference internal
    search-signature vector representation for a
    small set of seed attributes provided as input. A
    reference vector is the normalized sum of the
    individual vectors corresponding to the seed
    attributes
  • the amount of supervision is limited to seed
    attributes being provided for only one of the
    classes. High precision but low recall?
  • ranking of candidate attributes with respect to
    each class, by computing similarity scores
    between their individual vector representations
    and the reference vector of the seed attributes.

5
Evaluation
  • Data set
  • 50 million unique queries submitted to Google in
    2006
  • The set of instances that can be potentially
    acquired by the extraction algorithm is
    heuristically limited to the top five million
    queries with the highest frequency within the
    input query logs.
  • 100 million Web documents in English, as
    available in a Web repository snapshot from 2006
  • Extraction results
  • After discarding classes with fewer than 25
    instances, the extracted set of classes consists
    of 4,583 class labels, each of them associated
    with 25 to 7,967 instances, with an average of
    189 instances per class.

6
Accuracy of Class Labels
  • A class label is
  • correct, if it captures a relevant concept
    although it could not be found in WordNet
  • subjectively correct, if it is relevant not in
    general but only in a particular context, either
    from a subjective viewpoint (e.g., modern
    appliances), or relative to a particular temporal
    anchor (e.g., current players), or in connection
    to a particular geographical area (e.g., area
    hospitals)
  • incorrect, if it does not capture any useful
    concept (e.g., multiple languages).
  • The manual analysis of the sample of 200 class
    labels indicates that 154 (77) are relevant
    concepts and 27 (13.5) are subjectively relevant
    concepts, for a total of 181 (90.5) relevant
    concepts, whereas 19 (9.5) of the labels are
    incorrect.

7
Accuracy of Class Instances
  • the manual inspection of the automatically-extract
    ed instances sets indicates an average accuracy
    of 79.3 over the 37 gold-standard classes
    retained in the experiments.
  • They also claim 90 accuracy for class labels
    (37 out of 40 labels successfully matched with
    manual labels)

8
Evaluation of Class Attributes
9
Contribution
  • enables the simultaneous extraction of class
    instances, associated labels and attributes
  • Acquire thousands of open-domain classes covering
    a wide range of topics and domains
  • The accuracy exceeds 80 for both instance sets
    and class labels
  • the extraction of classes only a few
    commonly-used Is-A extraction patterns.
  • Extract attributes for thousands of open-domain,
    automatically-acquired classes
  • The amount of supervision is limited to five seed
    attributes provided for only one reference class.
  • The first approach to information extraction from
    a combination of both Web documents and search
    query logs, to extract open- domain knowledge
    that is expected to be suitable for later use
Write a Comment
User Comments (0)
About PowerShow.com