Toward the SelfAnnotating web - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Toward the SelfAnnotating web

Description:

the Hilton hotel ... the hotel Hilton... DEFINITE1: the INSTANCE CONCEPT ... Hilton, a hotel in ... Hilton is a hotel... APPOSITION: INSTANCE , a CONCEPT ... – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 26
Provided by: sea59
Category:

less

Transcript and Presenter's Notes

Title: Toward the SelfAnnotating web


1
Toward the Self-Annotating web
  • Presented byHaibo zhaoThe University of Georgia

2
Outline
  • Introduction
  • The process of PANKOW
  • Pattern-based categorization
  • Evaluation
  • Conclusion
  • Possible ways to improve

3
1 Introduction
  • The success of semantic web depends on the
    proliferation of web pages annotated with
    metadata
  • How to generate these annotated web pages
    Manually, Semi-automatically, Or automatically?

4
Problems with current large-scale information
exaction
  • Manual definition of an information extraction
    system is a laborious task requiring a lot of
    time
  • Learning of extraction rules requires too many
    examples for learning the rules
  • Vicious circle there is no semantic web because
    of lack of metadata, and there is no metadata
    because of there is no semantic web that one can
    learn from

5
PANKOW
  • Pattern-based annotation through Knowledge On the
    Web
  • An unsupervised, pattern-based approach to
    categorize instances with regard to an ontology
  • Use globally available web data and structures to
    semantically annotate (or at least facilitate
    annotation of) web pages

6
Linguistic Patterns (1)
  • Hearst patternsProfessors such as Dr. Sheth
    such Professors as Dr. Sheth Professors,
    especially Dr. Sheth Dr. Sheth and other
    Professors

H1 s such as H2 such
s as H3 s,
(especiallyincluding) H4
(andor) other s
7
Linguistic Patterns (2)
  • Definites patternsthe Hilton hotel the hotel
    Hilton

DEFINITE1 the DEFINITE2
the
8
Linguistic Patterns (3)
  • Apposition and CopulaHilton, a hotel in
    Hilton is a hotel

APPOSITION , a COPULA
is a
9
2 The Process of PANKOW
10
Process details of PANKOW
  • (1) Identify candidate proper nouns
  • (2) Derive hypothesis phrases
  • (3) Query Google DB to get the number of
    hits for each hypothesis phrase
  • (4) Sum up the query results to a total
    for each instance-concept pair. Then the
    system categorizes the instances into
    their highest ranked concepts

11
(1) Identify candidate proper nouns
  • Input A web page we want to annotate
  • Output Set of candidate proper nouns
  • How POS tagger (Part-Of-Speech tagger)Pos
    tagger is an approach which can assigns the
    correct syntactic category, like adjective,
    common noun, proper noun to words
  • Example

12
(2) Derive hypothesis phrases
  • Input Set of candidate proper nouns
  • Output Set of hypothesis phrase
  • How Introduce all candidate proper nouns and
    candidate ontology concepts into linguistic
    patterns to derive hypothesis phrases
  • Example

13
(3) Query Google DB to get the number of hits for
each hypothesis phrase
  • Input Set of hypothesis phrases
  • Output the number of hits for each hypothesis
    phrase
  • How Google DB is queried for the hypothesis
    phrases via google web service API
  • More about google web apihttp//www.google.com/a
    pis/

14
(4) Categorization
  • Input The number of hits for each hypothesis
    phrase
  • Output Annotated Web Page
  • How The system sumps up the query results to a
    total for each instance-concept pair, then the
    system categorizes the candidate proper nouns
    into the highest ranked concepts
  • Example

15
2 Determine the best categorization
  • Baseline
  • Linear weighting
  • Interactive SelectionReturn the best n matches
    for each proper noun

16
3 Evaluation
  • 30 web pages needed to annotate, 277 proper
    nouns, 59 concepts, and 10 patterns mentioned
    earlier.
  • 2 human subjects (A and B) are asked to annotate
    manually.
  • Compare the system results with manually
    annotation results

17
Results got by the system (top 60
Instance-concept relations)
18
Evaluation measures
  • Reference standards (Results produced by Human A
    and Human B)

19
Evaluation measures
  • Define Precision and Accuracy
  • y belongs to A, B
  • Compute seperately for A and B, then average the
    results to be the final precision and accuracy

20
(No Transcript)
21
Interactive Selection
  • Instead of returning a unique concept for each
    instance, it returns the best n concepts to end
    users.
  • The results are very good, the best accuracy is
    49.56
  • This means that for almost half of the instances
    in a web page, the system can provide the user
    the correct answer (concept)

22
(No Transcript)
23
4 Conclusion
  • A simple but smart idea to do categorization
  • Relatively effective approach to do
    auto-annotation
  • Heavily rely on GOOGLE
  • It seems we can do more than what this paper did

24
My Opinions
  • It only takes advantage of the number of hits,
    but google web api can actually return an
    abstract for the phrase, which is a context. How
    to make these contexts useful ?
  • It only consider is-a relationship. Is it
    possible to figure out other patterns for other
    relationships
  • Any way to generate these patterns automatically ?

25
Questions ?
Write a Comment
User Comments (0)
About PowerShow.com