Introduction to Web Science - PowerPoint PPT Presentation

About This Presentation
Title:

Introduction to Web Science

Description:

Introduction to Web Science Information Extraction for the SW * * * They refer to same thing!! * NE provides a foundation from which to build more complex IE systems ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 60
Provided by: Med6110
Category:

less

Transcript and Presenter's Notes

Title: Introduction to Web Science


1
Introduction to Web Science
  • Information Extraction
  • for the SW

2
Six challenges of the Knowledge Life Cycle
  • Acquire
  • Model
  • Reuse
  • Retrieve
  • Publish
  • Maintain

3
What is Text Mining?
  • Text mining is about knowledge discovery from
    large collections of unstructured text.
  • Its not the same as data mining, which is more
    about discovering patterns in structured data
    stored in databases.
  • Information extraction (IE) is a major component
    of text mining.
  • IE is about extracting facts and structured
    information from unstructured text.

4
IE is not IR
IR pulls documents from large text collections
(usually the Web) in response to specific
keywords or queries. You analyse the documents.
IE pulls facts and structured information from
the content of large text collections. You
analyse the facts.
5
Challenge of the Web Science
  • The Web requires machine processable,
    repurposable data to complement hypertext
  • Such metadata can be divided into two types of
    information explicit and implicit. IE is mainly
    concerned with implicit (semantic) metadata.
  • More on this later

6
IE by example (1)
  • the seminar at 4 pm will ...
  • How can we learn a rule to extract the seminar
    time?

7
IE by example (2)
8
IE by example (3)
9
Adaptive Information Extraction
  • IE
  • Systems capable of extracting information
  • AIE
  • Same as IE
  • But considers important the usability and
    accessibility of a system
  • Makes it easy to transport it to new domains
  • Exploits machine learning

10
What is adaptable?
  • New domain information
  • Based upon an ontology which can change
  • Different sub-language features
  • POS, Noun chunks, etc
  • Different text genres
  • Free text, structured, semi-structured, etc
  • Different types
  • Text, String, Date, Name, etc

11
Shallow Vrs Deep Approaches
  • Shallow approach
  • Uses syntax primarily
  • Tokenisation, POS, etc.
  • Deep approach
  • Uses syntactic information
  • Uses semantics (Named entity, etc)
  • Heuristics (World rules, Brother is male)
  • Additional knowledge

12
Single Vrs Multi Slot
  • Single
  • Extract one element at a time
  • The seminar is at 4pm.
  • Multi Slot
  • Extract several concepts simultaneously
  • Tom is the brother of Mary.
  • Brother(Tom, Mary)

13
Batch Vrs Incremental Learners
  • Batch
  • Examples are collected
  • The system is trained on the examples
  • Simpler
  • Incremental
  • Add a rule to the test set at a time
  • Evaluate that rule
  • More complex
  • Must be careful about local Maxima

14
Interactive Vrs Non-Interactive
  • Interactive
  • Use an oracle to verify and validate results
  • An oracle can be a person or a simple program
  • Non-Interactive
  • Use the training provided by the users only
  • 10 cross validation

15
Top-Down Vrs Bottom Up
  • Top-Down
  • Starts from a generic rule and specialise it
  • Bottom Up
  • Starts from a specific rule and relax it

16
Top Down
17
Bottom Up
18
Generalisation task
  • The process of generating generic rules from
    domain specific data

19
Overfitting Vrs Underfitting
  • Underfitting
  • When the learner does not manage to detect the
    full underlying model
  • Produces excessive bias
  • Overfitting
  • When the learner fits the model and the noise

20
Text mining stages
  • Document selection and filtering (IR techniques)
  • Document pre-processing (NLP techniques)
  • Document processing (NLP / ML / statistical
    techniques)

21
Stages of document processing
  • Document selection involves identification and
    retrieval of potentially relevant documents from
    a large set (e.g. the web) in order to reduce the
    search space. Standard or semantically-enhanced
    IR techniques can be used for this.
  • Document pre-processing involves cleaning and
    preparing the documents, e.g. removal of
    extraneous information, error correction,
    spelling normalisation, tokenisation, POS
    tagging, etc.
  • Document processing consists mainly of
    information extraction

22
Metadata extraction
  • Metadata extraction consists of two types
  • Explicit metadata extraction involves information
    describing the document, such as that contained
    in the header information of HTML documents
    (titles, abstracts, authors, creation date, etc.)
  • Implicit metadata extraction involves semantic
    information deduced from the material itself,
    i.e. endogenous information such as names of
    entities and relations contained in the text.
    This essentially involves Information Extraction
    techniques, often with the help of an ontology.

23
IE for Document Access
  • With traditional query engines, getting the facts
    can be hard and slow
  • Where has the President visited in the last year?
  • Which places in Europe have had cases of Bird
    Flu?
  • Which search terms would you use to get this kind
    of information?
  • How can you specify you want someones home page?
  • IE returns information in a structured way
  • IR returns documents containing the relevant
    information somewhere (if youre lucky)

24
IE as an alternative to IR
  • IE returns knowledge at a much deeper level than
    traditional IR
  • Constructing a database through IE and linking it
    back to the documents can provide a valuable
    alternative search tool.
  • Even if results are not always accurate, they can
    be valuable if linked back to the original text

25
Some example applications
  • HaSIE
  • KIM
  • Threat Trackers

26
HaSIE
  • Aims to find out how companies report about
    health and safety information
  • Answers questions such as
  • How many members of staff died or had accidents
    in the last year?
  • Is there anyone responsible for health and
    safety?
  • What measures have been put in place to improve
    health and safety in the workplace?

27
HASIE
  • Identification of such information is too
    time-consuming and arduous to be done manually
  • IR systems cant cope with this because they
    return whole documents, which could be hundreds
    of pages
  • System identifies relevant sections of each
    document, pulls out sentences about health and
    safety issues, and populates a database with
    relevant information

28
HASIE
29
KIM
  • KIM is a software platform developed by Ontotext
    for semantic annotation of text.
  • KIM performs automatic ontology population and
    semantic annotation for Semantic Web and KM
    applications
  • Indexing and retrieval (an IE-enhanced search
    technology)
  • Query and exploration of formal knowledge

30
KIM
Ontotexts KIM query and results
31
Threat tracker
  • Application developed by Alias-I which finds and
    relates information in documents
  • Intended for use by Information Analysts who use
    unstructured news feeds and standing collections
    as sources
  • Used by DARPA for tracking possible information
    about terrorists etc.
  • Identification of entities, aliases, relations
    etc. enables you to build up chains of related
    people and things

32
Threat tracker
33
What is Named Entity Recognition?
  • Identification of proper names in texts, and
    their classification into a set of predefined
    categories of interest
  • Persons
  • Organisations (companies, government
    organisations, committees, etc)
  • Locations (cities, countries, rivers, etc)
  • Date and time expressions
  • Various other types as appropriate

34
Why is NE important?
  • NE provides a foundation from which to build more
    complex IE systems
  • Relations between NEs can provide tracking,
    ontological information and scenario building
  • Tracking (co-reference) Dr Head, John, he

35
Two kinds of approaches
  • Knowledge Engineering
  • rule based
  • developed by experienced language engineers
  • make use of human intuition
  • require only small amount of training data
  • development can be very time consuming
  • some changes may be hard to accommodate
  • Learning Systems
  • use statistics or other machine learning
  • developers do not need expertise
  • require large amounts of annotated training data
  • some changes may require re-annotation of the
    entire training corpus

36
Typical NE pipeline
  • Pre-processing (tokenisation, sentence splitting,
    morphological analysis, POS tagging)
  • Entity finding (gazeteer lookup, NE grammars)
  • Coreference (alias finding, orthographic
    coreference etc.)
  • Export to database / XML

37
GATE and ANNIE
  • GATE (Generalised Architecture for Text
    Engineering) is a framework for language
    processing
  • ANNIE (A Nearly New Information Extraction
    system) is a suite of language processing tools,
    which provides NE recognition
  • GATE also includes
  • plugins for language processing, e.g. parsers,
    machine learning tools, stemmers, IR tools, IE
    components for various languages etc.
  • tools for visualising and manipulating ontologies
  • ontology-based information extraction tools
  • evaluation and benchmarking tools

38
GATE
39
Information Extraction for the Semantic Web
  • Traditional IE is based on a flat structure, e.g.
    recognising Person, Location, Organisation, Date,
    Time etc.
  • For the Semantic Web, we need information in a
    hierarchical structure
  • Idea is that we attach semantic metadata to the
    documents, pointing to concepts in an ontology
  • Information can be exported as an ontology
    annotated with instances, or as text annotated
    with links to the ontology

40
Richer NE Tagging
  • Attachment of instances in the text to concepts
    in the domain ontology
  • Disambiguation of instances, e.g. Cambridge, MA
    vs Cambridge, UK

41
Magpie
  • Developed by the Open University
  • Plugin for standard web browser
  • Automatically associates an ontology-based
    semantic layer to web resources, allowing
    relevant services to be linked
  • Provides means for a structured and informed
    exploration of the web resources
  • e.g. looking at a list of publications, we can
    find information about an author such as projects
    they work on, other people they work with, etc.

42
Magpie in Action (1)
43
Magpie in Action (2)
44
Magpie in Action (3)
45
Evaluation metrics and tools
  • Evaluation metrics mathematically define how to
    measure the systems performance against
    human-annotated gold standard
  • Scoring program implements the metric and
    provides performance measures
  • for each document and over the entire corpus
  • for each type of NE
  • may also evaluate changes over time
  • A gold standard reference set also needs to be
    provided this may be time-consuming to produce
  • Visualisation tools show the results graphically
    and enable easy comparison

46
Methods of evaluation
  • Traditional IE is evaluated in terms of Precision
    and Recall
  • Precision - how accurate were the answers the
    system produced?
  • correct answers/answers produced
  • Recall - how good was the system at finding
    everything it should have found?
  • correct answers/total possible correct answers
  • There is usually a tradeoff between precision and
    recall, so a weighted average of the two
    (F-measure) is generally also used.

47
Metrics for Richer IE
  • Precision and Recall are not sufficient for
    ontology-based IE, because the distinction
    between right and wrong is less obvious
  • Recognising a Person as a Location is clearly
    wrong, but recognising a Research Assistant as a
    Lecturer is not so wrong
  • Similarity metrics need to be integrated
    additionally, such that items closer together in
    the hierarchy are given a higher score, if wrong
  • Also possible is a cost-based approach, where
    different weights can be given to each concept in
    the hierarchy, and to different types of error,
    and combined to form a single score

48
Visualisation of Results
  • Cluster Map example
  • Traditionally used to show documents classified
    according to topic
  • Here shows instances classified according to
    concept
  • Enables analysis, comparison and querying of
    results

49
The principle Venn Diagrams
Documents classified according to topic
50
Jobs by region
Instances classified by concept
51
Concept distribution
Shows the relative importance of different
concepts
52
Correct and incorrect instances attached to
concepts
53
Why is IE difficult?
  • BNC Holdings Inc named Ms G Torretta as its new
    chairman.
  • Nicholas Andrews was succeeded by Gina Torretta
    as chairman of BNC Holdings Inc.
  • Ms. Gina Torretta took the helm at BNC Holdings
    Inc.
  • Hint What are they referring to?

54
Try IE yourself ... (1)
  • Given a particular text ...
  • Find all the successions ...
  • Hint there are 6 including the one below
  • Hint we do not have complete information
  • E.g.
  • ltSUCCESSION-1gt
  • ORGANIZATION New York Times
  • POST "president"
  • WHO_IS_IN Russell T. Lewis
  • WHO_IS_OUT Lance R. Primis

55
  • ltDOCgt
  • ltDOCIDgt wsj93_050.0203 lt/DOCIDgt
  • ltDOCNOgt 930219-0013. lt/DOCNOgt
  • ltHLgt Marketing Brief _at_ Noted.... lt/HLgt
  • ltDDgt 02/19/93 lt/DDgt
  • ltSOgt WALL STREET JOURNAL (J), PAGE B5 lt/SOgt
  • ltCOgt NYTA lt/COgt
  • ltINgt MEDIA (MED), PUBLISHING (PUB) lt/INgt
  • ltTXTgt
  • ltpgt New York Times Co. named Russell T. Lewis,
    45, president and general manager of its flagship
    New York Times newspaper, responsible for all
    business-side activities. He was executive vice
    president and deputy general manager. He succeeds
    Lance R. Primis, who in September was named
    president and chief operating officer of the
    parent.
  • lt/pgt
  • lt/TXTgt
  • lt/DOCgt

56
Answer (1)
  • ltSUCCESSION-2gt
  • ORGANIZATION "New York Times"
  • POST "general manager"
  • WHO_IS_IN "Russell T. Lewis"
  • WHO_IS_OUT "Lance R. Primis"
  • ltSUCCESSION-3gt
  • ORGANIZATION "New York Times"
  • POST "executive vice president"
  • WHO_IS_IN
  • WHO_IS_OUT "Russell T. Lewis"

57
Answer (2)
  • ltSUCCESSION-4gt
  • ORGANIZATION "New York Times"
  • POST "deputy general manager"
  • WHO_IS_IN
  • WHO_IS_OUT "Russell T. Lewis"
  • ltSUCCESSION-5gt
  • ORGANIZATION "New York Times Co."
  • POST "president"
  • WHO_IS_IN "Lance R. Primis"
  • WHO_IS_OUT

58
Answer (3)
  • ltSUCCESSION-6gt
  • ORGANIZATION "New York Times Co."
  • POST "chief operating officer"
  • WHO_IS_IN "Lance R. Primis"
  • WHO_IS_OUT

59
Questions?
Write a Comment
User Comments (0)
About PowerShow.com