Introduction to Web Science - PowerPoint PPT Presentation

About This Presentation

Title:

Introduction to Web Science

Description:

Introduction to Web Science Information Extraction for the SW * * * They refer to same thing!! * NE provides a foundation from which to build more complex IE systems ... – PowerPoint PPT presentation

Number of Views:89

Avg rating:3.0/5.0

Slides: 60

Provided by: Med6110

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Web Science

1
Introduction to Web Science

Information Extraction
for the SW

2
Six challenges of the Knowledge Life Cycle

Acquire
Model
Reuse
Retrieve
Publish
Maintain

3
What is Text Mining?

Text mining is about knowledge discovery from
large collections of unstructured text.
Its not the same as data mining, which is more
about discovering patterns in structured data
stored in databases.
Information extraction (IE) is a major component
of text mining.
IE is about extracting facts and structured
information from unstructured text.

4
IE is not IR
IR pulls documents from large text collections
(usually the Web) in response to specific
keywords or queries. You analyse the documents.
IE pulls facts and structured information from
the content of large text collections. You
analyse the facts.
5
Challenge of the Web Science

The Web requires machine processable,
repurposable data to complement hypertext
Such metadata can be divided into two types of
information explicit and implicit. IE is mainly
concerned with implicit (semantic) metadata.
More on this later

6
IE by example (1)

the seminar at 4 pm will ...
How can we learn a rule to extract the seminar
time?

7
IE by example (2)
8
IE by example (3)
9
Adaptive Information Extraction

IE
Systems capable of extracting information
AIE
Same as IE
But considers important the usability and
accessibility of a system
Makes it easy to transport it to new domains
Exploits machine learning

10
What is adaptable?

New domain information
Based upon an ontology which can change
Different sub-language features
POS, Noun chunks, etc
Different text genres
Free text, structured, semi-structured, etc
Different types
Text, String, Date, Name, etc

11
Shallow Vrs Deep Approaches

Shallow approach
Uses syntax primarily
Tokenisation, POS, etc.
Deep approach
Uses syntactic information
Uses semantics (Named entity, etc)
Heuristics (World rules, Brother is male)
Additional knowledge

12
Single Vrs Multi Slot

Single
Extract one element at a time
The seminar is at 4pm.
Multi Slot
Extract several concepts simultaneously
Tom is the brother of Mary.
Brother(Tom, Mary)

13
Batch Vrs Incremental Learners

Batch
Examples are collected
The system is trained on the examples
Simpler
Incremental
Add a rule to the test set at a time
Evaluate that rule
More complex
Must be careful about local Maxima

14
Interactive Vrs Non-Interactive

Interactive
Use an oracle to verify and validate results
An oracle can be a person or a simple program
Non-Interactive
Use the training provided by the users only
10 cross validation

15
Top-Down Vrs Bottom Up

Top-Down
Starts from a generic rule and specialise it
Bottom Up
Starts from a specific rule and relax it

16
Top Down
17
Bottom Up
18
Generalisation task

The process of generating generic rules from
domain specific data

19
Overfitting Vrs Underfitting

Underfitting
When the learner does not manage to detect the
full underlying model
Produces excessive bias
Overfitting
When the learner fits the model and the noise

20
Text mining stages

Document selection and filtering (IR techniques)
Document pre-processing (NLP techniques)
Document processing (NLP / ML / statistical
techniques)

21
Stages of document processing

Document selection involves identification and
retrieval of potentially relevant documents from
a large set (e.g. the web) in order to reduce the
search space. Standard or semantically-enhanced
IR techniques can be used for this.
Document pre-processing involves cleaning and
preparing the documents, e.g. removal of
extraneous information, error correction,
spelling normalisation, tokenisation, POS
tagging, etc.
Document processing consists mainly of
information extraction

22
Metadata extraction

Metadata extraction consists of two types
Explicit metadata extraction involves information
describing the document, such as that contained
in the header information of HTML documents
(titles, abstracts, authors, creation date, etc.)
Implicit metadata extraction involves semantic
information deduced from the material itself,
i.e. endogenous information such as names of
entities and relations contained in the text.
This essentially involves Information Extraction
techniques, often with the help of an ontology.

23
IE for Document Access

With traditional query engines, getting the facts
can be hard and slow
Where has the President visited in the last year?
Which places in Europe have had cases of Bird
Flu?
Which search terms would you use to get this kind
of information?
How can you specify you want someones home page?
IE returns information in a structured way
IR returns documents containing the relevant
information somewhere (if youre lucky)

24
IE as an alternative to IR

IE returns knowledge at a much deeper level than
traditional IR
Constructing a database through IE and linking it
back to the documents can provide a valuable
alternative search tool.
Even if results are not always accurate, they can
be valuable if linked back to the original text

25
Some example applications

HaSIE
KIM
Threat Trackers

26
HaSIE

Aims to find out how companies report about
health and safety information
Answers questions such as
How many members of staff died or had accidents
in the last year?
Is there anyone responsible for health and
safety?
What measures have been put in place to improve
health and safety in the workplace?

27
HASIE

Identification of such information is too
time-consuming and arduous to be done manually
IR systems cant cope with this because they
return whole documents, which could be hundreds
of pages
System identifies relevant sections of each
document, pulls out sentences about health and
safety issues, and populates a database with
relevant information

28
HASIE
29
KIM

KIM is a software platform developed by Ontotext
for semantic annotation of text.
KIM performs automatic ontology population and
semantic annotation for Semantic Web and KM
applications
Indexing and retrieval (an IE-enhanced search
technology)
Query and exploration of formal knowledge

30
KIM
Ontotexts KIM query and results
31
Threat tracker

Application developed by Alias-I which finds and
relates information in documents
Intended for use by Information Analysts who use
unstructured news feeds and standing collections
as sources
Used by DARPA for tracking possible information
about terrorists etc.
Identification of entities, aliases, relations
etc. enables you to build up chains of related
people and things

32
Threat tracker
33
What is Named Entity Recognition?

Identification of proper names in texts, and
their classification into a set of predefined
categories of interest
Persons
Organisations (companies, government
organisations, committees, etc)
Locations (cities, countries, rivers, etc)
Date and time expressions
Various other types as appropriate

34
Why is NE important?

NE provides a foundation from which to build more
complex IE systems
Relations between NEs can provide tracking,
ontological information and scenario building
Tracking (co-reference) Dr Head, John, he

35
Two kinds of approaches

Knowledge Engineering
rule based
developed by experienced language engineers
make use of human intuition
require only small amount of training data
development can be very time consuming
some changes may be hard to accommodate

Learning Systems
use statistics or other machine learning
developers do not need expertise
require large amounts of annotated training data
some changes may require re-annotation of the
entire training corpus

36
Typical NE pipeline

Pre-processing (tokenisation, sentence splitting,
morphological analysis, POS tagging)
Entity finding (gazeteer lookup, NE grammars)
Coreference (alias finding, orthographic
coreference etc.)
Export to database / XML

37
GATE and ANNIE

GATE (Generalised Architecture for Text
Engineering) is a framework for language
processing
ANNIE (A Nearly New Information Extraction
system) is a suite of language processing tools,
which provides NE recognition
GATE also includes
plugins for language processing, e.g. parsers,
machine learning tools, stemmers, IR tools, IE
components for various languages etc.
tools for visualising and manipulating ontologies
ontology-based information extraction tools
evaluation and benchmarking tools

38
GATE
39
Information Extraction for the Semantic Web

Traditional IE is based on a flat structure, e.g.
recognising Person, Location, Organisation, Date,
Time etc.
For the Semantic Web, we need information in a
hierarchical structure
Idea is that we attach semantic metadata to the
documents, pointing to concepts in an ontology
Information can be exported as an ontology
annotated with instances, or as text annotated
with links to the ontology

40
Richer NE Tagging

Attachment of instances in the text to concepts
in the domain ontology
Disambiguation of instances, e.g. Cambridge, MA
vs Cambridge, UK

41
Magpie

Developed by the Open University
Plugin for standard web browser
Automatically associates an ontology-based
semantic layer to web resources, allowing
relevant services to be linked
Provides means for a structured and informed
exploration of the web resources
e.g. looking at a list of publications, we can
find information about an author such as projects
they work on, other people they work with, etc.

42
Magpie in Action (1)
43
Magpie in Action (2)
44
Magpie in Action (3)
45
Evaluation metrics and tools

Evaluation metrics mathematically define how to
measure the systems performance against
human-annotated gold standard
Scoring program implements the metric and
provides performance measures
for each document and over the entire corpus
for each type of NE
may also evaluate changes over time
A gold standard reference set also needs to be
provided this may be time-consuming to produce
Visualisation tools show the results graphically
and enable easy comparison

46
Methods of evaluation

Traditional IE is evaluated in terms of Precision
and Recall
Precision - how accurate were the answers the
system produced?
correct answers/answers produced
Recall - how good was the system at finding
everything it should have found?
correct answers/total possible correct answers
There is usually a tradeoff between precision and
recall, so a weighted average of the two
(F-measure) is generally also used.

47
Metrics for Richer IE

Precision and Recall are not sufficient for
ontology-based IE, because the distinction
between right and wrong is less obvious
Recognising a Person as a Location is clearly
wrong, but recognising a Research Assistant as a
Lecturer is not so wrong
Similarity metrics need to be integrated
additionally, such that items closer together in
the hierarchy are given a higher score, if wrong
Also possible is a cost-based approach, where
different weights can be given to each concept in
the hierarchy, and to different types of error,
and combined to form a single score

48
Visualisation of Results

Cluster Map example
Traditionally used to show documents classified
according to topic
Here shows instances classified according to
concept
Enables analysis, comparison and querying of
results

49
The principle Venn Diagrams
Documents classified according to topic
50
Jobs by region
Instances classified by concept
51
Concept distribution
Shows the relative importance of different
concepts
52
Correct and incorrect instances attached to
concepts
53
Why is IE difficult?

BNC Holdings Inc named Ms G Torretta as its new
chairman.
Nicholas Andrews was succeeded by Gina Torretta
as chairman of BNC Holdings Inc.
Ms. Gina Torretta took the helm at BNC Holdings
Inc.
Hint What are they referring to?

54
Try IE yourself ... (1)

Given a particular text ...
Find all the successions ...
Hint there are 6 including the one below
Hint we do not have complete information
E.g.
ltSUCCESSION-1gt
ORGANIZATION New York Times
POST "president"
WHO_IS_IN Russell T. Lewis
WHO_IS_OUT Lance R. Primis

ltDOCgt
ltDOCIDgt wsj93_050.0203 lt/DOCIDgt
ltDOCNOgt 930219-0013. lt/DOCNOgt
ltHLgt Marketing Brief _at_ Noted.... lt/HLgt
ltDDgt 02/19/93 lt/DDgt
ltSOgt WALL STREET JOURNAL (J), PAGE B5 lt/SOgt
ltCOgt NYTA lt/COgt
ltINgt MEDIA (MED), PUBLISHING (PUB) lt/INgt
ltTXTgt
ltpgt New York Times Co. named Russell T. Lewis,
45, president and general manager of its flagship
New York Times newspaper, responsible for all
business-side activities. He was executive vice
president and deputy general manager. He succeeds
Lance R. Primis, who in September was named
president and chief operating officer of the
parent.
lt/pgt
lt/TXTgt
lt/DOCgt