Title: Introduction to Web Science
1Introduction to Web Science
- Information Extraction
- for the SW
2Six challenges of the Knowledge Life Cycle
- Acquire
- Model
- Reuse
- Retrieve
- Publish
- Maintain
3What is Text Mining?
- Text mining is about knowledge discovery from
large collections of unstructured text. - Its not the same as data mining, which is more
about discovering patterns in structured data
stored in databases. - Information extraction (IE) is a major component
of text mining. - IE is about extracting facts and structured
information from unstructured text.
4IE is not IR
IR pulls documents from large text collections
(usually the Web) in response to specific
keywords or queries. You analyse the documents.
IE pulls facts and structured information from
the content of large text collections. You
analyse the facts.
5Challenge of the Web Science
- The Web requires machine processable,
repurposable data to complement hypertext - Such metadata can be divided into two types of
information explicit and implicit. IE is mainly
concerned with implicit (semantic) metadata. - More on this later
6IE by example (1)
- the seminar at 4 pm will ...
- How can we learn a rule to extract the seminar
time?
7IE by example (2)
8IE by example (3)
9Adaptive Information Extraction
- IE
- Systems capable of extracting information
- AIE
- Same as IE
- But considers important the usability and
accessibility of a system - Makes it easy to transport it to new domains
- Exploits machine learning
10What is adaptable?
- New domain information
- Based upon an ontology which can change
- Different sub-language features
- POS, Noun chunks, etc
- Different text genres
- Free text, structured, semi-structured, etc
- Different types
- Text, String, Date, Name, etc
11Shallow Vrs Deep Approaches
- Shallow approach
- Uses syntax primarily
- Tokenisation, POS, etc.
- Deep approach
- Uses syntactic information
- Uses semantics (Named entity, etc)
- Heuristics (World rules, Brother is male)
- Additional knowledge
12Single Vrs Multi Slot
- Single
- Extract one element at a time
- The seminar is at 4pm.
- Multi Slot
- Extract several concepts simultaneously
- Tom is the brother of Mary.
- Brother(Tom, Mary)
13Batch Vrs Incremental Learners
- Batch
- Examples are collected
- The system is trained on the examples
- Simpler
- Incremental
- Add a rule to the test set at a time
- Evaluate that rule
- More complex
- Must be careful about local Maxima
14Interactive Vrs Non-Interactive
- Interactive
- Use an oracle to verify and validate results
- An oracle can be a person or a simple program
- Non-Interactive
- Use the training provided by the users only
- 10 cross validation
15Top-Down Vrs Bottom Up
- Top-Down
- Starts from a generic rule and specialise it
- Bottom Up
- Starts from a specific rule and relax it
16Top Down
17Bottom Up
18Generalisation task
- The process of generating generic rules from
domain specific data
19Overfitting Vrs Underfitting
- Underfitting
- When the learner does not manage to detect the
full underlying model - Produces excessive bias
- Overfitting
- When the learner fits the model and the noise
20Text mining stages
- Document selection and filtering (IR techniques)
- Document pre-processing (NLP techniques)
- Document processing (NLP / ML / statistical
techniques)
21Stages of document processing
- Document selection involves identification and
retrieval of potentially relevant documents from
a large set (e.g. the web) in order to reduce the
search space. Standard or semantically-enhanced
IR techniques can be used for this. - Document pre-processing involves cleaning and
preparing the documents, e.g. removal of
extraneous information, error correction,
spelling normalisation, tokenisation, POS
tagging, etc. - Document processing consists mainly of
information extraction
22Metadata extraction
- Metadata extraction consists of two types
- Explicit metadata extraction involves information
describing the document, such as that contained
in the header information of HTML documents
(titles, abstracts, authors, creation date, etc.) - Implicit metadata extraction involves semantic
information deduced from the material itself,
i.e. endogenous information such as names of
entities and relations contained in the text.
This essentially involves Information Extraction
techniques, often with the help of an ontology.
23IE for Document Access
- With traditional query engines, getting the facts
can be hard and slow - Where has the President visited in the last year?
- Which places in Europe have had cases of Bird
Flu? - Which search terms would you use to get this kind
of information? - How can you specify you want someones home page?
- IE returns information in a structured way
- IR returns documents containing the relevant
information somewhere (if youre lucky)
24IE as an alternative to IR
- IE returns knowledge at a much deeper level than
traditional IR - Constructing a database through IE and linking it
back to the documents can provide a valuable
alternative search tool. - Even if results are not always accurate, they can
be valuable if linked back to the original text
25Some example applications
- HaSIE
- KIM
- Threat Trackers
26HaSIE
- Aims to find out how companies report about
health and safety information - Answers questions such as
- How many members of staff died or had accidents
in the last year? - Is there anyone responsible for health and
safety? - What measures have been put in place to improve
health and safety in the workplace?
27HASIE
- Identification of such information is too
time-consuming and arduous to be done manually - IR systems cant cope with this because they
return whole documents, which could be hundreds
of pages - System identifies relevant sections of each
document, pulls out sentences about health and
safety issues, and populates a database with
relevant information
28HASIE
29KIM
- KIM is a software platform developed by Ontotext
for semantic annotation of text. - KIM performs automatic ontology population and
semantic annotation for Semantic Web and KM
applications - Indexing and retrieval (an IE-enhanced search
technology) - Query and exploration of formal knowledge
30KIM
Ontotexts KIM query and results
31Threat tracker
- Application developed by Alias-I which finds and
relates information in documents - Intended for use by Information Analysts who use
unstructured news feeds and standing collections
as sources - Used by DARPA for tracking possible information
about terrorists etc. - Identification of entities, aliases, relations
etc. enables you to build up chains of related
people and things
32Threat tracker
33What is Named Entity Recognition?
- Identification of proper names in texts, and
their classification into a set of predefined
categories of interest - Persons
- Organisations (companies, government
organisations, committees, etc) - Locations (cities, countries, rivers, etc)
- Date and time expressions
- Various other types as appropriate
34Why is NE important?
- NE provides a foundation from which to build more
complex IE systems - Relations between NEs can provide tracking,
ontological information and scenario building - Tracking (co-reference) Dr Head, John, he
35Two kinds of approaches
- Knowledge Engineering
- rule based
- developed by experienced language engineers
- make use of human intuition
- require only small amount of training data
- development can be very time consuming
- some changes may be hard to accommodate
- Learning Systems
- use statistics or other machine learning
- developers do not need expertise
- require large amounts of annotated training data
- some changes may require re-annotation of the
entire training corpus
36Typical NE pipeline
- Pre-processing (tokenisation, sentence splitting,
morphological analysis, POS tagging) - Entity finding (gazeteer lookup, NE grammars)
- Coreference (alias finding, orthographic
coreference etc.) - Export to database / XML
37GATE and ANNIE
- GATE (Generalised Architecture for Text
Engineering) is a framework for language
processing - ANNIE (A Nearly New Information Extraction
system) is a suite of language processing tools,
which provides NE recognition - GATE also includes
- plugins for language processing, e.g. parsers,
machine learning tools, stemmers, IR tools, IE
components for various languages etc. - tools for visualising and manipulating ontologies
- ontology-based information extraction tools
- evaluation and benchmarking tools
38GATE
39Information Extraction for the Semantic Web
- Traditional IE is based on a flat structure, e.g.
recognising Person, Location, Organisation, Date,
Time etc. - For the Semantic Web, we need information in a
hierarchical structure - Idea is that we attach semantic metadata to the
documents, pointing to concepts in an ontology - Information can be exported as an ontology
annotated with instances, or as text annotated
with links to the ontology
40Richer NE Tagging
- Attachment of instances in the text to concepts
in the domain ontology - Disambiguation of instances, e.g. Cambridge, MA
vs Cambridge, UK
41Magpie
- Developed by the Open University
- Plugin for standard web browser
- Automatically associates an ontology-based
semantic layer to web resources, allowing
relevant services to be linked - Provides means for a structured and informed
exploration of the web resources - e.g. looking at a list of publications, we can
find information about an author such as projects
they work on, other people they work with, etc.
42Magpie in Action (1)
43Magpie in Action (2)
44Magpie in Action (3)
45Evaluation metrics and tools
- Evaluation metrics mathematically define how to
measure the systems performance against
human-annotated gold standard - Scoring program implements the metric and
provides performance measures - for each document and over the entire corpus
- for each type of NE
- may also evaluate changes over time
- A gold standard reference set also needs to be
provided this may be time-consuming to produce - Visualisation tools show the results graphically
and enable easy comparison
46Methods of evaluation
- Traditional IE is evaluated in terms of Precision
and Recall - Precision - how accurate were the answers the
system produced? - correct answers/answers produced
- Recall - how good was the system at finding
everything it should have found? - correct answers/total possible correct answers
- There is usually a tradeoff between precision and
recall, so a weighted average of the two
(F-measure) is generally also used.
47Metrics for Richer IE
- Precision and Recall are not sufficient for
ontology-based IE, because the distinction
between right and wrong is less obvious - Recognising a Person as a Location is clearly
wrong, but recognising a Research Assistant as a
Lecturer is not so wrong - Similarity metrics need to be integrated
additionally, such that items closer together in
the hierarchy are given a higher score, if wrong - Also possible is a cost-based approach, where
different weights can be given to each concept in
the hierarchy, and to different types of error,
and combined to form a single score
48Visualisation of Results
- Cluster Map example
- Traditionally used to show documents classified
according to topic - Here shows instances classified according to
concept - Enables analysis, comparison and querying of
results
49The principle Venn Diagrams
Documents classified according to topic
50Jobs by region
Instances classified by concept
51Concept distribution
Shows the relative importance of different
concepts
52Correct and incorrect instances attached to
concepts
53Why is IE difficult?
- BNC Holdings Inc named Ms G Torretta as its new
chairman. - Nicholas Andrews was succeeded by Gina Torretta
as chairman of BNC Holdings Inc. - Ms. Gina Torretta took the helm at BNC Holdings
Inc. - Hint What are they referring to?
54Try IE yourself ... (1)
- Given a particular text ...
- Find all the successions ...
- Hint there are 6 including the one below
- Hint we do not have complete information
- E.g.
- ltSUCCESSION-1gt
- ORGANIZATION New York Times
- POST "president"
- WHO_IS_IN Russell T. Lewis
- WHO_IS_OUT Lance R. Primis
55- ltDOCgt
- ltDOCIDgt wsj93_050.0203 lt/DOCIDgt
- ltDOCNOgt 930219-0013. lt/DOCNOgt
- ltHLgt Marketing Brief _at_ Noted.... lt/HLgt
- ltDDgt 02/19/93 lt/DDgt
- ltSOgt WALL STREET JOURNAL (J), PAGE B5 lt/SOgt
- ltCOgt NYTA lt/COgt
- ltINgt MEDIA (MED), PUBLISHING (PUB) lt/INgt
- ltTXTgt
- ltpgt New York Times Co. named Russell T. Lewis,
45, president and general manager of its flagship
New York Times newspaper, responsible for all
business-side activities. He was executive vice
president and deputy general manager. He succeeds
Lance R. Primis, who in September was named
president and chief operating officer of the
parent. - lt/pgt
- lt/TXTgt
- lt/DOCgt
56Answer (1)
- ltSUCCESSION-2gt
- ORGANIZATION "New York Times"
- POST "general manager"
- WHO_IS_IN "Russell T. Lewis"
- WHO_IS_OUT "Lance R. Primis"
- ltSUCCESSION-3gt
- ORGANIZATION "New York Times"
- POST "executive vice president"
- WHO_IS_IN
- WHO_IS_OUT "Russell T. Lewis"
57Answer (2)
- ltSUCCESSION-4gt
- ORGANIZATION "New York Times"
- POST "deputy general manager"
- WHO_IS_IN
- WHO_IS_OUT "Russell T. Lewis"
- ltSUCCESSION-5gt
- ORGANIZATION "New York Times Co."
- POST "president"
- WHO_IS_IN "Lance R. Primis"
- WHO_IS_OUT
58Answer (3)
- ltSUCCESSION-6gt
- ORGANIZATION "New York Times Co."
- POST "chief operating officer"
- WHO_IS_IN "Lance R. Primis"
- WHO_IS_OUT
59Questions?