Title: Quality Labelling of Web Content
1Quality Labelling of Web Content
Software Knowledge Engineering
Laboratory Institute of Informatics
Telecommunications NCSR Demokritos, Athens,
Greece
Vangelis Karkaletsis
3rd IFIP Conference on Artificial Intelligence
Applications Innovations (AIAI 2006) Athens,
7-9 June 2006
2Contents
- Quality labels / trustmark schemes
- Existing labeling processes
- Needs for new technologies
- On-going projects and initiatives
- QUATRO, MedIEQ, W3C WCL-XG,
- Concluding Remarks
3Quality labels / trustmark schemes - I
- Quality labels / trustmark schemes have been
established in many parts of the world - some are online versions of existing schemes,
- others have been developed specifically for the
web.
4Quality labels / trustmark schemes - II
- Inform the user about the quality of data and
services provided - if these are of a certain type, they fulfill
certain criteria or meet given requirements - for example, a label may include an assertion
that the labeled web site has a suitable privacy
policy, that the publisher is clearly identified,
and that it meets legal practice in one or more
identified countries. - Two notable areas of interest for quality labels
/ trustmarks schemes are - those designed to give consumers confidence in
eCommerce operations, and - those that indicate that health related content
has been peer reviewed
5Quality labels / trustmark schemes III an
example the WMA label for health related web
content
- Identification
- Content
- Confidentiality
- Advertising and Sponsoring
- Virtual Consultation
- Non compliance
6Quality labels / trustmark schemes IV some
facts about the health related web content
- The number of health web sites and online
services is increasing day by day - 70-80 of Internet users seeks health information
for them or for their relatives - More than 4 out of 10 health information seekers
say the material they find affect their health
decisions
7Quality labels / trustmark schemes V some facts
about the health related web content
- Quality of health related web content is
extremely variable - from evidence-based healthcare to widespread
practice of fraud and potentially-dangerous
claims - Increase in consumer knowledge changes how
patient, professionals and providers interact - Patients are becoming more proactive in their
care management - Effects in Public Health
- An example Vaccines
8(No Transcript)
9(No Transcript)
10Existing labeling processes - I
- Organisations around the world are working on
establishing quality standards - For example, concerning health sites
- European Commission
- eEurope 2002 Quality criteria for health related
web sites - American Medical Association
- Guidelines for medical health information sites
on the Internet - Internet Healthcare Coalition
- eHealth Code of Ethics
- .
11Existing labeling processes - II
- Quality standards initiatives are not enough
- Self-adherence to codes of conduct or ethics,
nothing more than a claim with little
enforceability - Necessary the establishment of labeling
mechanisms - by third party accreditation
- by creating portals where web sites are organized
and characterized against certain labeling
criteria - Such initiatives already exist
12Existing labeling processes - III
Codes of Conduct are defined as sets of quality
criteria that provide a list of recommendations
for the development and content of websites
Quality Label (logo) is diplayed on screen and
represents a commitment by a provider to
implement or adhere to a code of conduct
User Guidance enables users to check if a site
complies with certain standards by accessing a
series of questions from a displayed logo
Filtering Tools applied manually or
automatically, accept or reject web resources -
resources are selected for their quality and
relevance to a particular audience
Third Parties certification quality and
accreditation labels are awarded by a third party
to inform consumers that a site provides
information meeting current standards for content
and form
13Existing Labeling Processes - IV
- (A.) A web site issues a request, for a label, to
a third party (labeling operator) - Site checked and if OK, a label is generated
- the label is either stored locally at the sites
server, or stored in the operators database (a
link to the label is added in the web site) - sites content is examined periodically and if an
unacceptable change occurs, the label is either
removed or replaced with a relevant message
14Existing Labeling Processes - V
- (B.) Location of unlabeled web sites in specific
thematic areas - Characterization of the located sites, with
respect to certain criteria - Filtering of some of the web sites based on their
characterization - Organizing the rest of the web sites into web
directories to facilitate access by information
consumers
15Need for new labeling technologies I
- Problems of existing labeling processes due to
- High costs to offer the service
- Huge amount of information to assess (too many
sites) - Content changes rapidly
- Broken links to accredited websites
- Not standardised rating criteria
- Dishonest use of the label
16Need for new labeling technologies II
- Most of the work in labeling processes is
currently performed manually - A site may have hundreds of pages (static/dynamic
ones) or other resources (.doc, .rtf, .pdf,
images, ) - Probably all or most of the resources have to be
checked - A single label may be used for the whole site or
different labels for sites resources
17Need for new labeling technologies III
- Access of the end-users to labeled resources must
be improved - If labels could be recognised by web browsers and
search engines this would motivate content
providers to label their resources
18Need for new labeling technologies IV
- Need for technologies that enable the automation
of the labeling operators work this involves - Technology for creating machine processable
labels - establishing common schemas and vocabularies
exploiting semantic web technologies (RDF, OWL), - developing label generators with user-friendly
interfaces - Technology for maintaining the labels
- monitoring the label with respect to its issue
and expiration dates, its integrity (when stored
by the content provider) - monitoring the label with respect to the labeled
content exploiting content analysis technologies - Technology for locating unlabeled web resources
- necessary for domain specific portals collecting
web resources meeting certain quality criteria
19Need for new labeling technologies V
- Need for technologies that enable the access of
the end-user to the label and its content this
involves - Technology for locating the label inside a web
resource, reading and validating labels content - Technology for presenting the labels content and
validation results to the end user
20Technology for creating machine processable
labels - I
- Advantages of establishing common schemas and
vocabularies - A label that is machine readable and uses common
descriptors will be interpreted more easily by
semantic web tools than one that uses purely
proprietary elements - For instance, if a user agent is configured to
look for Label A but finds a site that is
accredited by Label B, at least the common
elements will be recognised, even if those
specific to Label B are not. - The incentive for content providers to gain
accreditation for their material is therefore
enhanced if the labeling scheme they adopt uses
at least some of the common descriptors
21Technology for creating machine processable
labels - II
- Advantages of establishing common schemas and
vocabularies (cont.) - A common set of elements facilitates the
application of web content analysis techniques to
ensure that an accredited site continues to meet
at least some of the common labeling criteria. - For instance, a content analyser cannot tell
whether an e-mail sent to an eCommerce operator
will be responded to within a given time, but it
can detect that a contact route is still provided
3 months after the site was last reviewed by a
human, even if the nature of the contact route
changes - The use of a common vocabulary offers commercial
advantages to labeling authorities by increasing
the value of the labels for content providers and
end-users
22Technology for creating machine processable
labels - III
- Establishing common schemas and vocabularies
exploiting semantic web technologies (RDF, OWL) - Some basics on Resource Description Framework
(RDF) - Established by W3C
- Can be considered as providing our file format
- We'll use it for sharing Web content labels
(import/export/publish for our tools) - Queried using an SQL-like language, SPARQL
- RDF/XML files make simple statements
- Advertising is present here
- There is a service of virtual consultation for
professionals - The intended audience is health professionals
- It's in the MeSH 'quality of health care' category
23Technology for creating machine processable
labels IV
- Some basics on RDF (cont.)
- Enables sharing the work mixing schemas
- QUATRO for talking about 'advertisements'
- ltquatroacgt1lt/quatroacgt
- WMA for 'virtual consultations'
- ltwmavirtconsprofgt0lt/wmavirtconsprofgt
- Dublin Core for 'intended audience'
- ltdcaudiencegtChildrenlt/dcaudiencegt
- And unlimited others...
- Re-use of existing RDF vocabularies means
- saving time from re-defining existing concepts
- making re-use of both software and data more
likely
24Technology for creating machine processable
labels - V
- Some basics on the Ontology Web Language (OWL)
- W3C recommendation
- Can be used to explicitly represent the meaning
of terms in vocabularies and the relationships
between those terms. - OWL has more facilities for expressing meaning
and semantics than XML, RDF, and RDF-S, and thus
OWL goes beyond these languages in its ability to
represent machine interpretable content on the
Web. - OWL is a revision of the DAMLOIL web ontology
language - OWL has three increasingly-expressive
sublanguages OWL Lite, OWL DL, and OWL Full.
25Technology for creating machine processable
labels - VI
- Having the languages for representing labels
data is not enough to attract content providers
to create labels and add them to their content - Developing label generators with user-friendly
interfaces for the content providers - An example the ICRA label generator
26Technology for maintaining the labels - I
- Monitoring the labels content
- When a label is generated, the following data may
be stored in the labeling authoritys data base - Issue date, expiration date
- labels content hash
- the label itself
- If the label is reviewed at some point
- the labeling authoring updates the dates
- the hash may also be updated if after the review
the authority changes the label (the new one is
sent to the content provider) - it is also possible that the provider asks for
changes in the label - The data stored about the label can be used by a
tool to examine - the label against the expiration date (if date
has passed, alert the authority) - the labels integrity (when stored by the content
provider he/she may change it)
27Technology for maintaining the labels - II
- Monitoring the labels content against the
content of the labeled resource using content
analysis technologies - spidering technology that enables navigating the
monitored site to locate resources (pages,
documents) related to the labeling criteria - information extraction technology to extract from
the located resources the data corresponding to
the labeling criteria
28Technology for maintaining the labels - III
- Spidering involves techniques and tools for
- Site navigation to traverse a Web site,
collecting information from each resource visited
and forwarding it to the Resource
classification and Link Scoring modules - Resource classification is responsible for
deciding whether a resource is an interesting one
and should be stored or not, exploiting - machine learning techniques to train a
classifier, - techniques for natural language processing, image
analysis, page layout analysis that will provide
the necessary features for the classifiers
training - domain specific resources (terminologies,
vocabularies, ontologies) - Link-scoring validates the links to be followed.
Only links with a score above a certain threshold
are followed. - machine learning techniques, heuristics, domain
specific resources can be used
29Technology for maintaining the labels - IV
- Information extraction may involve
- wrappers for different types of resources
- techniques and tools from the areas of machine
learning, natural language processing, image
analysis (in case of image resources) for - term extraction
- named entity recognition and classification
- relations extraction
- use of domain specific resources (terminologies,
vocabularies, ontologies)
30Technology for locating unlabeled web resources
- Use of focused crawling technology to locate
unlabeled domain specific web resources
exploiting - existing search engines
- machine learning techniques
- domain specific resources
31Technologies for improving accessibility to the
labels - I
- Locating the label of a web resource, reading and
validating its content - parse the resources content to locate an RDF
label - if such a label exists
- identify the labeling authority
- calculate the labels hash
- get for the specific resource the data stored in
the authoritys data base - process the resources content with the content
analyser - validate the label against the data in the
authoritys data base - provide the labels data and the validation
results to the tools responsible for presenting
them to the end -user
32Technologies for improving accessibility to the
labels - II
- Presenting the labels content and validation
results to the end user - enabling existing web browsers and search engines
- to communicate with web services that are able to
locate and validate labels in the retrieved
resources, as well as - to present the labels data and validation
results in a format understandable by the
end-user - natural language generation technology can be
exploited to present the labels content in the
end-users language according to his/her
knowledge and needs
33QUATRO project - I
- The EC-funded project Quality Assurance and
Content Description QUATRO (Safer Internet
Programme) - Provides a common vocabulary and machine
processable RDF schema for quality labeling - Known as RDF-CL, it allows a small amount of
metadata to be applied to anything from a single
resource such as a web page, to millions of items
on any number of web sites. - A default label can be set for a whole web site
or set of web sites, with overrides coming into
play as required. - Labels can be stored on the labeled site or in a
database operated by the Labeling Authority.
34QUATRO project - II
- QUATRO vocabulary is divided into four
categories - General Criteria, such as whether the labeled
site includes a privacy statement, contact point
etc. - Criteria for labeling to ensure accuracy of
information such as the content providers
credentials and appropriate disclosure of
funding. - Criteria for labeling to ensure compliance with
rules and legislation for e-business such as fair
marketing practices and measures to protect
children - Terms used in operating the labeling scheme
itself such as the date the label was issued,
when it was last reviewed and by whom. - Three different domain specific vocabularies
developed by QUATRO partners - ICRA nudity, sexual content, violence,
- IQUA integrity, responsibility,
confidentiality, protection of intellectual and
industrial property rights, - WMA health related content and services
35QUATRO project - III
- QUATRO develops tools to support the exploitation
of its labels - QUATRO proxy server (QUAPRO)
- Takes as input a URL, either from a search engine
or a browser, and examines whether there are
labels inside - Parses the label (only for QUATRO-based labels)
in order to check its validity in terms of - the label itself, or
- the URLs content (in QUATRO, this is restricted
only to pages with pornographic content) - Returns a result on the labels validity (valid,
invalid, unknown) - A browser extension, the Metadata Visualizer
(ViQ) - A search engine wrapper which is a web interface
displaying annotated search results that link to
the corresponding labels, the Label Display
Interface (LADI)
36QUATRO project - IV Architecture
ViQ (the browser plug-in)
Web
SOAP
SOAP
QUAPRO (the QUATRO proxy)
Labelling Authorities databases
DAcc
LADI (the search engine data wrapper)
DAcc
Data ACCess interface
SOAP
SOAP
FilterX (the content analyser)
37QUATRO project LADI
38QUATRO project ViQ
39QUATRO project - V
- QUATRO addresses the needs of both the labeling
operator and the end-user - QUATROs technology for creating machine
processable labels represents the first step
towards a platform that will support the work of
the labeling operator - Technology for automated web content analysis is
still required
40MedIEQ Project I
- MedIEQ Quality Labeling of Medical Web content
using Multilingual Information Extraction - EC-funded project
- DG SANCO Health Consumer protection,
Directorate C Public Health and Risk Assessment - Public Health Programme, Priority Area 1.
Health Information, Action 1.5 eHealth - Duration 01/01/2006-01/01/2009
41MedIEQ Project - II Project Objectives
- develop a scheme for the quality labelling of
health related web content and provide the tools
supporting the creation, maintenance and access
of labelling data according to this scheme - specify a methodology for the content analysis of
medical web sites according to the MedIEQ scheme
and develop the tools that will implement it - integrate these technologies into a prototype
labelling system - demonstrate the resulting prototype in 7
different languages (Spanish, Catalan, German,
English, Greek, Czech, and Finnish) and two
labelling applications (third party
accreditation, classification)
42MedIEQ Project - III
43The MedIEQ Project - IV Indicators for measuring
the achievement of objectives
- Reduction of the manual labelling time
- Labeling unlabeled sites, monitoring labeled
sites, .. - Effective extraction from large collections of
medical web content - Processing time, precision of extracted data,
- Effort required to customize the system into new
languages - 7 languages to be supported
- Implementation of an open architecture
- Effort required to integrate new techniques and
tools,
44W3C Content Label Incubator Group
- WCL-XG aims to foster ideas for how content
providers can inform search engines, aggregators
and other data systems that - their content is of a certain type,
- fulfils certain criteria or meets given
requirements - Content labels will need to be applicable to a
resource or a group of resources - It should be possible to build systems that in
some way show the labels to be trustworthy - WCL-XG output may be used as input to a working
group leading to a full W3C Recommendation - WCL-XG started its work on February 2006 and is
scheduled to complete in June
45Concluding
- Quality labeling is an application area for
content analysis, knowledge management, and
intelligent interfaces - Strong need for the development of tools
assisting the work of labeling authorities - Browsers and search engines will have to be
enriched with functionalities that enable the
recognition of machine processable labels
(RDF-CL) and the presentation of their content to
the end-user
46Concluding
- Establishment of quality labels in practice
cannot be enforced by measures - If content providers realize that
- content labels can be created and added easily
to their content - labeling authorities are equipped with technology
that facilitates the monitoring of the provided
web content against the labeling criteria - search engines and browsers can inform users on
the existence of quality labels and their
features - they will adopt machine readable content labeling
technology - leading to the increase of labeled sites
- improving in turn the quality knowledge
disseminated through the Web - What do you think? Is this the right way to
proceed?
47Useful Links
- QUATRO site
- http//www.quatro-project.org
- MedIEQ site
- http//www. medieq.org
- WCL-XG
- http//www.w3.org/2005/Incubator/wcl
48Quality Labelling of Web Content
Vangelis Karkaletsis
Thank you !
3rd IFIP Conference on Artificial Intelligence
Applications Innovations (AIAI 2006) Athens,
7-9 June 2006