Quality Labelling of Web Content - PowerPoint PPT Presentation

1 / 48

About This Presentation

Title:

Quality Labelling of Web Content

Description:

Institute of Informatics & Telecommunications. NCSR 'Demokritos', Athens, Greece ... American Medical Association (AMA) e-Europe 2002. Quality Label (logo) ... – PowerPoint PPT presentation

Number of Views:58

Avg rating:3.0/5.0

Slides: 49

Provided by: ncsr2

Category:

more less

Transcript and Presenter's Notes

Title: Quality Labelling of Web Content

1
Quality Labelling of Web Content
Software Knowledge Engineering
Laboratory Institute of Informatics
Telecommunications NCSR Demokritos, Athens,
Greece
Vangelis Karkaletsis
3rd IFIP Conference on Artificial Intelligence
Applications Innovations (AIAI 2006) Athens,
7-9 June 2006
2
Contents

Quality labels / trustmark schemes
Existing labeling processes
Needs for new technologies
On-going projects and initiatives
QUATRO, MedIEQ, W3C WCL-XG,
Concluding Remarks

3
Quality labels / trustmark schemes - I

Quality labels / trustmark schemes have been
established in many parts of the world
some are online versions of existing schemes,
others have been developed specifically for the
web.

4
Quality labels / trustmark schemes - II

Inform the user about the quality of data and
services provided
if these are of a certain type, they fulfill
certain criteria or meet given requirements
for example, a label may include an assertion
that the labeled web site has a suitable privacy
policy, that the publisher is clearly identified,
and that it meets legal practice in one or more
identified countries.
Two notable areas of interest for quality labels
/ trustmarks schemes are
those designed to give consumers confidence in
eCommerce operations, and
those that indicate that health related content
has been peer reviewed

5
Quality labels / trustmark schemes III an
example the WMA label for health related web
content

Identification
Content
Confidentiality
Advertising and Sponsoring
Virtual Consultation
Non compliance

6
Quality labels / trustmark schemes IV some
facts about the health related web content

The number of health web sites and online
services is increasing day by day
70-80 of Internet users seeks health information
for them or for their relatives
More than 4 out of 10 health information seekers
say the material they find affect their health
decisions

7
Quality labels / trustmark schemes V some facts
about the health related web content

Quality of health related web content is
extremely variable
from evidence-based healthcare to widespread
practice of fraud and potentially-dangerous
claims
Increase in consumer knowledge changes how
patient, professionals and providers interact
Patients are becoming more proactive in their
care management
Effects in Public Health
An example Vaccines

8
(No Transcript)
9
(No Transcript)
10
Existing labeling processes - I

Organisations around the world are working on
establishing quality standards
For example, concerning health sites
European Commission
eEurope 2002 Quality criteria for health related
web sites
American Medical Association
Guidelines for medical health information sites
on the Internet
Internet Healthcare Coalition
eHealth Code of Ethics
.

11
Existing labeling processes - II

Quality standards initiatives are not enough
Self-adherence to codes of conduct or ethics,
nothing more than a claim with little
enforceability
Necessary the establishment of labeling
mechanisms
by third party accreditation
by creating portals where web sites are organized
and characterized against certain labeling
criteria
Such initiatives already exist

12
Existing labeling processes - III
Codes of Conduct are defined as sets of quality
criteria that provide a list of recommendations
for the development and content of websites
Quality Label (logo) is diplayed on screen and
represents a commitment by a provider to
implement or adhere to a code of conduct
User Guidance enables users to check if a site
complies with certain standards by accessing a
series of questions from a displayed logo
Filtering Tools applied manually or
automatically, accept or reject web resources -
resources are selected for their quality and
relevance to a particular audience
Third Parties certification quality and
accreditation labels are awarded by a third party
to inform consumers that a site provides
information meeting current standards for content
and form
13
Existing Labeling Processes - IV

(A.) A web site issues a request, for a label, to
a third party (labeling operator)
Site checked and if OK, a label is generated
the label is either stored locally at the sites
server, or stored in the operators database (a
link to the label is added in the web site)
sites content is examined periodically and if an
unacceptable change occurs, the label is either
removed or replaced with a relevant message

14
Existing Labeling Processes - V

(B.) Location of unlabeled web sites in specific
thematic areas
Characterization of the located sites, with
respect to certain criteria
Filtering of some of the web sites based on their
characterization
Organizing the rest of the web sites into web
directories to facilitate access by information
consumers

15
Need for new labeling technologies I

Problems of existing labeling processes due to
High costs to offer the service
Huge amount of information to assess (too many
sites)
Content changes rapidly
Broken links to accredited websites
Not standardised rating criteria
Dishonest use of the label

16
Need for new labeling technologies II

Most of the work in labeling processes is
currently performed manually
A site may have hundreds of pages (static/dynamic
ones) or other resources (.doc, .rtf, .pdf,
images, )
Probably all or most of the resources have to be
checked
A single label may be used for the whole site or
different labels for sites resources

17
Need for new labeling technologies III

Access of the end-users to labeled resources must
be improved
If labels could be recognised by web browsers and
search engines this would motivate content
providers to label their resources

18
Need for new labeling technologies IV

Need for technologies that enable the automation
of the labeling operators work this involves
Technology for creating machine processable
labels
establishing common schemas and vocabularies
exploiting semantic web technologies (RDF, OWL),
developing label generators with user-friendly
interfaces
Technology for maintaining the labels
monitoring the label with respect to its issue
and expiration dates, its integrity (when stored
by the content provider)
monitoring the label with respect to the labeled
content exploiting content analysis technologies
Technology for locating unlabeled web resources
necessary for domain specific portals collecting
web resources meeting certain quality criteria

19
Need for new labeling technologies V

Need for technologies that enable the access of
the end-user to the label and its content this
involves
Technology for locating the label inside a web
resource, reading and validating labels content
Technology for presenting the labels content and
validation results to the end user

20
Technology for creating machine processable
labels - I

Advantages of establishing common schemas and
vocabularies
A label that is machine readable and uses common
descriptors will be interpreted more easily by
semantic web tools than one that uses purely
proprietary elements
For instance, if a user agent is configured to
look for Label A but finds a site that is
accredited by Label B, at least the common
elements will be recognised, even if those
specific to Label B are not.
The incentive for content providers to gain
accreditation for their material is therefore
enhanced if the labeling scheme they adopt uses
at least some of the common descriptors

21
Technology for creating machine processable
labels - II

Advantages of establishing common schemas and
vocabularies (cont.)
A common set of elements facilitates the
application of web content analysis techniques to
ensure that an accredited site continues to meet
at least some of the common labeling criteria.
For instance, a content analyser cannot tell
whether an e-mail sent to an eCommerce operator
will be responded to within a given time, but it
can detect that a contact route is still provided
3 months after the site was last reviewed by a
human, even if the nature of the contact route
changes
The use of a common vocabulary offers commercial
advantages to labeling authorities by increasing
the value of the labels for content providers and
end-users

22
Technology for creating machine processable
labels - III

Establishing common schemas and vocabularies
exploiting semantic web technologies (RDF, OWL)
Some basics on Resource Description Framework
(RDF)
Established by W3C
Can be considered as providing our file format
We'll use it for sharing Web content labels
(import/export/publish for our tools)
Queried using an SQL-like language, SPARQL
RDF/XML files make simple statements
Advertising is present here
There is a service of virtual consultation for
professionals
The intended audience is health professionals
It's in the MeSH 'quality of health care' category

23
Technology for creating machine processable
labels IV

Some basics on RDF (cont.)
Enables sharing the work mixing schemas
QUATRO for talking about 'advertisements'
ltquatroacgt1lt/quatroacgt
WMA for 'virtual consultations'
ltwmavirtconsprofgt0lt/wmavirtconsprofgt
Dublin Core for 'intended audience'
ltdcaudiencegtChildrenlt/dcaudiencegt
And unlimited others...
Re-use of existing RDF vocabularies means
saving time from re-defining existing concepts
making re-use of both software and data more
likely

24
Technology for creating machine processable
labels - V

Some basics on the Ontology Web Language (OWL)
W3C recommendation
Can be used to explicitly represent the meaning
of terms in vocabularies and the relationships
between those terms.
OWL has more facilities for expressing meaning
and semantics than XML, RDF, and RDF-S, and thus
OWL goes beyond these languages in its ability to
represent machine interpretable content on the
Web.
OWL is a revision of the DAMLOIL web ontology
language
OWL has three increasingly-expressive
sublanguages OWL Lite, OWL DL, and OWL Full.

25
Technology for creating machine processable
labels - VI

Having the languages for representing labels
data is not enough to attract content providers
to create labels and add them to their content
Developing label generators with user-friendly
interfaces for the content providers
An example the ICRA label generator

26
Technology for maintaining the labels - I

Monitoring the labels content
When a label is generated, the following data may
be stored in the labeling authoritys data base
Issue date, expiration date
labels content hash
the label itself
If the label is reviewed at some point
the labeling authoring updates the dates
the hash may also be updated if after the review
the authority changes the label (the new one is
sent to the content provider)
it is also possible that the provider asks for
changes in the label
The data stored about the label can be used by a
tool to examine
the label against the expiration date (if date
has passed, alert the authority)
the labels integrity (when stored by the content
provider he/she may change it)

27
Technology for maintaining the labels - II

Monitoring the labels content against the
content of the labeled resource using content
analysis technologies
spidering technology that enables navigating the
monitored site to locate resources (pages,
documents) related to the labeling criteria
information extraction technology to extract from
the located resources the data corresponding to
the labeling criteria

28
Technology for maintaining the labels - III

Spidering involves techniques and tools for
Site navigation to traverse a Web site,
collecting information from each resource visited
and forwarding it to the Resource
classification and Link Scoring modules
Resource classification is responsible for
deciding whether a resource is an interesting one
and should be stored or not, exploiting
machine learning techniques to train a
classifier,
techniques for natural language processing, image
analysis, page layout analysis that will provide
the necessary features for the classifiers
training
domain specific resources (terminologies,
vocabularies, ontologies)
Link-scoring validates the links to be followed.
Only links with a score above a certain threshold
are followed.
machine learning techniques, heuristics, domain
specific resources can be used

29
Technology for maintaining the labels - IV

Information extraction may involve
wrappers for different types of resources
techniques and tools from the areas of machine
learning, natural language processing, image
analysis (in case of image resources) for
term extraction
named entity recognition and classification
relations extraction
use of domain specific resources (terminologies,
vocabularies, ontologies)

30
Technology for locating unlabeled web resources

Use of focused crawling technology to locate
unlabeled domain specific web resources
exploiting
existing search engines
machine learning techniques
domain specific resources

31
Technologies for improving accessibility to the
labels - I

Locating the label of a web resource, reading and
validating its content
parse the resources content to locate an RDF
label
if such a label exists
identify the labeling authority
calculate the labels hash
get for the specific resource the data stored in
the authoritys data base
process the resources content with the content
analyser
validate the label against the data in the
authoritys data base
provide the labels data and the validation
results to the tools responsible for presenting
them to the end -user

32
Technologies for improving accessibility to the
labels - II

Presenting the labels content and validation
results to the end user
enabling existing web browsers and search engines
to communicate with web services that are able to
locate and validate labels in the retrieved
resources, as well as
to present the labels data and validation
results in a format understandable by the
end-user
natural language generation technology can be
exploited to present the labels content in the
end-users language according to his/her
knowledge and needs

33
QUATRO project - I

The EC-funded project Quality Assurance and
Content Description QUATRO (Safer Internet
Programme)
Provides a common vocabulary and machine
processable RDF schema for quality labeling
Known as RDF-CL, it allows a small amount of
metadata to be applied to anything from a single
resource such as a web page, to millions of items
on any number of web sites.
A default label can be set for a whole web site
or set of web sites, with overrides coming into
play as required.
Labels can be stored on the labeled site or in a
database operated by the Labeling Authority.

34
QUATRO project - II

QUATRO vocabulary is divided into four
categories
General Criteria, such as whether the labeled
site includes a privacy statement, contact point
etc.
Criteria for labeling to ensure accuracy of
information such as the content providers
credentials and appropriate disclosure of
funding.
Criteria for labeling to ensure compliance with
rules and legislation for e-business such as fair
marketing practices and measures to protect
children
Terms used in operating the labeling scheme
itself such as the date the label was issued,
when it was last reviewed and by whom.
Three different domain specific vocabularies
developed by QUATRO partners
ICRA nudity, sexual content, violence,
IQUA integrity, responsibility,
confidentiality, protection of intellectual and
industrial property rights,
WMA health related content and services

35
QUATRO project - III

QUATRO develops tools to support the exploitation
of its labels
QUATRO proxy server (QUAPRO)
Takes as input a URL, either from a search engine
or a browser, and examines whether there are
labels inside
Parses the label (only for QUATRO-based labels)
in order to check its validity in terms of
the label itself, or
the URLs content (in QUATRO, this is restricted
only to pages with pornographic content)
Returns a result on the labels validity (valid,
invalid, unknown)
A browser extension, the Metadata Visualizer
(ViQ)
A search engine wrapper which is a web interface
displaying annotated search results that link to
the corresponding labels, the Label Display
Interface (LADI)

36
QUATRO project - IV Architecture
ViQ (the browser plug-in)
Web
SOAP
SOAP
QUAPRO (the QUATRO proxy)
Labelling Authorities databases
DAcc
LADI (the search engine data wrapper)
DAcc
Data ACCess interface
SOAP
SOAP
FilterX (the content analyser)
37
QUATRO project LADI
38
QUATRO project ViQ
39
QUATRO project - V

QUATRO addresses the needs of both the labeling
operator and the end-user
QUATROs technology for creating machine
processable labels represents the first step
towards a platform that will support the work of
the labeling operator
Technology for automated web content analysis is
still required

40
MedIEQ Project I

MedIEQ Quality Labeling of Medical Web content
using Multilingual Information Extraction
EC-funded project
DG SANCO Health Consumer protection,
Directorate C Public Health and Risk Assessment
Public Health Programme, Priority Area 1.
Health Information, Action 1.5 eHealth
Duration 01/01/2006-01/01/2009

41
MedIEQ Project - II Project Objectives

develop a scheme for the quality labelling of
health related web content and provide the tools
supporting the creation, maintenance and access
of labelling data according to this scheme
specify a methodology for the content analysis of
medical web sites according to the MedIEQ scheme
and develop the tools that will implement it
integrate these technologies into a prototype
labelling system
demonstrate the resulting prototype in 7
different languages (Spanish, Catalan, German,
English, Greek, Czech, and Finnish) and two
labelling applications (third party
accreditation, classification)

42
MedIEQ Project - III
43
The MedIEQ Project - IV Indicators for measuring
the achievement of objectives

Reduction of the manual labelling time
Labeling unlabeled sites, monitoring labeled
sites, ..
Effective extraction from large collections of
medical web content
Processing time, precision of extracted data,
Effort required to customize the system into new
languages
7 languages to be supported
Implementation of an open architecture
Effort required to integrate new techniques and
tools,

44
W3C Content Label Incubator Group

WCL-XG aims to foster ideas for how content
providers can inform search engines, aggregators
and other data systems that
their content is of a certain type,
fulfils certain criteria or meets given
requirements
Content labels will need to be applicable to a
resource or a group of resources
It should be possible to build systems that in
some way show the labels to be trustworthy
WCL-XG output may be used as input to a working
group leading to a full W3C Recommendation
WCL-XG started its work on February 2006 and is
scheduled to complete in June

45
Concluding

Quality labeling is an application area for
content analysis, knowledge management, and
intelligent interfaces
Strong need for the development of tools
assisting the work of labeling authorities
Browsers and search engines will have to be
enriched with functionalities that enable the
recognition of machine processable labels
(RDF-CL) and the presentation of their content to
the end-user

46
Concluding

Establishment of quality labels in practice
cannot be enforced by measures
If content providers realize that
content labels can be created and added easily
to their content
labeling authorities are equipped with technology
that facilitates the monitoring of the provided
web content against the labeling criteria
search engines and browsers can inform users on
the existence of quality labels and their
features
they will adopt machine readable content labeling
technology
leading to the increase of labeled sites
improving in turn the quality knowledge
disseminated through the Web
What do you think? Is this the right way to
proceed?

47
Useful Links

QUATRO site
http//www.quatro-project.org
MedIEQ site
http//www. medieq.org
WCL-XG
http//www.w3.org/2005/Incubator/wcl

48
Quality Labelling of Web Content
Vangelis Karkaletsis
Thank you !
3rd IFIP Conference on Artificial Intelligence
Applications Innovations (AIAI 2006) Athens,
7-9 June 2006

Write a Comment

User Comments (0)