Web Mining for the Baltic Sea Open University BSOU - PowerPoint PPT Presentation

1 / 25

About This Presentation

Title:

Web Mining for the Baltic Sea Open University BSOU

Description:

Introduction to BSOU and Web Mining. Semantic Web based portal ... Information extractor engine. Content provider. Consumers of information. Client 1. Client 2 ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 26

Provided by: lipas

Category:

more less

Transcript and Presenter's Notes

Title: Web Mining for the Baltic Sea Open University BSOU

1
Web Mining for the Baltic Sea Open University
(BSOU)

by
Siew Kheng Lee, Yaroslav Tsaruk,
Kimmo Salmenjoki, Lorna Uden
In collaboration with
University of Applied Sciences, Kiel, Germany
University of Vaasa, Finland
Staffordshire University, UK

2
Content

Introduction to BSOU and Web Mining
Semantic Web based portal for BSOU
Harvesting web information with Piggy Bank
Conclusions

http//www.uwasa.fi/
3
Data ? Web Mining

Normal data mining, but now information is more
dynamic
Methodologies
database design algorithms like SOM-gt with
webservers and webpages using XML technologies
and semantic approaches
Technical approaches
OLAP, Yale and other commercial software, Web 2.0
We look for a cross-technique between the
traditional data mining and web mining
methodologies

http//en.wikipedia.org/wiki/Data_mining
4
Introduction BSN

Baltic Sea Network (BSN) is a collaboration
of higher education institutions, regional
development
organizations other organizations
of the Baltic Sea Region.
9 Countries
Denmark, Estonia, Finland, Germany,
Latvia, Lithuania, Poland,
Russia and Sweden
Objective
To boost development, education research
through shared
networking, joint projects enhanced mobility
International Cooperation
Focusing in Welfare, Business Skills
Management, Tourism and Information
Communication Technology

http//www.balticseanetwork.org/
5
BSOU Web Portals

BSOU has 27 BSN educational institutions
BSOU aims to provide joint degrees within the
resources from its participating education
institutions
Draw all the information from existing
participating institutions together and to
organize them in a logical manner
The challenges lie ahead of BSOU in finding this
common bond, extracting them and there on forming
new structures that fits amongst them
Language barriers, different systems, separate
priorities, maintenance and efficient
organization to overcome

6
Semantic Web based portal for BSOU

For mining the websites of the partner
universities, we have established BSOUs own
composed ontology
The BSOU ontology is organized in a hierarchical
manner, assimilating the extension of a
parent-child node tree graph. The ontology was
based on the Bologna model and the results of
data mining analysis of section 2.1

http//ttwin.techno.uwasa.fi/bsou_portal/
7
BSOU Ontology Structure
8
Semantic Information Design

The BSOU ontology takes account of the specific
subjects and courses, that is obtained from the
participating institutions. They are classified
according to their departments.
To enable easier handling of the bulk
information, the BSOU ontology structure is
divided into two parts, one is the BSOU courses
ontology and the other is the BSOU contact
information ontology

9
Basic Semantic Web Approach
10
Courses BSOU Portal

How to combine information from 27 heterogenous
web sites?
First manual analysis to XLS table - gt human
overview
Possible automation?
Various Semantic Web based tool boxes SemTalk
(Visio add-on), KAON, KIM, Sesame, SMORE,
PiggyBank,

11
Web Mining BSOU Websites for BSOU Portal
Automation

2.0 Piggy Bank is an extension to the Firefox
web browser that turns it into a Semantic Web
Browser
2.1 Raw data analysis for the related web sites
2.2 Information structure analysis of underlying
web sites

http//simile.mit.edu/piggy-bank/
12
2.0 Piggy Bank

Piggy Bank uses existing information on the Web
in a more useful and flexible way not offered by
the original websites
Approaches to use Piggy Bank collect, install
scrapers, search browse, map, save, tag,
retrieve, combine, share

13
Piggy Bank Architecture

Realized with own User Interface (search
browse), tags (for users), screen scrapers (for
web pages)
Semantic Banks for sharing information

http//simile.mit.edu/semantic-bank/
14
2.1 Raw Data Analysis (for the related websites)

Data mining was used to know and to outline the
similarities and differences, to spot the odds
and other end extremes between these different
contributors
The analysis results would reveal of what quality
does the web structure of these websites are in
the state of, in order to progress with the
implementation of a semantic web solution
Through data mining, meaningful data links and
information can be discovered to construct a
mental model structure before the integration of
the semantic web solution
The data collected from all the websites was
compiled in a huge comparison spreadsheet
(equivalent to 4 times of A0 size paper). The
fields (shown next slide) in this spreadsheet
that were compared and listed horizontally in the
spreadsheet.

15
2.1 Raw Data Analysis (spreadsheet and fields)
16
2.2 Information Structure Analysis (of
underlying websites)

To obtain the least information-loss, every
single unit of the contributors website should
have a decent HTML structure. This would help to
eliminate prior pre-processing processes all in
the effort to just obtain the data in HTML
format.
An analysis report is drawn to form a general
idea on the quality of the web information
structure of each and every page that the
contributor has to offer.
Assumptions Every participating institution has
only one domain. Only English web pages are
considered in this case study. Sites are
distinguished for their page type that is if they
contain actual information in HTML or merely a
HTML masked page (e.g.consisting of links to
.doc, .pdf, .ppt files).

17
Case 1 Scriptlet

Our approach is to use a Case 1 scriptlet to test
against the page if any information proves to be
scrappable
Using this basic scriptlet enabled us to gauge if
the HTML page quality is fit for further
scraping. When there are no results, or errors
popping up from this basic scriptlet we try to
customize the generic scriptlet site specifically
and align to our common ontology of Section 2.3
with very precise landmarks of the page being
identified
If no further results are obtained, a quick check
with the browsers DOM inspector is taken for the
specific location in the page to confirm the HTML
tagging. Then the page can be ignored for it is
considered not worth scraping with the scriptlet
technique

18
Case 2 Scriptlet

If there are results obtained, then a more
refined Case 2 scriptlet is further scrutinized
to guarantee enhanced mining results
If these results have similarity with other pages
of that same domain, the site specific scriptlet
is pushed towards the scraping process. It is
left to the refined stage if there are no common
matches to be found

Case 2 Scriptlet
//begin of scriptlet code
var prefixRDF 'http//www.w3.org/1999/02/22-rdf-
syntax-ns'
var prefixDC 'http//purl.org/dc/elements/1.1/'
var prefixBSOU 'http//ttwin.techno.uwasa.fi/bsou
_portal/ontology/bsou'
var namespace doc.documentElement.namespaceURI
var nsResolver namespace ? function(prefix)
if (prefix 'x') return namespace else
return null
null
var getNode function(doc, contextNode, xpath,
nsResolver)
return doc.evaluate(xpath, contextNode,
nsResolver, XPathResult.ANY_TYPE,null).iterateNext
()
var cleanString function(s)
return utilities.trimString(s)
var xpath '/html/body/table/tbody/tr1/td1_at_c
lass"tavatekst"/table_at_class"tavatekst"/tbody/
tr'

19
Scrapped Items Domains (according to country)

After scraping each and every participating
institutions web domain, the results are
integrated according to country grouping. Table
shows a distribution of scrapped domains and the
number of information items scrapped from these
domains
The country-grouping integration process ran
smoothly except partial results from one partner
that failed to be integrated into the merged
country results
It was found that the information as encoded in
HTML inside the page contains unnecessary extra
spaces in between words to break a line, thus
creating the similar spaces for which the Piggy
Bank could not handle. It is an example of poor
HTML coding to suit a page design and not the
context

20
Table of Scrapped Items
21

The fully merged results appear to have 16
information items less as compared to the
mathematically-counted number of information
items that adds up from the individual countries
item
This happens because the Piggy Bank automatically
merges these items together due to that the Piggy
Bank employs a taxonomy that collapses same
information items and distinguishing tags in its
ontology when found to have information of the
same values

22
Conclusions

Out of 27 contributors, only 15 institutions have
decent HTML web pages in their respective domain.
The other 45 of the contributors which did not
make the cut is mainly due to poor content
aggregation in their own websites itself
Some pages have deep web information hierarchy
which makes is rather impossible to even manually
locate a piece of valuable information. These
websites too have very little structure in their
HTML code, making it a poor site for semantic
tasks to navigate and build upon
As we saw above, a lot of policy and web
development issues need to be harmonized when
communities of people or virtual organizations
share their information and knowledge processes
online. Hopefully semantic web and its wider
usage together with common approaches for the web
development in general will ease this aspect more
in the future

23
BSN Upcoming Events

Institutes of higher professional education in
future society
Location Tallinn, Estonia
Date 2nd 3rd November 2006
Next Baltic Sea Network Partner Days
Location Tallinn, Estonia
Date January 2007

http//www.itcollege.ee/inenglish/index.php
24
References

Baltic Sea Network, http//www.balticseanetwork.or
g/
J. David Eisenberg A List Apart Articles
Forgiving Browsers Considered Harmful April 27,
2001
D. Hand, H. Mannila, P. Smyth Principles of Data
Mining. MIT Press, Cambridge, MA, 2001. ISBN
0-262-08290-X
Siew Kheng Lee Automated Web Information System
via Semantic Web Technologies, Master thesis,
University of Applied Sciences, Kiel, Germany,
2006
SIMILE project PiggyBank browser extension,
http//simile.mit.edu/piggy-bank/, MIT
Yaroslav Tsaruk, Kimmo Salmenjoki, Helmut
Dispert Using agent approach in developing a
course sharing system, Information Society 2005
multiconference, Intelligent systems conference,
Ljubljana, Slovenia, ISBN 961-6303-71-6, p. 362-
365
L. Uden, K. Salmenjoki, M. Hericko, L. Pavlic
Metadata for research and development
collaboration between universities, Int. Journal
of Metadata, Semantics and Ontologies, IJMSO,
2006 (to appear)

var elmts utilities.gatherElementsOnXPath(doc,
doc, xpath, nsResolver)
for (var i 1 i lt 18 i)
var elmt elmtsi
var uri cleanString(getNode(doc, elmt,
'./td2/text()1', nsResolver).nodeValue)
model.addStatement(uri, prefixRDF 'type',
'subject', false)
try
var subject cleanString(getNode(doc, elmt,
'./td2/text()1', nsResolver).nodeValue)
model.addStatement(uri, prefixDC 'subject',
subject, false)
model.addStatement(uri, prefixBSOU 'level',
'master', false)
model.addStatement(uri, prefixBSOU
'course', 'Masters in Environmental Management
and Cleaner Production', false)
catch (e) utilities.debugPrint(e)
try
var code cleanString(getNode(doc, elmt,
'./td1/text()1', nsResolver).nodeValue)
model.addStatement(uri, prefixBSOU 'code',
code, false)
catch (e) utilities.debugPrint(e)
try
var ects cleanString(getNode(doc, elmt,
'./td3/text()1', nsResolver).nodeValue)
model.addStatement(uri, prefixBSOU 'ects',
ects, false)
catch (e) utilities.debugPrint(e)