Title: Web Mining for the Baltic Sea Open University BSOU
1Web Mining for the Baltic Sea Open University
(BSOU)
- by
- Siew Kheng Lee, Yaroslav Tsaruk,
- Kimmo Salmenjoki, Lorna Uden
- In collaboration with
- University of Applied Sciences, Kiel, Germany
- University of Vaasa, Finland
- Staffordshire University, UK
2Content
- Introduction to BSOU and Web Mining
- Semantic Web based portal for BSOU
- Harvesting web information with Piggy Bank
- Conclusions
http//www.uwasa.fi/
3Data ? Web Mining
- Normal data mining, but now information is more
dynamic - Methodologies
- database design algorithms like SOM-gt with
webservers and webpages using XML technologies
and semantic approaches - Technical approaches
- OLAP, Yale and other commercial software, Web 2.0
- We look for a cross-technique between the
traditional data mining and web mining
methodologies
http//en.wikipedia.org/wiki/Data_mining
4Introduction BSN
- Baltic Sea Network (BSN) is a collaboration
- of higher education institutions, regional
development - organizations other organizations
- of the Baltic Sea Region.
- 9 Countries
- Denmark, Estonia, Finland, Germany,
- Latvia, Lithuania, Poland,
- Russia and Sweden
- Objective
- To boost development, education research
through shared - networking, joint projects enhanced mobility
- International Cooperation
- Focusing in Welfare, Business Skills
Management, Tourism and Information
Communication Technology
http//www.balticseanetwork.org/
5BSOU Web Portals
- BSOU has 27 BSN educational institutions
- BSOU aims to provide joint degrees within the
resources from its participating education
institutions - Draw all the information from existing
participating institutions together and to
organize them in a logical manner - The challenges lie ahead of BSOU in finding this
common bond, extracting them and there on forming
new structures that fits amongst them - Language barriers, different systems, separate
priorities, maintenance and efficient
organization to overcome
6Semantic Web based portal for BSOU
- For mining the websites of the partner
universities, we have established BSOUs own
composed ontology - The BSOU ontology is organized in a hierarchical
manner, assimilating the extension of a
parent-child node tree graph. The ontology was
based on the Bologna model and the results of
data mining analysis of section 2.1
http//ttwin.techno.uwasa.fi/bsou_portal/
7BSOU Ontology Structure
8Semantic Information Design
- The BSOU ontology takes account of the specific
subjects and courses, that is obtained from the
participating institutions. They are classified
according to their departments. - To enable easier handling of the bulk
information, the BSOU ontology structure is
divided into two parts, one is the BSOU courses
ontology and the other is the BSOU contact
information ontology
9Basic Semantic Web Approach
10Courses BSOU Portal
- How to combine information from 27 heterogenous
web sites? - First manual analysis to XLS table - gt human
overview - Possible automation?
- Various Semantic Web based tool boxes SemTalk
(Visio add-on), KAON, KIM, Sesame, SMORE,
PiggyBank,
11Web Mining BSOU Websites for BSOU Portal
Automation
- 2.0 Piggy Bank is an extension to the Firefox
web browser that turns it into a Semantic Web
Browser - 2.1 Raw data analysis for the related web sites
- 2.2 Information structure analysis of underlying
web sites
http//simile.mit.edu/piggy-bank/
122.0 Piggy Bank
- Piggy Bank uses existing information on the Web
in a more useful and flexible way not offered by
the original websites - Approaches to use Piggy Bank collect, install
scrapers, search browse, map, save, tag,
retrieve, combine, share
13Piggy Bank Architecture
- Realized with own User Interface (search
browse), tags (for users), screen scrapers (for
web pages) - Semantic Banks for sharing information
http//simile.mit.edu/semantic-bank/
142.1 Raw Data Analysis (for the related websites)
- Data mining was used to know and to outline the
similarities and differences, to spot the odds
and other end extremes between these different
contributors - The analysis results would reveal of what quality
does the web structure of these websites are in
the state of, in order to progress with the
implementation of a semantic web solution - Through data mining, meaningful data links and
information can be discovered to construct a
mental model structure before the integration of
the semantic web solution - The data collected from all the websites was
compiled in a huge comparison spreadsheet
(equivalent to 4 times of A0 size paper). The
fields (shown next slide) in this spreadsheet
that were compared and listed horizontally in the
spreadsheet.
152.1 Raw Data Analysis (spreadsheet and fields)
162.2 Information Structure Analysis (of
underlying websites)
- To obtain the least information-loss, every
single unit of the contributors website should
have a decent HTML structure. This would help to
eliminate prior pre-processing processes all in
the effort to just obtain the data in HTML
format. - An analysis report is drawn to form a general
idea on the quality of the web information
structure of each and every page that the
contributor has to offer. - Assumptions Every participating institution has
only one domain. Only English web pages are
considered in this case study. Sites are
distinguished for their page type that is if they
contain actual information in HTML or merely a
HTML masked page (e.g.consisting of links to
.doc, .pdf, .ppt files).
17Case 1 Scriptlet
- Our approach is to use a Case 1 scriptlet to test
against the page if any information proves to be
scrappable - Using this basic scriptlet enabled us to gauge if
the HTML page quality is fit for further
scraping. When there are no results, or errors
popping up from this basic scriptlet we try to
customize the generic scriptlet site specifically
and align to our common ontology of Section 2.3
with very precise landmarks of the page being
identified - If no further results are obtained, a quick check
with the browsers DOM inspector is taken for the
specific location in the page to confirm the HTML
tagging. Then the page can be ignored for it is
considered not worth scraping with the scriptlet
technique
18Case 2 Scriptlet
- If there are results obtained, then a more
refined Case 2 scriptlet is further scrutinized
to guarantee enhanced mining results - If these results have similarity with other pages
of that same domain, the site specific scriptlet
is pushed towards the scraping process. It is
left to the refined stage if there are no common
matches to be found
- Case 2 Scriptlet
- //begin of scriptlet code
- var prefixRDF 'http//www.w3.org/1999/02/22-rdf-
syntax-ns' - var prefixDC 'http//purl.org/dc/elements/1.1/'
- var prefixBSOU 'http//ttwin.techno.uwasa.fi/bsou
_portal/ontology/bsou' - var namespace doc.documentElement.namespaceURI
- var nsResolver namespace ? function(prefix)
- if (prefix 'x') return namespace else
return null - null
- var getNode function(doc, contextNode, xpath,
nsResolver) - return doc.evaluate(xpath, contextNode,
nsResolver, XPathResult.ANY_TYPE,null).iterateNext
() - var cleanString function(s)
- return utilities.trimString(s)
- var xpath '/html/body/table/tbody/tr1/td1_at_c
lass"tavatekst"/table_at_class"tavatekst"/tbody/
tr'
19Scrapped Items Domains (according to country)
- After scraping each and every participating
institutions web domain, the results are
integrated according to country grouping. Table
shows a distribution of scrapped domains and the
number of information items scrapped from these
domains - The country-grouping integration process ran
smoothly except partial results from one partner
that failed to be integrated into the merged
country results - It was found that the information as encoded in
HTML inside the page contains unnecessary extra
spaces in between words to break a line, thus
creating the similar spaces for which the Piggy
Bank could not handle. It is an example of poor
HTML coding to suit a page design and not the
context
20Table of Scrapped Items
21- The fully merged results appear to have 16
information items less as compared to the
mathematically-counted number of information
items that adds up from the individual countries
item - This happens because the Piggy Bank automatically
merges these items together due to that the Piggy
Bank employs a taxonomy that collapses same
information items and distinguishing tags in its
ontology when found to have information of the
same values
22Conclusions
- Out of 27 contributors, only 15 institutions have
decent HTML web pages in their respective domain.
The other 45 of the contributors which did not
make the cut is mainly due to poor content
aggregation in their own websites itself - Some pages have deep web information hierarchy
which makes is rather impossible to even manually
locate a piece of valuable information. These
websites too have very little structure in their
HTML code, making it a poor site for semantic
tasks to navigate and build upon - As we saw above, a lot of policy and web
development issues need to be harmonized when
communities of people or virtual organizations
share their information and knowledge processes
online. Hopefully semantic web and its wider
usage together with common approaches for the web
development in general will ease this aspect more
in the future
23BSN Upcoming Events
- Institutes of higher professional education in
future society - Location Tallinn, Estonia
- Date 2nd 3rd November 2006
- Next Baltic Sea Network Partner Days
- Location Tallinn, Estonia
- Date January 2007
http//www.itcollege.ee/inenglish/index.php
24References
- Baltic Sea Network, http//www.balticseanetwork.or
g/ - J. David Eisenberg A List Apart Articles
Forgiving Browsers Considered Harmful April 27,
2001 - D. Hand, H. Mannila, P. Smyth Principles of Data
Mining. MIT Press, Cambridge, MA, 2001. ISBN
0-262-08290-X - Siew Kheng Lee Automated Web Information System
via Semantic Web Technologies, Master thesis,
University of Applied Sciences, Kiel, Germany,
2006 - SIMILE project PiggyBank browser extension,
http//simile.mit.edu/piggy-bank/, MIT - Yaroslav Tsaruk, Kimmo Salmenjoki, Helmut
Dispert Using agent approach in developing a
course sharing system, Information Society 2005
multiconference, Intelligent systems conference,
Ljubljana, Slovenia, ISBN 961-6303-71-6, p. 362-
365 - L. Uden, K. Salmenjoki, M. Hericko, L. Pavlic
Metadata for research and development
collaboration between universities, Int. Journal
of Metadata, Semantics and Ontologies, IJMSO,
2006 (to appear)
- var elmts utilities.gatherElementsOnXPath(doc,
doc, xpath, nsResolver) - for (var i 1 i lt 18 i)
- var elmt elmtsi
- var uri cleanString(getNode(doc, elmt,
'./td2/text()1', nsResolver).nodeValue) - model.addStatement(uri, prefixRDF 'type',
'subject', false) - try
- var subject cleanString(getNode(doc, elmt,
'./td2/text()1', nsResolver).nodeValue) - model.addStatement(uri, prefixDC 'subject',
subject, false) - model.addStatement(uri, prefixBSOU 'level',
'master', false) - model.addStatement(uri, prefixBSOU
'course', 'Masters in Environmental Management
and Cleaner Production', false) - catch (e) utilities.debugPrint(e)
- try
- var code cleanString(getNode(doc, elmt,
'./td1/text()1', nsResolver).nodeValue) - model.addStatement(uri, prefixBSOU 'code',
code, false) - catch (e) utilities.debugPrint(e)
- try
- var ects cleanString(getNode(doc, elmt,
'./td3/text()1', nsResolver).nodeValue) - model.addStatement(uri, prefixBSOU 'ects',
ects, false) - catch (e) utilities.debugPrint(e)
25Thank you