Web Mining for the Baltic Sea Open University BSOU - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Web Mining for the Baltic Sea Open University BSOU

Description:

Introduction to BSOU and Web Mining. Semantic Web based portal ... Information extractor engine. Content provider. Consumers of information. Client 1. Client 2 ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 26
Provided by: lipas
Category:

less

Transcript and Presenter's Notes

Title: Web Mining for the Baltic Sea Open University BSOU


1
Web Mining for the Baltic Sea Open University
(BSOU)
  • by
  • Siew Kheng Lee, Yaroslav Tsaruk,
  • Kimmo Salmenjoki, Lorna Uden
  • In collaboration with
  • University of Applied Sciences, Kiel, Germany
  • University of Vaasa, Finland
  • Staffordshire University, UK

2
Content
  • Introduction to BSOU and Web Mining
  • Semantic Web based portal for BSOU
  • Harvesting web information with Piggy Bank
  • Conclusions

http//www.uwasa.fi/
3
Data ? Web Mining
  • Normal data mining, but now information is more
    dynamic
  • Methodologies
  • database design algorithms like SOM-gt with
    webservers and webpages using XML technologies
    and semantic approaches
  • Technical approaches
  • OLAP, Yale and other commercial software, Web 2.0
  • We look for a cross-technique between the
    traditional data mining and web mining
    methodologies

http//en.wikipedia.org/wiki/Data_mining
4
Introduction BSN
  • Baltic Sea Network (BSN) is a collaboration
  • of higher education institutions, regional
    development
  • organizations other organizations
  • of the Baltic Sea Region.
  • 9 Countries
  • Denmark, Estonia, Finland, Germany,
  • Latvia, Lithuania, Poland,
  • Russia and Sweden
  • Objective
  • To boost development, education research
    through shared
  • networking, joint projects enhanced mobility
  • International Cooperation
  • Focusing in Welfare, Business Skills
    Management, Tourism and Information
    Communication Technology

http//www.balticseanetwork.org/
5
BSOU Web Portals
  • BSOU has 27 BSN educational institutions
  • BSOU aims to provide joint degrees within the
    resources from its participating education
    institutions
  • Draw all the information from existing
    participating institutions together and to
    organize them in a logical manner
  • The challenges lie ahead of BSOU in finding this
    common bond, extracting them and there on forming
    new structures that fits amongst them
  • Language barriers, different systems, separate
    priorities, maintenance and efficient
    organization to overcome

6
Semantic Web based portal for BSOU
  • For mining the websites of the partner
    universities, we have established BSOUs own
    composed ontology
  • The BSOU ontology is organized in a hierarchical
    manner, assimilating the extension of a
    parent-child node tree graph. The ontology was
    based on the Bologna model and the results of
    data mining analysis of section 2.1

http//ttwin.techno.uwasa.fi/bsou_portal/
7
BSOU Ontology Structure
8
Semantic Information Design
  • The BSOU ontology takes account of the specific
    subjects and courses, that is obtained from the
    participating institutions. They are classified
    according to their departments.
  • To enable easier handling of the bulk
    information, the BSOU ontology structure is
    divided into two parts, one is the BSOU courses
    ontology and the other is the BSOU contact
    information ontology

9
Basic Semantic Web Approach
10
Courses BSOU Portal
  • How to combine information from 27 heterogenous
    web sites?
  • First manual analysis to XLS table - gt human
    overview
  • Possible automation?
  • Various Semantic Web based tool boxes SemTalk
    (Visio add-on), KAON, KIM, Sesame, SMORE,
    PiggyBank,

11
Web Mining BSOU Websites for BSOU Portal
Automation
  • 2.0 Piggy Bank is an extension to the Firefox
    web browser that turns it into a Semantic Web
    Browser
  • 2.1 Raw data analysis for the related web sites
  • 2.2 Information structure analysis of underlying
    web sites

http//simile.mit.edu/piggy-bank/
12
2.0 Piggy Bank
  • Piggy Bank uses existing information on the Web
    in a more useful and flexible way not offered by
    the original websites
  • Approaches to use Piggy Bank collect, install
    scrapers, search browse, map, save, tag,
    retrieve, combine, share

13
Piggy Bank Architecture
  • Realized with own User Interface (search
    browse), tags (for users), screen scrapers (for
    web pages)
  • Semantic Banks for sharing information

http//simile.mit.edu/semantic-bank/
14
2.1 Raw Data Analysis (for the related websites)
  • Data mining was used to know and to outline the
    similarities and differences, to spot the odds
    and other end extremes between these different
    contributors
  • The analysis results would reveal of what quality
    does the web structure of these websites are in
    the state of, in order to progress with the
    implementation of a semantic web solution
  • Through data mining, meaningful data links and
    information can be discovered to construct a
    mental model structure before the integration of
    the semantic web solution
  • The data collected from all the websites was
    compiled in a huge comparison spreadsheet
    (equivalent to 4 times of A0 size paper). The
    fields (shown next slide) in this spreadsheet
    that were compared and listed horizontally in the
    spreadsheet.

15
2.1 Raw Data Analysis (spreadsheet and fields)
16
2.2 Information Structure Analysis (of
underlying websites)
  • To obtain the least information-loss, every
    single unit of the contributors website should
    have a decent HTML structure. This would help to
    eliminate prior pre-processing processes all in
    the effort to just obtain the data in HTML
    format.
  • An analysis report is drawn to form a general
    idea on the quality of the web information
    structure of each and every page that the
    contributor has to offer.
  • Assumptions Every participating institution has
    only one domain. Only English web pages are
    considered in this case study. Sites are
    distinguished for their page type that is if they
    contain actual information in HTML or merely a
    HTML masked page (e.g.consisting of links to
    .doc, .pdf, .ppt files).

17
Case 1 Scriptlet
  • Our approach is to use a Case 1 scriptlet to test
    against the page if any information proves to be
    scrappable
  • Using this basic scriptlet enabled us to gauge if
    the HTML page quality is fit for further
    scraping. When there are no results, or errors
    popping up from this basic scriptlet we try to
    customize the generic scriptlet site specifically
    and align to our common ontology of Section 2.3
    with very precise landmarks of the page being
    identified
  • If no further results are obtained, a quick check
    with the browsers DOM inspector is taken for the
    specific location in the page to confirm the HTML
    tagging. Then the page can be ignored for it is
    considered not worth scraping with the scriptlet
    technique

18
Case 2 Scriptlet
  • If there are results obtained, then a more
    refined Case 2 scriptlet is further scrutinized
    to guarantee enhanced mining results
  • If these results have similarity with other pages
    of that same domain, the site specific scriptlet
    is pushed towards the scraping process. It is
    left to the refined stage if there are no common
    matches to be found
  • Case 2 Scriptlet
  • //begin of scriptlet code
  • var prefixRDF 'http//www.w3.org/1999/02/22-rdf-
    syntax-ns'
  • var prefixDC 'http//purl.org/dc/elements/1.1/'
  • var prefixBSOU 'http//ttwin.techno.uwasa.fi/bsou
    _portal/ontology/bsou'
  • var namespace doc.documentElement.namespaceURI
  • var nsResolver namespace ? function(prefix)
  • if (prefix 'x') return namespace else
    return null
  • null
  • var getNode function(doc, contextNode, xpath,
    nsResolver)
  • return doc.evaluate(xpath, contextNode,
    nsResolver, XPathResult.ANY_TYPE,null).iterateNext
    ()
  • var cleanString function(s)
  • return utilities.trimString(s)
  • var xpath '/html/body/table/tbody/tr1/td1_at_c
    lass"tavatekst"/table_at_class"tavatekst"/tbody/
    tr'

19
Scrapped Items Domains (according to country)
  • After scraping each and every participating
    institutions web domain, the results are
    integrated according to country grouping. Table
    shows a distribution of scrapped domains and the
    number of information items scrapped from these
    domains
  • The country-grouping integration process ran
    smoothly except partial results from one partner
    that failed to be integrated into the merged
    country results
  • It was found that the information as encoded in
    HTML inside the page contains unnecessary extra
    spaces in between words to break a line, thus
    creating the similar spaces for which the Piggy
    Bank could not handle. It is an example of poor
    HTML coding to suit a page design and not the
    context

20
Table of Scrapped Items
21
  • The fully merged results appear to have 16
    information items less as compared to the
    mathematically-counted number of information
    items that adds up from the individual countries
    item
  • This happens because the Piggy Bank automatically
    merges these items together due to that the Piggy
    Bank employs a taxonomy that collapses same
    information items and distinguishing tags in its
    ontology when found to have information of the
    same values

22
Conclusions
  • Out of 27 contributors, only 15 institutions have
    decent HTML web pages in their respective domain.
    The other 45 of the contributors which did not
    make the cut is mainly due to poor content
    aggregation in their own websites itself
  • Some pages have deep web information hierarchy
    which makes is rather impossible to even manually
    locate a piece of valuable information. These
    websites too have very little structure in their
    HTML code, making it a poor site for semantic
    tasks to navigate and build upon
  • As we saw above, a lot of policy and web
    development issues need to be harmonized when
    communities of people or virtual organizations
    share their information and knowledge processes
    online. Hopefully semantic web and its wider
    usage together with common approaches for the web
    development in general will ease this aspect more
    in the future

23
BSN Upcoming Events
  • Institutes of higher professional education in
    future society
  • Location Tallinn, Estonia
  • Date 2nd 3rd November 2006
  • Next Baltic Sea Network Partner Days
  • Location Tallinn, Estonia
  • Date January 2007

http//www.itcollege.ee/inenglish/index.php
24
References
  • Baltic Sea Network, http//www.balticseanetwork.or
    g/
  • J. David Eisenberg A List Apart Articles
    Forgiving Browsers Considered Harmful April 27,
    2001
  • D. Hand, H. Mannila, P. Smyth Principles of Data
    Mining. MIT Press, Cambridge, MA, 2001. ISBN
    0-262-08290-X
  • Siew Kheng Lee Automated Web Information System
    via Semantic Web Technologies, Master thesis,
    University of Applied Sciences, Kiel, Germany,
    2006
  • SIMILE project PiggyBank browser extension,
    http//simile.mit.edu/piggy-bank/, MIT
  • Yaroslav Tsaruk, Kimmo Salmenjoki, Helmut
    Dispert Using agent approach in developing a
    course sharing system, Information Society 2005
    multiconference, Intelligent systems conference,
    Ljubljana, Slovenia, ISBN 961-6303-71-6, p. 362-
    365
  • L. Uden, K. Salmenjoki, M. Hericko, L. Pavlic
    Metadata for research and development
    collaboration between universities, Int. Journal
    of Metadata, Semantics and Ontologies, IJMSO,
    2006 (to appear)
  • var elmts utilities.gatherElementsOnXPath(doc,
    doc, xpath, nsResolver)
  • for (var i 1 i lt 18 i)
  • var elmt elmtsi
  • var uri cleanString(getNode(doc, elmt,
    './td2/text()1', nsResolver).nodeValue)
  • model.addStatement(uri, prefixRDF 'type',
    'subject', false)
  • try
  • var subject cleanString(getNode(doc, elmt,
    './td2/text()1', nsResolver).nodeValue)
  • model.addStatement(uri, prefixDC 'subject',
    subject, false)
  • model.addStatement(uri, prefixBSOU 'level',
    'master', false)
  • model.addStatement(uri, prefixBSOU
    'course', 'Masters in Environmental Management
    and Cleaner Production', false)
  • catch (e) utilities.debugPrint(e)
  • try
  • var code cleanString(getNode(doc, elmt,
    './td1/text()1', nsResolver).nodeValue)
  • model.addStatement(uri, prefixBSOU 'code',
    code, false)
  • catch (e) utilities.debugPrint(e)
  • try
  • var ects cleanString(getNode(doc, elmt,
    './td3/text()1', nsResolver).nodeValue)
  • model.addStatement(uri, prefixBSOU 'ects',
    ects, false)
  • catch (e) utilities.debugPrint(e)

25
Thank you
Write a Comment
User Comments (0)
About PowerShow.com