:: DIAsDEM :: - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

:: DIAsDEM ::

Description:

Mister Schr der earns 20000,- /month. What does it mean? How to compare? How to ... Mister Schr der earns 20000,- /month. 'useless' for computational analyse ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 25
Provided by: heikos6
Category:
Tags: diasdem | mister

less

Transcript and Presenter's Notes

Title: :: DIAsDEM ::


1
DIAsDEM
  • Seminar Web Mining
  • WS 2003/2004
  • Ingo Kampe
  • Heiko Scharff

2
Content
  • Introduction and data mining context
  • DIAsDEM - functioning
  • New extensions

3
Introduction
  • problems

4
Introduction
  • known data in databases (DB2, Oracle, ...)
  • unproblematically to analyse, for example with
    SQL, self-brewed programmes or data miners
  • but in enterprises 80 of data in text documents
    (MS Word, plain text files, text archives, ...)
  • knowledge there, but useless

5
Introduction
  • example (same meaning, other structure)
  • Mr. Schröder earns EUR 20.000 per month.
  • Mister Schröder earns 20000,- /month.
  • What does it mean?
  • How to compare? How to analyse?
  • Does this mean the same?

6
Introduction
  • data mining context

7
Introduction
  • necessary to make knowledge analysable
  • desirable
  • semantically structured knowledge
  • queryable knowledge
  • possible solution XML
  • semantic tagging
  • analysable (XPath, XQuery, Tamino, ...)

8
Introduction
  • for humans
  • Mr. Schröder earns EUR 20.000 per month.
  • Mister Schröder earns 20000,- /month.
  • useless for computational analyse
  • only useful informations
  • Mister Schröder
  • 20000 Euro
  • month

9
Introduction
  • need to
  • find important information
  • mark important information
  • ltpersongtMr. Schröderlt/persongt
  • ltcapital amount20000 EURgtearns EUR
    20.000lt/capitalgt
  • ltperiodgtper monthlt/periodgt.

10
DIAsDEM
  • DIAsDEM

11
DIAsDEM
  • DIAsDEM Datenintegration von Altlastdaten und
    semistrukturierten Dokumenten mit
    Mining-Verfahren (integration of legacy data and
    semi-structured documents with data mining
    techniques)
  • project of the Deutsche Forschungs-gemeinschaft
    (German Research Society)
  • necessary domain specific knowledge (!!!)

12
DIAsDEM
  • functioning

13
DIAsDEM
  • 2-phase-model
  • knowledge discovery
  • iterative process (with expert knowledge)
  • training phase with training text archive
  • finding of segments (clusters) and semi-automatic
    annotation
  • deduction of an unstructured XML DTD
  • semantic tagging
  • usage of found clusters on new archives
  • intelligent tagging of new, unknown texts of
    the same domain

14
DIAsDEM
Fig. Winkler 2003b, page 6
15
DIAsDEM
  • to achieve good semantic tagging, expert
    knowledge necessary
  • What is needed?
  • ltpersongtMr. Schröderlt/persongt
  • or
  • lttitlegtMr.lt/titlegt
  • ltnamegtSchröderlt/namegt

16
DIAsDEM
  • steps in DIAsDEM
  • finding segments (for example sentences) in
    training texts by using thesauri and knowledge of
    named entities (persons, ...)
  • building an unstructured XML DTD
  • clustering of similar text elements (cluster name
    in cluster dominating descriptors)
  • renaming of clusters by experts
  • annotation of training texts
  • building a final XML DTD (for querying, XML based
    databases like Tamino, data miner, ...)

17
Extensions
  • new extensions

18
Extensions
  • main goal
  • searching documents from the internet, concerning
    user specification
  • downloading hypertext documents
  • extracting plain text from hypertext documents
  • importing plain text into DIAsDEM collection

19
Extensions
  • querying Google

20
Extensions - Google
  • declaration of search words by user (panel)
  • querying of Google using the Google-API with
    reference to the search words
  • result list of URLs (now only 10, limited by
    Google) automatic exported as list into a text
    file

21
Extensions
  • processing and import

22
Extensions - Processing and Import
  • reading url list (exported text file)
  • downloading hypertext files into a directory and
    renaming the files (enumeration)
  • detagging the files
  • cleaning hypertext documents
  • deleting comments an tags
  • replacing special characters (not yet
    implemented)
  • importing files into the DIAsDEM collection

23
Questions?
  • ?

24
Literature
  • Graubitz,H., Spiliopoulou,M. Winkler,K. (2001).
    The DIAsDEM Framework for Converting
    Domain-Specific Texts into XML Documents with
    Data Mining Techniques. In Proceedings of the
    First IEEE International Conference on Data
    Mining, pages 171-178, San Jose, CA, USA,
    November / December 2001. IEEE Computer Society,
    Los Alamitos.
  • Winkler,K. Spiliopoulou,M. (2003a). Text
    Mining in der Wettbewerberanalyse Konvertierung
    von Textarchiven in XML-Dokumente. In
    Proceedings der 6. Konferenz der SAS Anwender in
    Forschung und Entwicklung, pages 347-363, Shaker
    Verlag, Aachen, Germany.
  • Winkler,K. (2003b). Technical Report - Getting
    Started with DIAsDEM Workbench 2.1. A Case-Based
    Approach Technical Report, 121 pages. HHL -
    Leipzig Graduate School of Management.
Write a Comment
User Comments (0)
About PowerShow.com