Text Mining Tools - PowerPoint PPT Presentation

1 / 9
About This Presentation
Title:

Text Mining Tools

Description:

Spidering, indexing & searching. You may use this tool (or other spiders) to collect documents for your case analysis project. ... Convert html files to text files ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 10
Provided by: Li194
Category:
Tags: mining | spidering | text | tools

less

Transcript and Presenter's Notes

Title: Text Mining Tools


1
Text Mining Tools
  • (May be used for your case analysis project)

2
1. Web Spiders
  • SpidersRUs
  • - Developed by the Artificial Intelligence
  • Lab at University of Arizona
  • - Spidering, indexing searching
  • You may use this tool (or other spiders) to
    collect documents for your case analysis project.
  • Free download at (software manual)
  • http//ai.bpa.arizona.edu/spidersrus/index.html
  • Refer to the manual to see how to use it.

3
Web Spiders architecture
4
Convert html files to text files
  • you may find many software which can convert html
    files to text files.
  • three exampleshttp//www.softinterface.com/Co
    nvert-Doc/Convert-Doc.htm
  • http//www.jetman.dircon.co.uk/software/web2text.
    html
  • http//www.jafsoft.com/detagger/

5
2. TextAnalyst 2.3
  • Main functions
  • Topic structure display
  • Distillation of the meaning of the text
  • Text summarization of user-specified length
  • Word search and semantic search
  • Clustering
  • Download trial version at
  • http//www.megaputer.com/products/ta/
  • Go to its HELP menu to see the tutorial.
  • Full version available at
  • machine IS-lab-12 and IS-lab-13 in CoLab
    (GITC 4323)
  • This software is mainly used to help you to
    understand the contents of your document
    collection.

6
TextAnalyst 2.3
  • Semantic weight
  • - The semantic weight of a concept is a measure
    of its importance in your document
  • - The semantic weight of the relationship
    between a concept and its parent concept is the
    measure of the strength of the relationship
    between the concept and its parent.

7
3. SimStat WordStat
  • SimStat A statistical program
  • WordStat A text analysis program, running with
    Simstat
  • Most previous students used these two tools and
    the web spider to do the case analysis project.
  • Download trial version at
  • http//www.simstat.com/wordstat.htm
  • Download both Simstat Wordstat. Install Simstat
    first, then WordStat.
  • You may find demo from their website and also
    your own machine after installing these two
    tools.
  • Full version available at
  • machine Is-lab-11 in CoLab (GITC4323)
  • To use these two software on this machine you
    need to use the following account
  • username cis634, password cis634, domainthis
    computer

8
3. SimStat WordStat (cont.)
  • Basic steps of using wordstat/simstat
  • Before the following steps, you need to collect
    your documents using spider, clean your documents
    manually, and convert them from .html files to
    .txt files.
  • 1. Load documents convert them into a database
    file
  • Start-gtProgram file-gtprovalis research-gtword
    stat-gtdocument conversion wizard
  • this wizard lets you load your documents, set
    variables etc, and finally save the result in a
    file of simstat (.dbf) type
  • 2. Open .dbf file from Simstat.
  • In simstat, open your .dbf file got from last
    step.
  • -gtstatistics-gtcontent analysis, go to wordstat.

9
SimStat WordStat (cont.)
  • 3. Work in wordstat
  • Build a specialized dictionary for your document
    collection. It should contain important
    categories and keywords you are interested in.
  • Do frequency analysis, clustering, etc., for the
    categories and keywords of your dictionary.
  • You may save the frequency results about the
    keywords/categories in a .dbf file and further
    analyze them in simstat.
  • 4. Work in simstat
  • You may load the file obtained from last step
    into simstat to do further statistical analysis.
Write a Comment
User Comments (0)
About PowerShow.com