CSC 9010: Text Mining Applications - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

CSC 9010: Text Mining Applications

Description:

CSC 9010: Text Mining Applications Dr. Paula Matuszek Paula_A_Matuszek_at_glaxosmithkline.com (610) 270-6851 So What Next? Evaluating systems Systems available Some good ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 20
Provided by: cscVilla
Category:

less

Transcript and Presenter's Notes

Title: CSC 9010: Text Mining Applications


1
CSC 9010 Text Mining Applications
  • Dr. Paula Matuszek
  • Paula_A_Matuszek_at_glaxosmithkline.com
  • (610) 270-6851

2
So What Next?
  • Evaluating systems
  • Systems available
  • Some good resources

3
Evaluating Text Mining Systems
  • There are dozens of text mining tools and systems
    available
  • commercial
  • open source
  • research
  • How do you decide which to use?

4
Determine Information Need
  • First step what are you trying to find out?
  • Locate a specific piece of information?
  • Locate and capture a large amount of specific
    information
  • Locate a specific document?
  • Get the gist of one or more documents?
  • Organize documents into groups?
  • Find out something about the overall domain which
    is reflected in a set of documents?
  • ???

5
Determine Environment
  • What operating system?
  • What document formats?
  • ASCII or something richer?
  • What level of software maturity?
  • COTS, with support available, maybe already tuned
    for your specific problem
  • Open source or other fairly stable
  • Research tool
  • What is the cost justification?

6
Thinking About Information Needs
  • How specific is your need?
  • How much do you know already?
  • How big a corpus? How well-defined?
  • One-time question or continuing?
  • Incremental or episodic?

7
Information Extraction Tools
  • Extract specific information, probably from a
    large number of documents.
  • What's the typical precision and recall?
  • KB info
  • What entities are already defined?
  • How easy is it to add enumerated lists?
  • How easy is it to add patterns?
  • What document formats does it accept?
  • Performance?

8
Document Retrieval
  • Need a specific document or some information
  • For spidering
  • Coverage, including kinds of documents
  • Performance, which affects refresh speed
  • flexibility/configuration of spiders
  • special needs? (focused crawling)
  • For retrieval
  • Relevance ranking
  • Performance
  • Richness of query engine
  • Precision and recall
  • Query broadening and narrowing
  • For both ease of use

9
Document Categorization
  • You need to sort your documents
  • Does system perform in real time?
  • How many categories total can it handle?
  • How many categories/document? Flat or
    hierarchical?
  • Categories defined automatically or by hand?
  • Automatically
  • Assumes significant vocabulary differences among
    different groups.
  • Requires training examples
  • By hand assumes
  • Time to do it!
  • Readily identifiable characteristics to
    distinguish groups

10
Document Clustering
  • What is going on in this domain?
  • What features of document are used to cluster?
    Linguistic? Semantic? TFIDF?
  • What methods are used for clustering? (How do we
    define "similar"?)
  • Any capability for incorporating domain
    knowledge?
  • Performance
  • Incremental? Or do you have to start over again
    to add new documents?

11
Document Summarization
  • What do I have?
  • Sentence extraction or capture and generate?
  • How much can it be shortened?
  • How many documents at once?
  • Sentence extraction methods are heavily dependent
    on the method used to identify "important" words.

12
Grab Bag of Systems Available Entity or
Information Extraction
  • AeroText Lockheed Martin
  • GATE U of Sheffield
  • Sophia CELI
  • iMiner IBM
  • ClearTag ClearForest
  • Thing Finder Inxight
  • LexiQuest SPSS
  • Faustus/TextPRO SRI

13
Categorization/Clustering
  • Semio Entrieva
  • Oracle Text Oracle
  • Inxight Categorizer Inxight
  • Verity K2 Verity
  • Autonomy
  • ClearForest
  • LexiMine SPSS
  • iMiner, Lotus Discovery Server IBM (IBM)

14
Summarizing
  • All over the place!
  • Every search engine
  • Mac OS 10.2 and later
  • Many others

15
What's Happening
  • Some specific domains are very hot or interesting
    or intriguing
  • Expertise finder
  • Patent retrieval, visualization
  • Reputation Minder
  • Biological text mining
  • Semantic web
  • In fact, anything web-related
  • ??

16
What's Happening
  • Some technologies are also gaining speed
  • Taxonomy identification/extraction
  • Question answering
  • Automatic markup for the semantic web, for
    instance
  • Integrated domain-based and statistical
    approaches
  • Machine learning of KBs

17
Some Useful Resources Links
  • Portal text mining links, kept reasonably up to
    date
  • filebox.vt.edu/users/wfan/text_mining.html
  • www.cs.utexas.edu/users/pebronia/text-mining
  • A really excellent overview paper, still useful
    although 2001
  • www.mitre.org/work/tech_papers/tech_papers_01/mayb
    ury_unstructured/maybury_unstructured.pdf
  • Best site to start with for software,
    conferences, etc
  • www.kdnuggets.com/index.html

18
Useful Resources Conferences
  • AAAI and IJCAI Basic NL research some good
    workshops and tutorials on text mining. Some of
    everything.
  • KDD Text Mining often included as a form of
    data mining, especially more statistical
    approaches. KDD cup sometimes text based.
  • SIGIR Lots of information retrieval
  • ACL Lots of linguistic-based info, especially
    things like entity recognition and tagging.
  • Data mining conferences often include text
    mining component. ICDM, for example.
  • Domain-specific conferences often include a text
    mining component too.

19
So Where Now?
  • You now all have a good background in the
    techniques and applications of text mining, and
    some ideas of how it's been applied.
  • Where do you think it will it be in 10 years, and
    what will we be doing with it?
Write a Comment
User Comments (0)
About PowerShow.com