Text Mining Extraction WebBased Information Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

Text Mining Extraction WebBased Information Architectures

Description:

Context-Free Entity Extraction. Role-based Entity Extraction. Relational ... Then employee checks template and adds/corrects information such as missing date ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 34
Provided by: cjin
Category:

less

Transcript and Presenter's Notes

Title: Text Mining Extraction WebBased Information Architectures


1
Text Mining -- ExtractionWeb-Based Information
Architectures
  • MSEC 20-760Mini IIJaime Carbonell

2
General Topic Text Extraction
  • Motivation Text Mining
  • Context-Free Entity Extraction
  • Role-based Entity Extraction
  • Relational Extraction
  • eBusiness Applications

3
Text Mining (1)
  • The Need to Process Text Automatically
  • Text is meant to be read by humans, not programs.
  • Most useful information is stored as text.
  • (100 times as much online text as online DBs)
  • HTML web pages are text (with structuring tags).
  • Data Mining (covered later) operates on data
    tables (i.e. numbers, fixed fields, adherence to
    data models).

4
Text Mining (2)
  • The Need to Process Text Automatically
  • We need text gt data table transducers.
  • General Natural Language Understanding is still
    too hard.
  • But, can we solve simpler but useful
    sub-problems?
  • Yes categorization of text by topic and
    extraction of certain kinds of information from
    free text or HTML-structured text is possible.

5
Text Mining (3)
  • Components of Text Mining
  • Categorization by topic or Genre
  • Introduced here, see Prof Yangs lecture
  • Fact extraction from text
  • Topic of this class
  • Data Mining from DBs or extracted facts
  • Later lecture on Data Mining

6
Text Categorization (1)
  • Definition
  • Assign labels to each document or web-page
  • Labels may be topics such as Yahoo-categories
  • e.g. "finance," "sports," "newsgtworldgtasiagtbusine
    ss"
  • Labels may be genres
  • e.g. "editorials" "movie-reviews" "news"
  • Labels may be binary
  • e.g. "interesting-to-me" "not-interesting-to-me"

7
Text Categorization (2)
  • Methods
  • Manual assignment (as in Yahoo)
  • Hand-coded rule based (as in Reuters)
  • (Usually If the document contains a given
    boolean combination of words, then assign it a
    specified category.)

8
Text Categorization (3)
  • Methods
  • Learning of document-label assignment function
  • Most new applications rely on machine learning
  • k-Nearest Neighbors (simple, powerful)
  • See Prof. Yangs lecture
  • Decision-tree induction (most common method)
  • Support-vector machines (newest method)

9
Named Entity Identification I (1)
  • Purpose
  • To answer questions such as
  • Who is mentioned in these 100 Society article?
  • What locations are listed in these 2000 web
    pages?
  • What companies are mentioned in these patent
    forms?
  • What products were evaluated by Consumer Reports
    this year?

10
Named Entity Identification I (2)
  • Example
  • President Clinton decided to send special trade
    envoy Mickey Kantor to the special Asian economic
    meeting in Singapore this week. Ms. Xuemei Peng,
    trade minister from China, and Mr. Hideto Suzuki
    from Japans Ministry of Trade and Industry will
    also attend. Singapore, who is hosting the
    meeting, will probably be represented by its
    foreign and economic ministers. The Australian
    representative, Mr. Langford, will not attend,
    though no reason has been given. The parties hope
    to reach a framework for currency stabilization.

11
Named Entity Identification I (3)
  • Extracted Named Entities (NEs)
  • PEOPLE PLACES
  • __________________________________________
  • President Clinton Singapore
  • Mickey Kantor Japan
  • Ms. Xuemei Peng China
  • Mr. Hideto Suzuki Australia
  • Mr. Langford

12
Named Entity Identification IIFinite-State
Machines (1)
  • Definition of Finite State Acceptor (FSA)
  • A FSA is a directed graph
  • With a "start" node
  • With one or more "accepting" nodes

13
Named Entity Identification IIFinite-State
Machines (2)
  • Definition of Finite State Acceptor (FSA)
  • With link-labels matching input items
  • exact-match links labels
  • e.g. "China" matching only "China"
  • wildcard (?) match
  • e.g. "?" matches "100" or "China" or ...
  • feature-match
  • e.g. CAP matches any capitalized word
  • list-membership match
  • e.g. if HON-LIST (Mr, Ms, Dr, President, ...)
  • it would match any of those words in the input

14
Named Entity Identification IIFinite-State
Machines (3)
  • Definition of Finite State Acceptor (FSA)
  • With an input source (e.g. string of words)
  • Outputs "YES" or "NO"

15
Named Entity Identification IIIFinite-State
Machines
  • Definition of A Finite State Transducer (FST)
  • An FSA with variable binding
  • Outputs "NO" or "YES"variable-bindings
  • Variable bindings encode recognized entity
  • e.g. "YES ltfirstname Hidetogt ltlastname Suzukigt"

16
Finite State Acceptor (FSA)
Start State
Accept State
CAP
? HON-LIST
CAP
17
Finite State Transducer (FST)
? HON-LIST
CAP
CAP
HON ?
FirstName ?
LastName ?
18
Role-Situated Named Entities (1)
  • Motivation
  • It is useful to know roles of NEs, e.g.
  • Who participated in the economic meeting?
  • Who hosted the economic meeting?
  • Who was discussed in the economic meeting?
  • Who was absent from the the economic meeting?

19
Role-Situated Named Entities (2)
  • How do we Assign Roles to Entities?
  • Instead of one FSM, use a trio of 3 FSMs
  • ltleft-context-FSAgtltentity-FSMgtltright-context-FSAgt
  • Where left and right context help assign role

20
Role-Situated Named Entities (3)
  • Example
  • If ltright-contextgt
  • lt? "not" ("attend" "participate")gt
  • Then entity.role ABSENT
  • If ltleft-contextgt
  • lt("meet" "meeting") ("in" "at")gt
  • Then entity.role HOST

21
Relational Information Extraction (1)
  • Motivation
  • It useful to know who is doing what to whom

22
Relational Information Extraction (2)
  • Example
  • "John Snell reporting for Wall Street. Today
    Flexicon Inc. announced a tender offer for
    Supplyhouse Ltd. for 30 per share, representing
    a 30 premium over Fridays closing price.
    Flexicon expects to acquire Supplyhouse by Q4
    2001 without problems from federal regulators"

23
Relational Information Extraction (3)
  • Extraction System is Template of FSMs
  • Corporate-acquisition
  • acquirer ltcompany-FSMgt ltr-acquirer-FSMgt
  • acquiree ltl-acquiree-FSMgt ltcompany-FSM)
  • share-price ltmoney-FSMgt ltr-stock-FSMgt
  • date ltl-event-date-FSMgt ltdate-FSMgt

24
Relational Information Extraction (4)
  • Output is Instantiated FSM
  • Corporate-acquisition
  • acquirer "Flexicon Inc."
  • acquiree "Supplyhouse Ltd."
  • share-price "30 USD"
  • date "Q4 2001"

25
Fact Extraction State of the Art (1)
  • Observations
  • Entity gt entityroles gt relation templates
  • Increasing richness of information extracted
  • But not equivalent to language understanding
  • Only pre-determined info types extracted

26
Fact Extraction State of the Art (2)
  • Observations
  • Useful for relational DB filling
  • Acquirer Acquiree Sh.price Year
  • __________________________________
  • Flexicon Logi-truck 18 1999
  • Flexicon Supplyhouse 30 2001
  • buy.com reel.com 10 2000
  • ... ... ... ...

27
Fact Extraction State of the Art (3)
  • Technical Approaches
  • Manually-built ad-hoc extraction "rules"
  • Manually-built FSTs
  • Feature-based training from labeled instances
  • (Naive Bayes, Decision Trees)
  • Hidden Markoff Models
  • FSTs with feedback-driven turning

28
Applications of Text Extraction I (1)
  • Financial
  • Email auto-response
  • e.g. "What is the balance of account N007623013?"
  • First categorize as balance-request
  • Then extract account number

29
Applications of Text Extraction I (2)
  • Financial
  • Template filling from bank order
  • e.g. "Please transfer 100,000 USD from N007623013
    to checking account A011129081 tomorrow
  • First categorize as transfer

30
Applications of Text Extraction I (3)
  • Financial
  • Template filling from bank order
  • Then extract
  • account-transfer
  • ltfrom N00762301gt
  • ltto A01112908gt
  • ltamount 100,000gt
  • ltdate ??gt
  • Then employee checks template and adds/corrects
    information such as missing date (e.g. if the
    system cannot interpret "tomorrow")

31
Applications of Text Extraction II (1)
  • Informational
  • For all seminar announcements in BB
  • extract time/title/speaker/location
  • From email messages about proposed meetings
  • extract time/participants/location

32
Applications of Text Extraction II (2)
  • Large-scale Wed applications
  • Build DB of all job openings
  • Categorize web pages as job descriptions
  • Extract company/date/salary/level/...
  • fill in relational DB with extracted info
  • Whizbang! (a Pittsburgh eCompany) is doing just
    this via its flipdog.com site
  • Build DB of all web-posted resumes,
  • first categorizing pages as resumes,
  • then extracting key fields name/expertise/...

33
Applications of Text Extraction II (3)
  • Corporate Intelligence
  • Extract key facts about competition web sites
  • New products offered
  • Any changes to prices, sales, etc.
  • Extract key facts about customers of competitors
Write a Comment
User Comments (0)
About PowerShow.com