Text Mining - PowerPoint PPT Presentation

About This Presentation
Title:

Text Mining

Description:

Topics Corpus Linguistics (mainly Dutch) Child language acquisition / computational psycholinguistics ... intelligence Translation ... Examples Application Areas ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 32
Provided by: WalterDa6
Category:

less

Transcript and Presenter's Notes

Title: Text Mining


1
Text Mining
  • Walter Daelemans
  • CNTS
  • Department of Linguistics
  • University of Antwerp
  • walter.daelemans_at_ua.ac.be

2
Centre for Dutch Language and Speech (CNTS)
  • Part of department of linguistics, University of
    Antwerp
  • Staff
  • 2 tenured 10-15 with temporary funding from EU,
    IWT, FWO, NTU, language industry, BOF,
  • Topics
  • Corpus Linguistics (mainly Dutch)
  • Child language acquisition / computational
    psycholinguistics
  • Language Technology
  • machine learning of language
  • shallow parsing
  • text mining

3
Information Overload
  • Language is the most natural and most used
    knowledge representation formalism
  • Non-structured or weakly structured information
  • Text
  • Databases with text fields
  • Web-pages, e-mail messages, blogs, chat,
  • (Non-structured) information overload
  • Doubles every three months (Gardner)
  • Hampers knowledge management and business
    intelligence
  • Translation bottleneck

4
Natural Language Understanding?
  • Word meaning
  • Morphological analysis
  • Complex Word Interpretation
  • Word Sense Disambiguation
  • Sentence Meaning
  • Syntactic structure (parsing)
  • Sentence interpretation
  • Discourse Meaning
  • World Knowledge
  • Frames, scenarios, grounding, intentions,
  • Fremdzugehen
  • External train marriages
  • The box is in the pen
  • I eat a pizza with extra cheese
  • I eat a pizza with a fork
  • I eat a pizza with my daughter
  • The mayors didnt want the students to strike
    because they feared violence
  • The mayors didnt want the students to strike
    because they preached the revolution

5
State of the Art
  • Robust, efficient, accurate, unrestricted
    language understanding will not be available for
    a long time
  • AI-complete problem
  • Alternative
  • text mining automatic extraction of reusable
    knowledge from text, based on linguistic analysis
    of the text

6
Approach
  • Text analysis tools (shallow instead of deep
    understanding)
  • Robust / Efficient / Accurate
  • Text Mining applications
  • Question Answering
  • Summarization
  • Ontology extraction
  • Information extraction
  • Text categorization
  • For embedding in
  • End user applications related to knowledge search
    / management / discovery / communication

7
Examples
  • Application Areas
  • Data mining (KDD) from unstructured and
    semi-structured data
  • (Corporate) Knowledge Management
  • Intelligence
  • Example Applications
  • Email routing and filtering (spam filtering)
  • Finding protein interactions in biomedical text
  • Brokering
  • Matching on-line resumes and vacancies
  • Buying and selling property

8
Text Data Mining (Discovery)
  • Find relevant information
  • Information extraction
  • Text categorization
  • Analyze the text
  • Text mining
  • Discovery new information
  • Integrate different sources
  • Data mining

9
Don Swanson 1981 medical hypothesis generation
  • stress is associated with migraines
  • stress can lead to loss of magnesium
  • calcium channel blockers prevent some migraines
  • magnesium is a natural calcium channel blocker
  • spreading cortical depression (SCD) is implicated
    in some migraines
  • high levels of magnesium inhibit SCD
  • migraine patients have high platelet
    aggregability
  • magnesium can suppress platelet aggregability

10
CNTS text analysis tools
  • MBSP
  • Flexible and adaptable
  • Dutch and English
  • State of the Art accuracy and efficiency
  • 90 sentences / 1000 words/sec
  • Configurable combination of linguistic modules
  • Modules developed using Machine Learning
  • TiMBL
  • Adaptation through re-training and
    semi-supervised learning
  • Client-server set-up

11
CNTS shallow understanding
12
Insulatard is an isophane insulin suspension
(NPH).
13
Insulatard is an isophane insulin suspension
(NPH).
Insulatard is an isophane insulin suspension ( NPH
) .
14
Insulatard is an isophane insulin suspension
(NPH).
Insulatard NNP is VBZ an DT isophane JJ
insulin NN suspension NN ( Punc NPH NNP )
Punc . Punc
15
Insulatard is an isophane insulin suspension
(NPH).
NP Insulatard VP is NP an isophane
insulin suspension( NPH )
16
Insulatard is an isophane insulin suspension
(NPH).
Insulatard Medicine name NPH Hormone
17
Insulatard is an isophane insuline suspension
(NPH).
SBJ Insulatard is PREDC an isophane
insuline suspension ( NPH )
18
Application Question Answering
  • Give answer to question
  • (document retrieval find documents relevant to
    query)
  • Who invented the telephone?
  • Alexander Graham Bell
  • When was the telephone invented?
  • 1876

19
QA System Shapaqa
  • Parse question
  • When was the telephone invented?
  • Which slots are given?
  • Verb invented
  • Object telephone
  • Which slots are asked?
  • Temporal phrase linked to verb
  • Document retrieval on internet with given slot
    keywords
  • Parsing of sentences with all given slots
  • Count most frequent entry found in asked slot
    (temporal phrase)

20
Shapaqa example
  • When was the telephone invented?
  • Google invented AND the telephone
  • produces 835 pages
  • 53 parsed sentences with both input slots and
    with a temporal phrase
  • is through his interest in Deafness and
    fascination with acoustics that the telephone was
    invented in 1876 , with the intent of helping
    Deaf and hard of hearing
  • The telephone was invented by Alexander Graham
    Bell in 1876
  • When Alexander Graham Bell invented the telephone
    in 1876 , he hoped that these same electrical
    signals could

21
Shapaqa frequency ranking
  • So when was the phone invented?
  • Internet answer is noisy, but robust
  • 17 1876
  • 3 1874
  • 2 ago
  • 2 later
  • 1 Bell
  • System was developed quickly
  • Precision 76 (Google 31)
  • International competition (TREC) MRR 0.45

22
Application Biomedical text mining (EU project
BioMinT)
IR
IE
Linguistic / Semantic Features
Templates Factoids
Text Analysis
Medline abstracts
23
(Partial) Factoids
  • The mouse lymphoma assay (MLA) utilizing the Tk
    gene is widely used to identify chemical mutagens.

CELL-LINE
The mouse lymphoma assay
MLA
O
S
the Tk gene
DNA part
utilizing
is widely used
to identify
O
chemical mutagens
24
lt!DOCTYPE MBSP SYSTEM 'mbsp.dtd'gt ltMBSPgt ltS
cnt"s1"gt ltNP rel"SBJ" of"s1_1"gt ltW
pos"DT"gtThelt/Wgt ltW pos"NN"
sem"cell_line"gtmouselt/Wgt ltW pos"NN"
sem"cell_line"gtlymphomalt/Wgt ltW
pos"NN"gtassaylt/Wgt lt/NPgt ltW
pos"openparen"gt(lt/Wgt ltNPgt ltW pos"NN"
sem"cell_line"gtMLAlt/Wgt lt/NPgt ltW
pos"closeparen"gt)lt/Wgt ltVP id"s1_1"gt ltW
pos"VBG"gtutilizinglt/Wgt lt/VPgt ltNP rel"OBJ"
of"s1_1"gt ltW pos"DT"gtthelt/Wgt ltW
pos"NN" sem"DNA_part"gtTklt/Wgt ltW pos"NN"
sem"DNA_part"gtgenelt/Wgt lt/NPgt
ltVP id"s1_2"gt ltW pos"VBZ"gtislt/Wgt ltW
pos"RB"gtwidelylt/Wgt ltW pos"VBN"gtusedlt/Wgt
lt/VPgt ltVP id"s1_3"gt ltW pos"TO"gttolt/Wgt
ltW pos"VB"gtidentifylt/Wgt lt/VPgt lt/VPgt ltNP
rel"OBJ" of"s1_3"gt ltW pos"JJ"gtchemicallt/Wgt
ltW pos"NNS"gtmutagenslt/Wgt lt/NPgt ltW
pos"period"gt.lt/Wgt lt/Sgt lt/MBSPgt
25
Extracted IEX Templates from shallow parser output
  • NP(ltX proteingt) contain NP(Y "domain")
  • EVENT contain
  • PROTEIN ltproteingt
  • DOMAIN domainf
  • NP(ltX proteingt) be associated with NP(Y
    disease)
  • EVENT associated_with
  • PROTEIN ltproteingt
  • DISEASE head
  • NP(ltX proteingt) regulate NP(Y)
  • EVENT regulate
  • PROTEIN ltproteingt
  • Y

Jee-Hyub Kim (Geneva)
() to be extracted, ltgt semantic constraint, ""
lexical constraint
26
Application Ontology Extraction
  • Clustering of head nouns of Subject-Verb and
    Verb-Object relations
  • Combine with pattern matching and heuristics
  • Case study Medline 4 million words hepatitis,
    SwissProt corpus
  • Results
  • Better clusters with shallow parsing
  • Useful in knowledge management, thesaurus
    development,

Ontobasis (IWT)
27
Example (SwissProt corpus)
gene show significant homology, amino_acid_se
quence have/indicate/lack/reveal/show
homology protein show homology,
immunoreactivity, reactivity, sequence
similarity protein inhibit catalytic
activity, apoptosis, protein synthesis... protein
exhibit significant homology protein bind
copper, ubiquitin protein correspond
isoelectric point induction requires protein
synthesis Edman degradation of intact protein
regulatory subunit of cAMP-dependent
protein kinase
28
(No Transcript)
29
Further development
  • Semantic roles
  • Faster adaptation to new domains
  • Domain semantics (NER / concept tagging)
  • Active Learning / semi-supervised learning
  • More analytic power
  • Negation, modality, quantification
  • Limited event and scenario recognition

30
Conclusions
  • Text Mining tasks benefit from text analysis
  • Understanding can be formulated as a flexible
    heterarchy of classifiers
  • These classifiers can be trained / adapted on
    annotated corpora and can eventually approximate
    deep understanding

31
Questions?
  • Walter Daelemans
  • A1.10 Campus Drie Eiken
  • (September Stadscampus)
  • Walter.daelemans_at_ua.ac.be
Write a Comment
User Comments (0)
About PowerShow.com