LingPipe - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

LingPipe

Description:

Other Niceties. Its free. Plenty of documentation. Tutorials for every subtask. Highly Configurable ... Divides up text in sentences and words using pretty ... – PowerPoint PPT presentation

Number of Views:155
Avg rating:3.0/5.0
Slides: 13
Provided by: EPLT
Category:

less

Transcript and Presenter's Notes

Title: LingPipe


1
LingPipe
  • http//www.alias-i.com/lingpipe/

2
Does a variety of tasks
  • Tokenization
  • Part of Speech Tagging
  • Named Entity Detection
  • Clustering
  • Identifies Significant Phrases
  • Other
  • Topic Classification
  • Database Text Mining
  • Spell Checker
  • Sentiment Analysis
  • Chinese Word Segmentation

3
Other Niceties
  • Its free
  • Plenty of documentation
  • Tutorials for every subtask
  • Highly Configurable
  • Source Code
  • Very complex, but well written
  • Good comments
  • Gives examples on how to edit code
  • Can be trained in several languages.

4
Tokenization
  • Divides up text in sentences and words using
    pretty sophisticated methods.

5
Part of Speech Tagging
  • You can output the N-best results
  • You can output a confidence score for each word.
  • You can also retrain the Part of Speech Tagger.
  • You can also edit how it runs.

6
Named Entity Detection
  • The default detection distinguishes between three
    types of entities.
  • People (distinguishes male and female)
  • Place
  • Organization
  • It can be trained to recognize any type of
    entity.
  • You can get corpora from online
  • You can annotate your own corpora using
    WordFreak, which also comes with LingPipe.

7
Sample Input/Output
  • - ltDOCUMENTgtltPgtThis is Mr. Bob Smith. Bob lives
    in Redmond. He works for Microsoft.lt/Pgtlt/DOCUMENTgt
  • - ltDOCUMENTgtltPgtltsentgtThis is Mr. ltENAMEX id"13"
    type"PERSON"gtBob Smith.lt/ENAMEXgt lt/sentgt
  • ltsentgtltENAMEX id"13" type"PERSON"gtBoblt/ENAMEXgt
    lives in
  •   ltENAMEX id"14" type"LOCATION"gtRedmondlt/ENAMEXgt
    . lt/sentgt
  • - ltsentgtltENAMEX id"13" type"MALE_PRONOUN"gtHelt/EN
    AMEXgt
  •   works for ltENAMEX id"15" type"ORGANIZATION"gtMi
    crosoftlt/ENAMEXgt . lt/sentgtlt/Pgtlt/DOCUMENTgt

8
Dictionary
  • To increase the accuracy of LingPipe, you can
    import a Dictionary.
  • A dictionary will force the recognition of
    certain strings to be certain types.
  • Common dictionaries include
  • Gazeteer
  • List of peoples names
  • Company names

9
Coreference
  • It identifies different references to the same
    entity, such Bob Smith and Bob.
  • It does not identify entities across documents.
  • It identifies pronouns with its antecedent.
  • It does not do other anaphora resolution, like
    Jane was the woman who pulled the trigger.

10
Clustering
  • Single-link Clustering
  • chops off longest link
  • Clustering with proximity bounds
  • Merges based on proximity
  • Extract for K-clusters
  • You can specify how many clusters you want
  • Complete-Link Clustering
  • var of single link using a whole cluster
  • Within-Cluster Point Scatter
  • You dont need to specify the number of clusters.
  • It detects the best breaking point.
  • This is the method used to do NER across
    documents.

11
Significant Phrases
  • Determines phrases that are seen together more
    often than coincidence
  • Seems to be mostly named entities
  • Puget Sound, George Bush
  • Helps tell the genre of an article

12
Questions?
Write a Comment
User Comments (0)
About PowerShow.com