A Text Processing Tool for the Romanian Language - PowerPoint PPT Presentation

About This Presentation
Title:

A Text Processing Tool for the Romanian Language

Description:

Engineering, University of Ottawa National Research Council of Canada ... Used a corpus of 40 million words of newspaper articles. Romanian newspapers 3-year period ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 16
Provided by: oana2
Category:

less

Transcript and Presenter's Notes

Title: A Text Processing Tool for the Romanian Language


1
A Text Processing Tool for the Romanian Language
  • Oana Frunza and Diana Inkpen David Nadeau
  • School of Information Technology and Institute
    for Information Technology
  • Engineering, University of Ottawa National
    Research Council of Canada
  • ofrunza,diana_at_site.uottawa.ca
    David.Nadeau_at_nrc-cnrc.gc.ca

2
Outline
  • BALIE System
  • RO-BALIE
  • Capabilities
  • Improvements
  • Evaluation Results
  • Future Work

3
BALIE- BaseLine Information Extraction
  • Multilingual information extraction system
  • Language identification
  • Tokenization
  • Sentence boundary detection
  • Part-of-speech tagging
  • for English, French, German, Spanish 1
  • Java trainable open source system
  • Uses WEKA 2 a Machine Learning Tool
  • Uses QTag 3 a language independent
    probabilistic part-of-speech tagger

4
BALIE- BaseLine Information Extraction (cont.)
  • Input Example
  • 1.Introduction    Information  Extraction 
    (IE)  is   the  name  given   to 
  • any  process  which  selectively  structures 
    and    combines data  which  is   found, 
    explicitly  stated  or  implied,  in  one  or 
    more  texts.

5
BALIE- BaseLine Information Extraction (cont.)
  • Output
  • lt?xml version"1.0" ?gt
  • ltbaliegt
  • lttokenListgt
  • ltsgt
  • lttoken type"2" pos"number" canon"1"gt1lt/tokengt
  • lttoken type"1" pos"period"
    canon"."gt.lt/tokengt
  • lttoken type"2" pos"noun" canon"introduction
    "gtIntroductionlt/tokengt
  • lt/sgt
  • ltsgt
  • lttoken type"2" pos"noun" canoninformation"gt
    Informationlt/tokengt
  • lt/sgt
  • lt/tokenListgt
  • lt/baliegt

6
RO-BALIE
  • Improvements
  • Easier manipulation of the input and output texts
  • A new tag set that maps the numerical tag set
    internally used by BALIE
  • More information in the output provided by the
    system
  • Available at http//www.site.uottawa.ca/ofrunza/
    RO-Balie/RO-Balie.html

7
RO-BALIE
  • Language Identification
  • 2-grams (sequence of 2 characters)
  • Naïve Bayes classifier
  • Overall accuracy is 99.25.

Language Files Train Files Test Correctly classified Accuracy
English 50 27 27 100
French 50 26 25 96
Spanish 50 25 25 100
German 50 27 27 100
Romanian 50 32 32 100
8
RO-BALIE (cont.)
  • Tokenization
  • Split each compound word based on - and /
  • Examples iat-o, socio-economic
  • Tokenization results

Tokens Precision Recall
904 99.5 98.7
9
RO-BALIE (cont.)
  • Sentence Boundary Detection
  • Training 106 hand-tagged English sentences
  • Decision Tree Classifier
  • Features
  • Beginning of the sentence first token
  • Previous token
  • Current token
  • Next token

10
RO-BALIE (cont.)
  • Sentence Boundary Detection (cont.)
  • Feature values
  • Period, Open Quote, Close Quote, New Line,
    Capital Word, Digit, Abbreviation, etc.
  • A list with Romanian abbreviations (510)
  • Evaluation on Orwells 1984 novel

Text Accuracy Precision Recall
Romanian 97 92 71
English 97.5 96.5 82
11
RO-BALIE (cont.)
  • Part-of-speech tagging QTag tagger
  • Used a corpus of 40 million words of newspaper
    articles
  • Romanian newspapers 3-year period
  • The training corpus is 98 accurate
  • Our system has a tagset of 14 tags for POS and 30
    tags for punctuations

Train Corpus Test Corpus Accuracy
2.5 mil words 13.425 words 95.3
12
RO-BALIE (cont.)
  • Output for Apel tirziu si inutil NISTORESCU.
  • lt?xml version"1.0" ?gt
  • ltbaliegt
  • ltLanguage ID"Romanian"gt
  • lttokenListgt
  • ltTokens Count"896"gt
  • lts id"1"gt
  • lttoken type"2" pos"NN" canon"apel"gtApellt/tokengt
  • lttoken type"2" pos"ADV" canon"tirziu"gttirziult/t
    okengt
  • lttoken type"2" pos"CJ" canon"si"gtsilt/tokengt
  • lttoken type"2" pos"NN" canon"inutil"gtinutillt/to
    kengt
  • lttoken type"2" pos"PN" canon"nistorescu"gtNISTOR
    ESCUlt/tokengt
  • lttoken type"1" pos"PER" canon"."gt.lt/tokengt
  • lt/sgt
  • lt/Tokensgt
  • lt/tokenListgt
  • lt/Languagegt
  • lt/baliegt

13
RO-BALIE (cont.)
  • Future Work
  • Use machine learning for the tokenization task
  • Add new services morphological analysis, named
    entity recognition, etc.
  • Add more specific information for each supported
    language.

14
RO-BALIE (cont.)
  • References
  • 1. http//balie.sourceforge.net/index.html
  • 2. http//www.cs.waikato.ac.nz/ml/weka/
  • 3.http//www.english.bham.ac.uk/staff/omason/soft
    ware/qtag.html
  • http//www.site.uottawa.ca/ofrunza/RO-Balie/R
    O-Balie.html

15
THANK YOU!
  • ? ? ?
  • ?
Write a Comment
User Comments (0)
About PowerShow.com