Mono - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Mono

Description:

Title: PowerPoint Presentation - At Home with Technology: Web Page Overview Author: Anne Kolaczyk Last modified by: ABBAS Created Date: 1/19/2004 3:49:41 PM – PowerPoint PPT presentation

Number of Views:272
Avg rating:3.0/5.0
Slides: 29
Provided by: AnneKo5
Category:

less

Transcript and Presenter's Notes

Title: Mono


1
Mono Cross Language Experiments on Persian Text
University of Tehran Database Research Group
Persian_at_CLEF 2008
  • Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian
  • Database Research Group
  • School of Electrical and Computer Engineering
  • University of Tehran

18 Sep 2008
2
Outline
  • Persian Language
  • Persian Test Collections
  • Hamshahri in CLEF 2008
  • UT Participants
  • Using Part of Speech Tagging in Persian
    Information Retrieval
  • Fusion of Retrieval Models at CLEF 2008 Ad-Hoc
    Persian Track
  • Local Cluster Analysis Using Part of Speech
    Tagging
  • Investigation on Application of Local Cluster
    Analysis and Part of Speech Tagging on Persian
    Text
  • Cross Language Experiments at Persian_at_CLEF 2008
  • Next Year

3
The Persian Language
  • A branch of Indo-European Languages
  • Official Language of Iran, Afghanistan and
    Tajikistan
  • Its morphological analysis is Comparably
    difficult
  • The word ??? has two plural forms
  • Persian rules ?????
  • Arabic rules ?????

4
Some Processing Issues
  • Writing Style Issues
  • e.g. ?? ??? and ????? are the same
  • e.g. ?????? and ???? ?? are the same
  • KASRE
  • e.g. ???? ??? ???? ?? ?????? has two different
    meanings
  • CheraghAli burned the house
  • Alis lantern burned the house

5
Some Processing Issues
  • Encoding
  • ?

6
Persian in the Middle East
User Population Growth on the Web (2000-2009)
December 31, 2007
Source Internet World Stats, http//internetworld
stats.com/
7
Persian Test Collections
  • IR Domain
  • Ghavanin (domain specific)
  • Hamshahri (news) WEB http//ece.ut.ac.ir/dbrg/ham
    shahri
  • NLP Domain
  • Bijankhan (2 Million Word) WEB
    http//ece.ut.ac.ir/dbrg/bijankhan

8
Hamshahri in CLEF 2008
  • News articles of Hamshahri newspaper from year
    1996 to 2002
  • Size of the documents varies from short news
    (under 1 KB) to rather long articles (e.g. 140
    KB)
  • 22 assessors
  • Evaluation based on DIRECT System

9
Hamshahri in CLEF 2008
Collection size 564 MB (Unicode text)
No. Of documents 166,774
No. Of unique terms 417,339
Average length of documents 380 Terms
No. Of categories 9
No. Of Topics 50 bilingual
10
Implementation of our methods
  • We submitted top 100 for each run

11
Using Part of Speech Tagging in Persian
Information RetrievalReza Karimpour, Amineh
Ghorbani, Azadeh Pishdad, Mitra Mohtarami,
Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian
12
Using Part of Speech Tagging in Persian
Information Retrieval
Config. Corpus Query
1 Tagged Title with equal weighting for all POS tags
2 Stemmed and tagged Stemmed title with equal weighting for all POS tags
3 Stemmed Stemmed title without POS tagging
4 Stemmed Stemmed Title plus description
5 Stemmed (stop words removed) Stemmed Title plus description (stop words removed)
6 Tagged Title plus description with equal weighting for all POS tags
7 Tagged Title with various weighting schemes for different POS tags
8 Normal Title (Neither stemmed nor tagged)
13
Using Part of Speech Tagging in Persian
Information Retrieval
20 less used tags omitted, others equal weight Noun3 Verb2 Adj1 Adv1 Noun3 Verb0 Avj3 Adv 0 Noun0 Verb2 Adj0 Adv0 Noun0 Verb0 Adj1 Adv0 Noun0 Verb0 Adj0 Adv1
Average precision 0.2745 0.2635 0.2597 0.1108 0.1198 0.0977
R-Precision 0.3097 0.3104 0.2888 0.1256 0.1186 0.1111
14
Using Part of Speech Tagging in Persian
Information Retrieval
15
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc
Persian Track Zahra Aghazade, Nazanin
Dehghani, Leili Farzinvash, Razieh Rahimi,
Abolfazel AleAhmad, Hadi Amiri, Farhad Oroumchian
Weighting Model Description
BB2 Bose-Einstein model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization
BM25 The BM25 probabilistic model
DFR_BM25 The DFR version of BM25
IFB2 Inverse Term Frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization
In_expB2 Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization
In_expC2 Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization with natural logarithm
InL2 Inverse document frequency model for randomness, succession for first normalization, and Normalization 2 for term frequency normalization
PL2 Poisson estimation for randomness, succession for first normalization, and Normalization 2 for term frequency normalization
TF_IDF The tfidf weighting function, where tf is given by Robertson's tf and idf is given by the standard Sparck Jones' idf
Terrier Open Source Retrieval Engine http//
ir.dcs.gla.ac.uk/terrier/
16
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc
Persian Track
Weighting Model Average Precision R-Precision
BB2 0.3854 0.4167
BM25 0.3562 0.4009
DFR_BM25 0.4006 0.4347
IFB2 0.4017 0.4328
In_expB2 0.3997 0.4329
In_expC2 0.4190 0.4461
InL2 0.3832 0.4200
PL2 0.4314 0.4548
TF_IDF 0.3574 0.4017
17
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc
Persian Track
  • And two other variations of this operator IOWA
    and NOWA

18
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc
Persian Track
19
Fusion of Retrieval Models at CLEF 2008 Ad-Hoc
Persian TrackPost hoc Results
Retrieval Method Toolkit Average Precision R-Precision Dif
TF_IDF with unstemmed single terms Terrier 0.3847 0.4122
PL2 with 4gram terms Terrier 0.3669 0.3939
Indri with stemmed terms Lemur 0.3955 0.4149
IOWA 0.4515 0.4708 5.6
NOWA 0.4522 0.4736 5.67
20
Investigation on Application of Local Cluster
Analysis and Part of Speech Tagging on Persian
TextAmir Hossein Jadidinejad, Mitra
Mohtarami,Hadi Amiri
21
Investigation on Application of Local Cluster
Analysis and Part of Speech Tagging on Persian
Text
But the result was not good on the test set
22
Cross Language Experiments at Persian_at_CLEF
2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash
Zareh, Masoud Rahgozar, Farhad Oroumchian
Run tot-ret rel-ret MAP Retrieval Model Tool
Using Light Stemmer 5161 1967 26.89 Vector Space Lucene
Without Stemmer 5161 1991 27.08 Vector Space Lucene
3Grams 5161 1901 26.07 Language Modeling Lemur
4Grams 5161 1950 26.70 Language Modeling Lemur
5Grams 5161 1983 27.13 Language Modeling Lemur
Term-Based 5161 2035 28.14 Language Modeling Lemur
23
Cross Language Experiments at Persian_at_CLEF
2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash
Zareh, Masoud Rahgozar, Farhad Oroumchian Query
Translation
  • Probabilistic Structured Queries (PSQ)
  • Combinatorial Translation Probability (CTP)

?
24
Cross Language Experiments at Persian_at_CLEF
2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash
Zareh, Masoud Rahgozar, Farhad Oroumchian Query
Translation Results
25
Cross Language Experiments at Persian_at_CLEF
2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash
Zareh, Masoud Rahgozar, Farhad Oroumchian
Document Translation
  • Using Shiraz machine translation system from CRL
    of NMSU
  • Took 10 days to translate 130,000 docs from
    Persian to English

26
Cross Language Experiments at Persian_at_CLEF
2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash
Zareh, Masoud Rahgozar, Farhad Oroumchian
Document Translation Hybrid Results
27
Next Year
  • Ham2 for the Next Year
  • Extended Version of Hamshahri Collection
  • 2 times larger (1.5 GB)

ltDOCgt ltDOCIDgtHAM2-851011-001lt/DOCIDgt ltDOCNOgtHAM2-851011-001lt/DOCNOgt ltORIGINALFILEgt/1385/851011/news/_adabh.htmlt/ORIGINALFILEgt ltISSUEgt?????? 11 ?? 1385 - ??? ??????? - ?????4172 - Jan 1, 2007lt/ISSUEgt ltDATEgt2007-01-01lt/DATEgt ltCAT xmllang"fa"gt??? ? ???lt/CATgt ltCAT xmllang"en"gtLiterature and Artlt/CATgt ltTITLEgt lt!CDATA?????? ???? ? ????????? ????? ????? ? ????? ?????? ??? ??? ???? ???? ???? ???? ????? ??gt lt/TITLEgt ltTEXTgt ltimagegt/1385/851011/news/008505.jpglt/imagegt lt!CDATA ???? ???? ?? ????  ? ???? ????? ????? ????? ? ????? ?????? ??? ???? ??? lt/TEXTgt lt/DOCgt ltDOCgt
28
Questions?Thanks For Your Attention
Database Research Group http//ece.ut.ac.ir/dbrg
Write a Comment
User Comments (0)
About PowerShow.com