Title: Mono
1Mono Cross Language Experiments on Persian Text
University of Tehran Database Research Group
Persian_at_CLEF 2008
- Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian
- Database Research Group
- School of Electrical and Computer Engineering
- University of Tehran
18 Sep 2008
2Outline
- Persian Language
- Persian Test Collections
- Hamshahri in CLEF 2008
- UT Participants
- Using Part of Speech Tagging in Persian
Information Retrieval - Fusion of Retrieval Models at CLEF 2008 Ad-Hoc
Persian Track - Local Cluster Analysis Using Part of Speech
Tagging - Investigation on Application of Local Cluster
Analysis and Part of Speech Tagging on Persian
Text - Cross Language Experiments at Persian_at_CLEF 2008
- Next Year
3The Persian Language
- A branch of Indo-European Languages
- Official Language of Iran, Afghanistan and
Tajikistan - Its morphological analysis is Comparably
difficult - The word ??? has two plural forms
- Persian rules ?????
- Arabic rules ?????
4Some Processing Issues
- Writing Style Issues
- e.g. ?? ??? and ????? are the same
- e.g. ?????? and ???? ?? are the same
- KASRE
- e.g. ???? ??? ???? ?? ?????? has two different
meanings - CheraghAli burned the house
- Alis lantern burned the house
5Some Processing Issues
6Persian in the Middle East
User Population Growth on the Web (2000-2009)
December 31, 2007
Source Internet World Stats, http//internetworld
stats.com/
7Persian Test Collections
- IR Domain
- Ghavanin (domain specific)
- Hamshahri (news) WEB http//ece.ut.ac.ir/dbrg/ham
shahri - NLP Domain
- Bijankhan (2 Million Word) WEB
http//ece.ut.ac.ir/dbrg/bijankhan
8Hamshahri in CLEF 2008
- News articles of Hamshahri newspaper from year
1996 to 2002 - Size of the documents varies from short news
(under 1 KB) to rather long articles (e.g. 140
KB) - 22 assessors
- Evaluation based on DIRECT System
9Hamshahri in CLEF 2008
Collection size 564 MB (Unicode text)
No. Of documents 166,774
No. Of unique terms 417,339
Average length of documents 380 Terms
No. Of categories 9
No. Of Topics 50 bilingual
10Implementation of our methods
- We submitted top 100 for each run
11Using Part of Speech Tagging in Persian
Information RetrievalReza Karimpour, Amineh
Ghorbani, Azadeh Pishdad, Mitra Mohtarami,
Abolfazl AleAhmad, Hadi Amiri, Farhad Oroumchian
12Using Part of Speech Tagging in Persian
Information Retrieval
Config. Corpus Query
1 Tagged Title with equal weighting for all POS tags
2 Stemmed and tagged Stemmed title with equal weighting for all POS tags
3 Stemmed Stemmed title without POS tagging
4 Stemmed Stemmed Title plus description
5 Stemmed (stop words removed) Stemmed Title plus description (stop words removed)
6 Tagged Title plus description with equal weighting for all POS tags
7 Tagged Title with various weighting schemes for different POS tags
8 Normal Title (Neither stemmed nor tagged)
13Using Part of Speech Tagging in Persian
Information Retrieval
20 less used tags omitted, others equal weight Noun3 Verb2 Adj1 Adv1 Noun3 Verb0 Avj3 Adv 0 Noun0 Verb2 Adj0 Adv0 Noun0 Verb0 Adj1 Adv0 Noun0 Verb0 Adj0 Adv1
Average precision 0.2745 0.2635 0.2597 0.1108 0.1198 0.0977
R-Precision 0.3097 0.3104 0.2888 0.1256 0.1186 0.1111
14Using Part of Speech Tagging in Persian
Information Retrieval
15Fusion of Retrieval Models at CLEF 2008 Ad-Hoc
Persian Track Zahra Aghazade, Nazanin
Dehghani, Leili Farzinvash, Razieh Rahimi,
Abolfazel AleAhmad, Hadi Amiri, Farhad Oroumchian
Weighting Model Description
BB2 Bose-Einstein model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization
BM25 The BM25 probabilistic model
DFR_BM25 The DFR version of BM25
IFB2 Inverse Term Frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization
In_expB2 Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization
In_expC2 Inverse expected document frequency model for randomness, the ratio of two Bernoulli's processes for first normalization, and Normalization 2 for term frequency normalization with natural logarithm
InL2 Inverse document frequency model for randomness, succession for first normalization, and Normalization 2 for term frequency normalization
PL2 Poisson estimation for randomness, succession for first normalization, and Normalization 2 for term frequency normalization
TF_IDF The tfidf weighting function, where tf is given by Robertson's tf and idf is given by the standard Sparck Jones' idf
Terrier Open Source Retrieval Engine http//
ir.dcs.gla.ac.uk/terrier/
16Fusion of Retrieval Models at CLEF 2008 Ad-Hoc
Persian Track
Weighting Model Average Precision R-Precision
BB2 0.3854 0.4167
BM25 0.3562 0.4009
DFR_BM25 0.4006 0.4347
IFB2 0.4017 0.4328
In_expB2 0.3997 0.4329
In_expC2 0.4190 0.4461
InL2 0.3832 0.4200
PL2 0.4314 0.4548
TF_IDF 0.3574 0.4017
17Fusion of Retrieval Models at CLEF 2008 Ad-Hoc
Persian Track
- And two other variations of this operator IOWA
and NOWA
18Fusion of Retrieval Models at CLEF 2008 Ad-Hoc
Persian Track
19Fusion of Retrieval Models at CLEF 2008 Ad-Hoc
Persian TrackPost hoc Results
Retrieval Method Toolkit Average Precision R-Precision Dif
TF_IDF with unstemmed single terms Terrier 0.3847 0.4122
PL2 with 4gram terms Terrier 0.3669 0.3939
Indri with stemmed terms Lemur 0.3955 0.4149
IOWA 0.4515 0.4708 5.6
NOWA 0.4522 0.4736 5.67
20Investigation on Application of Local Cluster
Analysis and Part of Speech Tagging on Persian
TextAmir Hossein Jadidinejad, Mitra
Mohtarami,Hadi Amiri
21Investigation on Application of Local Cluster
Analysis and Part of Speech Tagging on Persian
Text
But the result was not good on the test set
22Cross Language Experiments at Persian_at_CLEF
2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash
Zareh, Masoud Rahgozar, Farhad Oroumchian
Run tot-ret rel-ret MAP Retrieval Model Tool
Using Light Stemmer 5161 1967 26.89 Vector Space Lucene
Without Stemmer 5161 1991 27.08 Vector Space Lucene
3Grams 5161 1901 26.07 Language Modeling Lemur
4Grams 5161 1950 26.70 Language Modeling Lemur
5Grams 5161 1983 27.13 Language Modeling Lemur
Term-Based 5161 2035 28.14 Language Modeling Lemur
23Cross Language Experiments at Persian_at_CLEF
2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash
Zareh, Masoud Rahgozar, Farhad Oroumchian Query
Translation
- Probabilistic Structured Queries (PSQ)
- Combinatorial Translation Probability (CTP)
?
24Cross Language Experiments at Persian_at_CLEF
2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash
Zareh, Masoud Rahgozar, Farhad Oroumchian Query
Translation Results
25Cross Language Experiments at Persian_at_CLEF
2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash
Zareh, Masoud Rahgozar, Farhad Oroumchian
Document Translation
- Using Shiraz machine translation system from CRL
of NMSU - Took 10 days to translate 130,000 docs from
Persian to English
26Cross Language Experiments at Persian_at_CLEF
2008Abolfazl AleAhmad, Ehsan Kamalloo, Arash
Zareh, Masoud Rahgozar, Farhad Oroumchian
Document Translation Hybrid Results
27Next Year
- Ham2 for the Next Year
- Extended Version of Hamshahri Collection
- 2 times larger (1.5 GB)
ltDOCgt ltDOCIDgtHAM2-851011-001lt/DOCIDgt ltDOCNOgtHAM2-851011-001lt/DOCNOgt ltORIGINALFILEgt/1385/851011/news/_adabh.htmlt/ORIGINALFILEgt ltISSUEgt?????? 11 ?? 1385 - ??? ??????? - ?????4172 - Jan 1, 2007lt/ISSUEgt ltDATEgt2007-01-01lt/DATEgt ltCAT xmllang"fa"gt??? ? ???lt/CATgt ltCAT xmllang"en"gtLiterature and Artlt/CATgt ltTITLEgt lt!CDATA?????? ???? ? ????????? ????? ????? ? ????? ?????? ??? ??? ???? ???? ???? ???? ????? ??gt lt/TITLEgt ltTEXTgt ltimagegt/1385/851011/news/008505.jpglt/imagegt lt!CDATA ???? ???? ?? ???? ? ???? ????? ????? ????? ? ????? ?????? ??? ???? ??? lt/TEXTgt lt/DOCgt ltDOCgt
28Questions?Thanks For Your Attention
Database Research Group http//ece.ut.ac.ir/dbrg