Title: Computational Analysis of the Quran through Traditional Arabic Linguistics
1Computational Analysis of the Quranthrough
Traditional Arabic Linguistics
- Kais Dukes
- School of Computing, University of Leeds
- sckd_at_leeds.ac.uk
- PhD Supervisor Dr Eric Atwell
2The Challenge An Interdisciplinary approach to
analysing the Quran
3(1) What is the Quran?
The last in a series of 5 religious texts
4(1) What is the Quran?
The central religious text of Islam
- - Classical Arabic
- Islamic Law (legal logic)
- Divine guidance direction
- Scientific philosophical knowledge
- Has inspired many scientific achievements, e.g.
Algebra and linguistics
5(2) Traditional Arabic Linguistics
Originated in Arabs studying the language of the
Quran (detailed analysis for at least 1000 years)
- Orthography (diacritics and vowelization) -
Etymology (Semitic roots) - Morphology
(derivation and inflection) - Syntax (origins of
dependency grammar) - Discourse Analysis
Rhetoric - Semantics Pragmatics
6(3) Computational Linguistics
Where are we now (2009)?
- Current use of computing to analyze the Quran is
mostly - - Keyword search (useful)
- - Frequency analysis (numerology?)
7(3) Computational Linguistics
- How far can we go? - Is an artificial
intelligence system realistic?
- Example question-answering dialog system
- Question
- How long should I breastfeed my child for?
- Answer Mothers should suckle their offspring for
two years, if the father wishes to complete the
term (The Holy Quran, Verse 2233).
8PhD Research Project
Central Hypothesis Augmenting the text of the
Quran with rich annotation will lead to a more
accurate AI system. PhD part 1 - Prepare the
data by annotating the Quran. PhD part 2 - Use
the data to build an AI system for
question-answering.
9Annotating the Quran
Challenges Orthography - Complex script verified
in Unicode? Morphology - Arabic is highly
inflected and this is challenging to model by
computer Syntax - Phrase structure or dependency
grammar? Semantics - Logic, lexical frames or
semantic networks?
10Annotating the Quran
Solutions - Recent computational advances have
made possible annotating the Quran to very high
accuracy - Community effort using volunteers -
Leverage existing resources from Traditional
Arabic Grammar - Automatic annotation followed
by manual verification
11Recent Advances Orthography
(2008) Does an accurate digital copy of the Quran
exist?
- Encoding Issues
- Missing diacritics
- Simplified script (not Uthmani)
- Windows code page 1256, not Unicode
Google Search for verse (6838) on Jan 21, 2008
shows many typos
12Recent Advances Orthography
- Tanzil Project (http//tanzil.info)
- Stable version released May 2008
- Uses Unicode XML encoding, including the special
characters designed for the complex Arabic script
of the Quran - Manually verified to 100 accuracy by a group of
experts who have memorized the entire text of the
Quran
13Recent Advances Orthography
- Java Quran API (http//jqurantree.org)
- March 2009
- Java classes for querying the Tanzil XML of the
Quran - First step towards software package for
analyzing the Quran
14Recent Advances Morphology
- - Buckwalter Arabic Morphological Analyzer (2002)
- Morphological Analysis of the Quran at the
University of Haifa, Israel (2004) - - Lexeme feature based morphological
representation of Arabic (Nizar Habash, 2006)
15The Haifa Corpus (2004)
- Multiple analysis for each word (up to 5)
- rbbfalNounTriptoticMascSgPronDependent1P
Sg - rbbfalNounTriptoticMascSgGen
- Not a manually verified corpus
- Authors reports an F-measure of 86
- Non-standard annotation scheme not familiar to
traditional Arabic linguists (e.g. extracting a
list of all verbs in the corpus is non-trivial) - Arabic text is only encoded phonetically instead
of using the original Arabic. Searching for the
possible morphological analyses for a specific
word is not easy
16The Crescent Corpus (2009)
- http//quran.uk.net
- - Manually verified (99 accuracy)
- Poplar website with very positive feedback
- Thousands of visitors
1. Initial tagging using Buckwalter Analyzer 2.
Paid annotator working for 3 months 3. Community
of volunteers verifying against existing books of
Traditional Arabic Grammar which analyse the
Quran Shows Arabic and English morphological
analysis side-by-side, with phonetic
transcription, search and translation.
17The Crescent CorpusPart-of-speech Tagging
- Part-of-speech tags adapted from Traditional
Arabic Grammar, and mapped to English equivalents
(not the other way around) - These tags apply to words in the Quran, as well
as to individual morphological segments in the
text
18The Crescent CorpusVerified Uthmani Script
- Unicode Uthmani Script
- Sourced from the verified Tanzil project
19The Crescent CorpusPhonetics (faja'alnahumu)
- Phonetic transcription generated algorithmically
- Guided by Arabic vowelized diacritics
20The Crescent CorpusInterlinear translation
- Word-for-word translation from accepted sources
- Interlinear translation scheme
21The Crescent CorpusLocation Reference (21704)
- Common standard for verses (ChapterVerse)
- Extended in the Crescent corpus to include word
numbers and segment numbers, e.g. (217042)
22The Crescent CorpusMorphological Segmentation
- Division of a single word into multiple segments
- Part-of-speech tag assigned to each segment
- - Traditional Arabic Grammar rules used for
division
23The Crescent CorpusMorphological segment
features
24The Crescent CorpusArabic Grammar Summary
25The Crescent TreebankSyntactic Annotation
(2010)
- Dependency Grammar based on????? (i'rab)
- Syntactico-semantic roles for each word
26The Crescent TreebankWhats new about this
research?
- First Treebank of Classical Arabic
- Free Treebank of the Quran
- - Well-defined formal representation of
Traditional Arabic Grammar using hybrid
constituency/dependency graphs
27Automatic AnnotationClassical Arabic Dependency
Parser
- Joakim Nivre (2009) dependency parsing using a
shift/reduce queue/stack architecture with
machine learning - Following similar architecture, but with hand
written rules, custom parser has an - F-measure of 77.2
28Questions and Feedback