Computational Analysis of the Quran through Traditional Arabic Linguistics - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Computational Analysis of the Quran through Traditional Arabic Linguistics

Description:

rbb fa&l Noun Triptotic Masc Sg Pron Dependent 1P Sg. rbb fa&l Noun ... PRON. Personal pronoun. ????. DEM. Demonstrative pronoun. ??? ?????. REL ... Free ... – PowerPoint PPT presentation

Number of Views:167
Avg rating:5.0/5.0
Slides: 29
Provided by: kdu6
Category:

less

Transcript and Presenter's Notes

Title: Computational Analysis of the Quran through Traditional Arabic Linguistics


1
Computational Analysis of the Quranthrough
Traditional Arabic Linguistics
  • Kais Dukes
  • School of Computing, University of Leeds
  • sckd_at_leeds.ac.uk
  • PhD Supervisor Dr Eric Atwell

2
The Challenge An Interdisciplinary approach to
analysing the Quran
3
(1) What is the Quran?
The last in a series of 5 religious texts
4
(1) What is the Quran?
The central religious text of Islam
  • - Classical Arabic
  • Islamic Law (legal logic)
  • Divine guidance direction
  • Scientific philosophical knowledge
  • Has inspired many scientific achievements, e.g.
    Algebra and linguistics

5
(2) Traditional Arabic Linguistics
Originated in Arabs studying the language of the
Quran (detailed analysis for at least 1000 years)
- Orthography (diacritics and vowelization) -
Etymology (Semitic roots) - Morphology
(derivation and inflection) - Syntax (origins of
dependency grammar) - Discourse Analysis
Rhetoric - Semantics Pragmatics
6
(3) Computational Linguistics
Where are we now (2009)?
  • Current use of computing to analyze the Quran is
    mostly
  • - Keyword search (useful)
  • - Frequency analysis (numerology?)

7
(3) Computational Linguistics
- How far can we go? - Is an artificial
intelligence system realistic?
  • Example question-answering dialog system
  • Question
  • How long should I breastfeed my child for?
  • Answer Mothers should suckle their offspring for
    two years, if the father wishes to complete the
    term (The Holy Quran, Verse 2233).

8
PhD Research Project
Central Hypothesis Augmenting the text of the
Quran with rich annotation will lead to a more
accurate AI system. PhD part 1 - Prepare the
data by annotating the Quran. PhD part 2 - Use
the data to build an AI system for
question-answering.
9
Annotating the Quran
Challenges Orthography - Complex script verified
in Unicode? Morphology - Arabic is highly
inflected and this is challenging to model by
computer Syntax - Phrase structure or dependency
grammar? Semantics - Logic, lexical frames or
semantic networks?
10
Annotating the Quran
Solutions - Recent computational advances have
made possible annotating the Quran to very high
accuracy - Community effort using volunteers -
Leverage existing resources from Traditional
Arabic Grammar - Automatic annotation followed
by manual verification
11
Recent Advances Orthography
(2008) Does an accurate digital copy of the Quran
exist?
  • Encoding Issues
  • Missing diacritics
  • Simplified script (not Uthmani)
  • Windows code page 1256, not Unicode

Google Search for verse (6838) on Jan 21, 2008
shows many typos
12
Recent Advances Orthography
  • Tanzil Project (http//tanzil.info)
  • Stable version released May 2008
  • Uses Unicode XML encoding, including the special
    characters designed for the complex Arabic script
    of the Quran
  • Manually verified to 100 accuracy by a group of
    experts who have memorized the entire text of the
    Quran

13
Recent Advances Orthography
  • Java Quran API (http//jqurantree.org)
  • March 2009
  • Java classes for querying the Tanzil XML of the
    Quran
  • First step towards software package for
    analyzing the Quran

14
Recent Advances Morphology
  • - Buckwalter Arabic Morphological Analyzer (2002)
  • Morphological Analysis of the Quran at the
    University of Haifa, Israel (2004)
  • - Lexeme feature based morphological
    representation of Arabic (Nizar Habash, 2006)

15
The Haifa Corpus (2004)
  • Multiple analysis for each word (up to 5)
  • rbbfalNounTriptoticMascSgPronDependent1P
    Sg
  • rbbfalNounTriptoticMascSgGen
  • Not a manually verified corpus
  • Authors reports an F-measure of 86
  • Non-standard annotation scheme not familiar to
    traditional Arabic linguists (e.g. extracting a
    list of all verbs in the corpus is non-trivial)
  • Arabic text is only encoded phonetically instead
    of using the original Arabic. Searching for the
    possible morphological analyses for a specific
    word is not easy

16
The Crescent Corpus (2009)
  • http//quran.uk.net
  • - Manually verified (99 accuracy)
  • Poplar website with very positive feedback
  • Thousands of visitors

1. Initial tagging using Buckwalter Analyzer 2.
Paid annotator working for 3 months 3. Community
of volunteers verifying against existing books of
Traditional Arabic Grammar which analyse the
Quran Shows Arabic and English morphological
analysis side-by-side, with phonetic
transcription, search and translation.
17
The Crescent CorpusPart-of-speech Tagging
  • Part-of-speech tags adapted from Traditional
    Arabic Grammar, and mapped to English equivalents
    (not the other way around)
  • These tags apply to words in the Quran, as well
    as to individual morphological segments in the
    text

18
The Crescent CorpusVerified Uthmani Script
  • Unicode Uthmani Script
  • Sourced from the verified Tanzil project

19
The Crescent CorpusPhonetics (faja'alnahumu)
  • Phonetic transcription generated algorithmically
  • Guided by Arabic vowelized diacritics

20
The Crescent CorpusInterlinear translation
  • Word-for-word translation from accepted sources
  • Interlinear translation scheme

21
The Crescent CorpusLocation Reference (21704)
  • Common standard for verses (ChapterVerse)
  • Extended in the Crescent corpus to include word
    numbers and segment numbers, e.g. (217042)

22
The Crescent CorpusMorphological Segmentation
  • Division of a single word into multiple segments
  • Part-of-speech tag assigned to each segment
  • - Traditional Arabic Grammar rules used for
    division

23
The Crescent CorpusMorphological segment
features
24
The Crescent CorpusArabic Grammar Summary
25
The Crescent TreebankSyntactic Annotation
(2010)
  • Dependency Grammar based on????? (i'rab)
  • Syntactico-semantic roles for each word

26
The Crescent TreebankWhats new about this
research?
  • First Treebank of Classical Arabic
  • Free Treebank of the Quran
  • - Well-defined formal representation of
    Traditional Arabic Grammar using hybrid
    constituency/dependency graphs

27
Automatic AnnotationClassical Arabic Dependency
Parser
  • Joakim Nivre (2009) dependency parsing using a
    shift/reduce queue/stack architecture with
    machine learning
  • Following similar architecture, but with hand
    written rules, custom parser has an
  • F-measure of 77.2

28
Questions and Feedback
Write a Comment
User Comments (0)
About PowerShow.com