Arabic Natural Language Processing: State of the Art and Prospects - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Arabic Natural Language Processing: State of the Art and Prospects

Description:

Title: PowerPoint Presentation Last modified by: user Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show Other titles – PowerPoint PPT presentation

Number of Views:682
Avg rating:3.0/5.0
Slides: 51
Provided by: balamandE
Category:

less

Transcript and Presenter's Notes

Title: Arabic Natural Language Processing: State of the Art and Prospects


1
Arabic Natural Language Processing State of the
Art and Prospects
  • Rached Zantout, Ph.D.
  • Electrical and Computer Engineering Department
  • Hariri Canadian University
  • Mechref, Chouf, Lebanon

2
Outline
  • What is NLP ?
  • Why NLP?
  • MT as a case study!
  • Problems solved by MT.
  • Main players in MT.
  • How does Arabic compare to other Languages as far
    as NLP is concerned?
  • MT as a case study.
  • What kind of research is being conducted in ANLP?
  • Recommendations!

3
Tracing the history of NLP
4
NL and NLP definitions
adapted from http//www.cs.bham.ac.uk/pxc/nlpa/in
dex02.htm
  • 'natural language' (NL)
  • Any of the languages naturally used by humans,
  • not an artificial or man-made language such as a
    programming language.
  • (Arabic, English, Chinese, Swahili, etc.)
  • evolved over thousands of years.
  • efficient vehicles for human to human
    communication.
  • 'Natural language processing' (NLP)
  • attempts to use computers to process a NL.
  • Enter computers.
  • What's the connection?

5
Why ?
adapted from http//www.cs.utexas.edu/users/ear/cs
378NLP/
  • Is there any reason a computer should know
    English or Chinese or Swahili?
  • Yes. There are several "killer apps" for NLP
  • retrieving information from the web,
  • translating documents from one language to
    another, and
  • spoken front ends to all kinds of application
    programs.

6
NLP includes
adapted from http//www.cs.bham.ac.uk/pxc/nlpa/in
dex02.htm
  • Speech synthesis
  • is this very 'intelligent?
  • synthesis of natural-sounding speech is
    technically complex
  • requires some 'understanding' of what is being
    spoken to ensure, for example, correct
    intonation. (bear vs. dear)
  • Speech recognition
  • reduction of continuous sound waves to discrete
    words.
  • Natural language understanding
  • moving from isolated words (written or via speech
    recognition)
  • to 'meaning'.
  • Natural language generation
  • generating appropriate NL responses to
    unpredictable inputs.
  • Machine translation (MT) translating one NL into
    another

7
Areas Related to NLP
  • Input
  • Speech Recognition.
  • Natural Language Understanding.
  • Lip Reading ?
  • Processing
  • Information Retrieval
  • Finding where textual resources reside.
  • Information Extraction
  • Extracting pertinent facts from textual
    resources.
  • Inference Drawing conclusions based on known
    facts.
  • Spelling Correction.
  • Grammar Checking.
  • Output
  • Natural Language Generation.
  • Speech Synthesis.
  • Machine Translation.
  • Conversational Agents.

8
NLP taken from http//tangra.si.umich.edu/radev/
NLP/notes/1.ppt
  • Information extraction
  • Named entity recognition
  • Trend analysis
  • Subjectivity analysis
  • Text classification
  • Anaphora resolution, alias resolution
  • Cross-document cross-reference
  • Parsing
  • Semantic analysis
  • Word sense disambiguation
  • Word clustering
  • Question answering
  • Summarization
  • Document retrieval (filtering, routing)
  • Structured text (relational tables)
  • Paraphrasing and paraphrasing/entailment ID
  • Text generation
  • Machine translation

9
Sample projects
  • Noun phrase parser
  • Paraphrase identification
  • Question answering
  • NL access to databases
  • Named entity tagging
  • Rhetorical parsing
  • Anaphora resolution, entity crossreference
  • Document and sentence alignment
  • Using bioinformatics methods
  • Encyclopedia
  • Information extraction
  • Speech processing
  • Sentence normalization
  • Text summarization
  • Sentence compression
  • Definition extraction
  • Crossword puzzle generation
  • Prepositional phrase attachment
  • Machine translation
  • Generation
  • Semi-structured document parsing
  • Semantic analysis of short queries
  • User-friendly summarization
  • Number classification
  • Domain-specific PP attachment
  • Time-dependent fact extraction

10
Main research forums and other pointers
  • Conferences ACL/NAACL, SIGIR, AAAI/IJCAI, ANLP,
    Coling, HLT, EACL/NAACL, AMTA/MT Summit,
    ICSLP/Eurospeech
  • Journals Computational Linguistics, Natural
    Language Engineering, Information Retrieval,
    Information Processing and Management, ACM
    Transactions on Information Systems, ACM TALIP,
    ACM TSLP
  • University centers Columbia, CMU, JHU, Brown,
    UMass, MIT, UPenn, USC/ISI, NMSU, Michigan,
    Maryland, Edinburgh, Cambridge, Saarland,
    Sheffield, and many others
  • Industrial research sites IBM, SRI, BBN, MITRE,
    MSR, (ATT, Bell Labs, PARC)
  • Startups Language Weaver, Ask.com, LCC
  • The Anthology http//www.aclweb.org/anthology

11
NLP Sources
  • Journals
  • Artificial Intelligence.
  • Computational Intelligence.
  • IEEE Transactions on Intelligent Systems.
  • Journal of Artificial Intelligence Research.
  • Cognitive Science.
  • Machine Translation.
  • Conferences
  • AAAI American Association for Artificial
    Intelligence.
  • IJCAI International Joint Conference on
    Artificial Intelligence.
  • Cognitive Science Society Conferences.
  • DARPA Speech and Natural Language Processing
    Workshop.
  • ARPA Workshop on Human Language Technology.
  • Machine Translation Summit series of conferences.
  • TALN series of conferences.
  • COLING series of conferences.
  • Collection of papers
  • Readings in Natural Language Processing.

12
Why NLP? Numbers
  • Information age! Information revolution!
  • Cheaper PCs
  • Advances in networking
  • Internet/www central pillar of modern societies
  • Massive production of information
  • Growth of www?
  • 800 Million Documents as of Sep. 1999
  • People?
  • US 6.5 M new adult users between 2/99 5/99
  • World 26 Million in 1995
  • 163.25 Million as of 9/99

Year 92 93 94 96 Sep. 99
Sites 50 250 2000 gt100K 43 M
13
(No Transcript)
14
More Recent Statistics (2006)
15
Web Characterization Country Statisticshttp//ww
w.oclc.org/research/projects/archive/wcp/stats/int
nl.htm
1999 1999 2002 2002
Country Percent of public sites Country Percent of public sites
US 49 US 55
Germany 5 Germany 6
UK 5 Japan 5
Canada 4 UK 3
Japan 3 Canada 3
Australia 2 Italy 2
Brazil 2 France 2
Italy 2 Netherlands 2
France 2 Others 18
Others 16 Unknown 4
Unknown 10    
16
Web Characterization Language Statistics
http//www.oclc.org/research/projects/archive/wcp
/stats/intnl.htm
1999 1999 2002 2002
Language Percent of public sites Language Percent of public sites
English 72 English 72
German 7 German 7
French 3 Japanese 6
Japanese 3 Spanish 3
Spanish 3 French 3
Chinese 2 Italian 2
Italian 2 Dutch 2
Portuguese 2 Chinese 2
Dutch 1 Korean 1
Finnish 1 Portuguese 1
Russian 1 Russian 1
Swedish 1 Polish 1
17
Whats the Use of the Numbers?
  • Prove that there is a Linguistic Problem
  • Domination of the English Language.
  • Alienates non-English Speakers.
  • Computers are our interface to the internet
  • Computers do not understand a Natural Language.
  • We do not have enough time to guide computers to
    do what is required of them
  • E.g. Search for all presentations about NLP on
    the internet.
  • Digest them and produce one presentation
    appropriate for my talk at UOB -)

18
Whats the Use of the Numbers?
  • Middle-East is a growing internet market
  • Growing very fast.
  • Lots of Arabs (read non-English speakers).
  • Need to communicate with my own language.
  • Need computer to save time for me while searching
    for information.
  • Dream computer could do most of my work and I
    can just relax ?
  • Introducing the A into ANLP.

19
The Linguistic ProblemMachine Translation (MT) a
Case Study
  • English the de-facto international language
  • Internet and www (CyberEnglish!)
  • Science and Technology
  • Trade and Industry
  • Politics and Media
  • Tourism
  • Etc.
  • English key to accessing Knowledge
  • in all walks of life!
  • Alienation of the HUGE majority of world
    population
  • Impoverishment of world cultures

20
The Linguistic Challenge
  • France
  • 1997 7 French presence on www
  • Legislation introduced (forcing I. Content
    providers to translate web sites into French)
  • Pres. Chirac If in the new media, our language,
    our programs, our creations, are not strongly
    present, the young generation of our country will
    be economically and culturally marginalized
  • I do not want to see the European Culture
    sterilized or obliterated by the American
    Culture
  • French is stronger than Arabic on the internet
    and the PC.

21
If not General NLP! How about at least MT?
  • Languages in the world
  • 6,800 living languages
  • 600 with written tradition
  • 95 of world population speaks 100 languages
  • Translation Market
  • 8 Billion Global Market
  • Doubling every five years
  • (Donald Barabé, invited talk, MT Summit 2003)

22
The Problem
  • Coping with the huge amount of articles, books,
    patents in all disciplines (Assimilation)
  • Coping with the www massive volume
  • Exporting economic products (Dissemination)
  • Facing the Omnipresence of English
  • 50 of all scientific and technical references
  • ?Linguistic, cultural, social, educational,
    economic, and political factors

23
Human Translation too limited ? MT
Translation Cost in EU is 1 Billion
Official Languages from 11 to 20
1600 Human Translators
24
Why Machine Translation?
  • Full Translation
  • Domain specific
  • Weather reports
  • Machine-aided Translation
  • Translation dictionaries
  • Translation memories
  • Requires post-editing
  • Cross-lingual NLP applications
  • Cross-language IR
  • Cross-language Summarization

25
MT A Strategic Choice
  • USA FCCSET report on MT (1993) on the
    presidents request.
  • Japan 200 Million during 15 years till 1991.
    (Asian Multilingual MTS since 87)
  • EU since 1991, 220 projects on Language
    Technology (30 million on Eurotra!)
  • 1996 report on the state of MT

26
MT Players
  • Governments
  • US, European, Japan, Canada, ex-USSR (cold
    war), Korea, Malaysia, Indonesia, Thailand, etc.
  • International institutions
  • UN, E. Commission (12 languages soon to be
    22/23!!), etc.
  • Companies (RD)Microsoft, Siemens, Fujitsu,
    Hitachi, Toshiba, Oki, NEC, Mitsubishi, Sharp

27
MT Market
  • World estimated at 20 billion in 1991
  • MT Tools Market 20 million in 1994
  • gt 160 language pairs
  • gt 60 MTSs being developed (as of 98)
  • Globalink claims 600 K users of its MTS
  • Lang. Eng. Corp. income (LogoVista) 2M
  • Smart Communications (Smart Translator) 6M
  • Systran (12 languages) 60,000 pages/year

28
AMT
29
ANLPAsharqAlawsat (????? ??????) 09.10.03
30
ANLP State Compared to General NLP
  • Script problem
  • Arabic characters are nowhere near Latin-Based
    Characters.
  • Lack of funding
  • Governments.
  • Pan-Arab Organizations.
  • Industry ?! Private Sector.
  • Research ???
  • Infrastructure !

31
Progress in Western MTStatistical MT example
2002 2003 Human Translation
insistent Wednesday may recurred her trips to Libya tomorrow for flying Cairo 6-4 ( AFP ) - an official announced today in the Egyptian lines company for flying Tuesday is a company " insistent for flying " may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment . Egyptair Has Tomorrow to Resume Its Flights to Libya Cairo 4-6 (AFP) - said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya. Egypt Air May Resume its Flights to Libya Tomorrow Cairo, April 6 (AFP) - An Egypt Air official announced, on Tuesday, that Egypt Air will resume its flights to Libya as of tomorrow, Wednesday, after the UN Security Council had announced the suspension of the embargo imposed on Libya.
Form a talk by Charles Wayne, DARPA
32
A First taste of Arabic Machine Translation
  • English Text
  • Before more than 30,000 fans who headed to the
    Cite Sportive from all Lebanese region on Sunday
    Nejmeh drew 1-1 with their traditional rivals
    Ansar in a breathtaking showdown, which saw both
    teams performing their best.
  • Human Translation
  • ???? ???? ?? 30.000 ????? ????? ??? ???? ???????
    ???????? ???? ????? ????? ?????? ? ??????? 1-1 ??
    ?????? ????? ???? ?????? ???????? ????? ??????
    ??????? ??????? ????????? ??? ???? ?????.
  • Ajeeb Translation
  • ??? ???? ?? 30?000 ???? ?????? ??????? ??? ??????
    ???? ?? ??? ??????? ??????????? ??? ????? ????
    ??? 1-1 ?? ?????

33
A 1st Taste of Arabic MT
  • A sample of sentences to be translated
  • Quite disappointing!
  • But, need for a more formal assessment and closer
    scrutiny

34
Multilingual Challenges Morphological Variations
  • Affixation vs. RootPattern

write ? written ??? ? ?????
kill ? killed ??? ? ?????
do ? done ??? ? ?????
35
Translation Divergences conflation
be
???
etre
I
here
not
? ??
???
Je
ici
ne
pas
??? ??? I-am-not here
I am not here
Je ne suis pas ici I not be not here
36
Translation Divergencescategorial, thematic and
structural

be
? ??
?????
I
cold
??? ????? I cold
I am cold
37
Translation Divergenceshead swap and categorial
I swam across the river quickly
????? ???? ????? ????? I-sped crossing the-river
swimming
38
Translation Divergences head swap and
categorial
verb
noun
prep
noun
adverb
verb
39
Fluency vs. Accuracy
FAHQ MT
conMT
Prof. MT
Fluency
Info. MT
Accuracy
40
Evaluation of MTSs
  • Various methodologies put forward
  • Various aspects considered
  • Intelligibility, Fidelity, and other software
    engineering features
  • Mostly human-centered
  • Get users to compare Human and M. T.
  • Get users to evaluate MT output on a scale (e.g.
    1-5)
  • Subjective to a large extent

41
Automatic Evaluation ExampleBleu Metric
  • Test Sentence
  • colorless green ideas sleep furiously

Gold Standard References all dull jade ideas
sleep irately drab emerald concepts sleep
furiously colorless immature thoughts nap angrily
42
Automatic Evaluation ExampleBleu Metric
  • Test Sentence
  • colorless green ideas sleep furiously

Gold Standard References all dull jade ideas
sleep irately drab emerald concepts sleep
furiously colorless immature thoughts nap angrily
Unigram precision 4/5
43
Automatic Evaluation ExampleBleu Metric
  • Test Sentence
  • colorless green ideas sleep furiously
  • colorless green ideas sleep furiously
  • colorless green ideas sleep furiously
  • colorless green ideas sleep furiously

Gold Standard References all dull jade ideas
sleep irately drab emerald concepts sleep
furiously colorless immature thoughts nap angrily
Unigram precision 4 / 5 0.8 Bigram precision
2 / 4 0.5 Bleu Score (a1 a2 an)1/n
(0.8 ? 0.5)½ 0.6325 ? 63.25
44
Evaluating AMTs
  • 3 Arabic MT systems tested
  • - Al-Mutarjim Al-Arabey (ATA Software Tech.)
  • - Al-Wafi (by ATA Software Tech.)
  • - Arabtrans (by Arab.Net Tech.)
  • Sample texts translated.
  • Scoring by a human (1 or 0.5 or 0 )
  • Results

45
Analysis of the results
  • Poor AMT systems overall
  • Good Lexicon coverage in the domain Internet and
    Arabisation
  • Very Poor Grammatical results
  • detailed analysis focuses on bad areas.
  • Pronoun resolution and semantic correctness
  • barely above average
  • (almost 1 error out of each 2 cases!)
  • The technology used in AMTSs is outdated

46
Future Work
  • Develop awareness of the importance of MT and NLP
    for Arabic.
  • Developing our own MT system based on all what we
    have learned from the evaluation
  • Focus on Statistical techniques
  • Speed of Implementation.
  • Obtaining better results.

47
AMT and Lebanon ECOMLEB, no.2, 1st Quarter 2005
  • How can you explain why so many in the IT Field
    cant find a job in Lebanon when we keep hearing
    that we are the best in the region?, Readers
    Comments, P. 02.
  • Khan Al-Saboun, a local soap maker in Tripoli
    now sells soaps all over the world. University
    Series, p. 05
  • Lebanon has one of the highest rates of
    internet usage in the area, a good PC
    penetration, abundant human talent and resources
    in IT and particularly software and web design,
    and no money transfer restrictions Interview
    with Minister of Economy and Trade, H.E. Adnan
    Kassar, p. 16.
  • Lebanon needs to reduce brain drain
    Interview with Minister of Economy and Trade,
    H.E. Adnan Kassar, p. 17.
  • Lebanon has a multiligual and highly educated
    human resource base Interview with Minister of
    Economy and Trade, H.E. Adnan Kassar, p. 17.
  • B2C e-commerce is expected to cross US 1
    Billion mark by 2008 in GCC countries
    particularly in e-shopping mainly in Saudi
    Arabia and the UAE compund average growth of
    22 over 5 years gt 33.33 of transactions are
    booking for airline and hotels.

48
Recommendations
  • Develop Arab acceptance of the strategic nature
    of ANLP/AMT
  • Establishing an Arab Centre for Arabic language
    processing and AMT
  • Gather Arab researchers
  • Host and sponsor research
  • Morphology,
  • Parsing,
  • Speech
  • semantics, pragmatics
  • Building a central repository
  • software,
  • lexicons,
  • corpora,
  • Tools
  • and archive (literature)

49
Recommendations (cont.)
  • Strengthen ties between Academia, research
    centers, and industry
  • Sponsor Pan-Arab projects (ESPRIT-like)
  • Sponsor conferences, exhibitions, and trade
    shows
  • Coordinate Different Conferences
  • 2 upcoming ANLP conferences AT THE SAME TIME in 2
    Different places (KSA and Morocco)
  • Plan for a third (UAE).
  • Strengthen links with western institutions (on
    NLP/MT)
  • Already western researchers are active in ANLP
  • A workshop in London in the same time frame as
    both conferences in KSA and Morocco.

50
Thank you for your patience!
  • References
  • Ahmed Guessoum, Rached Zantout, A Methodology for
    Evaluating Arabic Machine Translation Systems,
    Machine Translation, Volume 18, Issue 4, Dec
    2004, Pages 299 - 335
  • R. Zantout and A. Guessoum, An Automatic
    English-Arabic HTML Page Translation System,
    Journal of Network and Computer Applications,
    vol. 4, no. 24, October 2001.
  • Guessoum and R. Zantout, A Methodology for a
    semi-automatic evaluation of the language
    coverage of machine translation system lexicons,
    The Journal of Machine Translation, Kluwer
    Academic Publishers, The Netherlands, vol. 16,
    October 2001.
  • Zantout, Rached and Guessoum, Ahmed, Arabic
    Machine Translation A Strategic Choice for the
    Arab World, Journal of King Saud University, Vol.
    12, Computer and Information Sciences, pp.
    117-144, A.H. 1420-2000.
  • Ahmed Guessoum, Rached Zantout , Machine
    Translation, A Startegic Dimension for the Arab
    World, University Forum, University of Sharjah,
    Issue 41, Year 6, Muharram 1427, February 2006,
    pp. 32-37.
  • Guessoum, Ahmed and Zantout, Rached, Arabizing
    the Internet and its effect on the development of
    the Kingdom of Saudi Arabia, The 100 years
    symposium of the King Saud University, Riyadh,
    Saudi Arabia, 18-19/10/1999.
  • Guessoum, Ahmed and Zantout, Rached, Towards a
    Strategic Effort, with a Central Theme of Machine
    Translation, to meet the challenges of the
    Information Revolution, 1998 Symposium of
    Proliferation of Arabization and Development of
    Translation in the Kingdom of Saudi Arabia, King
    Saud University, Riyadh.
  • Machine Translation Challenges and Approaches,
    Invited Lecture, CS 4705 Introduction to Natural
    Language Processing Fall 2004, Nizar
    HabashPost-doctoral Fellow, Center for
    Computational Learning Systems, Columbia
    University.
Write a Comment
User Comments (0)
About PowerShow.com