A Special Reading on Web as Corpus and English Writing Assistant - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

A Special Reading on Web as Corpus and English Writing Assistant

Description:

Title: Relevance Retrieval and Novelty Detection with Sentences Author: hhj Last modified by: Kevin Zhang Created Date: 10/29/2003 8:37:11 AM Document presentation format – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 40
Provided by: hhj
Category:

less

Transcript and Presenter's Notes

Title: A Special Reading on Web as Corpus and English Writing Assistant


1
A Special Reading on Web as Corpus and English
Writing Assistant
Web??????????????
  • Hua-Ping ZHANG (???)
  • Ph.D Candidate
  • Inst. of Computing Tech., CAS
  • 2004-5-8

2
Outline
  • Introduction (Adam Kilgarriff)
  • The Web as a Parallel Corpus (Philip Resnik)
  • wEBMT (Andy Way, Dublin City Univ.)
  • Automatic Association of Web Directories with
    Word Senses (Celina S.)
  • Web-based English Writing Assistant (Hua-Ping
    Zhang)
  • Conclusion

3
Introduction
  • Statistical approaches proved successful in many
    NLP and information procession application.
    However, corpus became the bottleneck. balanced?
    large enough?
  • Smoothing technique is used to address data
    sparseness for language modeling, but could not
    solve the problem. Performance improves with data
    size, and getting more data will make more
    difference than fine-tuning algorithms.
  • Language scientists are increasingly turning to
    the Web as a source of language data. Web is the
    only source for the type of language in which
    they are interested, or simply because it is free
    and instantly available.

4
Introduction II application samples
  • Spelling check
  • Speculater or speculator?
  • Google search 112 times V.S. 171,000
  • Finding the Right Translation
  • French phrase Groupe de travail
  • English candidate translations frequency in Web
  • Labor cluster 21 labor collective 428,., labor
    group 10,389, work group 148,331

5
Introduction III
  • In principle, any collection of more than one
    text can be called a corpus. It should consider
    four main headings sampling and
    representativeness, finite size, machine-readable
    form, a standard reference.
  • What is a corpus? What is a good corpus? Is
    corpus x good for task y? semantic question Is x
    a corpus at all
  • To summarize, a corpus is a collection of texts
    when considered as an object of language or
    literacy study

6
Introduction IV
  • Web Size and the Multilingual Web
  • 172 million network addresses in Jan., 2003.
    4,285,199,774 pages included in Google
  • BNC 100 million words V.S. Web word count?
  • Deep breath 732 868,631
  • Function words, such as the, with and in,
    ??,occur with a frequency that is relatively
    stable over many different languages.
  • Language error in Web. Web is a dirty corpus, but
    expected usage is much more frequent than what
    might be considered noise.

7
Introduction V
8
Introduction VI
9
Introduction VII
10
Outline
  • Introduction (Adam Kilgarriff)
  • The Web as a Parallel Corpus (Philip Resnik)
  • wEBMT (Andy Way, Dublin City Univ.)
  • Automatic Association of Web Directories with
    Word Senses (Celina S.)
  • Web-based English Writing Assistant (Hua-Ping
    Zhang)
  • Conclusion

11
The Web as a Parallel Corpus
  • STRAND structural translation recognition,
    acquiring natural data.
  • Motivation When presenting the same content in
    two different languages, authors exhibit a very
    strong tendency to use the same document
    structure.

12
The Web as a Parallel Corpus II
13
The Web as a Parallel Corpus III
  • STRAND finds parallel text on the Web with three
    main steps
  • Locating Pages
  • Parent page search englishanglais
    frenchfrancais within 10 lines
  • sibling page
  • Generating Candidate Pairs
  • 1) URL matching with manual rules english-gtbig5
  • http//mysite.com/english/home_en.html
  • http//mysite.com/big5/home_cn.html
  • 2) Document length length(E) C length(F)
  • Structural Filtering

14
The Web as a Parallel Corpus IV
Parent page
Sibiling page
15
The Web as a Parallel Corpus V
  • Structural Filtering
  • linearize the HTML structure and ignore the
    actual linguistic content of the documents.
  • STARTelement_label e.g., STARTA, STARTLI
  • ENDelement_label e.g., ENDA
  • Chunklength e.g., Chunk174
  • (e.g., ltFONT COLOR"BLUE"gt produces STARTFONT
    followed by Chunk12).

16
The Web as a Parallel Corpus VI
  • to align the linearized sequences using a
    standard dynamic programming technique

17
The Web as a Parallel Corpus VII
  • Using this alignment, we compute four scalar
    values that characterize the quality of the
    alignment
  • dp The difference percentage, indicating
    nonshared material (i.e., alignment tokens that
    are in one linearized file but not the other).
  • n The number of aligned nonmarkup text chunks of
    unequal length.
  • r The correlation of lengths of the aligned
    nonmarkup chunks.
  • p The significance level of the correlation r.

18
The Web as a Parallel Corpus VII
WWW between mostly the same meaning and entirely
the same meaning (3.25) MT 2.5
19
Outline
  • Introduction (Adam Kilgarriff)
  • The Web as a Parallel Corpus (Philip Resnik)
  • wEBMT (Andy Way, Dublin City Univ.)
  • Automatic Association of Web Directories with
    Word Senses (Celina S.)
  • Web-based English Writing Assistant (Hua-Ping
    Zhang)
  • Conclusion

20
wEBMT Developing and Validating an EBMT System
Using WWW
  • Populating the systems memory with translations
    gathered from 3 online rule-based MT systems.
  • Automatically create 4 knowledge sources
  • Phrasal Lexicons
  • Marker Lexicons
  • Generalized marker lexicon
  • Word-level lexicon

21
wEBMT II- phrasal lexicon
  • 218,697 English NP and VP from Penn Treebank
  • Automatically translate them using 3 online MT
    system (SDL, Reverso, Logomedia) UNIX wget web
    page translation

22
wEBMT III- Marker Lexicons
  • Marker words
  • ltDETgt the, a, an, those,these,..ltPREPgt
    in,on,out,with,from,to,under,..ltQUANTgt
    all,some,few,many,..ltCONJgt and,or,ltPOSSgtmy
    ,your,our,ltPRONgtI,you,..
  • Same as French
  • Marker-headed chunks in source S map sequentially
    to their target equivalents T
  • ltDETgt the board le conseil
  • Each chunk must also contain at least one
    non-marker word. i.e. In the cold

23
wEBMT III- example
  • Input A major concern for the parent company is
    what advertisers are paying per page.
  • Chunks found in marker lexicon for the parent
    company pour la societe mere .
  • Chunks found in generalized marker
    lexiconltDETgtmajor concern inquietude majeure
  • Words found in word-level lexicon ltDETgt a une
    ltLEXgtis est
  • Translation Une inquietude majeure pour la
    societe mere est quels annonceurs paient per page

24
wEBMT IV-Trace
  • segmentation of input
  • NP the total at risk a year could be segmented
    into ltDETgtthe total,ltPREPgtat risk, ltDETgta year
  • Retrieving translation chunks and weighting
  • P(la maison the house)8/10
  • P(le domicile the house)2/10
  • P(s ecroulacollapsed)1/7
  • P(s effondracollapsed)6/7
  • P(la maison s effondra the house
    collapsed)48/70
  • Generalized marker lexicon and word-level lexicon
    could be applies if it is instead of a house
    collapsed

25
Outline
  • Introduction (Adam Kilgarriff)
  • The Web as a Parallel Corpus (Philip Resnik)
  • wEBMT (Andy Way, Dublin City Univ.)
  • Automatic Association of Web Directories with
    Word Senses (Celina S.)
  • Web-based English Writing Assistant (Hua-Ping
    Zhang)
  • Conclusion

26
Automatic Association of Web Directories with
Word Senses
  • Open Directory Project http//dmoz.org
  • Circuit has 6 synset in WordNet
  • Q1circuit, electrical circuit,-tour,
  • Q2 circuit, tour,journey,-electrical
    circuit,
  • Retrieving with all query in ODP. Directory
    returned!
  • dbusiness/industries/electronics and../contract
    manufacturers
  • Sense/Directory comparisons with confidence score

27
Outline
  • Introduction (Adam Kilgarriff)
  • The Web as a Parallel Corpus (Philip Resnik)
  • wEBMT (Andy Way, Dublin City Univ.)
  • Automatic Association of Web Directories with
    Word Senses (Celina S.)
  • Web-based English Writing Assistant (Hua-Ping
    Zhang)
  • Conclusion

28
Web-based English Writing Assistant (WEWA)
  • ??????????????,???????????????????????????????????
    ?,????????????????????????????????????????????????
    ??????????????????????? (MSA)

29
WEWA II Problems
  • English foreign speakers suffer from
  • Spelling Speculater
  • Vocabulary
  • many,divers, diverse, various,considerable,
    numerous, very many, a good many, ever so many,
    many more,ever-recurring, frequent, repeated
  • Native Chinglish or Singlish on the other hand
    acquire knowledge, learn knowledge
  • Fluency even beauty Our method enjoys many good
    properties To the best of our knowledge,
    however, no previous study has so far dealt with
    the problem

30
WEWA IIIMotivation
  • Write as native speakers did!
  • Assumption Regular and correct words or phrases
    occur more frequent than invalid or improper
    ones.
  • WEWA performs machine learning from regular
    corporate or Web. We could restrict the Web
    domain according to the application.
  • Occurrence factor/ Replacement cost (estimated
    with proper approaches, such as edit distance)
    More frequent but less replacement cost.

31
WEWA IV Hierarchy
Semantics
Semantics
Sentence structure
Syntax
Phrase/Chunk
Word
Spelling/n-gram
Attachments
32
WEWA V Word level
  • N-gram, considering occurrence and co-occurrence
    frequency.
  • Ww1 w2 wn generating candidate sequence.
    Where wij is similar to wi in semantics or
    spelling. We only consider open class words

w11
wi1
w1

wi
w2
wn

w11
wnk
wik

33
WEWA VI Word level
  • Cont.
  • Finding the proposed sequence W with the most
    probability. That is
  • W argWmax P(W)
  • argWmax ?p(wiwi-1)
  • Or argWmax Freq(w1, w2,wn) if window size is
    less than 5.
  • p(wiwi-1) and Freq(w1, w2,wn) could be
    estimated on Web or corpus
  • W could be got via a dynamic programming.

34
WEWA VII Phrase level
  • With a shallow parsing or chunking as described
    in wEBMT, retrieving the phrase components in the
    Web. Statistics on returned texts with regular
    expressions. Then return the more frequent
    phrase.
  • I look up my shoes in the room
  • Look up my/NP/Pron Look for my/NP/Pron
    Look at my/NP/Pron

35
WEWA VIII Syntax level
  • Getting all constituents after (partial) parsing.
  • Generating all possible substituting structures
    according known constituent.
  • Finding the proposed structure with most
    probability
  • NPVP VPNP

36
Outline
  • Introduction (Adam Kilgarriff)
  • The Web as a Parallel Corpus (Philip Resnik)
  • wEBMT (Andy Way, Dublin City Univ.)
  • Automatic Association of Web Directories with
    Word Senses (Celina S.)
  • Web-based English Writing Assistant (Hua-Ping
    Zhang)
  • Conclusion

37
Conclusion
  • The Web is a huge and cheap language corpora with
    helpful information as required although it
    contains noises.
  • The Web is somewhat different from manually-built
    corpora. Traditional corpus approaches should be
    tuned to Web.
  • The Web could help build a parallel corpus
    automatically and acquire sense-tagged lexicons
    and corpora.
  • wEBMT proved success after introducing Web.

38
Conclusion II
  • Web-based English Writing Assistant has a
    four-leveled hierarchy word,phrase/chunk,syntax,
    semantics.
  • Web could provide the word/n-gram/phrase
    occurrence and co-occurrence frequency, which
    help WEWA make decisions.
  • WEWA should considering occurrence and
    replacement costs.

39
  • THANKS!
Write a Comment
User Comments (0)
About PowerShow.com