Title: A Special Reading on Web as Corpus and English Writing Assistant
1A Special Reading on Web as Corpus and English
Writing Assistant
Web??????????????
- Hua-Ping ZHANG (???)
- Ph.D Candidate
- Inst. of Computing Tech., CAS
- 2004-5-8
2Outline
- Introduction (Adam Kilgarriff)
- The Web as a Parallel Corpus (Philip Resnik)
- wEBMT (Andy Way, Dublin City Univ.)
- Automatic Association of Web Directories with
Word Senses (Celina S.) - Web-based English Writing Assistant (Hua-Ping
Zhang) - Conclusion
3Introduction
- Statistical approaches proved successful in many
NLP and information procession application.
However, corpus became the bottleneck. balanced?
large enough? - Smoothing technique is used to address data
sparseness for language modeling, but could not
solve the problem. Performance improves with data
size, and getting more data will make more
difference than fine-tuning algorithms. - Language scientists are increasingly turning to
the Web as a source of language data. Web is the
only source for the type of language in which
they are interested, or simply because it is free
and instantly available.
4Introduction II application samples
- Spelling check
- Speculater or speculator?
- Google search 112 times V.S. 171,000
- Finding the Right Translation
- French phrase Groupe de travail
- English candidate translations frequency in Web
- Labor cluster 21 labor collective 428,., labor
group 10,389, work group 148,331
5Introduction III
- In principle, any collection of more than one
text can be called a corpus. It should consider
four main headings sampling and
representativeness, finite size, machine-readable
form, a standard reference. - What is a corpus? What is a good corpus? Is
corpus x good for task y? semantic question Is x
a corpus at all - To summarize, a corpus is a collection of texts
when considered as an object of language or
literacy study
6Introduction IV
- Web Size and the Multilingual Web
- 172 million network addresses in Jan., 2003.
4,285,199,774 pages included in Google - BNC 100 million words V.S. Web word count?
- Deep breath 732 868,631
- Function words, such as the, with and in,
??,occur with a frequency that is relatively
stable over many different languages. - Language error in Web. Web is a dirty corpus, but
expected usage is much more frequent than what
might be considered noise.
7Introduction V
8Introduction VI
9Introduction VII
10Outline
- Introduction (Adam Kilgarriff)
- The Web as a Parallel Corpus (Philip Resnik)
- wEBMT (Andy Way, Dublin City Univ.)
- Automatic Association of Web Directories with
Word Senses (Celina S.) - Web-based English Writing Assistant (Hua-Ping
Zhang) - Conclusion
11The Web as a Parallel Corpus
- STRAND structural translation recognition,
acquiring natural data. - Motivation When presenting the same content in
two different languages, authors exhibit a very
strong tendency to use the same document
structure.
12The Web as a Parallel Corpus II
13The Web as a Parallel Corpus III
- STRAND finds parallel text on the Web with three
main steps - Locating Pages
- Parent page search englishanglais
frenchfrancais within 10 lines - sibling page
- Generating Candidate Pairs
- 1) URL matching with manual rules english-gtbig5
- http//mysite.com/english/home_en.html
- http//mysite.com/big5/home_cn.html
- 2) Document length length(E) C length(F)
- Structural Filtering
14The Web as a Parallel Corpus IV
Parent page
Sibiling page
15The Web as a Parallel Corpus V
- Structural Filtering
- linearize the HTML structure and ignore the
actual linguistic content of the documents. - STARTelement_label e.g., STARTA, STARTLI
- ENDelement_label e.g., ENDA
- Chunklength e.g., Chunk174
- (e.g., ltFONT COLOR"BLUE"gt produces STARTFONT
followed by Chunk12).
16The Web as a Parallel Corpus VI
- to align the linearized sequences using a
standard dynamic programming technique
17The Web as a Parallel Corpus VII
- Using this alignment, we compute four scalar
values that characterize the quality of the
alignment - dp The difference percentage, indicating
nonshared material (i.e., alignment tokens that
are in one linearized file but not the other). - n The number of aligned nonmarkup text chunks of
unequal length. - r The correlation of lengths of the aligned
nonmarkup chunks. - p The significance level of the correlation r.
18The Web as a Parallel Corpus VII
WWW between mostly the same meaning and entirely
the same meaning (3.25) MT 2.5
19Outline
- Introduction (Adam Kilgarriff)
- The Web as a Parallel Corpus (Philip Resnik)
- wEBMT (Andy Way, Dublin City Univ.)
- Automatic Association of Web Directories with
Word Senses (Celina S.) - Web-based English Writing Assistant (Hua-Ping
Zhang) - Conclusion
20wEBMT Developing and Validating an EBMT System
Using WWW
- Populating the systems memory with translations
gathered from 3 online rule-based MT systems. - Automatically create 4 knowledge sources
- Phrasal Lexicons
- Marker Lexicons
- Generalized marker lexicon
- Word-level lexicon
21wEBMT II- phrasal lexicon
- 218,697 English NP and VP from Penn Treebank
- Automatically translate them using 3 online MT
system (SDL, Reverso, Logomedia) UNIX wget web
page translation
22wEBMT III- Marker Lexicons
- Marker words
- ltDETgt the, a, an, those,these,..ltPREPgt
in,on,out,with,from,to,under,..ltQUANTgt
all,some,few,many,..ltCONJgt and,or,ltPOSSgtmy
,your,our,ltPRONgtI,you,.. - Same as French
- Marker-headed chunks in source S map sequentially
to their target equivalents T - ltDETgt the board le conseil
- Each chunk must also contain at least one
non-marker word. i.e. In the cold
23wEBMT III- example
- Input A major concern for the parent company is
what advertisers are paying per page. - Chunks found in marker lexicon for the parent
company pour la societe mere . - Chunks found in generalized marker
lexiconltDETgtmajor concern inquietude majeure - Words found in word-level lexicon ltDETgt a une
ltLEXgtis est - Translation Une inquietude majeure pour la
societe mere est quels annonceurs paient per page
24wEBMT IV-Trace
- segmentation of input
- NP the total at risk a year could be segmented
into ltDETgtthe total,ltPREPgtat risk, ltDETgta year - Retrieving translation chunks and weighting
- P(la maison the house)8/10
- P(le domicile the house)2/10
- P(s ecroulacollapsed)1/7
- P(s effondracollapsed)6/7
- P(la maison s effondra the house
collapsed)48/70 - Generalized marker lexicon and word-level lexicon
could be applies if it is instead of a house
collapsed
25Outline
- Introduction (Adam Kilgarriff)
- The Web as a Parallel Corpus (Philip Resnik)
- wEBMT (Andy Way, Dublin City Univ.)
- Automatic Association of Web Directories with
Word Senses (Celina S.) - Web-based English Writing Assistant (Hua-Ping
Zhang) - Conclusion
26Automatic Association of Web Directories with
Word Senses
- Open Directory Project http//dmoz.org
- Circuit has 6 synset in WordNet
- Q1circuit, electrical circuit,-tour,
- Q2 circuit, tour,journey,-electrical
circuit, - Retrieving with all query in ODP. Directory
returned! - dbusiness/industries/electronics and../contract
manufacturers - Sense/Directory comparisons with confidence score
27Outline
- Introduction (Adam Kilgarriff)
- The Web as a Parallel Corpus (Philip Resnik)
- wEBMT (Andy Way, Dublin City Univ.)
- Automatic Association of Web Directories with
Word Senses (Celina S.) - Web-based English Writing Assistant (Hua-Ping
Zhang) - Conclusion
28Web-based English Writing Assistant (WEWA)
- ??????????????,???????????????????????????????????
?,????????????????????????????????????????????????
??????????????????????? (MSA)
29WEWA II Problems
- English foreign speakers suffer from
- Spelling Speculater
- Vocabulary
- many,divers, diverse, various,considerable,
numerous, very many, a good many, ever so many,
many more,ever-recurring, frequent, repeated - Native Chinglish or Singlish on the other hand
acquire knowledge, learn knowledge - Fluency even beauty Our method enjoys many good
properties To the best of our knowledge,
however, no previous study has so far dealt with
the problem
30WEWA IIIMotivation
- Write as native speakers did!
- Assumption Regular and correct words or phrases
occur more frequent than invalid or improper
ones. - WEWA performs machine learning from regular
corporate or Web. We could restrict the Web
domain according to the application. - Occurrence factor/ Replacement cost (estimated
with proper approaches, such as edit distance)
More frequent but less replacement cost.
31WEWA IV Hierarchy
Semantics
Semantics
Sentence structure
Syntax
Phrase/Chunk
Word
Spelling/n-gram
Attachments
32WEWA V Word level
- N-gram, considering occurrence and co-occurrence
frequency. - Ww1 w2 wn generating candidate sequence.
Where wij is similar to wi in semantics or
spelling. We only consider open class words
w11
wi1
w1
wi
w2
wn
w11
wnk
wik
33WEWA VI Word level
- Cont.
- Finding the proposed sequence W with the most
probability. That is - W argWmax P(W)
- argWmax ?p(wiwi-1)
- Or argWmax Freq(w1, w2,wn) if window size is
less than 5. - p(wiwi-1) and Freq(w1, w2,wn) could be
estimated on Web or corpus - W could be got via a dynamic programming.
34WEWA VII Phrase level
- With a shallow parsing or chunking as described
in wEBMT, retrieving the phrase components in the
Web. Statistics on returned texts with regular
expressions. Then return the more frequent
phrase. -
- I look up my shoes in the room
- Look up my/NP/Pron Look for my/NP/Pron
Look at my/NP/Pron
35WEWA VIII Syntax level
- Getting all constituents after (partial) parsing.
- Generating all possible substituting structures
according known constituent. - Finding the proposed structure with most
probability - NPVP VPNP
36Outline
- Introduction (Adam Kilgarriff)
- The Web as a Parallel Corpus (Philip Resnik)
- wEBMT (Andy Way, Dublin City Univ.)
- Automatic Association of Web Directories with
Word Senses (Celina S.) - Web-based English Writing Assistant (Hua-Ping
Zhang) - Conclusion
37Conclusion
- The Web is a huge and cheap language corpora with
helpful information as required although it
contains noises. - The Web is somewhat different from manually-built
corpora. Traditional corpus approaches should be
tuned to Web. - The Web could help build a parallel corpus
automatically and acquire sense-tagged lexicons
and corpora. - wEBMT proved success after introducing Web.
38Conclusion II
- Web-based English Writing Assistant has a
four-leveled hierarchy word,phrase/chunk,syntax,
semantics. - Web could provide the word/n-gram/phrase
occurrence and co-occurrence frequency, which
help WEWA make decisions. - WEWA should considering occurrence and
replacement costs.
39