A Special Reading on Web as Corpus and English Writing Assistant

About This Presentation

Title:

A Special Reading on Web as Corpus and English Writing Assistant

Description:

Title: Relevance Retrieval and Novelty Detection with Sentences Author: hhj Last modified by: Kevin Zhang Created Date: 10/29/2003 8:37:11 AM Document presentation format – PowerPoint PPT presentation

Number of Views:80

Avg rating:3.0/5.0

Slides: 40

Provided by: hhj

Category:

more less

Transcript and Presenter's Notes

Title: A Special Reading on Web as Corpus and English Writing Assistant

1
A Special Reading on Web as Corpus and English
Writing Assistant
Web??????????????

Hua-Ping ZHANG (???)
Ph.D Candidate
Inst. of Computing Tech., CAS
2004-5-8

2
Outline

Introduction (Adam Kilgarriff)
The Web as a Parallel Corpus (Philip Resnik)
wEBMT (Andy Way, Dublin City Univ.)
Automatic Association of Web Directories with
Word Senses (Celina S.)
Web-based English Writing Assistant (Hua-Ping
Zhang)
Conclusion

3
Introduction

Statistical approaches proved successful in many
NLP and information procession application.
However, corpus became the bottleneck. balanced?
large enough?
Smoothing technique is used to address data
sparseness for language modeling, but could not
solve the problem. Performance improves with data
size, and getting more data will make more
difference than fine-tuning algorithms.
Language scientists are increasingly turning to
the Web as a source of language data. Web is the
only source for the type of language in which
they are interested, or simply because it is free
and instantly available.

4
Introduction II application samples

Spelling check
Speculater or speculator?
Google search 112 times V.S. 171,000
Finding the Right Translation
French phrase Groupe de travail
English candidate translations frequency in Web
Labor cluster 21 labor collective 428,., labor
group 10,389, work group 148,331

5
Introduction III

In principle, any collection of more than one
text can be called a corpus. It should consider
four main headings sampling and
representativeness, finite size, machine-readable
form, a standard reference.
What is a corpus? What is a good corpus? Is
corpus x good for task y? semantic question Is x
a corpus at all
To summarize, a corpus is a collection of texts
when considered as an object of language or
literacy study

6
Introduction IV

Web Size and the Multilingual Web
172 million network addresses in Jan., 2003.
4,285,199,774 pages included in Google
BNC 100 million words V.S. Web word count?
Deep breath 732 868,631
Function words, such as the, with and in,
??,occur with a frequency that is relatively
stable over many different languages.
Language error in Web. Web is a dirty corpus, but
expected usage is much more frequent than what
might be considered noise.

7
Introduction V
8
Introduction VI
9
Introduction VII
10
Outline

Introduction (Adam Kilgarriff)
The Web as a Parallel Corpus (Philip Resnik)
wEBMT (Andy Way, Dublin City Univ.)
Automatic Association of Web Directories with
Word Senses (Celina S.)
Web-based English Writing Assistant (Hua-Ping
Zhang)
Conclusion

11
The Web as a Parallel Corpus

STRAND structural translation recognition,
acquiring natural data.
Motivation When presenting the same content in
two different languages, authors exhibit a very
strong tendency to use the same document
structure.

12
The Web as a Parallel Corpus II
13
The Web as a Parallel Corpus III

STRAND finds parallel text on the Web with three
main steps
Locating Pages
Parent page search englishanglais
frenchfrancais within 10 lines
sibling page
Generating Candidate Pairs
1) URL matching with manual rules english-gtbig5
http//mysite.com/english/home_en.html
http//mysite.com/big5/home_cn.html
2) Document length length(E) C length(F)
Structural Filtering

14
The Web as a Parallel Corpus IV
Parent page
Sibiling page
15
The Web as a Parallel Corpus V

Structural Filtering
linearize the HTML structure and ignore the
actual linguistic content of the documents.
STARTelement_label e.g., STARTA, STARTLI
ENDelement_label e.g., ENDA
Chunklength e.g., Chunk174
(e.g., ltFONT COLOR"BLUE"gt produces STARTFONT
followed by Chunk12).

16
The Web as a Parallel Corpus VI

to align the linearized sequences using a
standard dynamic programming technique

17
The Web as a Parallel Corpus VII

Using this alignment, we compute four scalar
values that characterize the quality of the
alignment
dp The difference percentage, indicating
nonshared material (i.e., alignment tokens that
are in one linearized file but not the other).
n The number of aligned nonmarkup text chunks of
unequal length.
r The correlation of lengths of the aligned
nonmarkup chunks.
p The significance level of the correlation r.

18
The Web as a Parallel Corpus VII
WWW between mostly the same meaning and entirely
the same meaning (3.25) MT 2.5
19
Outline

Introduction (Adam Kilgarriff)
The Web as a Parallel Corpus (Philip Resnik)
wEBMT (Andy Way, Dublin City Univ.)
Automatic Association of Web Directories with
Word Senses (Celina S.)
Web-based English Writing Assistant (Hua-Ping
Zhang)
Conclusion

20
wEBMT Developing and Validating an EBMT System
Using WWW

Populating the systems memory with translations
gathered from 3 online rule-based MT systems.
Automatically create 4 knowledge sources
Phrasal Lexicons
Marker Lexicons
Generalized marker lexicon
Word-level lexicon

21
wEBMT II- phrasal lexicon

218,697 English NP and VP from Penn Treebank
Automatically translate them using 3 online MT
system (SDL, Reverso, Logomedia) UNIX wget web
page translation

22
wEBMT III- Marker Lexicons

Marker words
ltDETgt the, a, an, those,these,..ltPREPgt
in,on,out,with,from,to,under,..ltQUANTgt
all,some,few,many,..ltCONJgt and,or,ltPOSSgtmy
,your,our,ltPRONgtI,you,..
Same as French
Marker-headed chunks in source S map sequentially
to their target equivalents T
ltDETgt the board le conseil
Each chunk must also contain at least one
non-marker word. i.e. In the cold

23
wEBMT III- example

Input A major concern for the parent company is
what advertisers are paying per page.
Chunks found in marker lexicon for the parent
company pour la societe mere .
Chunks found in generalized marker
lexiconltDETgtmajor concern inquietude majeure
Words found in word-level lexicon ltDETgt a une
ltLEXgtis est
Translation Une inquietude majeure pour la
societe mere est quels annonceurs paient per page

24
wEBMT IV-Trace

segmentation of input
NP the total at risk a year could be segmented
into ltDETgtthe total,ltPREPgtat risk, ltDETgta year
Retrieving translation chunks and weighting
P(la maison the house)8/10
P(le domicile the house)2/10
P(s ecroulacollapsed)1/7
P(s effondracollapsed)6/7
P(la maison s effondra the house
collapsed)48/70
Generalized marker lexicon and word-level lexicon
could be applies if it is instead of a house
collapsed

25
Outline

Introduction (Adam Kilgarriff)
The Web as a Parallel Corpus (Philip Resnik)
wEBMT (Andy Way, Dublin City Univ.)
Automatic Association of Web Directories with
Word Senses (Celina S.)
Web-based English Writing Assistant (Hua-Ping
Zhang)
Conclusion

26
Automatic Association of Web Directories with
Word Senses

Open Directory Project http//dmoz.org
Circuit has 6 synset in WordNet
Q1circuit, electrical circuit,-tour,
Q2 circuit, tour,journey,-electrical
circuit,
Retrieving with all query in ODP. Directory
returned!
dbusiness/industries/electronics and../contract
manufacturers
Sense/Directory comparisons with confidence score

27
Outline

Introduction (Adam Kilgarriff)
The Web as a Parallel Corpus (Philip Resnik)
wEBMT (Andy Way, Dublin City Univ.)
Automatic Association of Web Directories with
Word Senses (Celina S.)
Web-based English Writing Assistant (Hua-Ping
Zhang)
Conclusion

28
Web-based English Writing Assistant (WEWA)

??????????????,???????????????????????????????????
?,????????????????????????????????????????????????
??????????????????????? (MSA)

29
WEWA II Problems

English foreign speakers suffer from
Spelling Speculater
Vocabulary
many,divers, diverse, various,considerable,
numerous, very many, a good many, ever so many,
many more,ever-recurring, frequent, repeated
Native Chinglish or Singlish on the other hand
acquire knowledge, learn knowledge
Fluency even beauty Our method enjoys many good
properties To the best of our knowledge,
however, no previous study has so far dealt with
the problem

30
WEWA IIIMotivation

Write as native speakers did!
Assumption Regular and correct words or phrases
occur more frequent than invalid or improper
ones.
WEWA performs machine learning from regular
corporate or Web. We could restrict the Web
domain according to the application.
Occurrence factor/ Replacement cost (estimated
with proper approaches, such as edit distance)
More frequent but less replacement cost.

31
WEWA IV Hierarchy
Semantics
Semantics
Sentence structure
Syntax
Phrase/Chunk
Word
Spelling/n-gram
Attachments
32
WEWA V Word level

N-gram, considering occurrence and co-occurrence
frequency.
Ww1 w2 wn generating candidate sequence.
Where wij is similar to wi in semantics or
spelling. We only consider open class words

w11
wi1
w1

wi
w2
wn

w11
wnk
wik

33
WEWA VI Word level

Cont.
Finding the proposed sequence W with the most
probability. That is
W argWmax P(W)
argWmax ?p(wiwi-1)
Or argWmax Freq(w1, w2,wn) if window size is
less than 5.
p(wiwi-1) and Freq(w1, w2,wn) could be
estimated on Web or corpus
W could be got via a dynamic programming.

34
WEWA VII Phrase level

With a shallow parsing or chunking as described
in wEBMT, retrieving the phrase components in the
Web. Statistics on returned texts with regular
expressions. Then return the more frequent
phrase.
I look up my shoes in the room
Look up my/NP/Pron Look for my/NP/Pron
Look at my/NP/Pron

35
WEWA VIII Syntax level

Getting all constituents after (partial) parsing.
Generating all possible substituting structures
according known constituent.
Finding the proposed structure with most
probability
NPVP VPNP

36
Outline

Introduction (Adam Kilgarriff)
The Web as a Parallel Corpus (Philip Resnik)
wEBMT (Andy Way, Dublin City Univ.)
Automatic Association of Web Directories with
Word Senses (Celina S.)
Web-based English Writing Assistant (Hua-Ping
Zhang)
Conclusion

37
Conclusion

The Web is a huge and cheap language corpora with
helpful information as required although it
contains noises.
The Web is somewhat different from manually-built
corpora. Traditional corpus approaches should be
tuned to Web.
The Web could help build a parallel corpus
automatically and acquire sense-tagged lexicons
and corpora.
wEBMT proved success after introducing Web.

38
Conclusion II

Web-based English Writing Assistant has a
four-leveled hierarchy word,phrase/chunk,syntax,
semantics.
Web could provide the word/n-gram/phrase
occurrence and co-occurrence frequency, which
help WEWA make decisions.
WEWA should considering occurrence and
replacement costs.

THANKS!

Write a Comment

User Comments (0)

About PowerShow.com

A Special Reading on Web as Corpus and English Writing Assistant - PowerPoint PPT Presentation

A Special Reading on Web as Corpus and English Writing Assistant

Title: Relevance Retrieval and Novelty Detection with Sentences Author: hhj Last modified by: Kevin Zhang Created Date: 10/29/2003 8:37:11 AM Document presentation format – PowerPoint PPT presentation