The Web as a Parallel Corpus - PowerPoint PPT Presentation

About This Presentation
Title:

The Web as a Parallel Corpus

Description:

The Rosetta Stone dates back from around 190 BC. The three texts on the RS are ... Motivation:Bitexts provide indispensable training data for statistical ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 14
Provided by: jimnu
Category:
Tags: corpus | parallel | rosetta | web

less

Transcript and Presenter's Notes

Title: The Web as a Parallel Corpus


1
The Web as a Parallel Corpus
  • A paper by Philip Resnik and Noah A. Smith
  • (2003, Computational Linguistics)
  • My interpretation of their research.
  • http//www.thebritishmuseum.ac.uk/compass/ixbin/go
    to?idOBJ67

2
Contents
  • Introduction to parallel corpora
  • The STRAND Web-mining architecture (estb.1999)
  • Content-Based Matching
  • Exploiting the Internet Archive
  • Conclusions and Further Work

3
Introduction to parallel corpora
  • The Rosetta Stone dates back from around 190 BC.
    The three texts on the RS are of the same content
    in hieroglyphs, demotic and Greek. (3 different
    languages)
  • Canadian Hansard and Hong Kong Hansard are two
    other famous parallel corpora, especially because
    they are available electronically and are of high
    standards.
  • MotivationBitexts provide indispensable training
    data for statistical translation models.
  • The Web can be mined for suitable bilingual and
    multilingual texts.

4
STRAND Web-Mining Architecture(1)
  • Structural Translation Recognition Acquired
    Natural Data (STRAND) is the authors software to
    search for pairs of Web pages that are
    translations of each other.
  • Using more parallel texts is always to the
    advantage of machine translation research and
    implementation.
  • How STRAND works?
  • 1)Location of pages that might have parallel
    translations Looking for parent pages and
    sibling pages. The web page writer most
    probably has a language link such as Chineseor
    Arabic imbedded in the page.
  • 2)Generation of candidate pairs that might be
    translations Seeing if the pairs have the same
    HTML structure.
  • 3)Structural filtering out the non-translation
    candidate pairs Searching the content of the
    pairs.

5
STRAND Web-Mining Architecture(2)
  • 1) Locating pairs Candidate pairs are typically
    from one Web-site. STRAND looks for sibling
    pairs. These pages are often linked to each other
    by links which offers the user Francais,
    espanol, or other options.
  • 2)Generating pairsFor many web sites The URLs
    are compared http//www.ottawa.ca/index_en.html
    ...http//www.ottawa.ca/index_fr.html.
  • 3)Structural filtering First look at the HTML
    structure. Web-page writers often use the same or
    very similar template. Next, we use a markup
    analyzer using three(3) tokens to produce a
    linear reproduction of each of the two candidate
    web-pages..over

6
STRAND Web-Mining Architecture(3)
  • Candidate pairs
  • ltHTMLgt ltHTMLgt
  • ltTITLEgtCity Halllt/TITLEgt ltTITLEgtHotel de
    Villelt/TITLEgt
  • ltBODYgt ltBODYgt
  • ltH1gtRegional GovernmentltH1gt Les affaires.
  • The business
  • Candidate pairs Now formed into 2 linear
    alignments
  • STARTHTML STARTHTML
  • STARTTITLE STARTTITLE
  • Chunk 8 Chunk 12
  • END TITLE END TITLE
  • START H1 START BODY
  • Chunk18 Chunk 138
  • .over

7
Using these 2 linear alignments
  • We use four scaler values to characterize the
    quality of the alignment
  • dp(difference percentage) mismatches of
    alignments (that is, tokens that dont
    match)
  • n number of aligned non-markup text chunks.
  • r correlation of lengths of the aligned
    non-markup chunks
  • p level of significance of the correlation r.
  • Next the analysts can manually set the thresholds
    of these parameters and check the results. 100
    precision and 68.6 recall has been obtained
    using STRAND to find English-French Web pages.

8
Optimizing Parameters Using Machine Learning
  • A ninefold cross-validation experiment using
    decision tree induction was used to predict the
    class assigned by the human judges. The learned
    classifiers were substantially different from the
    manually-set (heuristic) thresholds.
  • Manually-set 31 of good document pairs were
    discarded
  • ML-set 16 of good pairs are discarded.(4false
    positive)
  • Other Related Work
  • Some analysts use Parallel Text Miner (PT Miner)
    using already existing search engines to locate
    pages that are likely to be in the other language
    of interest.

9
  • Other Related Work /Other Linguistic Researchers
  • Some analysts use Parallel Text Miner (PT Miner)
    using already existing search engines to locate
    pages that are likely to be in the other language
    of interest. Then a final filtering stage is
    undertaken to clean the corpus.
  • Bilingual Internet Text Search(BITS) is used by
    other researchers and utilizes different matching
    techniques.
  • STRAND, PTMiner, and BITS are all largely
    independent of linguistic knowledge about
    particular languages, and therefore very easily
    ported to new language pairs.
  • Reskin has looked intoEnglish-Arabic,
    English-Chinese(big5), and English-Basque.

10
Mining the Web
  • Researchers can and do mine the internet every
    day. An American physicist (Barabasi) has had his
    team look at the size, shape and structure of the
    internet as well as hit-frequencies of numerous
    Web pages.
  • Spiders or crawlers are used in research.
  • The Internet Archive(www.archive.org/web/researche
    r/ ) is also instrumental in obtaining useful
    information.

11
The internet Archive
  • The internet archive is a nonprofit organization
    attempting to archive the entirely publicly
    available Web, preserving the content and
    providing free access to researchers, historians,
    scholars, and the general public.
  • (120terabytes of information in 2002)
  • Over 10 billion Web pages.
  • Properties of the Archive
  • 1)The archive is a temporal database, but it is
    not stored in temporal order.
  • 2)Extracting a document is an expensive
    operation.(text extraction.)
  • 3)Computational complexity must be keep low for
    mining this database.
  • 4)Data relevant for linguistic purposes are
    clearly available.
  • 5)A suite of tools exist for linguistic
    processing of the archive.

12
Building an English-Arabic Corpus
  • Step 1Search for English-Arabic pairs. Look at
    24 top-level national domains for countries where
    Arabis is spoken Egypt(.eg), Saudi Arabia(.sa),
    Kuwait(.kw). Also other .com domains believed to
    be useful to Arabic-speaking people.
  • Step 2Resnik et al. mined two crawls of the
    internet archive comprising 8TB and 12TB.
    Relevant domains numbered 19,917,923 pages.
  • Step 3 Only 8,294 pairs of English-Arabic
    bitexts were found.
  • EVALUATION

13
Conclusions and Further Work
  • Initial web searches for parallel texts were
    undertaken in 1998. Resnicks report is from
    2002. The author laments the lack of different
    languages available on the internet as well as
    the lack of data made available by some
    countries.
  • The growth of both the internet and the internet
    archive will considerably add to the expansion of
    parallel corpora.
  • Chen and Nie(2000), for example have found around
    15,000 English-Chinese document pairs.
  • One of the early STRAND projects for
    English-Chinese parallel texts found over 70,000
    pairs.
  • Because STRAND expects pages to be very similar
    in structure terms, the resulting document
    collections are particularly amenable to
    sentence- or segment-level alignment.
Write a Comment
User Comments (0)
About PowerShow.com