The Web as Corpus for Linguistic Research PowerPoint PPT Presentation

presentation player overlay
1 / 23
About This Presentation
Transcript and Presenter's Notes

Title: The Web as Corpus for Linguistic Research


1
The Web as Corpus for Linguistic Research
  • Martin Volk
  • Stockholm University

2
Diachronic Study Handy vs. Natel
  • Restrict search
  • to German
  • to web addresses in Switzerland CH
  • (i.e. to the domain '.ch)
  • AltaVista frequencies (number of pages found)
  • Handy 28'417 (last year 23'792)
  • Natel 23'156 (last year 15'165)

3
Results of the AltaVista-Search
4
Results of the AltaVista-Search
5
The Web as Corpus
  • Advantages
  • Size gt 1 billion documents
  • Accessibility from every networked computer
  • Timeliness always new documents

6
Size of the Web per Language
  • How does one estimate how many words there are in
    the web for a given language ( as indexed by a
    search engine)?
  • Search for very frequent words (e.g. for EN the
    or SV det). The relative frequency of these
    tokens per 1000 tokens is relatively constant.

7
Corpus Access
  • to a locally stored corpus
  • via database interface or
  • special software (e.g. concordance program)
  • ? Complete searches are possible.
  • to the Web
  • via search engines (like Google or AltaVista)
  • ? Complete searches are not possible.

8
Search engine
  • provides links to hits (ordered by relevance)
  • provides frequencies for the query items
  • allows
  • Restriction to a language
  • Restriction to time span
  • Restriction to domain (e.g. only Web addresses
    with suffix .se)
  • Combination of search items
  • (AND, OR, NOT, NEAR)
  • Search with regular expressions (wildcards ,?)

9
Extraction and automatic "Learning" of names
  • Search for patterns
  • EN "the following politicians ... "
  • DE "die folgenden Politiker ..."
  • ? Extraction of person names
  • EN "Chemical companies like ..."
  • ? Extraction of company names
  • DE "PCs der Serie ..."
  • ? Extraction of product names

10
Translation of compounds DE-EN (Grefenstette 1999)
  • DE Aktienkurs ? EN ??
  • Aktie (share, stock) und
  • Kurs (course, price, rate)
  • Generation of all possible translations (share
    course, share price, share rate, stock course,
    ...)
  • Idea The Translation with the highest web
    frequency is correct!
  • Leads to 87 correct decisions.

11
Building Parallel Corpora (Resnik 1999 Mining
the web for bilingual text)
  • Parallel Corpora are important for translators or
    computational linguists
  • Problem Find translated texts in the Web!
  • Idea Search for patterns like
  • ? Click here for English version
  • Load the documents and compare their structure
    (length, number of paragraphs, names, numbers,
    ...)

12
Building Parallel Corpora (Idea by Martin Volk
2003 -)
  • Alternative approach for finding parallel texts
    in the web
  • Automatic Translation of a Web page.
  • Search for similar" Web pages (available in
    Google) to the translated text
  • Compare the structure of the original with the
    found documents.

13
Answer Validation (Magnini et al. 2002 Is it
the right answer?)
  • Problem A Question-Answer-System finds a set of
    possible answers to a given Question.
  • Idea If (parts of) the Question and the Answer
    co-occur frequently in the Web, then it is a good
    Answer.
  • Procedure Extraction of key words from the
    Question and the Answer. Increase the Recall via
    pattern shortening.

14
Answer Validation (Example)
  • Question Which river in the US is known as Big
    Muddy?
  • Answer Mississippi River and Columbia River
  • Web-Search 1 river NEAR US NEAR Big NEAR Muddy
    NEAR Mississippi NEAR River
  • Web-Search 2 river NEAR US NEAR Big NEAR Muddy
    NEAR Columbia NEAR River

15
Synonym Quality (Turney 2001 Mining the Web for
Synonyms)
  • Given from TOEFL (Test of English as a Foreign
    Language)
  • Query word levied
  • Possible synonyms imposed, believed, requested,
    correlated
  • Question Which of the possible synonyms is
    closest in meaning to the Query word?

16
Synonym Quality
  • Pointwise Mutual Information Score
  • score(choice) log (p(problem choice) /
    (p(problem) p(choice)))
  • can be simplified to
  • score(choice) (p(problem choice) / p(choice))
  • Translates to
  • score(choice) hits(problem AND choice) /
    hits(choice)

17
Synonym Quality
  • Alternative
  • score(choice) hits(problem NEAR choice) /
    hits(choice)
  • Oder
  • score(choice) hits((problem AND choice) AND NOT
    ((problem OR choice) NEAR "not")) / hits(choice
    AND NOT ((problem OR choice) NEAR "not") )

18
Synonym Quality
  • Results over 80 TOEFL questions

19
The Web as Corpus
  • Disadvantages
  • Size
  • Timeliness always new documents
  • (Partially) unreliable data sources
  • Not linguistically structured

20
Bottleneck Search engine
  • is optimized for content retrieval and not for
    linguistic questions.
  • does not support linguistic operators (in the
    same phrase, in the same clause, with suffix).
  • does not allow a (precise) restriction to domains
    or genre (text type).

21
Current Solution Linguistics Filter
  • e.g. The program KWiCFinder
  • allows (more) precise queries
  • e.g. BEFORE, AFTER mit Abstand
  • is a front end to AltaVista
  • displays the results as Keyword-in-Context (KWIC)

22
Future A Linguistic Search Engine
  • Linguistic analysis in indexing
  • Lemmatising and compound segmentation
  • Search Haus, Hauses, Häuser and Häusern
  • Search bok, boks, böcker, böckerna
  • Part-of-Speech Tagging
  • Search dt. 'Junge' as noun
  • Search engl. 'can' as noun
  • Phrase recognition
  • Search 'med barn' in the same noun phrase

23
Summary
  • The Web as Corpus
  • provides many new opportunities.
  • But a more linguistically-oriented access is
    necessary.
Write a Comment
User Comments (0)
About PowerShow.com