Title: The Web as Corpus for Linguistic Research
1 The Web as Corpus for Linguistic Research
- Martin Volk
- Stockholm University
2Diachronic Study Handy vs. Natel
- Restrict search
- to German
- to web addresses in Switzerland CH
- (i.e. to the domain '.ch)
- AltaVista frequencies (number of pages found)
- Handy 28'417 (last year 23'792)
- Natel 23'156 (last year 15'165)
3Results of the AltaVista-Search
4Results of the AltaVista-Search
5The Web as Corpus
- Advantages
- Size gt 1 billion documents
- Accessibility from every networked computer
- Timeliness always new documents
6Size of the Web per Language
- How does one estimate how many words there are in
the web for a given language ( as indexed by a
search engine)? - Search for very frequent words (e.g. for EN the
or SV det). The relative frequency of these
tokens per 1000 tokens is relatively constant.
7Corpus Access
- to a locally stored corpus
- via database interface or
- special software (e.g. concordance program)
- ? Complete searches are possible.
- to the Web
- via search engines (like Google or AltaVista)
- ? Complete searches are not possible.
8Search engine
- provides links to hits (ordered by relevance)
- provides frequencies for the query items
- allows
- Restriction to a language
- Restriction to time span
- Restriction to domain (e.g. only Web addresses
with suffix .se) - Combination of search items
- (AND, OR, NOT, NEAR)
- Search with regular expressions (wildcards ,?)
9Extraction and automatic "Learning" of names
- Search for patterns
- EN "the following politicians ... "
- DE "die folgenden Politiker ..."
- ? Extraction of person names
- EN "Chemical companies like ..."
- ? Extraction of company names
- DE "PCs der Serie ..."
- ? Extraction of product names
10Translation of compounds DE-EN (Grefenstette 1999)
- DE Aktienkurs ? EN ??
- Aktie (share, stock) und
- Kurs (course, price, rate)
- Generation of all possible translations (share
course, share price, share rate, stock course,
...) - Idea The Translation with the highest web
frequency is correct! - Leads to 87 correct decisions.
11Building Parallel Corpora (Resnik 1999 Mining
the web for bilingual text)
- Parallel Corpora are important for translators or
computational linguists - Problem Find translated texts in the Web!
- Idea Search for patterns like
- ? Click here for English version
- Load the documents and compare their structure
(length, number of paragraphs, names, numbers,
...)
12Building Parallel Corpora (Idea by Martin Volk
2003 -)
- Alternative approach for finding parallel texts
in the web - Automatic Translation of a Web page.
- Search for similar" Web pages (available in
Google) to the translated text - Compare the structure of the original with the
found documents.
13Answer Validation (Magnini et al. 2002 Is it
the right answer?)
- Problem A Question-Answer-System finds a set of
possible answers to a given Question. - Idea If (parts of) the Question and the Answer
co-occur frequently in the Web, then it is a good
Answer. - Procedure Extraction of key words from the
Question and the Answer. Increase the Recall via
pattern shortening.
14Answer Validation (Example)
- Question Which river in the US is known as Big
Muddy? - Answer Mississippi River and Columbia River
- Web-Search 1 river NEAR US NEAR Big NEAR Muddy
NEAR Mississippi NEAR River - Web-Search 2 river NEAR US NEAR Big NEAR Muddy
NEAR Columbia NEAR River
15Synonym Quality (Turney 2001 Mining the Web for
Synonyms)
- Given from TOEFL (Test of English as a Foreign
Language) - Query word levied
- Possible synonyms imposed, believed, requested,
correlated - Question Which of the possible synonyms is
closest in meaning to the Query word?
16Synonym Quality
- Pointwise Mutual Information Score
- score(choice) log (p(problem choice) /
(p(problem) p(choice))) - can be simplified to
- score(choice) (p(problem choice) / p(choice))
- Translates to
- score(choice) hits(problem AND choice) /
hits(choice)
17Synonym Quality
- Alternative
- score(choice) hits(problem NEAR choice) /
hits(choice) - Oder
- score(choice) hits((problem AND choice) AND NOT
((problem OR choice) NEAR "not")) / hits(choice
AND NOT ((problem OR choice) NEAR "not") )
18Synonym Quality
- Results over 80 TOEFL questions
19The Web as Corpus
- Disadvantages
- Size
- Timeliness always new documents
- (Partially) unreliable data sources
- Not linguistically structured
20Bottleneck Search engine
- is optimized for content retrieval and not for
linguistic questions. - does not support linguistic operators (in the
same phrase, in the same clause, with suffix). - does not allow a (precise) restriction to domains
or genre (text type).
21Current Solution Linguistics Filter
- e.g. The program KWiCFinder
- allows (more) precise queries
- e.g. BEFORE, AFTER mit Abstand
- is a front end to AltaVista
- displays the results as Keyword-in-Context (KWIC)
22Future A Linguistic Search Engine
- Linguistic analysis in indexing
- Lemmatising and compound segmentation
- Search Haus, Hauses, Häuser and Häusern
- Search bok, boks, böcker, böckerna
- Part-of-Speech Tagging
- Search dt. 'Junge' as noun
- Search engl. 'can' as noun
- Phrase recognition
- Search 'med barn' in the same noun phrase
23Summary
- The Web as Corpus
- provides many new opportunities.
- But a more linguistically-oriented access is
necessary.