Title: How linguistics can help you
1How linguistics can help you
- Lori Levin
- LTI Immigration Course 2005
2Outline
- What is linguistics?
- The influence of linguistics on language
technologies - The long rule of rationalism
- The triumph of empiricism
- What was forgotten
- The use of linguistics in Language Technologies
- Outside of LTI
- Inside LTI
- Linguistics courses
3Linguistics
- Linguistics is a
- Cognitive Science
- Social Science
- Area of the Humanities
- Neuroscience
- Computer science
- Primarily about the human mind and human
communication behavior.
4Linguistics as a Cognitive Science
- Knowledge of language is not conscious knowledge.
- Like knowing how to walk without knowing which
neurons and muscles are involved. - Sub-areas of linguistic knowledge
- Grammar of sentences (syntax), grammar of words
(morphology), sentence meaning (semantics), word
meaning (lexical semantics), language use in
context (pragmatics and discourse analysis).
5Linguistics as a Cognitive Science
- Do human languages differ from each other in
random ways, or are there common, universal
properties? - How are human languages different from
mathematical languages, logical languages,
programming languages, and animal communication
systems?
6Linguistics as a Cognitive Science
- First language acquisition How do human babies
learn something so complex so quickly with such
imperfect input? - Second language acquisition How do adults learn
a second language, and why are they so bad at
something that babies are so good at? - How can you teach something that is not conscious
knowledge for experts (native speakers)? - Do adults learn languages better with immediate
or delayed feedback on errors? - Does explanation of foreign language grammar help
adults learn the foreign language?
7Linguistics as a Cognitive Science
- Psycholinguistics How is human language
processed in the brain and how is human language
produced? - Why do you have to do a double take to understand
this sentence (garden path sentence) - The cotton shirts are made of is soft.
- Neuro-linguistics What areas of the brain are
activated during language processing? How do
brain injuries affect language production and
comprehension?
8Linguistics as a Social Science
- Historical Linguistics How do human languages
change over time? - Drift
- Corn used to mean all small grains, e.g, pepper
corn, barley corn. - What happened to the word britches?
- English f is systematically related to French
p. What was the common sound that they both
derived from in some ancient language? - Foot/pied
- Father/pere
- Contact
- Languages in proximity to each other will
influence each others vocabulary and grammar,
even if the languages were previously unrelated.
9Language as a Social Science
- Sociolinguistics
- How do human languages vary with social factors
such as - Geography
- Age
- Ethnic group
- Sex
- Race
- Economic class
- Social setting
- In situations of language contact, what are the
factors that determine whether there will be
bilingualism or language loss?
10Documentary Linguistics
- Computer based tools for describing languages
- Annotating corpora
- Databases for annotated corpora
- Managing lexicons
- Has become urgent because of increased rate of
language death.
11Computational Linguistics
- Formalisms for describing human languages
- Based on formal language theory
- Enable precise, testable formulations of
linguistic rules - Using linguistic rules for language processing.
12Language Technologies
- Computer based tools for processing human
languages - Speech recognition
- Speech synthesis
- Machine translation
- Human-machine dialogue systems
- Information Retrieval, Extraction, and
Summarization - Computer-assisted language learning
13History
- Before Language Technologies there was
Computational Linguistics - Cognitive Science
- Artificial Intelligence how can a computer
understand a story as well as a human does - Psycholinguistics how is language processed by
the human brain - Formal Linguistic Theory use formal language
theory to model human language - All of these topics would be covered at
Computational Linguistics conferences - Models of human linguistic knowledge and human
language processing were thought to be relevant
to computer based processing of language - All computational linguists knew a lot of
linguistics
14History
- Computers got faster
- Toy systems and papers on theories of language
gave way to implementations that worked on a real
scale. - Computational linguistics became more of a
computer science rather than a cognitive science.
15HistoryTwo Philosophical Approaches
- Rationalism
- The source of knowledge is reason.
- Knowledge comes from the mind.
- Empiricism
- The source of knowledge is experience.
- Knowledge comes from data.
16HistoryTwo Philosophical Approaches
- Rationalism
- The source of knowledge is reason.
- Goal To discover the mental representation of
linguistic rules - The primary data for studying language is the
grammaticality judgment - Your head contains a formal grammar that accepts
strings of words that are in your language and
rejects strings of words that are not in your
language.
17History Rationalism
- Grammaticality judgments
- The car needs to be washed. YES
- The car needs washed. NO
- Car the washed needs. NO
- This music gives a headache to me. NO
- This music gives me a headache. YES
18History Rationalism
- Mental models of language are real
- Linguistics should strive to model human
grammaticality judgment - Corpora are artifacts
- Pieces of the mental model get mixed up with
speech and writing errors, constraints of time
and space, etc.
19History
- Empiricism
- The source of knowledge is experience
- Goal to discover how meaning is negotiated in
context - The primary data is what is attested in a corpus
20History Empiricism
- Sentences are found in corpora with different
probabilities - This music would give a headache to anyone with
refined sensibilities. - This music gives a headache to me.
- This music gives anyone with refined
sensibilities a headache. - This music gives me a headache.
- The car needs washed.
21History Rationalism and Empiricism
- From 1957 until very close to the present time,
linguistics was vehemently rationalist. - Actually, some empiricists survived through the
20th century, but they typically didnt formalize
their theories, so they didnt have a way to
influence computational linguistics. - Rationalists and empiricists were intellectual
enemies. - Computational Linguistics was mostly rationalist
until the mid 1980s.
22History Late 1980s and Early 1990s
- Rationalist vs Empiricist debates
- Empiricism triumphs in speech recognition
23HistoryThe Rise of Empiricism, 1990s
- Speech recognition
- Statistical Machine Translation
- Information Retrieval
24- I amar prestar aen. The world is changed. Han
mathon ne nen. I feel it in the waters. Han
mathon ne chae. I feel it in the earth. A han
noston ned 'wilith. I smell it in the air. Much
that once was is lost. For none now live who
remember it - .And some things that should not have been
forgotten were lost. -
Lord of the Rings -
Movie Script
25What was forgotten?
- Parsers
- Treebanks
- Lexicons
- Morphological analyzers
26- (TOP (S
- (NP (DT The) (NNP National) (NNP Park)
- (NNP Service))
- (VP (VBZ hopes) (PP (IN by) (NP (CD 1966)))
- (S (NP (-NONE- )) (AUX (TO to))
- (VP (VB have)
- (NP (CD 30,000) (NNS campsites))
- (ADJP (JJ available)
- (PP (IN for)
- (NP (NP (NP (CD
100,000) - (NNS
campers)) - (NP (DT
a) (NN day))).
27Things that were lost that should not have been
forgotten
- Can you understand the Treebank?
- Can you evaluate the Treebank?
- Can you build a Treebank?
- If you train a statistical parser on the Treebank
and it doesnt work well, can you attribute blame
to your training or to mistakes in the Treebank?
28Language Technologies Today
- Influenced by both rationalism and empiricism.
- Rule based systems
- Rationalist linguistics
- Statistical systems
- Corpus linguistics
- Statistics and machine learning
29Jamie Callan
30Full-Text IndexingGerman Decompounding
- Compounds that behave like English phrases
- computerviren (computer viruses) vs. computer
and viren - sonnenenergie (solar energy) vs. sonnen and
energie - Compounds that probably dont behave like English
phrases - gemüseexporteure (vegetable exporters)
- fussballeuropameisterschaft (European football
cup) - vs. fuss, ball, europa, meisterschaft
- vs. fussball, europa, meisterschaft
- vs. europameisterschaft im fußball
- Slightly irregular compounds
- schönheitskönigin (beauty queen) vs. schönheit
and königin - Note introduction of s between compounds
- erdbeben (earthquake) vs. erde and beben
- Note that ending e in erde is elided
(Chen, 2002)
Jamie Callan
31Example of using linguistics with speech
recognition
- Integrate parsing and speech recognition so that
only parsable hypotheses are considered.
32Answer Verification
- Parse passages to create a dependency tree among
words - Attempt to unify logical forms of question and
answer text
(M. Pasca and S. Harabagiu, SIGIR 2001)
Jamie Callan
33For More Information
- S. Harabagiu, D. Moldovan, C. Clark, M. Bowden,
J. Williams, and J. Bensley. Answer mining by
combing extraction techniques with abductive
reasoning. In Proceedings of the Twelfth Text
Retrieval Conference (TREC 2003). 2004. - M. Pasca and S. Harabagiu. High performance
question answering. In SIGIR 2001 conference
proceedings.
34Recent Work on Combining Linguistic Structure
with Statistical MT
- David Chiang, Best Paper Award, Association for
Computational Linguistics, 2005 - Johns Hopkins Workshop, 2005, Statistical Machine
Translation by Parsing, led by Dan Melamed.
35Motivation for SMT by Parsing
- State-of-the-art SMT often produces word salad.
- Bolting trees onto FST-based (IBM-style) SMT
doesn't seem to help - SMT is very compute-intensive (slow).
- SMT systems getting very complicated, making them
hard to study and improve.
Dan Melamed
36The Engineering Motivation for Syntax
- Need fewer parameters to express ordering
preferences. - E.g. Arabic adjectives always follow their
nouns. - Fewer parameters are easier to learn, given
limited training data and/or computing resources. - Less training data needed to reach a given level
of accuracy. - Better accuracy on fixed amount of data.
- All parameters interact during learning, so
better estimates for syntactic parameters lead to
better estimates for other types.
Dan Melamed
37But isnt syntax too expensive?
- Myth Translation models involving syntax are
computationally too expensive to train. - Fact Finite-state models are more expensive!
(more parameters) - Of course, bolting syntax on top of a finite
state model incurs the combined cost of both. (So
we avoided that.) - In machine learning with structured inference
(most of NLP), better models should train faster.
Dan Melamed
38LTI projects that use linguistic knowledge
- Javelin Question answering
- Nyberg, Mitamura
- Radar Information extraction from email
- Nyberg, Frederking, Levin
- Lets Go Human-machine bus information system
- Eskenazi, Black
- Writing Tutor Detect errors made by English
learners - Mitamura
39LTI projects that use linguistic knowledge
- AVENUE Automatically learning machine
translation rules for minor-major language pairs - Carbonell, Lavie, Levin, Frederking, Brown
- SAMPLE Reading assistant for English as a second
language.
40LTI faculty with training in linguistics
- Lori Levin, Linguistics
- Teruko Mitamura, Linguistics
- Eric Nyberg, Computational Linguistics
- Alan Black, Cognitive Science
41Linguistic Courses at LTI
- Grammars and Lexicons, 11-721
- Levin and Mitamura
- Fall 2005
- Goals Skills
- Write grammar rules that can be used for parsing
- Work on multilingual applications
- Understand syntactic annotation of data
- Goals Knowledge
- Linguistic categories
- Noun, verb, noun phrase, verb phrase
- Linguistic Structures
- Main clauses, embedded clauses, relative clauses
- Linguistic Variation
- How to recognize the categories and structures
even though they look different - Grammar writing
- Writing grammar rules for a parser (English and
Japanese)
42Linguistics Courses at LTI
- Grammar Formalisms, 11-722
- Levin, Lavie, Black
- Spring 2006
- Goal
- How to implement basic linguistic structures and
semantics in several formalisms - Lexical Functional Grammar, Head Driven Phrase
Structure Grammar, Categorial Grammar, Tree
Adjoining Grammar - How parsers can be implemented for these
formalisms
43Linguistics Courses at LTI
- Formal Semantics, 11-723
- Mandy Simons (Philosophy)
- Next offered in 2006-2007?
- Apply formal logic to the modeling of natural
language meaning. - Another thing that should not have been lost.
44Useful Linguistics Courses at the University of
Pittsburgh
- Phonetics and phonemics
- The inventory of human speech sounds
- Phonology
- Patterns of sounds and syllable structure
- Morphology
- Prefixes, suffixes, and other processes for
making words out of smaller pieces - Morphosyntax
- Morphology that affects syntax e.g., passive and
causative affixes - Syntactic Theory
- The course taught at Pitt is not the most
relevant kind of syntactic theory for LT, but it
will give you insight into what languages have in
common.