Linguistic Research - PowerPoint PPT Presentation

About This Presentation
Title:

Linguistic Research

Description:

... zeer:3/d_r-343077, allemachtig:2/d_r-9922, beestachtig:2/d_r-23835, bijzonder:4/c_546765, bliksems:2/d_r-32612, bloedig:2/d_r-32881, bovenmate:1/d_r-36728, ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 49
Provided by: Mart3
Category:

less

Transcript and Presenter's Notes

Title: Linguistic Research


1
  • Linguistic Research
  • And
  • The CLARIN Infrastructure
  • Jan Odijk
  • Digital Humanities Lecture, Utrecht 23 Oct 2012

2
Overview
  • Introduction
  • Basic Facts Research Questions
  • Do the Research
  • Consult Grammars
  • Select from relevant data from multiple sources
  • Apply tools to enrich data
  • Analyze the data
  • Conclusions

3
Introduction
  • Suppose youre a linguistic researcher in 1980
    (no internet, no computers,)
  • And libraries would not exist.
  • I am a linguistic researcher in 2012
  • But no infrastructure for data and tools exists!
  • though there are many data and tools
  • CLARIN has as its main goal to remedy this

4
Basic Facts
  • Heel, erg, and zeer are synonyms (very)
  • Zeer, erg can modify verbs, adjectival
    predicates and prepositional predicates
  • Heel can only modify adjectival predicates
  • A Hij is daar zeer/erg/heel blij mee
  • P Hij is daar zeer/erg/heel mee in zijn nopjes
  • V Dat verbaast ons zeer/erg/heel.

5
Basic Facts
  • English very is like heel in these respects
  • P He is very in love
  • A He is very amorous
  • V It surprised us very (much))

6
Basic Facts
  • Difference
  • not due to semantics
  • Purely syntactic
  • As far as we know does not follow from a general
    rule
  • So it must be learned by a child acquiring
    Dutch as first language

7
Research Question (1)
  • How does a child acquiring Dutch as a first
    language get to know that zeer and erg can
    modify verbs, prepositional and adjectival
    predicates?

8
Hypotheses (1)
  • Hypothesis 1a
  • Once a word is encountered for the first time, a
    critical phase (training phase) starts in which
    the word properties will be determined based on
    input after this phase the word properties are
    fixed.
  • A sufficient number of actual examples occurring
    in this period sets the word properties (positive
    evidence)

9
Hypotheses (1)
  • Hypothesis 2a
  • Once a word is encountered for the first time,
    its grammatical properties are initially set by
    Semantic Bootstrapping D (semcat) -gt syncat
  • A sufficient number of actual examples occurring
    in this period will add to the word properties
    (positive evidence)
  • Sufficient amount of input that is contradictory
    to the semantically bootstrapped properties
    overrules them

10
Research Question (2)
  • How can a child acquiring Dutch as first language
    get to know that heel cannot modify
    prepositional predicates and verbs?
  • Children are never taught that it is not
    possible
  • They are also never or seldom corrected for
    language errors, and if they are, they seem to
    ignore it (Negative evidence plays no role)

11
Hypotheses (2)
  • Hypothesis 1b
  • Absence of relevant constructions in the training
    phase of a word leads to absence of the property
    (indirect negative evidence)
  • Hypothesis 2b
  • Absence of relevant constructions in the training
    phase of a word does not lead to absence of the
    property for semantically bootstrapped properties

12
Related Questions
  • Do children ever make errors against this?
  • Is a training phase for word properties real?
  • How long is this training phase?
  • What is a sufficiently large number of actual
    examples
  • Does semantic bootstrapping play a role, and if
    so which one
  • Are these words acquired in different language
    acquisition stages?

13
Related Questions
  • Can this be related to the different modification
    potential?
  • Is there a relation with the fact that zeer
    appears to be rather formal, while heel and erg
    are not?

14
Related Questions
  • adverb-adjective agreement (substandard)
  • heel/hele dikke boeken very thick books
  • erg/erge dikke boeken
  • Zeer/zere dikke boeken
  • Is this somehow related?
  • What about other, closely related, words?

15
Consult Grammars
  • Currently
  • Consult paper and electronic grammars
  • ANS and e-ANS e.g. section 15-3-1-1
  • In the near Future
  • Consult Taalportaal with (I hope/expect)
  • All examples formally marked as such
  • All examples parsed/tagged, using ISOCAT DCs and
    searchable
  • Links to (possibly complex queries) to illustrate
    with real data from treebanks and other annotated
    data

16
Find Data
  • Which data and tools (LRs) exist that might
    contribute to answering these questions?
  • Currently
  • you have to search for them in multiple places
  • Many relevant data are not publicly visible (you
    will encounter them by personal contacts only)
  • Or you have to create them yourself

17
Find Data
  • There is no place/site where you query
  • Give me a list of all LRs for the Dutch language
  • What is the size of all Dutch text corpora (in
    tokens)
  • Give me a list of all Dutch data that contain
    children 2-7 years old as speaker
  • Give me a list of all Dutch data containing any
    of the words heel, zeer, erg
  • Not even in most individual data centres
    (TST-Centrale, ELRA, LDC, ..)

18
Find Data
  • CLARIN
  • Provides a flexible framework incl. tools for
    making descriptions of LRs (metadata)
  • CMDI
  • Supports (assistance, execution, funding) the
    creation of metadata for LRs
  • Supports making these metadata (and the actual
    data) visible and accessible via CLARIN portals

19
Find Data
  • CLARIN
  • Provides facilities for semantic interoperability
  • ISOCAT, Relation Registry (coming soon)
  • browsing, searching and querying facilities for
    the metadata
  • Initial prototype Virtual Language Observatory
  • Will enable you to collect the data that are
    relevant to you in a virtual collection
  • This will save the researcher a lot of time
  • It will enlarge the empirical basis for the
    research

20
Closely Related Words
  • Find words that are closely related
  • Adverbs that function as an intensifier
    (booster)
  • Are (near-)synonymous, hyponyms, or co-hyponyms
  • Also (near-)antonyms are relevant
  • In order to determine their properties and
    potential further generalizations

21
Closely Related Words
  • Using e.g.
  • Synonym information in traditional dictionaries
  • Dutch EuroWordnet (currently via ELRA M0016)
  • Or Cornetto (via the Dutch HLT-Agency)
  • Currently searchable only via
  • a plug-in in an old version (3.5) of Firefox. or
  • In programs via a python module
  • A CLARIN-NL project to improve this

22
Closely Related Words
  • Found via synonym dictionaries
  • abnormaal afschuwelijk akelig bijster bijzonder
    bovenmatig buitengemeen buitensporig danig
    donders eminent enorm exceptioneel extra
    extraordinair extreem fabelachtig fenomenaal
    geweldig gigantisch intens kolossaal merkwaardig
    mirakels onbeschrijfelijk ongelofelijk ongehoord
    ongekend ongemeen onmenselijk onmetelijk
    ontzettend onwijs speciaal uitermate uiterst
    uitzonderlijk verdraaid verduiveld verrekte
    verschrikkelijk vet zeldzaam ..

23
Closely Related Words
  • zeeradverb3 / heeladverb5 (from Cornetto)
  • zeer3/d_r-343077, allemachtig2/d_r-9922,
    beestachtig2/d_r-23835, bijzonder4/c_546765,
    bliksems2/d_r-32612, bloedig2/d_r-32881,
    bovenmate1/d_r-36728, buitengewoon2/d_r-39235,
    buitenmate1/d_r-39294, buitensporig2/d_r-401837,
    crimineel4/d_a-53026, deerlijk2/d_r-57321,
    deksels2/d_r-57728, donders2/d_r-62605,
    drommels2/d_r-65820, eindeloos3/c_546740,
    enorm2/d_r-74285, erbarmelijk2/d_r-74877,
    fantastisch6/d_r-79264, formidabel2/d_r-82704,
    geweldig4/d_r-92392, goddeloos2/d_r-94633,
    godsjammerlijk2/d_r-94798, grenzeloos2/d_r-96846
    , grotelijks1/d_r-98244, heel5/d_r-106880,
    ijselijk2/d_r-118854, ijzig4/c_546756,
    intens2/d_r-123517, krankzinnig3/d_r-142403,
    machtig4/d_r-165866, mirakels1/d_r-173095,
    onsterachtig2/d_r-175264, moorddadig4/d_r-175475
    , oneindig2/d_r-193740, onnoemelijk2/d_r-194761,
    ontiegelijk2/d_r-415154, ontstellend2/d_r-41516
    5, ontzaglijk2/d_r-415176, ontzettend3/d_r-19690
    6, onuitsprekelijk2/d_r-415180,
    onvoorstelbaar2/d_r-415191, onwezenlijk2/d_r-197
    464, onwijs4/d_r-197468, overweldigend2/d_r-2050
    04, peilloos2/d_r-213144, reusachtig3/d_r-239357
    , reuze2/d_r-239379, schrikkelijk2/d_r-256144,
    sterk7/d_r-272639, uiterst4/d_r-300933,
    verdomd2/d_r-308293, verdraaid4/c_546761,
    verduiveld2/d_r-308522, verduveld2/d_r-308569,
    verrekt3/d_r-418644, verrot3/d_r-418648,
    verschrikkelijk3/d_r-312634, vervloekt2/d_r-3143
    72, vreselijk5/d_r-323099, waanzinnig2/d_r-32906
    1, zeldzaam2/d_r-419882, zwaar10/d_r-347153

24
Basic Facts Correct?
  • Check the basic facts
  • Check against occurrences in corpora
  • Problem each of the 3 words is ambiguous!
  • Erg (4x) noun(de) erg noun(het)evil,
    adjadv unpleasant, adv very
  • Zeer (3x) noun pain adj painful adv very
  • Heel (3x) adj whole verbform heal adv
    very
  • PoS-tagged corpus will help somewhat
  • But most corpora do not distinguish adj from adv
    by category! (searching for PoS bigrams will help
    slightly)
  • A fully-parsed corpus would be ideal

25
Basic Facts Correct?
  • LASSY Small 1M manually verified parsed corpus
  • Interface to LASSY Small
  • Requires knowledge of XPATH/XQUERY
  • Very Simple Interface to LASSY Small
  • limited options but simple commands
  • Example-based interface GrETEL (CLARIN-
    Flanders)
  • Greedy Extraction of Trees for Empirical
    Linguistics
  • Generates XPATH/XQUERY expression on the basis of
    an example sentence plus markings of what is
    relevant in it

26
Basic Facts Correct?
  • Queries ergmod zeermod heelmod
  • Extract from Statistics
  • Query heelmodww

erg zeer heel
ADJ 143 268 263
WW 35 49 9
BW 1 1 7
27
Basic Facts Correct?
  • Analysis
  • 8 examples are forms that are ambiguous between
    adjectival and verbal participle,
  • All are examples of adjectival participles but
    LASSY represents all participles as verbal
  • In 1 example heel modifies the adj open from the
    expression open staan voor, but wrongly analyzed
    as modifying the verb staan
  • CLARIN will offer facilities to make annotations
    to such corpora
  • Same queries could be done
  • for the other related words
  • on LASSY Large Corpus (2.4 billion words,
    automatically parsed)
  • In the CGN corpus (but it uses a different
    interface)
  • But this will require facilities for batch
    jobs or more complicated queries (maybe via web
    services)

28
Acquisition Corpora Search
  • E.g. data in the CHILDES system (part of TalkBank
  • 7 corpora for Dutch
  • But with their own data formats (CHAT) and tools
    (CLAN)
  • However, also mirrored at MPI and accessible via
    (ANNEX/)TROVA (again another interface)

29
Acquisition Corpora Search
  • Give records for utterances containing erg with
  • Corpus (e.g. Van Kampen Corpus)
  • File (e.g. laura74.cha)
  • Line (e.g. 139)
  • Part Role (e.g. Child)
  • Child Gdr (e.g. female)
  • Age (e.g. 56.12)
  • UTT (e.g. ja , die s erg moeilijk .)
  • Maybe also some preceding/following context
  • Map attribute names and values to ISOCAT

30
Acquisition Corpora Search
  • Corpus Van Kampen
  • File sarah21.cha
  • Line 630
  • Speaker Child
  • Child Gender Female
  • Age 27.16
  • UTT prinses e(r)g groot !

31
Acquisition Corpora Search
  • For each child, give list of pairs session age
    of the child
  • For child and each session, give occurrences of
    zeer, heel, erg
  • etc, etc.
  • Such queries (Some example attempts )
  • Mixed metadata/content search
  • Over multiple resources
  • Specific output formats
  • are not so easy with the current interfaces!!

32
Acquisition corpora Search
  • Heel is found 153 times in Van Kampen corpus
  • Erg is found 77 times in Van Kampen corpus
  • But many are an irrelevant use of erg
  • PoS-tagging the corpus might be useful
  • Search for POS-bigrams (e.g. erg/adj /adj)
  • Add lemmas
  • Or even full parsing, at least of the adult speech

33
Acquisition corpora Parse
  • CLARIN-NL
  • Web services are being developed
  • For PoS-tagging text
  • For full parsing of text
  • (and many more)
  • To be usable by humanities researchers
  • in a user-friendly way in work flow systems
  • Usefulness depends on
  • Size of the data (effort to select manually)
  • Quality of the web services

34
Store the found data
  • The found and newly created data
  • should be stored in a supported format
  • With automatically generated metadata
  • With automatically generated provenance data
  • Using data categories mapped to or from ISOCAT
  • For which PIDs are provided
  • Stored on a server of a CLARIN-centre
  • So that they
  • can become proper resources on their own
  • Are visible, accessible and interpretable as part
    of enriched publications

35
Search in CGN / SONAR
  • To assess level of formality
  • Give absolute and relative frequencies of
    heel/hele/erg/erge/zeer as adj by text genre, and
    speaker/participants education level
  • In CGN (spoken corpus)
  • In SONAR (written corpus)
  • Idem but for the word the following Pos-tag
  • Idem but in the fully parsed part of CGN and in
    LASSY the PoS tag of the modifiee head

36
Interpret the data
  • Interpret the data in function of the hypotheses
    being investigated
  • Apply analytical / statistical tools to the data
  • CLARIN should support formats of frequently used
    statistical packages such as SPSS, R, etc.
  • The research will surely lead to new questions,
    so to new queries
  • Reach conclusions and publish an open access
    enriched publication

37
Broaden the scope
  • Do the same for worden/raken (become/ get)
  • NP, PP and AP can be predicate complements
  • Worden and raken take predicate complements
  • They are (almost) synonymous
  • worden takes only NP or AP
  • raken takes only AP or PP

38
Broaden the scope
  • AP Zij werd / raakte zwanger
  • PP Zij werd / raakte in verwachting
  • NP Zij werd / raakte burgemeester
  • And
  • repeat the process
  • Exercise

39
Conclusions
  • There is no adequate infrastructure for
    linguistic research
  • There are bits and pieces, but
  • Finding LRs is not easy
  • LRs have their own formats, data categories, user
    / search interfaces
  • Limited formal and no semantic interoperability
  • Search in combined LRs very difficult if not
    impossible
  • ?full research potential is not exploited
  • CLARIN(-NL) attempts to remedy this

40
CLARIN-NL
  • Thanks for your attention!
  • http//www.clarin.nl/

41
No Entry!
42
Basic Facts Correct?
  • De omgang met de buren gebeurt op een heel
    ontspannen manier en de vrouw van de dominee
    heeft zelfs al Wolderse vlaai leren bakken .
    (parse)
  • heelADJmodWWontspannen
  • De verschijnselen zijn heel verschillend .
    (parse)
  • heelADJmodWWverschillend
  • ,, Op het voorterrein ging het nog heel
    overtuigend . (parse)
  • heelADJmodWWovertuigend
  • Ze hebben heel gericht en planmatig volkscafés
    bezocht om daar hun gif te spuien . (parse)
  • heelADJmodWWgericht
  • Ze is zelfs met een ' meester ' getrouwd Marc
    Dassesse _ mevrouw Spiritus-Dassesse zet heel
    geëmanicipeerd haar meisjesnaam voorop _ is nu
    een gerenomeerd fiscaal adviseur en hoogleraar
    aan de ULB . (parse)
  • heelADJmodWWgeëmanicipeerd
  • Gelukkig krijg ik nog heel geregeld te horen '
    Gerard jongen , dat doe je gewoon foùt ' .
    (parse)
  • heelADJmodWWgeregeld
  • Dat is een heel verrassend resultaat en het stemt
    tot optimisme . (parse)
  • heelADJmodWWverrassend
  • De biermarkt is heel versnipperd en wordt
    overspoeld door nieuwe productlanceringen .
    (parse)
  • heelADJmodWWversnipperd
  • Toch staan we hier heel open voor voorstellen .
    (parse)
  • heelADJmodWWstaan

43
Metadata search CGNCHILDESDutch 2ltagelt7
44
Regexp content searchheelzeerergergehele
45
Resultset export to file
46
CGNregexp heelerg
47
CGNregexp op WORDS tier POS
48
Exercise
  • Worden takes APs not PPs as predc
  • Use the LASSY-Small Very Simple Interface
  • Give me all sentences in which the word worden
    takes a predicative (predc) PP complement
  • rel'predc' and hlemma'worden and postag'vz'
  • Do you find examples with this query?
  • How do you interpret this?
Write a Comment
User Comments (0)
About PowerShow.com