Chinese Information Extraction Technologies - PowerPoint PPT Presentation

1 / 119
About This Presentation
Title:

Chinese Information Extraction Technologies

Description:

Natural Language Processing Lab. National Taiwan University Chinese Information Extraction Technologies – PowerPoint PPT presentation

Number of Views:183
Avg rating:3.0/5.0
Slides: 120
Provided by: HsinHs8
Category:

less

Transcript and Presenter's Notes

Title: Chinese Information Extraction Technologies


1
Chinese Information Extraction Technologies
  • Hsin-Hsi Chen
  • Department of Computer Science and Information
    Engineering
  • National Taiwan University
  • Taipei, Taiwan
  • E-mail hh_chen_at_csie.ntu.edu.tw

2
Outline
  • Introduction to Information Extraction (IE)
  • Chinese IE Technologies
  • Tagging Environment for Chinese IE
  • Applications
  • Summary

3
Introduction
4
Introduction
  • Information Extraction
  • the extraction or pulling out of pertinent
    information from large volumes of texts
  • Information Extraction System
  • an automated system to extract pertinent
    information from large volumes of text
  • Information Extraction Technologies
  • techniques used to automatically extract
    specified information from text

(http//www.itl.nist.gov/iaui/894.02/related_proje
cts/muc/)
5
An Example in Air Vehicle Launch
  • Original Document
  • Named-Entity-Tagged Document
  • Equivalence Classes
  • Co-Reference Tagged Document

6
ltDOCgt ltDOCIDgt NTU-AIR_LAUNCH-????-19970612-002
lt/DOCIDgt ltDATASETgt Air Vehicle Launch
lt/DATASETgt ltDDgt 1997/06/12 lt/DDgt ltDOCTYPEgt ????
lt/DOCTYPEgt ltDOCSRCgt ???? lt/DOCSRCgt ltTEXTgt ????????
??????????????????? ????????????
???????,??????? ????????????,?????????????? ??????
?,?????????? ????????? ??????????? ????
?????????????????? ??????? ,?????????????????????
????? ??????????,????????????????
red location name blue date expression green
organization name purple person name
7
???????????????,??????????? ???????,??????????
??????????????????????????? ??????????????????????
?????? ??????????????,????????????
??????????????????????????? ??????????,???????????
?????????????????,???????? ??????????????????
?????????? ????? ?????????????,????????????? ??
??????????????????????????, ??,???????????????????

8
?????????????????????????? ????
?????????????????????????? ????????????
???????????????????????,? ??????????????????????
???? ?????? ??????,???????????? ?????????
lt/TEXTgt lt/DOCgt
9
ltDOCgt ltDOCIDgt NTU-AIR_LAUNCH-????-19970612-002
lt/DOCIDgt ltDATASETgt Air Vehicle Launch
lt/DATASETgt ltDDgt 1997/06/12 lt/DDgt ltDOCTYPEgt ????
lt/DOCTYPEgt ltDOCSRCgt ???? lt/DOCSRCgt ltISRELEVANTgt
NO lt/ISRELEVANTgt ltTITLEgt ltENAMEX
TYPE"LOCATION"gt?lt/ENAMEXgt??ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt1065?????
lt/TITLEgt ltTEXTgt
?????ltENAMEX TYPE"LOCATION"gt??lt/ENAMEXgt?ltENAMEX
TYPE"LOCATION"gt???lt/ENAMEXgtltTIMEX
TYPE"DATE"gt???lt/TIMEXgt??????ltENAMEX
TYPE"LOCATION"gt???lt/ENAMEXgt????????????????
?ltENAMEX TYPE"LOCATION"gt??lt/ENAMEXgt?ltTIMEX
TYPE"DATE"gt???lt/TIMEXgt,ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt?ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgtltTIMEX
TYPE"DATE"gt??lt/TIMEXgt?ltENAMEX TYPE"LOCATION"gt??lt
/ENAMEXgt?????????,???????????ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt????????,??ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt?????? ?????ltENAMEX
TYPE"LOCATION"gt??lt/ENAMEXgt????????????? ????
10
ltID"3"gt??? ltID"4" REF"3" gt?? ltID"5
REF"3"gt???????????? ???????
ltID"63" gt??????? ltID66 REF63gt?????????????
?????? ?????
ltID"65" REF"63"gt????????????? ltID"70"
REF"65"gt?? ltID"69" REF"65"gt?? ltID"64"
REF"63"gt?????????
11
ltDOCgt ltDOCIDgt NTU-AIR_LAUNCH-????-19970612-002
lt/DOCIDgt ltDATASETgt Air Vehicle Launch
lt/DATASETgt ltDDgt 1997/06/12 lt/DDgt ltDOCTYPEgt ????
lt/DOCTYPEgt ltDOCSRCgt ltCOREF ID"1"gt????lt/COREFgt
lt/DOCSRCgt ltISRELEVANTgt NO lt/ISRELEVANTgt ltTITLEgt
ltCOREF ID"6"gt?lt/COREFgt??ltCOREF
ID"23"gt??lt/COREFgtltCOREF ID"45" REF"44"
TYPE"IDENT" MIN"????"gt1065?????lt/COREFgt
lt/TITLEgt ltTEXTgt
?ltCOREF ID"2" REF"1" TYPE"IDENT"gt??lt/COREFgt??ltC
OREF ID"61"gt??lt/COREFgt?ltCOREF ID"8"
STATUS"OPT" REF"6" TYPE"IDENT"gt???lt/COREFgtltCORE
F ID"3"gt???lt/COREFgt??????ltCOREF ID"7" REF"6"
TYPE"IDENT"gt???lt/COREFgt????ltCOREF ID"5"
STATUS"OPT" REF"3" TYPE"IDENT"
MIN"???"gt???????????? ?ltCOREF ID"24" REF"23"
TYPE"IDENT"gt??lt/COREFgt????lt/COREFgt,ltCOREF
ID"77"gtltCOREF ID"9" REF"6" TYPE"IDENT"gt??lt/COR
EFgt?ltCOREF ID"29"gt??lt/COREFgtlt/COREFgtltCOREF
ID"4" REF"3" TYPE"IDENT"gt??lt/COREFgt?ltCOREF
ID"62" REF"61" TYPE"IDENT"gt??lt/COREFgt??ltCOREF
ID"63" MIN"??"gt???????lt/COREFgt,ltCOREF ID"64"
REF"63" TYPE"IDENT" MIN"??"gt?????????lt/COREFgt??
ltCOREF ID"81" STATUS"OPT" REF"75" TYPE"IDENT"
MIN"??"gtltCOREF ID"30" REF"29"
TYPE"IDENT"gt??lt/COREFgt???lt/COREFgt?????,??ltCOREF
ID"31" REF"29" TYPE"IDENT"gt??lt/COREFgt??????
???????????????????? ????
12
IE Evaluation in MUC-7 (1998)
  • Named Entity Task NE Insert SGML tags into the
    text to mark each string that represents a
    person, organization, or location name, or a date
    or time stamp, or a currency or percentage figure
  • Multi-lingual Entity Task MET NE task for
    Chinese and Japanese
  • Co-reference Task CO Capture information on
    co-referring expressions all mentions of a given
    entity, including those tagged in NE, TE tasks

13
IE Evaluation in MUC-7 (cont.)
  • Template Element Task TE Extract basic
    information related to organization, person, and
    artifact entities, drawing evidence from anywhere
    in the text
  • Template Relation Task TR Extract relational
    information on employee_of, manufacture_of, and
    location_of relations
  • Scenario Template Task ST Extract
    pre-specified event information and relate the
    event information to particular organization,
    person, or artifact entities involved in the
    event.

14
Chinese IE Technologies
  • Segmentation
  • Named Entity Extraction
  • Part of Speech/Sense Tagging
  • Full/Partial Parsing
  • Co-Reference Resolution

15
Segmentation
16
Segmentation
  • Problem
  • A Chinese sentence is composed of characters
    without word boundary
  • ?????????
  • ? ? ?? ? ? ???
  • ? ? ??? ? ???
  • Word Definition
  • A character string with an independent meaning
    and a specific syntactic function

17
Segmentation
  • Standard
  • China???????????????
  • Implemented in 1988
  • National standard in 1992 (GB/T13715-92)
  • Taiwan???????????????
  • Proposed by ROCLING in 1996
  • National standard in 1999 (CNS14366)

18
Segmentation Strategies
  • Dictionary is an important resource
  • List all possible words
  • Find the most plausible path from a word
    lattice
  • ???????????
  • ??????????????

19
Segmentation Strategies (Continued)
  • Disambiguation Select the best combination
  • Rule-based
  • Longest-word first???? ? ?? ?
    ????????? ? ??? ? ???
  • Delete the discontinuous fragments
  • Other heuristic rules 2-3 words preference, ...
  • parser
  • Statistics-based
  • Markov models, relaxation method, and so on

20
Segmentation Strategies
  • Dictionary Coverage
  • Dictionary cannot cover all the words
  • solutions
  • Morphological rules
  • (semi-)automatic construction of dictionaries
    automatic terminology extraction
  • Unknown word resolution

21
Morphological Rules
  • numeral classifierclassifier
  • ???, ???
  • date time
  • ????????
  • noun (or verb) prefix/suffix
  • ???
  • special verbs
  • ?? ?,?? ?,?? ?
  • ????,????,????,????
  • ???,???,???
  • ...

22
Term Extraction n-gram Approach
  • Compute n-grams from a corpus
  • Select candidate terms
  • Successor variety
  • the successor variety will sharply increase until
    a segment boundary is reached
  • Use i-grams and (i1)-grams to select candidate
    terms of length i
  • Mutual Information
  • Significance Estimation Function

23
Named Entity Extraction
24
Named Entities Extraction
  • Five basic components in a document
  • People, affairs, time, places, things
  • Major unknown words
  • Named Entities in MET2
  • Names people, organizations, locations
  • Number monetary/percentage expressions
  • Time data/time expressions

25
Named People Extraction
  • Chinese person names
  • Chinese person names are composed of surnames and
    names.
  • Most Chinese surnames are single character and
    some rare ones are two characters.
  • Most names are two characters and some rare ones
    are single characters (in Taiwan)
  • The length of Chinese person names ranges from 2
    to 6 characters.
  • Transliterated person names
  • Transliterated person names denote foreigners.
  • The length of transliterated person names is not
    restricted to 2 to 6 characters.

26
Named People ExtractionChinese Person Names
  • Extraction Strategies
  • baseline models name-formulation statistics
  • Propose possible candidates.
  • context cues
  • Add extra scores to the candidates.
  • When a title appears before (after) a string, it
    is probably a person name.
  • Person names usually appear at the head or the
    tail of a sentence.
  • Persons may be accompanied with speech-act verbs
    like "??", "?", "??", etc.
  • cache occurrences of named people
  • A candidate appearing more than once has high
    tendency to be a person name.

27
Structure of Chinese Personal Names
  • Chinese surnames have the following three types
  • Single character like '?', '?', '?', '?'
  • Two characters like '??' and '??'
  • Two surnames together like '??'
  • Most names have the following two types
  • Single character
  • Two characters

28
Training Data
  • Name-formulation statistics is trained from
    1-million person name corpus in Taiwan.
  • Each contains surname, name and sex.
  • There are 489,305 male names, and 509,110 female
    names.
  • Total 598 surnames are retrieved from this 1-M
    corpus.
  • The surnames of very low frequency like ?, ?,
    etc., are removed to avoid false alarms.
  • Only 541 surnames are left, and are used to
    trigger the person name extraction system.

29
Training Data
  • The probability of a Chinese character to be the
    first character (the second character) of a name
    is computed for male and female, separately.
  • We compute the probabilities using training
    tables for female and male, respectively.
  • Either male score or female score may be greater
    than thresholds.
  • In some cases, female score may be greater than
    male score.
  • Thresholds are defined as 99 of training data
    should pass the thresholds.

30
Baseline Models name-formulation statistics
  • Model 1. Single character, e.g., ?, ?, ? and
    ?
  • P(C1)P(C2)P(C3) using the training table for
    male gt Threshold1 and P(C2)P(C3) using training
    table for male gt Threshold2, or
  • P(C1)P(C2)P(C3) using the training table for
    female gt Threshold3 andP(C2)P(C3) using the
    training table for female gt Threshold4
  • Model 2. Two characters, e.g., ?? and ??
  • P(C2)P(C3) using training table for male gt
    Threshold2, or
  • P(C2)P(C3) using training table for female gt
    Threshold4
  • Model 3. Two surnames together like '??
  • P(C12)P(C2)P(C3) using the training table for
    female gt Threshold3,P(C2)P(C3) using the
    training table for female gt Threshold4
    andP(C12)P(C2)P(C3) using the training table
    for female gtP(C12)P(C2)P(C3) using training
    table for male

31
Cues from Character Levels
  • Gender
  • A married woman may add her husband's surname
    before her surname. That forms type 3 person
    names.
  • Because a surname may be considered as a name,
    the candidates with two surnames do not always
    belong to the type 3 person names.
  • The gender information helps us disambiguate this
    type of person names.
  • Some Chinese characters have high score for male
    and some for female. The following shows some
    examples.
  • Male ?????????????????????
  • Female ?????????????????????

32
Cues from Sentence Levels
  • Titles
  • When a title appears before (after) a candidate,
    it is probable a person name. It can help to
    decide the boundary of a name.
  • ????? vs. ??????? ...
  • Mutual Information
  • How to tell if a word is a content word or a name
    is indispensable.
  • ?????,??????
  • When there exists a strong relationship between
    surrounding words, the candidate word has a high
    probability to be a content word.
  • Punctuation Marks
  • When a candidate is located at the end of a
    sentence, we give it an extra score.
  • If words around the caesura mark, then they have
    similar types.

33
Cues from Passage/Document Level Cache
  • A person name may appear more than once in a
    paragraph.
  • There are four cases when cache is used.
  • (1) C1C2C3 and C1C2C4 are both in the cache, and
    C1C2 is correct.
  • (2) C1C2C3 and C1C2C4 are both in the cache, and
    both are correct.
  • (3) C1C2C3 and C1C2 are both in the cache, and
    C1C2C3 is correct.
  • (4) C1C2C3 and C1C2 are both in the cache, and
    C1C2 is correct.

34
Cache
  • The problem using cache is case selection.
  • For every entry in the cache, we assign it a
    weight.
  • The entry with clear right boundary has a high
    weight.
  • title and punctuation
  • The other entries are assigned low level weight.
  • The use of weight in case selection
  • high vs. high gt case (2)
  • high vs. low or low vs. high gt high is correct
  • low vs. low
  • check the score of the last character of the name
    part
  • ??? ???
  • ??? ???

35
Discussion
  • Some typical types of errors.
  • foreign names (e.g., ???, ???)
  • They are identified as proper nouns correctly,
    but are assigned wrong features.
  • About 20 of errors belong to this type.
  • rare surnames (e.g.,?, ?, ?) or artists' stage
    names.
  • Near 14 of errors come from this type.
  • others
  • Other proper nouns (place names, organization
    names, etc.)
  • identification errors

36
Omitted Name Problem
  • Some texts usually omit name part and leave
    surname only.
  • ??????
  • Strategies
  • If this candidate appears before in the same
    paragraph, it is an omitted name.
  • If this candidate has a special title like
    ??????? or a general title like ??????...,
    then it is an omitted name.
  • If two single characters have very high
    probability to be surnames, and they appear
    around caesura mark, then they are regarded as
    omitted names.

37
Transliterated Person Names
  • Challenging Issues
  • No special cue like surnames in Chinese person
    names to trigger the recognition system.
  • No restriction on the length of a transliterated
    person name.
  • No large scale transliterated personal name
    corpus
  • Ambiguity in classification. '???' may denote a
    city or a former American president.

38
Strategy (1)
  • Character Condition
  • When a foreign name is transliterated, the
    selection of homophones is restrictive.
    Richard Macs ????? vs. ?????
  • Basic character set can be trained from a
    transliterated name corpus.
  • If all the characters in a string belong to this
    set, they are regarded as a candidate.

39
Strategy (2)
  • Syllable Condition
  • Some characters which meet the character
    condition do not look like transliterated names.
  • Syllable Sequence
  • Simplified Condition
  • (1) For each candidate, we check the syllable of
    the first (the last) character.
  • (2) If the syllable does not belong to the
    training corpus, the character is deleted.
  • (3) The remaining characters are treated in the
    similar way.

40
Strategy (3)
  • Frequency Condition
  • For each candidate which has only two characters,
    we compute the frequency of these two characters
    to see if it is larger than a threshold.
  • The threshold is determined in the similar way as
    the baseline model of Chinese person names.

41
Cues around Names
  • Cues within Transliterated Names
  • Character Condition
  • Syllable Condition
  • Frequency Condition
  • Cues around Transliterated Names
  • titles the same as Chinese person names
  • name introducers "?", "??", "??", "??", and "??"
  • special verbs the same as Chinese person names
  • first namemiddle name last name

42
Discussion
  • Some transliterated person names may be
    identified by the Chinese person name extraction
    system.
  • ??? ???
  • Some nouns may look like transliterated person
    names.
  • popular brands of automobiles, e.g., '???' and
    '???'
  • Chinese proper nouns, e.g., '??', '??' and '??'
  • Chinese person names, e.g., '???'
  • Besides the above nouns, the boundary errors
    affect the precision too.
  • (?)???

43
Named Organization Extraction
  • A complete organization name can be divided into
    two parts name and keyword.
  • Example ?????
  • Many words can serve as names, but only some
    fixed words can serve as keywords.
  • Challenging Issues
  • (1) a keyword is usually a common content word.
  • (2) a keyword may appear in the abbreviated form.
  • (3) the keyword may be omitted completely.

44
Classification of Organization Names
  • Complete organization names
  • This type of organization names is usually
    composed of proper nouns and keywords.
  • Some organization names are very long, thus
    (left) boundary determination is difficult.
  • Some organization names with keywords are still
    ambiguous.
  • '???' usually denotes reading matters, but not
    organizations.
  • Incomplete organization names
  • These organization names often omit their
    keywords.
  • The abbreviated organization names may be
    ambiguous.
  • '??' and '??' are famous sport teams in Taiwan
    and in USA, respectively, however, they are also
    common content words.

45
Strategies
  • Keywords
  • A keyword shows not only the possibility of an
    occurrence of an organization name, but also its
    right boundary.
  • Prefix
  • Prefix is a good marker for possible left
    boundary.
  • Single-character words
  • If the character preceding a possible keyword is
    a single-character word, then the content word is
    not a keyword.
  • If the characters preceding a possible keyword
    cannot exist independently, they form a name part
    of an organization.
  • Words of at least two characters
  • The words to compose a name part usually have
    strong relationships.

46
Strategies
  • Parts of speech
  • The name part of an organization cannot extend
    beyond a transitive verb.
  • Numeral and classifier are also helpful.
  • Cache
  • problem when should a pattern be put into cache?
  • Character set is incomplete.
  • n-gram model
  • It must consist of a name and an organization
    name keyword.
  • Its length must be greater than 2 words.
  • It does not cross any punctuation marks.
  • It must occur more than a threshold.

47
Handcrafted Rules
  • OrganizationName ? OrganizationName
    OrganizationNameKeyworde.g.,
    ??? ??
  • OrganizationName ? CountryName
    OrganizationNameKeyworde.g.,
    ?? ???
  • OrganizationName ? PersonName
    OrganizationNameKeyworde.g.,
    ??? ???
  • OrganizationName ? CountryName
    OrganizationNamee.g.,
    ?? ???
  • OrganizationName ? LocationName
    OrgnizationNamee.g.,
    ???? ??
  • OrganizationName ? CountryName DDD
    OrganizationNameKeyworde.g.,
    ?? ?? ????
  • OrganizationName ? PersonName DD
    OrganizationNameKeyworde.g.,
    ??? ?? ???
  • OrganizationName ? LocationName DD
    OrganizationNameKeyworde.g.,
    ?? ?? ????

48
Discussion
  • Most errors result from organization names
    without keywords.
  • ??? ?? ????
  • ?? ?? ??
  • Identification errors
  • Even if keywords appear, organization names do
    not always exist.
  • ???? ????
  • Error left boundary is also a problem.
  • ????? (??)????
  • Ambiguities
  • ??? ????

49
Application of Gender Assignment
  • Anaphora resolution
  • "?????,??????????,??????????,?????????????????????
    ,?????,?????,?????????????????,????????"
  • Gender of a person name is useful for this
    problem.
  • The correct rate for gender assignment is 89.
  • Co-Reference resolution

50
Named Location Extraction
  • A location name is composed of name and keyword
    parts.
  • Rules
  • LocationName ? PersonName LocationNameKeyword
  • LocationName ? LocationName LocationNameKeyword
  • Locative verbs like '??', '??', and so on, are
    introduced to treat location names without
    keywords.
  • Cache and n-gram models are also employed to
    extract location names.

51
Date Expressions
  • DATE ? NUMBER YEAR (? ?)
  • DATE ? NUMBER MTHUNIT (? ?)
  • DATE ? NUMBER DUNIT (? ?)
  • DATE ? REGINC (??)
  • DATE ? FSTATE DATE (?? ??)
  • DATE ? COMMON DATE (? ??)
  • DATE ? REGINE DATE (?? ????)
  • DATE ? DATE DMONTH (?? ??)
  • DATE ? DATE BSTATE (?? ?)
  • DATE ? FSTATEDATE DATE (?? ???)
  • DATE ? FSTATEDATE DMONTH (?? ??)
  • DATE ? FSTATEDATE FSTATEDATE (?? ??)
  • DATE ? DATE YXY DATE (???? ? ????)

52
Time Expressions
  • TIME ? NUMBER HUNIT (? ?)
  • TIME ? NUMBER MUNIT (?? ?)
  • TIME ? NUMBER SUNIT (? ?)
  • TIME ? FSTAETIME TIME
  • TIME ? FSTATE TIME
  • TIME ? TIME BSTATE
  • TIME ? MORN BSTATE
  • TIME ? TIME TIME
  • TIME ? TIME YXY TIME (?? ? ??)
  • TIME ? NUMBER COLON NUMBER (03 45)

53
Monetary Expressions
  • DMONEY ? MOUNIT NUMBER MOUNIT (?? ? ?)
  • DMONEY ? NUMBER MOUNIT MOUNIT (? ? ??)
  • DMONEY ? NUMBER MOUNIT (? ?)
  • DMONEY ? MOUNIT MOUNIT NUMBER (?? 5)
  • DMONEY ? MOUNIT NUMBER ( 5)
  • DMONEY ? NUMBER YXY DMONEY (? ? ??)
  • DMONEY ? DMONEY YXY DMONEY (?? ? ??)
  • DMONEY ? DMONEY YXY NUMBER (200 - 500)

54
Percentage Expressions
  • DPERCENT ? PERCENT NUMBER
    (??? ?)
  • DPERCENT ? NUMBER PERCENT
    (3 )
  • DPERCENT ? DPERCENT YXY DPERCENT
    (5 ? 8)
  • DPERCENT ? DPERCENT YXY NUMBER
    (???? ? ?)
  • DPERCENT ? NUMBER YXY DPERCENT
    (? ? ????)

55
Named Entity Extraction in MET2
  • Transform Chinese texts in GB codes into texts in
    Big-5 codes.
  • Segment Chinese texts into a sequence of tokens.
  • Identify named people.
  • Identify named organizations.
  • Identify named locations.
  • Use n-gram model to identify named
    organizations/locations.
  • Identify the rest of named expressions.
  • Transform the results in Big-5 codes into the
    results in GB codes.

56
from GB codes to Big-5 codes
  • Big-5 traditional character set and GB simplified
    character set are adopted in Taiwan and in China,
    respectively.
  • Our system is developed on the basis of Big-5
    codes, so that the transformation is required.
  • Characters used both in simplified character set
    and in tradition character set always result in
    error mapping.
  • ?? vs. ?? ?? vs. ?? ?? vs. ???? vs. ?? ?? vs.
    ?? ??? vs. ?????? vs. ??? ?? vs. ?? ?? vs.
    ?????? vs. ???? and so on.
  • More unknown words may be generated.

57
Segmentation
  • We list all the possible words by dictionary
    look-up, and then resolve ambiguities by
    segmentation strategies.
  • The test documents in MET-2 are selected from
    China newspapers.
  • Our dictionary is trained from Taiwan corpora.
  • Due to the different vocabulary sets, many more
    unknown words may be introduced.???? vs.
    ????, ?? vs. ??, ?? vs. ???, ??? vs.
    ???, etc.
  • The unknown words from different code sets and
    different vocabulary sets make named entity
    extraction more challenging.

58
MET-2 Formal Run of NTUNLPL
  • F-measures
  • PR 79.61
  • 2PR 77.88
  • P2R 81.42
  • Recall and Precision
  • name (85, 79)
  • number (91, 98)
  • time (95, 85)

59
Named Persons
  • The recall rate and the precision are 91 and
    74.
  • Major errors
  • segmentation, e.g., ??? -gt ?? ?Part of person
    names may be regarded as a word during
    segmentation.
  • surname name, character set and title are
    incomplete, e.g., ???, ?? ? ??, ?? ??
  • blanks, e.g., ? ?We cannot tell if blanks
    exist in the original documents or are inserted
    by segmentation system.
  • Boundary errors
  • Japanese names, e.g., ?????

60
Evaluation Named Organization
  • The recall rate and the precision rate are 78
    and 85.
  • Major errors
  • more than two content words between name and
    keyworde.g., ?? ?? ?? ?? ??
  • absent of keywordse.g., ???????
  • absent of name part the name part do not satisfy
    character condition, e.g., ????
  • n-gram errorse.g., ???????????

61
Evaluation Named Locations
  • The recall rate and the precision rate are 78
    and 69.
  • character set
  • The characters "?" and "?" in the string "????"
    do not belong to our transliterated character
    set.
  • wrong keyword
  • The character "?" is an organization keyword.
    Thus the string "?????" is mis-regarded as an
    organization name.
  • common content words
  • The words such as "??", "??", etc., are common
    content words. We do not give them special tags.
  • single-character locations
  • The single-character locations such as "?", "?",
    and so on, are missed during recognition.

62
Evaluation Time/Date Expressions
  • The recall rate and the precision rate for date
    expression, time expression, monetary expression
    and percentage expression are (94, 88), (98,
    70), (98, 98), and (83, 98), respectively.
  • Major errors
  • propagation errors
  • segmentation before entity extraction, e.g., ??
  • named people extraction before date expressions
  • absent date units
  • the date unit does not appear, e.g., ????
  • the date unit should appear, e.g., ???

63
  • Absent keywords
  • Some keywords are not listed.
  • E.g., ???????8?58? is divided into ??,
    8?58?
  • Rule coverage
  • E.g., ?????
  • Ambiguity
  • Some characters like ? can be used in time and
    monetary expressions. E.g., ???????? is
    divided into two parts ??? and ?????
  • The strings "??" and "??" are words. In our
    pipelined model, "????" and "????" will be missed.

64
Issues
  • Deal with the errors propagated from the previous
    modules
  • Pipelining model vs. interleaving model
  • Deal with the errors resulting from rule coverage
  • Handcrafted rules vs. learning rules
  • Deal with the errors resulting from segmentation
    standards
  • Vocabulary set of Academic Sinica vs. Peking
    University

65
Pipelining Model

segmentation
named entity extraction
named people
named location
named organization
number date/time
input
output
ambiguity resolution
only one result
66
Interleaving Model

table lookup
named people
named locations
named organization
number date/time
input
ambiguity resolution
output
67
An Example in Interleaving Model
?
?
?
?
?
?
68
Learning Rules vs. Hand-Crafted Rules
  • Collect organization names.
  • Extract Patterns
  • Clustering organization names based on keywords
  • Assign features in name parts
  • Employ Teiresias algorithm to extract
    patterns(http//cbcsrv.watson.ibm.com/Tspd.html)

69
Teiresias algorithm
  • Patterns consist of words and wild card (),
    e.g.,
  • ?? ?? ??
  • ?? ?? ??
  • gt ?? ??
  • parameter setting
  • L the least number of non-wild cards
  • W the maximum number of L non-wild cards
  • T confidence level, I.e., how many training
    instances this pattern must satisfy

70
Keyword Set
  • Extracting keyword set
  • Input all the training instances (i.e.,
    organization names) into Teiresias algorithm.
  • Let confidence level be 5.
  • Find all the patterns not ending with wild card,
    e.g., ( ?? ) ( ?? ) ( ? ? )
  • Regard suffix of a pattern as a keyword.e.g.,
    ??? ??? ? ?? ?? ??? ?? ?????

71
Features of Patterns
  • types
  • named entities
  • named people
  • named locations
  • named organizations
  • date expression ( ????????????? )
  • number ( 87????????? )
  • common nouns

72
Tagging
73
Tagging
  • Lexical level
  • Part of Speech Tagging
  • Named entity tagging
  • Sense Tagging
  • Syntactic level
  • Syntactic Category (Structure) Tagging
  • Discourse level
  • Anaphora-Antecedent Tagging
  • Co-Reference Tagging

74
Part-of-Speech Tagging
  • Issues of tagging accuracy
  • the amount of training data
  • the granularity of the tagging set
  • the occurrences of unknown words, and so on.
  • Academia Sinica Balanced Corpus
  • 5 million words
  • 46 tags
  • Language Models, e.g., bigram, trigram,

75
Sense Tagging
  • Assign sense labels to words in a sentence.
  • Sense Tagging Set
  • tong2yi4ci2ci2lin2 (?????, Cilin)
  • 12 large categories
  • 94 middle categories
  • 1,428 small categories
  • 3,925 word clusters

76
A People Aa a collective
name 01       Human being The people Everybody 02
       I We 03       You You 04      He/She T
hey 05      Myself Others Someone 06      Who A
b people of all ages and both sexes 01 A
Man A Woman Men and Women 02 An Old
Person An Adult The old and the
young 03        A Teenager 04        An
Infant A Child Ac posture 01 A Tall
Person A Dwarf 02 A Fat Person A Thin
Person 03 A Beautiful Woman A Handsome Man
77
(No Transcript)
78
Degree of Polysemy in Mandarin Chinese
  • Small categories of Cilin are used to compute the
    distribution of word senses.
  • ASBC is employed to count frequency of a word
  • Total 28,321 word types appear both in Cilin and
    in ASBC corpus.
  • Total 5,922 words are polysemous.

79
degree number of senses of a word word type a
dictionary entry
80
97.53 94.70 for N and V
98.2297.08 for A and K
?
5922
4132
81
93.77 of polysemous words belong to the class of
low ambiguity they only occupy 58.52 of tokens
in ASBC corpus
word token an occurrence of word type in ASBC
corpus word type a dictionary entry
82
Low frequency (lt 100), Middle frequency
(100?lt1000), High frequency (?1000)
24
58
gt8
83
Phenomena
  • POS information reduces the degree of ambiguities
  • Total 8.94 of word tokens are high ambiguous in
    Table 3. It decreases to 0.47 in Table 4.
  • High ambiguous words tend to be high frequent
  • 23.67 of word types are middle- or high-frequent
    words, and they occupy 94.06 of word tokens

84
Semantic Tagging Unambiguous Words
  • acquire the context for each semantic tag
    starting from the unambiguous words

Unambiguous Words
those words that have only one sense in Cilin
Cilin
ASBC
85
Acquire Contextual Vectors
  • An unambiguous word is characterized by the
    surrounding words.
  • The window size is set to 6, and the stop words
    are removed.
  • A sense tag Ctag is in terms of a vector (w1, w2,
    ..., wn)
  • MI metric
  • EM metric

86
Semantic Tagging Ambiguous Words
  • apply the information trained at the first stage
    to selecting the best sense tag from the
    candidates of each ambiguous word

Unambiguous Words
Cilin
ASBC
Ambiguous Words
Those words that have more than one sense in Cilin
87
Apply and Retrain Contextual Vectors
  • Identify the context vector of an ambiguous word.
  • Measures the similarity between a sense vector
    and a context vector by cosine formula.
  • Select the sense tag of the highest similarity
    score.
  • Retrain the sense vector for each sense tag.

88
Semantic Tagging Unknown Words
  • Adopt outside evidences from the mapping among
    WordNet synsets and Cilin sense tags to narrow
    down the candidate set.

Unambiguous Words
Cilin
ASBC
Unknown Words
Ambiguous Words
89
(No Transcript)
90
Experiments
  • Test Materials
  • Sample documents of different categories from
    ASBC corpus
  • Total 35,921 words in the test corpus
  • Research associates tag this corpus manually
  • Mark up the ambiguous words by looking up the
    Cilin dictionary
  • Tag the unknown words by looking up the mapping
    table
  • The tag mapper achieves 82.52 of performance
    approximately

91
(No Transcript)
92
49.55
62.60 31.36 27.00
93
?The performance for tagging low ambiguity (2-4),
middle ambiguity (5-8) and high ambiguity (gt8) is
similar (i.e., 63.98, 60.92 and 67.95) when 1
candidate, 2 candidates, and 3 candidates are
proposed. ? Under the middle categories and 1-3
proposed candidates, the performance for tagging
low, middle and high ambiguous words are 71.02,
73.88, and 75.94.
94
(No Transcript)
95
Co-Reference Resolution
96
Introduction
  • Anaphora vs. Co-Reference
  • Anaphora
  • ?????
  • Type/Instance ??/??, ?????/??
  • Function/Value ?????/?? 30 ?
  • NP ?????? ?????/???

??
??
97
Flow of Co-Reference Resolution
Document
98
Find the Candidate List
Document
Determine Candidates
All the Candidates
Co-Reference Resolution Algorithm
Singletons
Class 1
Class 2
Class N
99
Find the Candidates
  • Select all nouns (Cand-Terms)
  • Na (????)
  • Nb (????)
  • Nc (????)
  • Nd (????)
  • Nh (???)
  • Delete some Nds (total 171)
  • e.g., ??????????, ??, ??, ...
  • Select noun phrases (Cand-NP)
  • Select maximal noun phrases (Cand-MaxNP)

Some are found in named entity extraction
100
Recognize NPs whose head is Na (common noun)
??(Neqa) ?(DE) ??(Na) ??(Neqa) ?(Neu) ?(Nf)
??(Na) ?(Nep) ?(Neu) ?(Nf) ??(Na) ?(Nes)
??(Na)
101
Recognize NPs whose head is Nh (pronoun)
Na
Nh
Init
Nb
??(Nb) ??(Nh) ???(Na) ??(Nd) ?(Nh) ??(Nh)
102
Cilin (?????)
  • 12 large categories
  • 94 middle categories
  • 1,428 small categories
  • 3,925 word clusters

103
Features
  • Classification
  • Word/Phrase Itself
  • Part of Speech of Head
  • Semantics of Head
  • Type of Named Entities
  • Positions (Sentences and Paragraphs)
  • Number Singular, Plural and Unknown
  • Gender Pronouns and Chinese Person Names
  • Pronouns Personal Pronouns, Demonstrative
    Pronouns

104
Co-Reference Resolution Algorithms
  • Strategy 1 simple pattern matching
  • Strategy 2 Cardie Clustering Algorithm

105
Cardie Clustering Algorithm
106
(No Transcript)
107
Semantics Restrictions
  • SemanticFun_1(NPi ,NPj )
  • If heads belong to the same word cluster, then 0
    is assigned, else 1 is assigned.
  • SemanticFun_2(NPi ,NPj )
  • Integrate POS, Named Entity and Cilin sense
  • Only one is NE
  • NPi and NPj are not NE, and they are in Cilin
  • NPi and NPj are not NE, and only of them not in
    Cilin
  • NPi and NPj are not NE, and both are not in Cilin
    ? 0

108
SemanticFun_2
  • One of them is NE
  • Column denotes the type of NE
  • Row denotes the part of speech
  • English string in a table cell denotes Cilin sense

109
Experimental Results
110
Named Entity Tagging Environment
111
(No Transcript)
112
(No Transcript)
113
(No Transcript)
114
tag at the same time
115
(No Transcript)
116
?? vs. ??????
117
(No Transcript)
118
Named Entity Extraction in Bioinformatics
Application
  • Named Entities
  • Protein Name
  • Gene Name

119
Summary
  • Segmentation
  • Named Entity Extraction
  • POS and Sense Tagging
  • Co-Reference Resolution
  • NE Tagging Environment
  • Bioinformatics
Write a Comment
User Comments (0)
About PowerShow.com