Title: BURC: Bootstrapping Using ResearchCyc
1BURC Bootstrapping Using ResearchCyc
2Introduction to the Problem
- Goal To extend Cycs knowledge base using
relationships implied to be possible, normal or
commonplace in the world - Prior work with Cyc knowledge entry has been
manually oriented - How will we collect commonsense without a body
and manual labor? - Read, Parse, Mine!
- Proposal Read text, Parse into a database,
Extract relations between words, Propose
hypothetical relations between concepts
3Basic Analogy
- The Shotgun approach to the Human Genome
- Extract millions of fragments then knit them back
together by finding commonalities - Will it work for the Human Menome?
4What is Cyc?
- the world's largest and most complete general
knowledge base and commonsense reasoning engine - Started in mid 1980s (should take only 10
years.) - Logic Based
- LISP oriented
- For WordNet users, each Concept Synset
- Available from http//www.opencyc.org
- http//researchcyc.cyc.com
- Big (ResearchCyc v0.8)
- Constants 89,379
- Assertions 968,985
- Deduction 361,185
- Sample Collection Extents
- EnglishWord 18,007
- Event 6,050
- PartiallyTangible 24,387
- Microtheory 1,688
5Example of what Cyc currently knows about fingers
- Collection Finger
- GAF Arg 1
- Mt UniversalVocabularyMtisa
AnimalBodyPartType - quotedIsa DensoOntologyConstant
- genls Digit-AnatomicalPart
- comment "The collection of all digits of all
Hands (q.v.). Fingers are (typically) flexibly
jointed and are necessary to enabling the hand
(and its owner) to perform grasping and
manipulation actions." - Mt BaseKBdefiningMt AnimalPhysiologyVocabula
ryMt - Mt AnimalPhysiologyMtproperPhysicalPartTypes
Fingernail - Mt WordNetMappingMt (synonymousExternalConcept
Finger WordNet-Version2_0 "N05247839")
(synonymousExternalConcept Finger
WordNet-1997Version "N04312497")
- GAF Arg 2
- Mt UniversalVocabularyMt (genls LittleFinger
Finger) (genls IndexFinger Finger) (genls Thumb
Finger) (genls RingFinger Finger) (genls
MiddleFinger Finger) - Mt HumanActivitiesMt (bodyPartsUsed-TypeType
Typing Finger) - Mt HumanSocialLifeMt (bodyPartsUsed-TypeType
PointingAFinger Finger)
6Example of what Cyc currently knows about fingers
- 2
- GAF Arg 3
- Mt HumanPhysiologyMt (relationAllExists
anatomicalParts HomoSapiens Finger) - Mt VertebratePhysiologyMt (relationAllExistsCou
nt physicalParts Hand Finger 5) - Mt UniversalVocabularyMt (relationAllOnly
wornOn Ring-Jewelry Finger) - Mt AnimalPhysiologyMt (relationExistsAll
physicalParts Hand Finger) - GAF Arg 4
- Mt GeneralEnglishMt (denotation Finger-TheWord
CountNoun 0 Finger)
- Mt AnimalPhysiologyMt
- -(conceptuallyRelated Fingernail Finger)
(properPhysicalPartTypes Hand Finger)
(relationAllInstance age Finger      Â
(YearsDuration 0 200)) (relationAllInstance
widthOfObject Finger       (Meter 0.001 0.2))
(relationAllInstance heightOfObject Finger
      (Meter 0.001 0.2)) (relationAllInstance
lengthOfObject Finger       (Meter 0.01 0.5))
(relationAllInstance massOfObject Finger      Â
(Kilogram 0.001 1))
7Bootstrapping with ResearchCyc
- Cyc has vocabulary about objects in the world and
relationships - Cyc could still use more common relationships
- BURC uses what Cyc already has lots of parsed
text to create new Cyc entries for common
relationships found in the text - Lenats Bootstrap Hypothesis once Cyc reaches a
certain level/scale it can help in its own
development and start using NLP to augment its
knowledge base - BURC should help test this hypothesis
8The BURC Process From seedsHypothe-seeds
- Use the link grammar parser for bulk parsing of
text, primarily narratives based in worlds like
ours. Other text styles could be included. - Operates in two directions
- Forward from text to CycL
- Backwards from existing CycL to the text to find
new forward patterns
9BURC Process - 2
- Load the link fragments into a database (1 and 2
link fragments), and compute frequency of
fragment occurrences. The database will be in a
SQL format so multiple queries can be formed
dynamically. - Using Cyc knowledge as a starting point (the
seeds), extract knowledge for use in Cyc - Given a set of seed facts in Cyc, identify how
those facts are represented as link fragments in
the database - Generate conjectures as to new knowledge AND new
knowledge extraction patterns using the fragment
patterns.
10BURC Process - 3
- Use Cyc knowledge directly to conjecture new
statements - Cyc has lexical knowledge, which can be used as
templates against the DB to form new statements - For example, common adjectives applied to noun
classes - Cyc knows WhiteColor and Blouse but does not
know that white is a common blouse color,
although it becomes apparent after reading some
text - Optionally, gather supporting background
statistics for hypothesis verification using
other sources - Perhaps Google desktop with a larger than fully
parsed corpus - Perhaps check against answer extraction engines
11KNEXT (KNowledge EXtraction from Text)
- Deriving general world knowledge from texts and
taxonomies - http//www.cs.rochester.edu/schubert/projects/wor
ld-knowledge-mining.html - Lenhart K. Schubert and Matthew Tong, "Extracting
and evaluating general world knowledge from the
Brown Corpus", Proc. of the HLT-NAACL Workshop on
Text Meaning, May 31, 2003, Edmonton, Alberta,
pp. 7-13. - System extracts commonsense relationships from
text - Limited to the pre-parsed Penn Treebank
- Generated 117,326 propositions (about 2 per
sentence) - About 60 judged reasonable by any given judge
12KNEXT (Example)
- (BLANCHE KNEW 0 SOMETHING MUST BE CAUSING STANLEY
'S NEW, STRANGE BEHAVIOR BUT SHE NEVER ONCE
CONNECTED IT WITH KITTI WALKER.) - A FEMALE-INDIVIDUAL MAY KNOW A PROPOSITION.
- SOMETHING MAY CAUSE A BEHAVIOR.
- A MALE-INDIVIDUAL MAY HAVE A BEHAVIOR.
- A BEHAVIOR CAN BE NEW.
- A BEHAVIOR CAN BE STRANGE.
- A FEMALE-INDIVIDUAL MAY CONNECT A
THING-REFERRED-TO WITH A FEMALE-INDIVIDUAL. -
- ((I (Q DET FEMALE-INDIVIDUAL) KNOWV (Q DET
PROPOS)) - (I (F K SOMETHINGN) CAUSEV (Q THE
BEHAVIORN)) - (I (Q DET MALE-INDIVIDUAL) HAVEV (Q DET
BEHAVIORN)) - (I (Q DET BEHAVIORN) NEWA)
- (I (Q DET BEHAVIORN) STRANGEA)
- (I (Q DET FEMALE-INDIVIDUAL) CONNECTV (Q
DET THING-REFERRED-TO) - (P WITHP (Q DET FEMALE-INDIVIDUAL))))
13Other Extraction Pattern Research
- Towards Terascale Knowledge Acquisition (Pantel,
Ravichandran and Hovy, 2004) - Learning Surface Text Patterns for a Question
Answering System (Ravichandran Hovy, 2002) - Defined Pattern Precision P Ca/Co
- Ca total number of patterns with answer term
present - Co Total number of patterns with any term
present - DIRT Discovery of Inference Rules from Text
(Lin Pantel, 2001)
14Other Lexical Knowledge Research
- VerbOcean (Chklovski Pantel) Collecting pairs
and searching to verify relationships - Lexical Acquisition via Constraint Solving
(Pedersen Chen) Acquiring syntactic and
semantic classification rules of unknown words
for LGP - Information Extraction Using Link Grammar papers
- Automatic Meaning Discovery Using Google
15The General Backwards Model
- Given some Cyc relation Pred(?X,?Y)
- Create SQL search query
- Lookup in Cyc lexical entries for X Y ? LX, LY
- Select from LGPTable where Term1"ltLXgt" and
Term3"ltLYgt - System returns records LX Link1 Term2
Link2 LY (Freq) - Generate new hypothetical extraction patterns
- Select from LGPTable where Link1"ltL1gt" and
Link2"ltL2gt" and Term2"ltT2gt - L1 T2 L2 ? generate hypothetical record (
Pred ?S1?S3 ) - Frequency information is propagated forward
16The General Backwards Model - 2
- Optional Search Cyc for ?PRED (X,Y) and use the
set to form a local ambiguity class to reduce
search labor and identify ambiguity. One rule ?
multiple relations. - Stored as SQLTemplate \ Pattern \
Pred1/Pred2//PRedN - Need to explore (canidateBinaryPred ARG1 ARG2
RELN) - Optional Form more specific patterns for
Pred(X,_) and Pred(_,Y)
17Update the LGParsers CycL Rules
- ltrulegt
- ltpatterngt Link1 Term2 Link2lt/patterngt
- ltdefinegt?ITEMr lt/definegt
- ltbodygt(is-node ?ITEMr "R")lt/bodygt
- ltdefinegt?ITEMl lt/definegt
- ltbodygt(is-node ?ITEMl "R")lt/bodygt
- ltbodygt(?PRED1 ?ITEMl ?TERMr)lt/bodygt
- ...
- ltbodygt(?PREDN ?ITEMl ?TERMr)lt/bodygt
- lt/rulegt
- There are rules for translation of LGP output
into CycL - If the frequency information warrants it then we
can generate new LGP rules - Results in expanded parser precision
18Forward Mining Adjective Relations
- There are 1941 GAFs on adjSemTrans, the primary
lexical adjective predicate - Find applicable fragments and use definitions
- Select from LGPTable Where NumLinks1 and
Link1'a' and Term1 like '.a' and Term2 like
'.n - Returns records Term1.a a Term2.n
- Potentially test using either an internal or
search engine based relevancy metric - Query Cyc for (adjSemTrans ltterm1gt-TheWord ?N
RegularAdjFrame (?Pred NOUN ?Val)) - Generate (plausiblePredValOFType ltterm2gt lt?Predgt
lt?Valgt) - Possibly generate parsing rule
19Mining Adjective Knowledge Example
- white blouse as factoid
- white.a a blouse.n
- Potentially test using an internal or search
engine relevancy metric GC70400 - (adjSemTrans White-TheWord 11 RegularAdjFrame
(mainColorOfObject NOUN WhiteColor)) - Hypothesis (plausiblePredValueOfType Blouse
mainColorOfObject WhiteColor)
20Update the LGParsers CycL Rules - 2
- ltrulegt
- ltpatterngt Term1.a a lt/patterngt
- ltdefinegt?ITEMr lt/definegt
- ltbodygt(is-node ?ITEMr "R")lt/bodygt
- ltdefinegt?ITEMl lt/definegt
- ltbodygt(?PRED ?ITEMr ?VAL)lt/bodygt
- lt/rulegt
- There are rules for translation of LGP output
into CycL - We can use the adjSemTrans data to generate new
translation rules - Results in expanded parser precision
- ltrulegt
- ltpatterngt white.a a lt/patterngt
- ltdefinegt?ITEMr lt/definegt
- ltbodygt(is-node ?ITEMr "R")lt/bodygt
- ltdefinegt?ITEMl lt/definegt
- ltbodygt(mainColorOfObject ?ITEMr
WhiteColor)lt/bodygt - lt/rulegt
21Mined Finger Descriptions
- 000010(plausiblePredValueOfType Finger
feelsSensation (PositiveAmountFn
LevelOfSoreness)) - 000037(plausiblePredValueOfType Finger
forceCapacity Strong) - 000025(plausiblePredValueOfType Finger
forceCapacity Strong) - 000025(plausiblePredValueOfType Finger
hardnessOfObject Hard) - 000037(plausiblePredValueOfType Finger
hardnessOfObject (MediumToVeryHighAmountFn
Hardness)) - 000037(plausiblePredValueOfType Finger
hardnessOfObject (MediumToVeryHighAmountFn
Hardness)) - 000002(plausiblePredValueOfType Finger
hasEvaluativeQuantity (MediumToVeryHighAmountF
n Goodness-Generic)) - 000002(plausiblePredValueOfType Finger
hasPhysicalAttractiveness GoodLooking) - 000047(plausiblePredValueOfType Finger isa
(LeftObjectOfPairFn REPLACE)) - 000015(plausiblePredValueOfType Finger isa
(RightObjectOfPairFn REPLACE)) - 000155(plausiblePredValueOfType Finger
lengthOfObject (RelativeGenericValueFn
lengthOfObject REPLACE highAmountOf)) - 000155(plausiblePredValueOfType Finger
lengthOfObject (RelativeGenericValueFn
lengthOfObject REPLACE highToVeryHighAmountOf
)) - 000003(plausiblePredValueOfType Finger
mainColorOfObject BlackColor) - 000010(plausiblePredValueOfType Finger
mainColorOfObject LightYellowishBrown-Color) - 000010(plausiblePredValueOfType Finger
mainColorOfObject ModerateYellowishBrown-Color
) - 000010(plausiblePredValueOfType Finger
mainColorOfObject SunTan-FleshColor) - 000002(plausiblePredValueOfType Finger
possessiveRelation SuddenChange)
22Mined Finger Descriptions
- 000006(plausiblePredValueOfType Finger
possessiveRelation (HighAmountFn Speed)) - 000094(plausiblePredValueOfType Finger
rigidityOfObject (HighAmountFn Rigidity)) - 000060(plausiblePredValueOfType Finger
sizeParameterOfObject (RelativeGenericValueFn
sizeParameterOfObject REPLACE highAmountOf))
- 000052(plausiblePredValueOfType Finger
sizeParameterOfObject (RelativeGenericValueFn
sizeParameterOfObject REPLACE
highToVeryHighAmountOf)) - 000060(plausiblePredValueOfType Finger
sizeParameterOfObject (RelativeGenericValueFn
sizeParameterOfObject REPLACE
highToVeryHighAmountOf)) - 000285(plausiblePredValueOfType Finger
sizeParameterOfObject (RelativeGenericValueFn
sizeParameterOfObject REPLACE
veryLowToLowAmountOf)) - 000074(plausiblePredValueOfType Finger
sizeParameterOfObject (RelativeGenericValueFn
sizeParameterOfObject REPLACE
veryLowToLowAmountOf)) - 000029(plausiblePredValueOfType Finger
speedOfObject-Underspecified (LowAmountFn
Speed)) - 000138(plausiblePredValueOfType Finger
surfaceFeatureOfObj Slippery) - 000074(plausiblePredValueOfType Finger
temperatureOfObject Warm) - 000004(plausiblePredValueOfType Finger
textureOfObject Rough) - 000168(plausiblePredValueOfType Finger
thicknessOfObject (RelativeGenericValueFn
thicknessOfObject REPLACE highAmountOf)) - 000168(plausiblePredValueOfType Finger
thicknessOfObject (RelativeGenericValueFn
thicknessOfObject REPLACE highToVeryHighAmoun
tOf)) - 000182(plausiblePredValueOfType Finger
wetnessOfObject Wet)
23Verb Semantic Filtering -1Discovering what a
finger can do
- A similar process can be used finding information
based on verb semantic parsing frames - For each potential ltNOUNWORDgt-ltVERBgt pair query
Cyc to find basic relationships using the verb
semantic templates - (and
- (denotation ltNOUNWORDgt ?NOUNTYPE ?N ?CYCTERM)
- (wordForms ?WORD ?PRED ""ltVERBgt"")
- (speechPartPreds ?POS ?PRED)
- (semTransPredForPOS ?POS ?SEMTRANSPRED)
- (?SEMTRANSPRED ?WORD ?NUM ?FRAME ?TEMPLATE))
- Verify for each potential relationship (ltSPREDgt
ltVERTERMgt ltCYCTERMgt) derivable from ?TEMPLATE
that it makes sense in the ontology - (and
- (arg1Isa ltSPREDgt ?VTYP)
- (arg2Isa ltSPREDgt ?CTYP)
- (genls ltCYCTERMgt ?CTYP)
- (genls ltVERBTERMgt ?VTYP) )
24Verb Semantic Filtering -2Templates of Movement
- (verbSemTrans Move-TheWord 0 IntransitiveVerbFram
e       (and           (isa ACTION
MovementEvent) Â Â Â Â Â Â Â Â Â Â (primaryObjectMoving
ACTION SUBJECT))) - (verbSemTrans Move-TheWord 1 IntransitiveVerbFram
e       (and           (isa ACTION
ChangeOfResidence) Â Â Â Â Â Â Â Â Â Â (performedBy
ACTION SUBJECT))) - (verbSemTrans Move-TheWord 2 TransitiveNPFrame
      (and           (isa ACTION
CausingAnotherObjectsTranslationalMotion)
          (objectActedOn ACTION OBJECT)
          (doneBy ACTION SUBJECT))) - (arg1Isa performedBy Action)
- (arg2Isa performedBy Agent-Generic)
25Verb Semantic Filtering - 3
- BURC can use Cycs knowledge of what things can
perform what actions or have what attributes to
filter out implausible relationships. - (behaviorCapableOf Finger CausingAnotherObje
ctsTranslationalMotion doneBy) - (behaviorCapableOf Finger ChangeOfResidence
performedBy) - (behaviorCapableOf Finger Inspecting
performedBy) - (behaviorCapableOf Finger Movement-Translati
onEvent primaryObjectMoving) - (behaviorCapableOf Finger MovementEvent
primaryObjectMoving) - (behaviorCapableOf Finger PushingAnObject
providerOfMotiveForce) - (behaviorCapableOf Finger Sliding-Generic
objectMoving) - (behaviorCapableOf Finger Sliding-Generic
primaryObjectMoving) - (behaviorCapableOf Finger Slipping
objectMoving) - (behaviorCapableOf Finger Slipping
primaryObjectMoving) - Cyc can help in its own knowledge entry process.
62 of generated hypothesis were filtered out
using semantic role filtering.
26Other Direct Extraction Rules
- Some underspecified patterns exist just based
on the links - This could be used to extract ConceptNet like
output directly from link records - Examples
- ltobj1gtssltactgt.vosltobj2gt ? capableOf(ltobj1gt,
ltactgt ltobj2gt) - ltactgt.v osltobjgt ? CapableOfReveivingAction(ltob
jgt,ltactgt) - ltobjgtsltactgt.v ? capableOf(ltobjgt,ltactgt)
27Quest for Metrics
- Percentage of hypothesis that make sense to a
panel of judges - Percentages of hypothesis that are already known
to Cyc - Percentage of hypothesis that are known in other
knowledge sources (WordNet, Sumo/Milo, VerbOcean,
MIT OpenMind) - Number of hypothesis generated vs. number of
records - What percentage of relations in Cyc can be found
in the fragment pool - The Pattern Precision measure
- Maybe compare against KNEXT but need to see if
they return real numbers - Unfortunately we dont know all possible
knowledge (otherwise we wouldnt be doing this),
because if we did we could measure recall and
precision. - Simple space estimate (2.3K binary predicates
85K constants 85K constants 16.617500 T
simple possibilities)
28Desired Outputs
- Version of link grammar for bulk reading and
generating fragments - Database control program to queue texts, monitor
their processing, and merge the fragment results - The database of fragments with fragment counts
for some corpus - The hypothesis set generated by the system
- Optionally an OpenMind / ConceptNet like set of
commonsense factoids - Open enough that others could duplicate
29Did any of that make sense?
- Comments?
- Questions?
- Suggestions?