WRAPPER MAINTENANCE - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

WRAPPER MAINTENANCE

Description:

?? e??a? wrapper,p?? pa???eta?, ???s ?? wrapper maintenance p?? ?? at??. S??t? ? pa???s?as? ???s ???? e??as??? se wrapper ... (Raposo, Pan, Vi a, lvarez ) ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 41
Provided by: tsourakaki
Category:

less

Transcript and Presenter's Notes

Title: WRAPPER MAINTENANCE


1
WRAPPER MAINTENANCE
  • ??p??µat??? ???as?a
  • ?a???aµp?? ??t. ?s???a?????
  • ??ad?µa??? ?t??2005-2006

2
?????S ?????S??S?S
  • ?? e??a? wrapper,p?? pa???eta?, ???sµ?? wrapper
    maintenance p??ß??µat??.
  • S??t?µ? pa???s?as? ???sµ???? e??as??? se wrapper
    verification ?a? se wrapper reinduction.
  • ?a???s?as? t?? e??as?a? µa? ?d?a , ap?te??sµata.
  • S?µpe??sµata-µe????t??? e??as?a.

3
?? 80 t?? ?st?se??d?? pe??????? ded?µ??a ta
?p??a p???????ta? ap? µ?a ??.
4
T? e??a? ? web wrapper?
  • Web wrapper e??a? ??a p????aµµa t? ?p??? µe ß?s?
    ??a s????? ?a????? e???e? a?t?µata p????f???a ap?
    ?st?se??de? ?a? t?? ap????e?e? se µ?a d?µ?µ???
    µ??f?.
  • To s????? t?? ?a????? ßas??eta? st?? ?a??????t?ta
    p?? ?p???e? st?? pa???s?as? t?? p????f???a? st??
    ???st? (layout).

5
Data Integration wrappers
6
???p?? pa?a????? wrappers
  • 1?? t??p??
  • ?????aµµa se ??p??a ???ssa.
  • ?s?µf???.
  • 2?? t??p??
  • Wrapper Induction System
  • ??s?d??S????? ?st?se??d?? µe
  • pa?ade??µata ep???µ?t?? p????f???a?.
  • ???d?? wrapper

7
Wrapper Maintenance
  • ?? ?st?se??de? a??et? s???? a??????? layout,
    a??µa ?a? pe??e??µe??.
  • ?? s????? t?? ?a????? e?a????? t?? wrapper pa?e?
    ?a e???e? t?? ep???µ?t? p????f???a.
  • Wrapper Maintenance
  • Wrapper verification Wrapper reinduction

8
(No Transcript)
9
?????S ?????S??S?S
  • ?? e??a? wrapper,p?? pa???eta?, ???sµ?? wrapper
    maintenance p??ß??µat??.
  • S??t?µ? pa???s?as? ???sµ???? e??as??? se wrapper
    verification ?a? se wrapper reinduction.
  • ?a???s?as? t?? e??as?a? µa? ?d?a , ap?te??sµata.
  • S?µpe??sµata-µe????t??? e??as?a.

10
STRAWMAN
11
RAPTURE
  • ???t? content based µ???d?? Kushmerick
  • ?????t?ta HTML ?a?a?t???? ?.? p?? a??????e? t??
    ?a?????? ?ata??µ?
  • G?a t?? testing se??de? ?p??????e? µe ß?s? t??
    e?t?µ?t??e? µ1,s1 t?? p??a??t?te? ??a ???e
    ?????sµa ta e?a??µe?a ded?µ??a a?? attribute ?a
    pa?????? t?? t?µ?? t???.
  • Testing probability µe ß?s? µ2,s2
  • S?????s? testing probability µe threshold
  • ?p? t?? verified p????f???a ?p??????e? t??
    e?t?µ?t??e? µ1,s1 ?a??? ?a? t?? p??a??t?te? ta
    e?a??µe?a ded?µ??a ??a t? ???e attribute ?a
    pa?????? t?? t?µ?? t???.
  • Verified probability ?a? ?p?????sµ?? µ2,s2
  • ???a ?????sµata ?????t?ta ??aµµ?t??, p????t?ta
    ??f???, p????t?ta ?a?a?t???? st????,p?????
    tokens,µ???? tokens.

12
Wrapper Verification (Lerman,Minton,Knoblock)
  • ?e?t??s? RAPTURE a??????µ??.
  • DATAPROG a??????µ?? e??es?? patterns p????f???a?.
  • Stat?st???? ??e???? Pearson.
  • G?a ???e ????? pattern p??s??te? ??a? ??? t??
    µ??f??
  • Ntuples training attribute, ntuples
    testing attribute, rituples p?? a????????? t?
    pattern pi

13
WRAPPER REINDUCTION(Raposo, Pan, Viña, Álvarez )
  • ?p????e?s? ap?te?esµ?t?? queries ?at? t? d????e?a
    t?? ????? ?e?t?????a? t?? wrapper se ??.
  • ??e??es? pa?ade??µ?t?? st?? a??a?µ??e?
    ?st?se??de?.
  • ???f?d?t?s? WI s?st?µat?? µe a??a?µ??e?
    ?st?se??de? ?a? pa?ade??µata.

14
?????S ?????S??S?S
  • ?? e??a? wrapper,p?? pa???eta?, ???sµ?? wrapper
    maintenance p??ß??µat??.
  • S??t?µ? pa???s?as? ???sµ???? e??as??? se wrapper
    verification ?a? se wrapper reinduction.
  • ?a???s?as? t?? e??as?a? µa? ?d?a , ap?te??sµata.
  • S?µpe??sµata-µe????t??? e??as?a.

15
G?at? d????e ?µfas? st? wrapper verification?
  • ??a ?a?? reinduction s?st?µa ????? ??a ?a??
    verification s?st?µa de? ap?d?de? ?a??.
  • ? a????? ??a a?t?µat?p???s? t?? verification
    µ????? e??a? µe?a??te?? ap? t?? reinduction ????
    t?? ?pa???? WI systems.

16
ARMAGEDDON
  • ?????????S???? VERIFICATION module
  • Content based s?st?µa.
  • S???et?? a??????µ??e?µeta??e?eta? ?s?
    pe??ss?te?? t? d?µ? t?? e?a??µe??? p????f???a?.
  • ????st?.
  • ???? ?a?? ap?d?s? st? verification task.
  • ?????????S???? REINDUCTION module
  • ?p?? ?d?a
  • ?a?? ap?d?s? se se??de? µe stat??? pe??e??µe??.
  • ?????t??? e??a?e?? ??a t?? ???st?.

17
Verification System
  • ??s?d?? Training attribute , Testing Attribute.
  • ?????st? ?t? ? p????f???a t?? testing attribute
    e??a? ? s?st? .
  • ??a ?d?a s?µas??????a µe t?? training
    attribute
  • ??a pa??µ??a d?µ? ?a? pa??µ??a patterns !!!

18
St?d?a Verification a??????µ??
19
?atas?e?? ??a??sµ?t?? ?etap????f???a?
20
?a??de??µa d?a??sµat??
  • ??s?d??
  • 12 Aiginitoy Street
  • 11 Antifylou Street
  • 42 Hrwwn Polytexneiou
  • Street
  • 25 Laodikeias Street
  • 53 Papagou Avenue
  • ?????sµa
  • ltaddress,5,1,
  • INTEGER
  • CAPITALIZED
  • CAPITALIZED,5,
  • 0.101, 0.111,
  • 0.707,0,5.6875,3.2gt

21
????s? Patterns
  • ?atas?e?? ?e?a???a? token types.
  • ?e?t???? a?a??t?? p?? ap???µe? se ???e token t?
    p?? s???e???µ??? t?p? p?? µp??e? ?a ??ße?.

CS123 ALPHANUM 12
INTEGER 12.3 DECIMAL DATABASE
ALLUPPERCASE course ALLLOWERCASE !
PUNCT Alice CAPITALIZED TheBook
ALPHABETIC
22
????s? Patterns
  • ???sd????sµ?? µ????? starting patterns µe ß?s? t?
    µ?s? p????? ?e?t???? µ???d?? a?? e???af?.
  • Null Hypothesis Testing.
  • ?e?t???? ???a?? ?e???µa.
  • ?-test

23
??? ?e?t????e?? (1)
  • T?s? 1 ? t?p?? CAPITALIZED de? e??a? stat?st???
    s?µa?t???? (null hypothesis).
  • ?? ap????f?e? ? µ?de???? ?p??es?

24
??? ?e?t????e?? (2)
  • T?s? 2 ? t?p?? ?LPHANUMERIC de? e??a? stat?st???
    s?µa?t???? µet? t? t?p? CAPITALIZED(null
    hypothesis).
  • ?? ap????f?e? ? µ?de???? ?p??es?

25
??? ?e?t????e?? (3)
  • ?a???eta? ??a PATTERN tree.
  • ß????f(µ?s? p????? tokens/e???af?).
  • ???s??s? t?? d??d??? d??e? ta starting patterns!

26
??e???? Pearson (goodness of fit method)
  • Training (ver) ?a? testing (test) d?a??sµata
    µetap????f???a?.
  • ??e???? ?µ???t?ta? d?a??sµ?t??.
  • G?a digDen(x1),..,averNumOfTokensPerLine(x6)

27
S?st?µa p?????
  • ?? qlt? ?p?? ??2(freedomDegrees-1,0.05) t?te t?
    s?st?µa e?s???eta? se ??a s?st?µa p?????.
  • ????? ??at? q a????eta? ????? ?a a??????ta?
    a?t?st???a ?? ßa?µ?? e?e??e??a?.
  • S???et? s?st?µa
  • ?d?a s?µas??????a ? pa??µ??a patterns

28
?as???? ?????e? t?? s?st?µat?? p????? (1)
  • ?µ?de? s?s?et???µe??? token types.
  • ?µ?da1,ALPHANUM,
  • ?µ?da2ALPHABETIC,ALLUPPERCASE,ALLLOWERCASE
    ,CAPITALIZED,
  • ?µ?da3INTEGER,DECIMAL,
  • S?s?et???µe?a patterns.
  • ?? ?p???e? 1-1 a?t?st????a µeta?? t?? token types
    se t??????st?? ??a p????? ??se??, t? ?p??? e??a?
    s????t?s? t?? p???p????t?ta? t?? p?????? tokens
    t?? µ????te??? pattern.
  • S?s?et???µe?a s????a patterns P1p11,..,p1m,P2
    p21,..,p2n.
  • ?? ???e pattern p1i e??a? s?s?et???µe?? µe ??p???
    p2j ?a? a?t?st??f??.

29
?as???? ?????e? t?? s?st?µat?? p????? (2)
  • ?????? e???af?? st? training attribute p?? µa?
    ep?t??p??? ?a ????µe a???µ??? pep????s? ?t?
    ????µe de? ta pe??ss?te?a ap? ta patterns ?at?
    t?? e?pa?de?s?.
  • ??????s? pe??pt?se?? ??a t? s??s? t?? s??????
    patterns Pver Ptest.

30
?as???? ?????e? t?? s?st?µat?? p????? (3)
  • ???e? pa??µet??? p?? ?aµß????ta? ?p???? e??a?
  • ???????t?te? Pver Ptest Pcommon
  • ??s?st? e???af?? t?? Ptest p?? ?a??pt??ta? ap? ta
    ????? patterns.

31
Reinduction System
  • ???sa?µ?sµ??? st?? ???p???s? t?? STALKER p??
    d?a??taµe (single ?a? ??? multi slot extractor).
  • Brute force a??????µ?? a?a??t?s?? pa?ade??µ?t??
    ????? p????f???a? st?? a??a?µ??e? ?st?se??de?.
  • ???d?? annotation files st? format p?? ???e? ?
    STALKER.

32
??de??µe?a ?e?t?????a? verification s?st?µat??
  • a ?? s?st?µa s?µpe?a??e? ?t? ? wrapper
    ?e?t????e? s?st?
  • b St?? p?a?µat???t?ta ? wrapper ?e?t????e?
    s?st?
  • 4 e?de??µe?a ?e?t?????a t?? s?st?µat??

33
?et????? a???????s?? Verification s?st?µat??
  • acaccuracy(TPTN)/(TPFPFNTN)
  • upunchanged precision TP/(TPFP)
  • cpchanged precision TN/(TNFN)
  • urunchanged recall TP/(TPFN)
  • crchanged recall TN/(TNFP)
  • Fchanged(2crcp)/(crcp)
  • Funchanged(2urup)/(urup)

34
RAPTURE DATASET
  • 16 query-able web sites
  • ??de??t??? a?af????µe
  • www.altavista.com,
  • www.uk.lycos.de,
  • www.thriveonline.com ,
  • www.news.com,
  • www.usnews.com
  • ?p? ???e site ???aµe e?a????
  • p????f???a? ap? 1 ??? ?a? 8 attributes.

35
  • WEB SITE ep?ped?
  • ac100,up100, cp100,ur100, cr100,
  • Fchanged100
  • Funchanged100
  • ATTRIBUTE ep?ped?
  • ac99.37,up100 cp96.55,ur99.23, cr100,
  • Fchanged99.82
  • Funchanged99.61

36
????????s? wrapper reinduction s?st?µat??
37
?a?at???s?
  • To reinduction s?st?µa µp??e? ?a ???s?µ?p????e?
    ??a ?a d?e??????e? t? ???st? ?a pa???e? ??a?
    s?st? wrapper p??? e????a

38
?????S ?????S??S?S
  • ?? e??a? wrapper,p?? pa???eta?, ???sµ?? wrapper
    maintenance p??ß??µat??.
  • S??t?µ? pa???s?as? ???sµ???? e??as??? se wrapper
    verification ?a? se wrapper reinduction.
  • ?a???s?as? t?? e??as?a? µa? ?d?a , ap?te??sµata.
  • S?µpe??sµata-µe????t??? e??as?a.

39
S?µpe??sµata
  • ???pt??? e???st??,content based s?st?µat?? ??a
    wrapper verification.
  • ?e? ßas??eta? ?a????? se HTML p????t?te? ?p?? ta
    p??????µe?a
  • ?µfas? st? s?µas??????a.
  • ?a?at??????e ?t? ? ?d?a ?t? ta p?s?st? t??
    e???af?? p?? a????????? ??p??? ????? pattern ?a
    e??a? ?d?a st? verified st? training attribute!

40
?e????t??? e??as?a
  • ?fa?µ??? t?? µe??d?? se ???a p??ß??µata.
  • ??te?? pe???µata ??a a???????s? wrapper
    verification s?st?µat??.
  • ???pt??? s???et?te??? reinduction s?st?µat??.
Write a Comment
User Comments (0)
About PowerShow.com