XML, Information Extraction and Document structuring - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

XML, Information Extraction and Document structuring

Description:

XML, Information Extraction and Document structuring Maria Teresa PAZIENZA Roma Tor Vergata University Italy In short Why XML? XML was created so that richly ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 36
Provided by: uniroma2I
Category:

less

Transcript and Presenter's Notes

Title: XML, Information Extraction and Document structuring


1
XML, Information Extraction and Document
structuring
  • Maria Teresa PAZIENZA
  • Roma Tor Vergata University
  • Italy

2
In short
  • Why XML?
  • XML was created so that richly structured
    documents could be shared over the web.
  • It requires the integration of heterogeneous and
    distributed data and information sources.

3
In short
  • XML (Extensible Markup Language)
  • is a markup language for documents containing
  • structured information.
  • XML is being designed to deliver structured
  • content over the web
  • A markup language is a mechanism to identify
    structures in a document.

4
In short
  • Structured information contains both content
    (words, pictures, etc.) and some indication of
    what role that content plays (for example,
    content in a section heading has a different
    meaning from content in a footnote, which means
    something different than content in a figure
    caption or content in a database table, etc.).
  • Almost all documents have some structure

5
In short
  • The word "document" refers not only to
    traditional documents, but also to the miriad of
    other XML "data formats". These include vector
    graphics, e-commerce transactions, mathematical
    equations, object meta-data, server APIs, and a
    thousand other kinds of structured information.

6
In short
  • XML specifies neither semantics nor a tag set. In
    fact XML is really a meta-language for describing
    markup languages. In other words, XML provides a
    facility to define tags and the structural
    relationships between them.
  • Since there's no predefined tag set, there can't
    be any preconceived semantics. All of the
    semantics of an XML document will either be
    defined by the applications that process them or
    by stylesheets.

7
XML documents
  1. XML documents are composed of markup and content
  2. Elements are the most common form of markup
  3. If an element is not empty, it begins with a
    start-tag, ltelementgt, and ends with an end-tag,
    lt/elementgt.

8
XML documents
  • Attributes are name-value pairs that occur inside
    start-tags after the element name.
  • For example,
  • ltdiv class"preface"gt is a div element with the
    attribute class having the value preface.
  • In XML, all attribute values must be quoted.

9
XML document
  • Example 1. A Simple XML Document
  • lt?xml version"1.0"?gt ltoldjokegt ltburnsgtSay
    ltquotegtgoodnightlt/quotegt, Gracie.lt/burnsgt
    ltallengtltquotegtGoodnight, Gracie.lt/quotegtlt/allengt
    ltapplause/gt lt/oldjokegt

10
XML currently
  • XML is not actually a markup language it is a
    standard that specifies a syntax that allows
    anyone to create his own markup language.
  • It is a tag-based language to describe tree
    structures with a linear syntax.
  • The markup created language will depend on the
    task you are trying to accomplish.

11
XML currently
  • XML allows to separate data from the processs
    that act on that data.

12
XML semantics
  • The use of a unique Ontology for any application
    contexts will never be possible.
  • Neither will an ontology be suitable for all
    subjects and domains nor will such a large and
    heterogeneous community, as the Web, agree on a
    complex ontology for describing all their issues.

13
DTD (Document Type Definition)
  • It is the closest thing XML offers for
    ontological modeling.
  • It defines the legal nesting of tags and
    introduces attributes for them.
  • Defining tags, their nesting and attributes for
    tags may be seen as defining an ontology.

14
XML and DTD
  • An XML document is valid if it is wellformed and
    if the document uses a DTD it respects it.
  • DTD are not necessary for XML documents, they
    provide the possibility to define stronger
    constraints for documents.

15
DTD
  • A DTD consists of three elements
  • Elements declaration that define composed tags
    and value ranges for elementary tags
  • Attribute declaration that defines attributes of
    tags
  • Entity declaration

16
Issues to be addressed
  • The availability of large amounts of data in Web
    raises several issues that XML standards does not
    address
  • Extracting data from large repositories of XML
    documents
  • Translating XML data between different ontologies
    (DTDs)
  • Integrating XML data from multiple XML sources
  • Transporting large amounts of XML selected data
    to users

17
XML
  • A first attempt
  • to structure the entropy
  • of Web world

18
NAMIC project
  • NAMIC aims to extract relevant facts/events from
    the news streams of large European news agencies
    and newspaper producers, to provide hypertextual
    structures within each (monolingual) stream and
    then to produce cross-lingual links among
    streams.

19
NAMIC project
  • Language specific procesors (LPs) are responsible
    for text processing and event matching in
    independent text units in each stream. LPs
    compile an objective representation for each
    source texts, including the detected
    morphosyntactic information, categorization in
    news standards (IPTC) classes and description of
    relevant events.

20
NAMIC Schema news
  • - ltNEWS NEWSID"ita_1" DATE"6/7/2000"
    PLACE"PARIGI" AGENCY"ANSA" LANG"ITA"gt-
  • ltTITLEgt-
  • ltPP IDEN"0"gt 
  • ltPgtUE IMMIGRAZIONE APPREZZAMENTO A PARIGI PER
    POLITICA ITALIA(2)lt/Pgt  
  • lt/PPgt 
  • lt/TITLEgt-

21
Schema news
  • ltBODYgt-
  • ltPP IDEN"0"gt 
  • ltPgt''I risultati raggiunti - ha proseguito il
    sottosegretario - si devono all' impegno di
    polizia, carabinieri, guardia di finanza, marina
    e aeronautica militare. Ma in questa occasione
    abbiamo ribadito che, proprio perche' l'Italia e'
    un paese di frontiera e i clandestini la
    considerano spesso un passaggio per arrivare nel
    resto dell' Europa, serve l'impegno di tutti. In
    particolare, abbiamo chiesto piu' spazio e piu'
    forza investigativa per l'Europol''.lt/Pgt  
  • lt/PPgt-
  • ltPP IDEN"1"gt 
  • ltPgtNella riunione di oggi, il ministro degli
    interni francese, Jean-Pierre Chevenement, ha
    proposto all'Unione europea ''cinque piste'' in
    favore del co-sviluppo privilegiare le
    iniziative dei migranti in favore dello sviluppo
    dei paesi di origine favorire ''l'immigrazione
    alterna'' (come fa il Mali, con i giovani di un
    villaggio che vanno per un periodo in Francia poi
    tornano e lasciano il posto a coetanei di
    villaggi vicini) favorire l'accesso alla
    formazione, con corsi programmati come prevede il
    decreto italiano fa cilitare la libera
    circolazione nell'Ue dei protagonisti delle
    politiche di sviluppo favorire la partnership
    negoziata su un piano di parita' fra gli stati
    dell'Ue e i paesi di origine.lt/Pgt  
  • lt/PPgt 
  • lt/BODYgt 
  • lt/NEWSgt

22
  • ltNEWS NEWSID"ita_1" DATE"6/7/2000"
    PLACE"PARIGI" AGENCY"ANSA" LANG"ITA"gt  ltCAT
    TYPE"cro" /gt
  • ltTITLEgt-
  • ltPP IDEN"0"gt 
  • ltPgtUE IMMIGRAZIONE APPREZZAMENTO A PARIGI PER
    POLITICA ITALIA(2)lt/Pgt -
  • ltSYNTACTIC_GRAPHgt-
  • ltLNSgt-
  • ltLEX_HANDLE IDEN"1" POSTAG"NPR" SURFACE"UE"gt 
    ltTOKEN I"1" S"UE" SS"0" SE"2" POS"1" /gt  
    ltLEMMA IDEN"0" SURFACE"UE" SYNTCAT"nome.proprio
    " MORPHFEAT"mas.fem.plur.sing." /gt  
  • lt/LEX_HANDLEgt-
  • ltLEX_HANDLE IDEN"2" POSTAG"COP" SURFACE""gt 
    ltTOKEN I"2" S"" SS"2" SE"3" POS"2" /gt  
    ltLEMMA IDEN"1" SURFACE"" SYNTCAT"cong.coord.du
    ep." MORPHFEAT"invariante" /gt  
  • lt/LEX_HANDLEgt-
  • ltLEX_HANDLE IDEN"3" POSTAG"NCS"
    SURFACE"IMMIGRAZIONE"gt  ltTOKEN I"3"
    S"IMMIGRAZIONE" SS"4" SE"16" POS"3" /gt  
    ltLEMMA IDEN"2" SURFACE"immigrazione"
    SYNTCAT"nome.comune" MORPHFEAT"fem.sing." /gt  
  • lt/LEX_HANDLEgt

23
  • ltLEX_HANDLE IDEN"4" POSTAG"COP" SURFACE""gt 
    ltTOKEN I"4" S"" SS"16" SE"17" POS"4" /gt  
    ltLEMMA IDEN"3" SURFACE"" SYNTCAT"cong.coord.pv
    irg." MORPHFEAT"invariante" /gt  
  • lt/LEX_HANDLEgt-
  • ltLEX_HANDLE IDEN"5" POSTAG"NCS"
    SURFACE"APPREZZAMENTO"gt  ltTOKEN I"5"
    S"APPREZZAMENTO" SS"18" SE"31" POS"5" /gt  
    ltLEMMA IDEN"4" SURFACE"apprezzamento"
    SYNTCAT"nome.comune" MORPHFEAT"mas.sing." /gt  
  • lt/LEX_HANDLEgt-
  • ltLEX_HANDLE IDEN"6" POSTAG"PSE" SURFACE"A"gt 
    ltTOKEN I"6" S"A" SS"32" SE"33" POS"6" /gt  
    ltLEMMA IDEN"5" SURFACE"a" SYNTCAT"prep.sempl."
    MORPHFEAT"invariante" /gt  
  • lt/LEX_HANDLEgt-
  • ltLEX_HANDLE IDEN"7" POSTAG"NPR"
    SURFACE"PARIGI"gt  ltTOKEN I"7" S"PARIGI"
    SS"34" SE"40" POS"7" /gt - ltLEMMA IDEN"1"
    SURFACE"parigi" SYNTCAT"nome.proprio"
    MORPHFEAT"invariante"gt  ltNEC CAT"citta" /gt  
    lt/LEMMAgt 
  • lt/LEX_HANDLEgt

24
  • ltLEX_HANDLE IDEN"8" POSTAG"PSE"
    SURFACE"PER"gt  ltTOKEN I"8" S"PER" SS"41"
    SE"44" POS"8" /gt   ltLEMMA IDEN"7"
    SURFACE"per" SYNTCAT"prep.sempl."
    MORPHFEAT"invariante" /gt  
  • lt/LEX_HANDLEgt-
  • ltLEX_HANDLE IDEN"9" POSTAG"NCS"
    SURFACE"POLITICA"gt  ltTOKEN I"9" S"POLITICA"
    SS"45" SE"53" POS"9" /gt   ltLEMMA IDEN"8"
    SURFACE"politica" SYNTCAT"nome.comune"
    MORPHFEAT"fem.sing." /gt   ltLEMMA IDEN"8"
    SURFACE"politico" SYNTCAT"nome.comune"
    MORPHFEAT"mas.fem.sing." /gt  
  • lt/LEX_HANDLEgt-
  • ltLEX_HANDLE IDEN"10" POSTAG"NPR"
    SURFACE"ITALIA"gt  ltTOKEN I"10" S"ITALIA"
    SS"54" SE"60" POS"10" /gt - ltLEMMA IDEN"1"
    SURFACE"italia" SYNTCAT"nome.proprio"
    MORPHFEAT"invariante"gt  ltNEC CAT"paese" /gt  
    lt/LEMMAgt 
  • lt/LEX_HANDLEgt-
  • ltLEX_HANDLE IDEN"11" POSTAG"COS" SURFACE"("gt 
    ltTOKEN I"11" S"(" SS"60" SE"61" POS"11" /gt  
    ltLEMMA IDEN"10" SURFACE"(" SYNTCAT"cong.subord.
    paren." MORPHFEAT"invariante" /gt  
  • lt/LEX_HANDLEgt-
  • ltLEX_HANDLE IDEN"12" POSTAG"NUM" SURFACE"2"gt 
    ltTOKEN I"12" S"2" SS"61" SE"62" POS"12" /gt  
    ltLEMMA IDEN"11" SURFACE"numero_card"
    SYNTCAT"nome.comune" MORPHFEAT"invariante" /gt  
  • lt/LEX_HANDLEgt

25
  •   ltSYNT_LINK IDEN"117" HEAD"222"
    MODIFIER"227" TYPE"PP_PP" PLAUS"0.16666667"
    /gt  
  • ltSYNT_LINK IDEN"118" HEAD"220"
    MODIFIER"227" TYPE"PP_PP" PLAUS"0.16666667"
    /gt  
  • ltSYNT_LINK IDEN"119" HEAD"217"
    MODIFIER"227" TYPE"PP_PP" PLAUS"0.16666667"
    /gt  
  • ltSYNT_LINK IDEN"120" HEAD"215"
    MODIFIER"227" TYPE"PP_PP" PLAUS"0.16666667"
    /gt  
  • ltSYNT_LINK IDEN"121" HEAD"211"
    MODIFIER"227" TYPE"NP_PP" PLAUS"0.16666667"
    /gt  
  • lt/SRSgt 
  • lt/SYNTACTIC_GRAPHgt 
  • lt/PPgt 
  • lt/BODYgt 
  • lt/NEWSgt

26
CROSSMARC project
  • It will develop a technology for e-retail product
    comparison.
  • It will be able to process pages written in
    several languages and will employ language
    technology methods for information extraction
    which will be extended and tailored to the
    characteristics of e-shopping.

27
prodotto
  •   lt?xml version"1.0" encoding"UTF-8" ?gt
  • - ltdocumentgt 
  • ltsourcegtDell Latitude LSH 500 Lire 5516000
    Pentium III 500 Mhz, 128 MbB Sdram, disco fisso
    da 20 GB, schermo TFT Svga da 12.1 pollici, chip
    grafico NeoMagic MagicMedia 256AV con 2.5 MB,
    lettore esterno per Cd-Rom, Ethernet 10/100
    Mbit/sec. Integrata
  • lt/sourcegt  
  • lt/documentgt

28
prodotto1
  •   lt?xml version"1.0" encoding"UTF-8" ?gt
  • ltdocumentgt 
  • ltsourcegtDell Latitude LSH 500 Lire 5516000
    Pentium III 500 Mhz, 128 MbB Sdram, disco fisso
    da 20 GB, schermo TFT Svga da 12.1 pollici, chip
    grafico NeoMagic MagicMedia 256AV con 2.5 MB,
    lettore esterno per Cd-Rom, Ethernet 10/100
    Mbit/sec. Integrata lt/sourcegt
  • - lttokenizationgt 
  • ltTOKEN Id"1" Label"dell"gtDelllt/TOKENgt  
  • ltTOKEN Id"2" Label"latitude"gtLatitudelt/TOKENgt  
  • ltTOKEN Id"3" Label"lsh"gtLSHlt/TOKENgt  
  • ltTOKEN Id"4" Label"500"gt500lt/TOKENgt  
  • ltTOKEN Id"5" Label"lire"gtLirelt/TOKENgt  
  • ltTOKEN Id"6" Label"5516000"gt5516000lt/TOKENgt  
  • ltTOKEN Id"7" Label"pentium"gtPentiumlt/TOKENgt  
  • ltTOKEN Id"8" Label"iii"gtIIIlt/TOKENgt  
  • ltTOKEN Id"9" Label"500"gt500lt/TOKENgt  
  • ltTOKEN Id"10" Label"mhz"gtMhzlt/TOKENgt  

29
prodotto1
  • ltTOKEN Id"35" Label"con"gtconlt/TOKENgt  
  • ltTOKEN Id"36" Label"2.5"gt2.5lt/TOKENgt  
  • ltTOKEN Id"37" Label"mb"gtMBlt/TOKENgt   ltTOKEN
    Id"38" Label","gt,lt/TOKENgt  
  • ltTOKEN Id"39" Label"lettore"gtlettorelt/TOKENgt  
  • ltTOKEN Id"40" Label"esterno"gtesternolt/TOKENgt  
  • ltTOKEN Id"41" Label"per"gtperlt/TOKENgt  
  • ltTOKEN Id"42" Label"cd-rom"gtCd-Romlt/TOKENgt  
  • ltTOKEN Id"43" Label","gt,lt/TOKENgt  
  • ltTOKEN Id"44" Label"ethernet"gtEthernetlt/TOKENgt
     
  • ltTOKEN Id"45" Label"10"gt10lt/TOKENgt  
  • ltTOKEN Id"46" Label"/"gt/lt/TOKENgt  
  • ltTOKEN Id"47" Label"100"gt100lt/TOKENgt   lt
  • TOKEN Id"48" Label"mbit"gtMbitlt/TOKENgt  
  • ltTOKEN Id"49" Label"/"gt/lt/TOKENgt  
  • ltTOKEN Id"50" Label"sec."gtsec.lt/TOKENgt  
  • ltTOKEN Id"51" Label"integrata"gtintegratalt/TOKENgt
     
  • lt/tokenizationgt

30
prodotto2
  •   lt/tokenizationgt
  • ltnamed-entitiesgt
  • - ltnamed-entity sem-type"Processor Name"
    normal"Intel Pentium III"gt 
  • ltTOKEN Id"7" Label"pentium"gtPentiumlt/TOKENgt  
  • ltTOKEN Id"8" Label"iii"gtIIIlt/TOKENgt  
  • lt/named-entitygt
  • - ltnamed-entity sem-type"Processor Speed"
    normal"500 MHz"gt 
  • ltTOKEN Id"9" Label"500"gt500lt/TOKENgt  
  • ltTOKEN Id"10" Label"mhz"gtMhzlt/TOKENgt  
  • lt/named-entitygt
  • - ltnamed-entity sem-type"Screen Type"
    normal"Active Matrix (TFT)"gt 
  • ltTOKEN Id"23" Label"tft"gtTFTlt/TOKENgt  
  • lt/named-entitygt
  • - ltnamed-entity sem-type"Drive Types"
    normal"CD-ROM"gt 
  • ltTOKEN Id"42" Label"cd-rom"gtCd-Romlt/TOKENgt  
  • lt/named-entitygt
  • - ltnamed-entity sem-type"Ports"
    normal"10/100Base-T"gt 
  • ltTOKEN Id"44" Label"ethernet"gtEthernetlt/TOKENgt
     
  • ltTOKEN Id"45" Label"10"gt10lt/TOKENgt  

31
prodotto3
  • ltPRODUCTgt
  • - ltDESCRIPTIONgt 
  • ltMANUFgtDelllt/MANUFgt  
  • ltMODELgtLatitude LSH 500lt/MODELgt  
  • ltNUMEX TYPE"MONEY" NORM"2848.77"
    UNIT"EUR"gtLire 5516000lt/NUMEXgt  
  • ltPROCESSORgtPentium IIIlt/PROCESSORgt  
  • ltNUMEX TYPE"SPEED" NORM"500" UNIT"Mhz"gt500
    Mhzlt/NUMEXgt   ,  
  • ltNUMEX TYPE"CAPACITY" NORM"128"
    UNIT"Mbyte"gt128 Mblt/NUMEXgt  
  • ltTERMgtSdramlt/TERMgt   ,   ltTERMgtdisco fissolt/TERMgt
      da  
  • ltNUMEX TYPE"CAPACITY" NORM"20000"
    UNIT"Mbyte"gt20 GBlt/NUMEXgt   ,  
  • ltTERMgtschermo TFT Svgalt/TERMgt   da   ltNUMEX
    TYPE"LENGHT" NORM"12.1" UNIT"inch"gt12.1
    pollicilt/NUMEXgt   , chip grafico NeoMagic
    MagicMedia  
  • ltNUMEX TYPE"SIMPLE"gt256AVlt/NUMEXgt   con  
  • ltNUMEX TYPE"CAPACITY" NORM"2.5"
    UNIT"Mbyte"gt2.5 MBlt/NUMEXgt   ,  
    ltTERMgtlettorelt/TERMgt   ltLOC_ATTRgtesternolt/LOC_ATTR
    gt   per   ltTERMgtCd-Romlt/TERMgt   ,  
    ltTERMgtEthernet 10/100lt/TERMgt   Mbit/sec.  
    ltLOC_ATTRgtintegratalt/LOC_ATTRgt  
  • lt/DESCRIPTIONgt 
  • lt/PRODUCTgt

32
prodotto4
  • ltPRODUCTgt 
  • ltMANUFgtDelllt/MANUFgt  
  • ltMODELgtLatitude LSH 500lt/MODELgt  
  • ltPRICEgt2848.77lt/PRICEgt -
  • ltPROCESSORgt 
  • ltPROCESSOR-NAMEgtPIIIlt/PROCESSOR-NAMEgt  
  • ltPROCESSOR-SPEEDgt500lt/PROCESSOR-SPEEDgt  
  • lt/PROCESSORgt
  • - ltSCREENgt 
  • ltSCREEN-TYPEgtTFTlt/SCREEN-TYPEgt  
  • ltSCREEN-SIZEgt12.1lt/SCREEN-SIZEgt  
  • lt/SCREENgt
  • - ltMEMORYgt 
  • ltSTANDARD-RAMgt128lt/STANDARD-RAMgt  
  • lt/MEMORYgt
  • - ltHARD-DISKgt 
  • ltCAPACITYgt20000lt/CAPACITYgt  
  • lt/HARD-DISKgt

33
ontodemo
  •   lt?xml version"1.0" encoding"UTF-8" ?gt -
  • ltcategorygt  Hardware
  • ltproductgt  Laptop Computers
  • ltfeaturegt  Operating System  
  • ltattributegtOperating Systemlt/attributegt  
  • lt/featuregt
  • - ltfeaturegt  Processor  
  • ltattributegtProcessor Namelt/attributegt  
  • ltattributegtProcessor Speedlt/attributegt  
  • lt/featuregt
  • - ltfeaturegt  Screen  
  • ltattributegtScreen Typelt/attributegt  
  • ltattributegtScreen Sizelt/attributegt  
  • ltattributegtMaximum Resolutionlt/attributegt  
  • lt/featuregt
  • ... lt/featuregt 
  • lt/productgt 
  • lt/categorygt

34
Gazeteer
  • lt?xml version"1.0" encoding"UTF-8" ?gt -
  • ltdatagt
  • ltsurface normal"Windows NT" sem-type"Operating
    System"gt 
  • ltTgtWindowslt/Tgt  
  • ltTgtNTlt/Tgt  
  • lt/surfacegt
  • ltsurface normal"Windows NT" sem-type"Operating
    System"gt 
  • ltTgtWinNTlt/Tgt  
  • lt/surfacegt
  • - ltsurface normal"Windows NT"
    sem-type"Operating System"gt 
  • ltTgtNTlt/Tgt  
  • ltTgt4lt/Tgt  
  • lt/surfacegt

35
  • ltsurface normal"Windows 95/98"
    sem-type"Operating System"gt ltTgtWindowslt/Tgt
  • ltTgt95lt/Tgt  
  • ltTgt/lt/Tgt  
  • ltTgt98lt/Tgt  
  • lt/surfacegt
  • - ltsurface normal"Windows 95/98"
    sem-type"Operating System"gt
  • ltTgtWindowslt/Tgt
  • ltTgt98lt/Tgt  
  • lt/surfacegt
  • - ltsurface normal"Windows 95/98"
    sem-type"OperatingSystem"gt
  • ltTgtWindowslt/Tgt
  • ltTgt95lt/Tgt  
  • lt/surfacegt
  • - ltsurface normal"Windows 95/98"
    sem-type"OperatingSystem"gt
  •   ltTgtWindows95lt/Tgt  
  • lt/surfacegt
  • - ltsurface normal"Windows 95/98"
    sem-type"OperatingSystem"gt
  •   ltTgtWindows98lt/Tgt  
  • lt/surfacegt
Write a Comment
User Comments (0)
About PowerShow.com