Title: XG Multimedia Semantic News Use Case
1XG Multimedia SemanticNews Use Case
- Thierry Declerck, DFKI GmbH
- Language Technology Lab
2Automatic Semantic Analysis of Metada associated
with News Videos of Broacasting companies
- On-going work in the projects K-Space and MESH
3Metadata of News Broadcasters
- We analysed the metadata available from various
Broadcasters - Their data consists of audio/video material and
textual metadata. This is a very valuable data
set, since the textual metadata consists also in
manually annotated scenes descriptions. - This dataset can be used for building a training
corpus for automated alignment of video, audio
and text data. - In the next slides we see some abstraction over
the various types of metadata provided.
4The Metadata Labels
- ltDOC filename0324000-3_Journal_
ENG_F4001C_26122003_2000gt - ltTYPEgtEarthquake Iranlt/TYPEgt
- ltSERIESgtJournal F 4001 Clt/SERIESgt
- ltSEG sidintegergt
- ltTITLEgtlt/TITLEgt
- ltDESCRIPTIONgtlt/DESCRIPTIONgt
- ltSCENESgt lt/SCENESgt
- ltKEYWORDSgtlt/KEYWORDSgt
- lt/SEGgt
- lt/DOCgt
5The Title Tag
- ltTITLEgt
- TdT Erdbeben /Iran/Zerstörungen in
Bam/Trümmer/Ruinen/Opfer - lt/TITLEgt
- Extract Erdbeben (keyword for disaster
ontology) location Iran (with NE detection).
Other terms, but yet still unclear about their
role
6The Description Tag
- ltDESCRIPTIONgt
- Ein schweres Erdbeben hat im Iran die Stadt Bam
fast völlig zerstört. - lt/DESCRIPTIONgt
- Linguistic and semantic analysis
- Subj-NP Ein schweres ltnoun-disastergtErbebenlt/nou
n-disastergt Vhat LOC-PP in ltne-countrygtIranlt/n
e-countrygt OBJ-NP die Stadt ltne-citygtBamlt/ne-cit
ygt ADV fast völlig Vzerstört. - Extraction
- Who (causation)Erbeben (Earthquake)
- What_action zertören (destroy)
- What Stadt Bam (city of Bam). Here the system
can infer that Bam is located in Iran. - Where Iran
7The Scenes Tag
- ltSCENESgt
- ltSCENE sid"1"gtBam Menschen sitzen zwischen
Trümmern auf Bodenlt/SCENEgt - ltSCENE sid"2"gtverzweifelte Menschen sitzen am
Strassenrandlt/SCENEgt - ltSCENE sid"3"gtSchuttbergelt/SCENEgt
- ltSCENE sid"4"gtzerstörte Häuserlt/SCENEgt
- ltSCENE sid"5"gtrauchende Trümmerlt/SCENEgt
- lt/SCENESgt
- Descriptons of sequences of images displayed.
Extracting related entitiesPeople within ruins,
desperate people, destroyed houses, smoking ruins
etc. All those terms can be seen as consequences
of the earthquake. Important also they provide
for a description of what is to be seen in the
video.
8The Keywords Tag
- ltKEYWORDSgt
- Naher Osten Iran Erdbeben
- lt/KEYWORDSgt
- The pattern of the content of this tag allows us
to infer that Iran is located in near-east.
9Linguistic Knowledge Structures
- Multiple layers and levels
- Low-level linguistic features (tokenization,
morphology, ) - Semantic properties of terms and phrases
- Named Entities
- Relation Extraction (incl. Grammatical Relations)
- Semantic linking to domain ontologies
- Can involve several abstraction layers connected
through reasoning/mapping processes - Semantic linking to other media analysis
- Associated to the domain ontology of MESH
(natural disasters in the news)
10Association of Ontologies
11Semantic annotation of Text extracted from Images
- (Thierry Declerck, DFKI Andreas Cobet, TUB)
12Background
- The data The German Broadcast news programme
Tagesthemen - Extract Text from key frames of shots. Annotate
those terms semantically - Analyse of the position of the text and the kind
of text extracted. 6 cases detected so far
13Case1 Above the picture, just a normal phrase,
mostly a nominal phrase (NP)
14Case 2 Below the picture Name of a person and
of the function of this person
15Case 3 Below the picture Name of a person and
of a city/country
16Case 4 Above the picture, just a normal nominal
phrase, and below the picture, name of a person,
17Case 5 Below the picture the word Bericht (or
similar) and name of Person (gt Journalist)
18Case 6 A location name. No picture of a
specific human
19Cross-Media Ontologies
- The next slides by courtesy of Paul Buitelaar,
Michael Sintek, Malte Kiesel (DFKI GmbH) from the
Project SmartWeb. Paper Feature Representation
for Cross-Lingual, Cross-Media Semantic Web
Applications, presented at ESWC 2006.
20Semiotic Triangle
- See (Ogden Richards, 1923) - based on
- Structural Linguistics (de Saussure, 1916)
- philosophical work by Peirce (mostly 19th
century)
21Semiotic Triangle the real world
... actual goalkeepers in the real world ...
22Semiotic Triangle concepts
... actual goalkeepers in the real world ...
23Semiotic Triangle words
... actual goalkeepers in the real world ...
goalkeeper (EN) Torwart (DE) doelman (NL) ...
24Semiotic Triangle images
... actual goalkeepers in the real world ...
25Features
- Multilingual Features
- Terms with Linguistic Info and Context Models
- Example goalkeeper
- part-of-speech noun
- morphology goal-keeper
- context (Google hits stats.) gets420000,
holds212000, shoots55900, - Multimedia Features
- Images with Feature Models
- Example goalkeeper
- color 111111
- shape human
- texture keypatch-set 223
26Representation Proposal
- Attach multilingual and multimedia features to
classes and properties (and also instances) - use of meta-classes ClassWithFeats and
PropertyWithFeats with properties lingFeat and
imgFeat (with ranges LingFeat and ImgFeat) - The classes LingFeat and ImgFeat are used for
complex feature descriptions
rdfsClass
rdfsProperty
rdfssubClassOf
rdfssubClassOf
featClassWithFeatsfeatlingFeatfeatimgFeat
featPropertyWithFeatsfeatlingFeatfeatimgFeat
meta-classes
ifImgFeatifcolor iftexture
lfLingFeatlftermlflang
classes
27Representation Simplified Example
28Representation LingInfo Ontology
is-a
is-a
is-a
is-a
is-a
is-a
...
29Example Instance Fußballspielers (of the
football player)
30Features Interacting Layers
31Translating XBRL Into Description Logic
- Thierry Declerck and Hans-Ulrich Krieger
- DFKI GmbH
32Motivation
- Toward a large intelligent web-based financial
information and decision support systems in the
MUSING project - Till now a prototype based on XBRL (eXtensible
Business Reporting Language), as developed within
the eTen project WINS - There we experienced the limitations of the XBRL
schema, due to the lack of reasoning support over
XML-based data and information extracted from
documents. - Need to translate XBRL into an ontology
33XBRL Example Header Metadata
- lt?xml version"1.0" encoding"iso-8859-1"
standalone"no"?gt - ltgroup xmlns"http//www.xbrl.org/2001/instance"
xmlnsxsi"http//www.w3.org/2001/XMLSchema-instan
ce" xmlnst"http//www.xbrl.org/german/ap/ci/2002
-02-15" xmlnsISO4217"http//www.iso.org/4217"
xsischemaLocation"http//www.xbrl.org/german/ap/
ci/2002-02-15 german_ap.xsd"gt - ltnumericContext id"cn0" precision"8"
cwa"false"gt - ltentitygt
- ltidentifier scheme"http//www.xbrl.de/xbrl/sg2
"gt001lt/identifiergt - lt/entitygt
- ltperiodgt
- ltstartDategt2001-01-01lt/startDategt
- ltendDategt2001-12-31lt/endDategt
- lt/periodgt
- ltunitgt
- ltmeasuregtISO4217EURlt/measuregt
- lt/unitgt
- ..
- lt/nonNumericContextgt
- .
- lttgenInfo.doc.author nonNumericContext"c2"gtXBRL
Deutschland e.V.lt/tgenInfo.doc.authorgt - lttgenInfo.doc.author.city nonNumericContext"c2"
gtDüsseldorflt/tgenInfo.doc.author.citygt - lttgenInfo.doc.author.compName
nonNumericContext"c2"/gt
34XBRLFinancial Data
- .
- lttbs.ass numericContext"cn0"gt1338066lt/tbs.assgt
- lttbs.ass.accountingConvenience
numericContext"cn0"gt0lt/tbs.ass.accountingConveni
encegt - lttbs.ass.accountingConvenience.changeDem2Eur
numericContext"cn0"gt0lt/tbs.ass.accountingConveni
ence.changeDem2Eurgt - lttbs.ass.accountingConvenience.startUpCost
numericContext"cn0"gt0lt/tbs.ass.accountingConveni
ence.startUpCostgt - lttbs.ass.currAss numericContext"cn0"gt749385lt/t
bs.ass.currAssgt - lttbs.ass.currAss.cashEquiv numericContext"cn0"gt
259760lt/tbs.ass.currAss.cashEquivgt - lttbs.ass.currAss.inventory numericContext"cn0"gt
209343lt/tbs.ass.currAss.inventorygt - lttbs.ass.currAss.inventory.advPaymPaid
numericContext"cn0"gt0lt/tbs.ass.currAss.inventory
.advPaymPaidgt -
35XBRL to OWL-XBRL
- XBRL taxonomies make use of XML in order to
describe the structure of an XBRL document as
well as to define new datatypes and properties
relevant to XBRL. - Allows to check whether a concrete (business)
document conforms to the syntactic structure,
defined in the schema. - But a need for languages and tools that go beyond
the expressive syntactic power of XML Schema. - OWL, the Web Ontology Language is the new
emerging language for the Semantic Web that
originates from the DAMLOIL standardization. OWL
still makes use of constructs from RDF and RDFS,
such as rdfresource, rdfssubClassOf, or
rdfsdomain - Two important variants OWL Lite and OWL DL
restrict the expressive power of RDFS, thereby
ensuring decidability. - What makes OWL unique (as compared to RDFS or
even XML Schema) is the fact that it can describe
resources in more detail and that it comes with a
well-defined model-theoretical semantics,
inherited from description logic
36Actual Experiment with the Sesame DB
- The basic idea during our (manual) effort was
that even though we are developing an XBRL
taxonomy in OWL using Protégé, the information
that is stored on disk is still RDF at the
syntactic level. We were thus interested in RDF
data base systems, wich make sense of the
semantics of OWL and RDFS constructs such as
rdfssubClassOf or owlequivalentClass - Current experiment with the Sesame open-source
middleware framework for storing and retrieving
RDF data. Sesame partially supports the semantics
of RDFS and OWL constructs via entailment rules
that compute missing" RDF triples (the deductive
closure) - From an RDF point of view, additional 62,598
triples were generated through Sesame's deductive
closure.
37Example of Entailment RulehasPart relation
- ltrule name"owl-transitiveProp"gt
- lt!-- note ?p, ?x, ?y, and ?z are variables --gt
- ltpremisegt
- ltsubject var"?p"/gt
- ltpredicate uri"rdftype"/gt
- ltobject uri"owlTransitiveProperty"/gt
- lt/premisegt
- ltpremisegt
- ltsubject var"?x"/gt
- ltpredicate var"?p"/gt
- ltobject var"?y"/gt
- lt/premisegt
- ltpremisegt
- ltsubject var"?y"/gt
- predicate var"?p"/gt
- ltobject var"?z"/gt
- lt/premisegt
- ltconsequentgt
- ltsubject var"?x"/gt
38A concrete Example of deduced Relation
- Since we have classied hasPart (as well as
partOf) as a transitive OWL property, the rule in
the former slide will fiere, making implicit
knowledge explicit and produces new triples such
as - ltt_bs, hasPart, t_bs.ass.defTaxgt
- although only
- ltt_bs, hasPart, t_bs.assgt
- ltt_bs.ass, hasPart, t_bs.ass.defTaxgt
- can be found in the original XBRL specification.
39Translating the Base Taxonomy
- In the GermanAP Commercial and Industrial (German
Accounting Principles) taxonomy
(http//www.xbrl-deutschland.de/xe news2.htm),
the file xbrl-instance.xsd specifies the XBRL
base taxonomy using XML Schema. It makes use of
XML schema datatypes, such as xsdstring or
xsddate, but also defines simple types
(simpleType), complex types (complexType),
elements (element), and attributes (attribute).
Element and attribute declarations are used to
restrict the usage of elements and attributes in
XBRL XML documents. - Since OWL only knows the distinction between
classes and properties, the correspondences
between XBRL and OWL description primitives is
not a one-to-one mapping
40Business Intelligence in MUSING
- Next generation Business Intelligence The MUSING
European RD Project (MUlti-industry,
Semantic-based next generation business
INtelliGence). Towards a new generation of
Business Intelligence (BI) tools and modules
founded on semantic-based knowledge and content
systems, enhancing the technological foundations
of knowledge acquisition and reasoning in BI
applications.
41Application Domains in MUSING
- The breakthrough impact of MUSING on
semantic-based BI will be measured in three
strategic, vertical domains - Finance, through development and validation of
next generation (Basel II and beyond)
semantic-based BI solutions, with particular
reference to Credit Risk Management - Internationalisation, (i.e., the process that
allows an enterprise to evolve its business from
a local to an international dimension, hereby
expressly focusing on the information acquisition
work concerning international partnerships,
contracts, investments) through development and
validation of next-generation semantic based
internationalisation platforms - Operational Risk Management, through development
and validation of semantic-driven knowledge
systems for measurement and mitigation tools,
with particular reference to IT operational risks
faced by IT-intensive organisations.
42Processing of Quantitative Data
- Typical Input Finance reports in PDF
43PDF to XBRL (OWL-XBRL)
- Mapping from PDF to HTML/XML
- Detection in the HTML/XML of relevant layout
information that helps in reconstructing the
logical units of the original PDF documents
(title,header/footer, footnote,tables, free text) - Mapping of terms found in the XML version of the
document to XBRL labels. Disambiguating where
needed. - Checking if all the lines of the PDF documents
are XBRL compliant. Non-compliant information to
be saved in a log file. Towards a XBRL checker of
balance sheets delivered in proprietary formats. - Generation of the results of the PDFtoXBRL
procedure in a multilingual setting
44Processing of Qualitative Data
- TURNOVER, INCOME, GROWTH State of revenues, if
depurated from sales related to Consip contract
award, which remarkably affected the turnover in
2003, would have, on the contrary, recorded an
increase of 3,23 against that microinformatics
market which recorded an increase of 3,2 (Sirmi,
january 2005). - Task of identifying relevant expressions and to
classify them
45Integration of Data
- The Challenge Merging data and information
extracted from various types of documents. Also
in various languages. And in the XG use case,
especially integrated information from news wires.