A metadata infrastructure using ISO standards - PowerPoint PPT Presentation

1 / 30
About This Presentation
Title:

A metadata infrastructure using ISO standards

Description:

Use of extant standards (11179) for new (12620, 639) OmegaWiki as ... Chambers, J.K. and Trudgill, P. (1998) Dialectology. Cambridge: Cambridge University Press ... – PowerPoint PPT presentation

Number of Views:100
Avg rating:3.0/5.0
Slides: 31
Provided by: css83
Category:

less

Transcript and Presenter's Notes

Title: A metadata infrastructure using ISO standards


1
A metadata infrastructure using ISO standards
2
Introduction
  • ISO becoming more open?
  • ISO 1.0 Top-down, expensive, cobwebs?
  • ISO 2.0 ISO 1.0 plus Bottom-up, free, webs?
  • Standards on Wikis (wikification)
  • Open systems of metadata
  • Outline
  • Use of extant standards (11179) for new (12620,
    639)
  • OmegaWiki as exemplar O/S project
  • Peacekeeping forces ISO WLDC

3
Introduction
  • Application area human languages
  • 50 of languages are endangered (UNESCO)
  • large proportion of languages have no resources
    and no web presence
  • discontinuity and fragmentation of research
  • sustainability and curation issues
  • And yet..
  • Capability for capturing data like never before
  • Expansion of capacity of the Internet and growing
    pressure for an inclusive multilingual internet
  • OLPC programme
  • Language experts and non-experts are prepared to
    contribute time and resources
  • So, how to create an infrastructure in which to
    form communities around languages and harmonize
    results?

4
Introduction
  • Language experts may identify linguistic content
    in a highly precise manner
  • What are non-experts (user community) capable of?
  • Providing more specific sets of labels may help
    in discovery of written or spoken languages in
    all kinds of media and help to harmonize
    research activities - so long as people know what
    they are looking at.
  • Inaccuracies of currently tagged content need to
    take the problem away from end users
  • More precise identification improves the chances
    of getting what you wanted consider coffee vs.
    coffee TYPE COLOUR vs. strong black
    coffee, in a mug, with 2 sugars.
  • Beyond documentation of names and
    representations, documentary information for each
    language might be helpful.
  • Working towards a machine-readable representation
    for all such information is a longer-term goal.

5
ISO standards
6
ISO standards
  • Metadata registry according to ISO 11179 series
    of standards (see, also, ISO 19763).
  • According to ISO 11179
  • A Value Domain is associated with a Conceptual
    Domain A Value Domain provides a representation
    for the Conceptual Domain.
  • Example Conceptual Domain and set of Value
    Domains is ISO 3166, Codes for the representation
    of names of countries.
  • ISO 3166 describes the set of seven Value
    Domains short name in English, official name in
    English, short name in French, official name in
    French, alpha-2 code, alpha-3 code, and numeric
    code.
  • Each representation contains a set of values that
    may be used in the value domain associated with
    the DEC each one of the seven associations is a
    data element.
  • For each representation of the data, the
    permissible values, the datatype, the
    representation class, and possibly the units of
    measure, are altered.

Conceptual domain name Countries of the
world Conceptual domain definition Lists of
current countries of the world represented as
names or codes. Value domain name (1) Country
codes 2 character alpha Permissible
values ltAF, The primary geopolitical entity
known as "Democratic Republic of
Afghanistan"gt ltAL, The primary geopolitical
entity known as "People's Socialist Republic of
Albania"gt . . . ltZW, The primary geopolitical
entity known as "Republic of Zimbabwe"gt Value
domain name (2) Country codes 3 character
alpha Permissible values ltAFG, The primary
geopolitical entity known as "Democratic Republic
of Afghanistan"gt ltALB, The primary geopolitical
entity known as "People's Socialist Republic of
Albania"gt . . . ltZWE, The primary geopolitical
entity known as "Republic of Zimbabwe"gt
7
ISO standards
/masculine/ /feminine/ /neuter/
/Afghanistan/
/Gender/
/Language identifier/
/Country/
/English/ /French/
lang
country
en, fr..
GB, FR, CN,
gen
m, f, n
Implemented as an XML attribute named
Anchors
ltxml country FR gt
ltw lemmevert langfr gengtvertelt/wgt
8
ISO standards
  • 12620 metamodel - ISO standard in preparation

9
Languages Infrastructure
  • OmegaWiki, a collaborative project to produce a
    free, multilingual resource in every language,
    with lexicological, terminological and thesaurus
    information. Relational databased
  • World Language Documentation Centre (WLDC),
    currently comprising 22 experts in language
    technologies, linguistics, terminology
    standardisation, and localisation
  • ISO, provision of the ISO 639 series of
    standards focus here on 639-4 and 639-6
    standards provide the structure

10
Languages Infrastructure
  • Model for ISO 639 proposed and developed by
    LIRICS project participants (Gillam, Romary)
    recently accepted for inclusion and review in the
    current iteration of the developing ISO 639 part
    4.
  • intended to be fully compatible with models being
    developed in ISO TC 37 in general, compatible
    with the Data Category Interchange Format defined
    in ISO 12620, and to provide a means for
    interlinking the collection of identifiers
    provided across the 639 series.
  • ISO TC 37 standards for computational use of
    terminology collections, specifically ISO 16642
    and its combination with ISO 12620, emphasize a
    metamodel in combination with metadata
    identifiers, referred to as data categories.
  • Language identifiers of ISO 639 shall be
    compatible, interoperable, mutually
    understandable, and usable to the degree of
    precision needed by the user up to the
    limitations of these identifiers.
  • Language identifiers themselves need to be
    described by metadata.
  • All of these metadata items can be submitted to
    the metadata registry specified according to ISO
    12620

11
Languages Infrastructure
  • ISO 639 model based on
  • need to replicate simplistic structure of ISO
    639-1 and 639-2
  • inferred model of the Ethnologue as published
  • ISO 12620 / ISO 11179
  • emergent model through BSI for ISO 639-6 adapted,
    generalized and cross-validated from encyclopædic
    and other sources including
  • Gordon Jr, R. G (Ed.) (2005). Ethnologue
    Languages of the World, 15th Edn. SIL
    International.
  • Voegelin, C.F. and F.M. (1977) Classification and
    index of the world's languages. New York, NY
    Elsevier North Holland, Inc.
  • Ruhlen, M. (1987) A guide to the world's
    languages. Vol.1 Classification. London Edward
    Arnold.
  • Bernard Comrie (ed.) (1987) The World's major
    languages. Oxford University Press, New York,
  • Chambers, J.K. and Trudgill, P. (1998)
    Dialectology. Cambridge Cambridge University
    Press
  • Dalby, D (1999). Linguasphere Register of the
    worlds languages and speech communities.
    Linguasphere Press.
  • development of ISO 639-6 initially assisted by a
    fund made available by the Department of Trade
    and Industry of the UK and administered by BSI
    subsequent efforts in standardization and
    validation have been funded, and supported, by
    BSI and ICT Marketing Ltd.

12
Languages Infrastructure
Data categories Metadata registries
Co-ordination
SIL, LoC, Infoterm
UN
ISO 639-6 data
ISO 639-X data
13
Languages Infrastructure
  • The right organizational model? c/w Citizendium
  • Larry Sanger, a co-founder of Wikipedia who left
    to become one of its most vocal critics.
  • "Wikipedia has accomplished great things, but the
    world can do even better," Dr Sanger said. "By
    engaging expert editors, eliminating anonymous
    contribution, and launching a more mature
    community under a new charter, a much broader and
    more influential group of people and institutions
    will be able to improve upon Wikipedias
    extremely useful, but often uneven work. The
    result will be not only enormous and free, but
    reliable.
  • A vetted set of editors, dubbed "constables",
    developing a set of rules for contributors to
    abide by.
  • Times Online, 7 September 2007

14
Languages Infrastructure
15
ISO 3166-1
16
ISO 639-1
17
ISO 639-6
18
(No Transcript)
19
Wikis for Languages
20
(No Transcript)
21
(No Transcript)
22
http//lux12.mpi.nl/isocat/
23
ISOcat architecture
client
tool
web interface
REST API
WS API
core DCR services
control access
manage session
manage user profile
manage access
manage balloting
manage comments
access data
manage system
administrator
DBMS
mirror
Kemps-Snijders, Windhouwer, Wittenburg and Wright
24
Languages Infrastructure
  • Language Documentation via ISO 639-4 association
    of metadata descriptors to model interoperable
    with DCIF (12620) (639-4 section 9)

25
Languages Infrastructure
  • Eventual inclusion of all available metadata

26
Languages Infrastructure
  • Language Codes Standards are growing in number
    and complexity
  • From 2 to 6
  • From 400 identifiers to upwards of 30000
  • From lists to databases
  • From tables to metadata registries
  • From published text documents to published
    databases
  • From IETF RFC to RFCs to RFCs
  • From a closed membership committee to an open
    Community initiative (OmegaWiki)
  • . with accompanying (web) services and products

27
Languages Infrastructure
  • Language Codes Standards are growing in number
    and complexity
  • From 2 to 6 eventually back to 1?
  • From 400 identifiers to upwards of 30000 plus
    supporting metadata
  • From lists to databases multiple metadata
    registers
  • From tables to metadata registries registers
    policies auditors
  • From published text documents to published
    databases SAD
  • From IETF RFC to RFCs to RFCs consume, consume,
    consume
  • From a closed membership committee to an open
    Community initiative (OmegaWiki) supporting
    infrastructure, expert review of community
    contributions (e-Voting?)
  • . with accompanying (web) services and products
    Open Source and bespoke, and secured funding as
    necessary

28
ISO standards
29
Next steps
  • ISO efforts with ISOcat (TC 37)
  • OmegaWiki support for community building
  • WLDC verification and validation in an on-going
    fashion
  • Connecting the whole thingand evaluating at
    scale
  • a simple catalogue of names of all languages in
    ISO 639 parts 1-3 has potential for, at least,
    7500x 7500 entries (gt 56 million) plus associated
    status information
  • Further connectivity SRB (MCAT)? OMII (Data 2.0)?

30
Acknowledgements
  • EU eContent project LIRICS (22236)
  • British Standards Institution
  • OmegaWiki
  • WLDC
  • Department for Trade and Industrys Knowledge
    Transfer Partnerships scheme (KTP 1739).
  • Contributions and efforts of colleagues and peers
    in ISO, BSI, IETF, in the projects identified,
    and in the wider community also.
  • And thank you for listening.
Write a Comment
User Comments (0)
About PowerShow.com