Title: A metadata infrastructure using ISO standards
1A metadata infrastructure using ISO standards
2Introduction
- ISO becoming more open?
- ISO 1.0 Top-down, expensive, cobwebs?
- ISO 2.0 ISO 1.0 plus Bottom-up, free, webs?
- Standards on Wikis (wikification)
- Open systems of metadata
- Outline
- Use of extant standards (11179) for new (12620,
639) - OmegaWiki as exemplar O/S project
- Peacekeeping forces ISO WLDC
3Introduction
- Application area human languages
- 50 of languages are endangered (UNESCO)
- large proportion of languages have no resources
and no web presence - discontinuity and fragmentation of research
- sustainability and curation issues
- And yet..
- Capability for capturing data like never before
- Expansion of capacity of the Internet and growing
pressure for an inclusive multilingual internet - OLPC programme
- Language experts and non-experts are prepared to
contribute time and resources - So, how to create an infrastructure in which to
form communities around languages and harmonize
results?
4Introduction
- Language experts may identify linguistic content
in a highly precise manner - What are non-experts (user community) capable of?
- Providing more specific sets of labels may help
in discovery of written or spoken languages in
all kinds of media and help to harmonize
research activities - so long as people know what
they are looking at. - Inaccuracies of currently tagged content need to
take the problem away from end users - More precise identification improves the chances
of getting what you wanted consider coffee vs.
coffee TYPE COLOUR vs. strong black
coffee, in a mug, with 2 sugars. - Beyond documentation of names and
representations, documentary information for each
language might be helpful. - Working towards a machine-readable representation
for all such information is a longer-term goal.
5ISO standards
6ISO standards
- Metadata registry according to ISO 11179 series
of standards (see, also, ISO 19763). - According to ISO 11179
- A Value Domain is associated with a Conceptual
Domain A Value Domain provides a representation
for the Conceptual Domain. - Example Conceptual Domain and set of Value
Domains is ISO 3166, Codes for the representation
of names of countries. - ISO 3166 describes the set of seven Value
Domains short name in English, official name in
English, short name in French, official name in
French, alpha-2 code, alpha-3 code, and numeric
code. - Each representation contains a set of values that
may be used in the value domain associated with
the DEC each one of the seven associations is a
data element. - For each representation of the data, the
permissible values, the datatype, the
representation class, and possibly the units of
measure, are altered.
Conceptual domain name Countries of the
world Conceptual domain definition Lists of
current countries of the world represented as
names or codes. Value domain name (1) Country
codes 2 character alpha Permissible
values ltAF, The primary geopolitical entity
known as "Democratic Republic of
Afghanistan"gt ltAL, The primary geopolitical
entity known as "People's Socialist Republic of
Albania"gt . . . ltZW, The primary geopolitical
entity known as "Republic of Zimbabwe"gt Value
domain name (2) Country codes 3 character
alpha Permissible values ltAFG, The primary
geopolitical entity known as "Democratic Republic
of Afghanistan"gt ltALB, The primary geopolitical
entity known as "People's Socialist Republic of
Albania"gt . . . ltZWE, The primary geopolitical
entity known as "Republic of Zimbabwe"gt
7ISO standards
/masculine/ /feminine/ /neuter/
/Afghanistan/
/Gender/
/Language identifier/
/Country/
/English/ /French/
lang
country
en, fr..
GB, FR, CN,
gen
m, f, n
Implemented as an XML attribute named
Anchors
ltxml country FR gt
ltw lemmevert langfr gengtvertelt/wgt
8ISO standards
- 12620 metamodel - ISO standard in preparation
9Languages Infrastructure
- OmegaWiki, a collaborative project to produce a
free, multilingual resource in every language,
with lexicological, terminological and thesaurus
information. Relational databased - World Language Documentation Centre (WLDC),
currently comprising 22 experts in language
technologies, linguistics, terminology
standardisation, and localisation - ISO, provision of the ISO 639 series of
standards focus here on 639-4 and 639-6
standards provide the structure
10Languages Infrastructure
- Model for ISO 639 proposed and developed by
LIRICS project participants (Gillam, Romary)
recently accepted for inclusion and review in the
current iteration of the developing ISO 639 part
4. - intended to be fully compatible with models being
developed in ISO TC 37 in general, compatible
with the Data Category Interchange Format defined
in ISO 12620, and to provide a means for
interlinking the collection of identifiers
provided across the 639 series. - ISO TC 37 standards for computational use of
terminology collections, specifically ISO 16642
and its combination with ISO 12620, emphasize a
metamodel in combination with metadata
identifiers, referred to as data categories. - Language identifiers of ISO 639 shall be
compatible, interoperable, mutually
understandable, and usable to the degree of
precision needed by the user up to the
limitations of these identifiers. - Language identifiers themselves need to be
described by metadata. - All of these metadata items can be submitted to
the metadata registry specified according to ISO
12620
11Languages Infrastructure
- ISO 639 model based on
- need to replicate simplistic structure of ISO
639-1 and 639-2 - inferred model of the Ethnologue as published
- ISO 12620 / ISO 11179
- emergent model through BSI for ISO 639-6 adapted,
generalized and cross-validated from encyclopædic
and other sources including - Gordon Jr, R. G (Ed.) (2005). Ethnologue
Languages of the World, 15th Edn. SIL
International. - Voegelin, C.F. and F.M. (1977) Classification and
index of the world's languages. New York, NY
Elsevier North Holland, Inc. - Ruhlen, M. (1987) A guide to the world's
languages. Vol.1 Classification. London Edward
Arnold. - Bernard Comrie (ed.) (1987) The World's major
languages. Oxford University Press, New York, - Chambers, J.K. and Trudgill, P. (1998)
Dialectology. Cambridge Cambridge University
Press - Dalby, D (1999). Linguasphere Register of the
worlds languages and speech communities.
Linguasphere Press. - development of ISO 639-6 initially assisted by a
fund made available by the Department of Trade
and Industry of the UK and administered by BSI
subsequent efforts in standardization and
validation have been funded, and supported, by
BSI and ICT Marketing Ltd.
12Languages Infrastructure
Data categories Metadata registries
Co-ordination
SIL, LoC, Infoterm
UN
ISO 639-6 data
ISO 639-X data
13Languages Infrastructure
- The right organizational model? c/w Citizendium
- Larry Sanger, a co-founder of Wikipedia who left
to become one of its most vocal critics. - "Wikipedia has accomplished great things, but the
world can do even better," Dr Sanger said. "By
engaging expert editors, eliminating anonymous
contribution, and launching a more mature
community under a new charter, a much broader and
more influential group of people and institutions
will be able to improve upon Wikipedias
extremely useful, but often uneven work. The
result will be not only enormous and free, but
reliable. - A vetted set of editors, dubbed "constables",
developing a set of rules for contributors to
abide by. - Times Online, 7 September 2007
14Languages Infrastructure
15ISO 3166-1
16ISO 639-1
17ISO 639-6
18(No Transcript)
19Wikis for Languages
20(No Transcript)
21(No Transcript)
22http//lux12.mpi.nl/isocat/
23ISOcat architecture
client
tool
web interface
REST API
WS API
core DCR services
control access
manage session
manage user profile
manage access
manage balloting
manage comments
access data
manage system
administrator
DBMS
mirror
Kemps-Snijders, Windhouwer, Wittenburg and Wright
24Languages Infrastructure
- Language Documentation via ISO 639-4 association
of metadata descriptors to model interoperable
with DCIF (12620) (639-4 section 9)
25Languages Infrastructure
- Eventual inclusion of all available metadata
26Languages Infrastructure
- Language Codes Standards are growing in number
and complexity - From 2 to 6
- From 400 identifiers to upwards of 30000
- From lists to databases
- From tables to metadata registries
- From published text documents to published
databases - From IETF RFC to RFCs to RFCs
- From a closed membership committee to an open
Community initiative (OmegaWiki) - . with accompanying (web) services and products
27Languages Infrastructure
- Language Codes Standards are growing in number
and complexity - From 2 to 6 eventually back to 1?
- From 400 identifiers to upwards of 30000 plus
supporting metadata - From lists to databases multiple metadata
registers - From tables to metadata registries registers
policies auditors - From published text documents to published
databases SAD - From IETF RFC to RFCs to RFCs consume, consume,
consume - From a closed membership committee to an open
Community initiative (OmegaWiki) supporting
infrastructure, expert review of community
contributions (e-Voting?) - . with accompanying (web) services and products
Open Source and bespoke, and secured funding as
necessary
28ISO standards
29Next steps
- ISO efforts with ISOcat (TC 37)
- OmegaWiki support for community building
- WLDC verification and validation in an on-going
fashion - Connecting the whole thingand evaluating at
scale - a simple catalogue of names of all languages in
ISO 639 parts 1-3 has potential for, at least,
7500x 7500 entries (gt 56 million) plus associated
status information - Further connectivity SRB (MCAT)? OMII (Data 2.0)?
30Acknowledgements
- EU eContent project LIRICS (22236)
- British Standards Institution
- OmegaWiki
- WLDC
- Department for Trade and Industrys Knowledge
Transfer Partnerships scheme (KTP 1739). - Contributions and efforts of colleagues and peers
in ISO, BSI, IETF, in the projects identified,
and in the wider community also. - And thank you for listening.