Title: Big Data and How to Overcome the Problems it Causes
1Big Data and How to Overcome the Problems it
Causes
- Ontology Engineering CSE 510/PHI 598
- Fall 2014
- September 8, 2014
2Big Data Problem
- Wikipedia defines Big Data as a collection of
data sets so large and complex that it becomes
difficult to process using on-hand database
management tools. - Gartner defines Big Data with three Vs
- Volume
- Velocity (of production and analysis)
- Variety
- This means that Big Data are beyond our control
(as opposed to those complex and big systems with
diverse and changing data where the complexity is
known)
3The Promise of Big Data
- Great insights can be obtained from large diverse
data sets if properly exploited with the right
analytics - Proper exploitation requires solutions in the
areas of - Hardware
- Software
- Method
4Knowledge Representations Attribute-Value
Systems
Restaurant Cuisine Cost Avg. Diner Review Avg. Critic Review Reservation Required
Toms Diner American 3.2 2.8 No
Les Gros Poissons French 4.5 4.8 Yes
Il Grand Pesce Italian 3.8 3.5 Yes
El Gran Pez Spanish 4.3 4.4 No
Den Stora Fisken Swedish 3.2 4.8 Yes
De Grote Vis Dutch 4.0 2.2 Preferred
5A Shortcoming of Attribute-Value Systems
Restaurant Cuisine Owner Owner 2 Owner 3
Toms Diner American Tom Washington
Les Gros Poissons French Jean Adams Simone Jefferson
Il Grand Pesce Italian Robert Madison Simone Jefferson
El Gran Pez Spanish Louis Adams
Den Stora Fisken Swedish Philip Jackson Claire Van Buren Susan Harrison
De Grote Vis Dutch Kate Tyler
6Relational Database Solutions
- 1st Normal Form No Attributes which are
themselves sets
Restaurant Cuisine Owner
Toms Diner American Tom Washington
Les Gros Poissons French Jean Adams
Les Gros Poissons French Simone Jefferson
Il Grand Pesce Italian Robert Madison
Il Grand Pesce Italian Simone Jefferson
El Gran Pez Spanish Louis Adams
7Rows Represent Unique Objects
- Each row now uniquely represents an aggregate
entity of Restaurant and Owner - This aggregate forms the primary key of the table
Restaurant Cuisine Owner
Toms Diner American Tom Washington
Les Gros Poissons French Jean Adams
Les Gros Poissons French Simone Jefferson
Il Grand Pesce Italian Robert Madison
Il Grand Pesce Italian Simone Jefferson
El Gran Pez Spanish Louis Adams
8A Shortcoming of 1st Normal Form
- Since the attributes depend on only a part of the
primary key (i.e. Restaurant) the table is
subject to risks of inconsistencies if the
attributes of one of the objects is changed but
not the others
Restaurant Cuisine Owner
Toms Diner American Tom Washington
Les Gros Poissons Creole Jean Adams
Les Gros Poissons French Simone Jefferson
Il Grand Pesce Italian Robert Madison
Il Grand Pesce Italian Simone Jefferson
El Gran Pez Spanish Louis Adams
9Relational Database Solutions
- 2nd Normal Form requires that any attribute must
describe the object designated by the primary key
rather than just some part of it
Restaurant Cuisine Cost
Toms Diner American
Les Gros Poissons Creole
Il Grand Pesce Italian
El Gran Pez Spanish
Den Stora Fisken Swedish
De Grote Vis Dutch
Restaurant Owner
Toms Diner Tom Washington
Les Gros Poissons Jean Adams
Les Gros Poissons Simone Jefferson
Il Grand Pesce Robert Madison
Il Grand Pesce Simone Jefferson
El Gran Pez Louis Adams
10A Shortcoming of 2nd Normal Form
- While both Date and Day of Purchase describe the
unique object of the table (i.e. the
RestaurantOwner primary key) there are duplicate
combinations of the two - If one of the combinations is changed without the
other a date may be shown has falling on two days
of the week
Restaurant Owner Date of Purchase Day of Purchase
Toms Diner Tom Washington 5/3/1994 Wednesday
Les Gros Poissons Jean Adams 4/14/2008 Friday
Les Gros Poissons Simone Jefferson 4/14/2008 Saturday
Il Grand Pesce Robert Madison 10/28/2003 Thursday
Il Grand Pesce Simone Jefferson 2/2/1998 Monday
El Gran Pez Louis Adams 7/30/2012 Tuesday
11Relational Database Solutions
- 3rd Normal Form requires that any attribute
describes the entity represented by the primary
key and only that entity - No transitive descriptions as in the example from
the previous slide
Restaurant Owner Date of Purchase
Toms Diner Tom Washington 5/3/1994
Les Gros Poissons Jean Adams 4/14/2008
Les Gros Poissons Simone Jefferson 4/14/2008
Il Grand Pesce Robert Madison 10/28/2003
Il Grand Pesce Simone Jefferson 2/2/1998
El Gran Pez Louis Adams 7/30/2012
Date Day of Week
5/3/1994 Wednesday
4/14/2008 Friday
10/28/2003 Thursday
2/2/1998 Monday
7/30/2012 Tuesday
12Knowledge Representations As Highly Designed
Artifacts
Restaurant Cuisine Cost
Toms Diner American
Les Gros Poissons Creole
Il Grand Pesce Italian
El Gran Pez Spanish
Den Stora Fisken Swedish
De Grote Vis Dutch
Restaurant Owner Date of Purchase
Toms Diner Tom Washington 5/3/1994
Les Gros Poissons Jean Adams 4/14/2008
Les Gros Poissons Simone Jefferson 4/14/2008
Il Grand Pesce Robert Madison 10/28/2003
Il Grand Pesce Simone Jefferson 2/2/1998
El Gran Pez Louis Adams 7/30/2012
Date Day of Week
5/3/1994 Wednesday
4/14/2008 Friday
10/28/2003 Thursday
2/2/1998 Monday
7/30/2012 Tuesday
13Application Translation Layers
Presentation Layer
Business Layer
Data Access Layer
14Big Data Hardware Solution
- Costly and can overrun the capabilities of the
largest single machines - A solution is to distribute information across
many smaller machines
15Hardware Solution is Contrary to Relational Design
- Designed to run on single machines
- Attempting to disassemble them and run them on a
cluster of machines is very difficult - Big Data requires a different Data Model, one
that is cluster friendly, that is, one that can
be distributed while still being efficient at
retrieving the data that is needed
16NoSQL Database Solutions
- Do not require a highly structured representation
of data, the data models are relatively simple - Key Value Model
- Document Model
- Column Family Model
- Graph Model
17Key-Value Data Model
- Key Value pair where the key is associated to
some value - The value can be any type of object, a number a
text value, an array, an image, a file, etc.
Toms Diner
Value associated with Toms Diner
Les Gros Poissons
Value associated with Les Gros Poissos
Il Grand Pesce
Value associated with Il Grand Pesce
El Gran Pez
Value associated with El Gran Pez
18Document Data Model
- Each element is a document, that is, a complex
data structure of some type, usually expressed in
JSON (JavaScript Object Notation) - No set schema for the documents
- More transparent than the Key-Value model
"id" 1, "Name" "Tom's
Diner", "Cuisine" "American",
"Cost" "", "Average Diner Review"
3.2, "Average Critic Review" 2.8,
"Reservation Required" "No", "Owner"
"Tom Washington"
19Column Family Data Model
- A Row Key is associated with n-many column
families (i.e. groups of columns that store
related data)
Name
Toms Diner
Cuisine
American
Restaurant Column Family
Cost
Avg Review
2.8
1234
Row Key
Owner Column Family
Name
Tom Washington
20Aggregate Orientation
- As noticed and described by Martin Fowler all of
the aforementioned noSQL data models share an
orientation towards storing a the description of
a significant object - This enables the distribution of data that tends
to be requested together (cluster-friendly) - Tends to be difficult to re-order the data to
query by different aggregates
NoSQL Distilled A Brief Guide to the Emerging
World of Polyglot Persistence, by Sadalage, P.J.
and Fowler, M. (2012)
21Graph Data Model
Wednesday
Reservations Not Required
Toms Diner
5/13/94Date of Purchase
Avg. Critic Review of 2.8
Restaurant
Owner
Tom Washington
Avg. Diner Review of 3.2
American Cuisine
Cost of
22Graph Data Model
- Does not have an aggregate orientation, rather
the opposite, a granular orientation that breaks
the aggregate into its composite elements - Good for data exploration
- Still cluster friendly, similar data can be
stored in separate graphs
23RDF Data Model
- RDF specifies a regular syntax for well formed
expressions - rdfstatement a simple expression that relates
one entity to another - rdfsubject the entity the statement is about
- rdfpredicate the relationship said to hold
between the two entities - rdfobject the entity that is related to the
subject - Humans are mortal
- UBs website homepage has URL http//www.buffalo.e
du/ - Remus is the brother of Romulus
24RDF Data Model
Subject Predicate Object
Toms Dinner Is_a Restaurant
Toms Dinner Offers American Cuisine
Toms Dinner Costs
Toms Dinner Has_average_diners review 3.2
Toms Dinner Has_average_critics_review 2.8
Toms Dinner Requires_reservation No
Toms Dinner Has_owner Tom Washington
25Methodological Solution
Linking Open Data cloud diagram, by Richard
Cyganiak and Anja Jentzsch. http//lod-cloud.net/
26Origin
- Formats of data sources included free text,
semi-structured and structured - Some data sets are made available only a short
time prior to system testing - Data sets and domain of interest will change
- Data can not be collected into a single store
- Provide cross-source searching and analytics
- Need to maintain the provenance of data
27High Level View of Ontology Content
- Enable Description of Human Activity
Actions
to perform
People Organizations
Artifacts
that take place in
use
Natural Artificial Environments
are distinguished by
Time
Attributes
28High Level View of Ontology Content
- Including the Activity of Describing Human
Activity
People Organizations
Information
produce
that describe
at a
Time
is distinguished by
Attributes
29Current Import Structure of the I2WD Ontologies
Relation Ontology (RO)
Basic Formal Ontology (BFO)
RO BFO Bridge 1.1
Upper Level Ontology Mid-Level Ontology
Domain Ontology
Extended Relation Ontology
Time Ontology
Agent Ontology
Quality Ontology
Information Entity Ontology
Artifact Ontology
Geospatial Ontology
Event Ontology
ChEBI Ontology
Emotion Ontology
Manufactured Chemicals Ontology
AIRS Mid-Level Ontology
Counter-terrorism Ontology
Information Technology Ontology
30Highlighted Capabilities of Ontologies
- Objects (persons, organizations, facilities,
materials, etc.) are linked to qualities,
functions and roles - these links can be time-stamped
- these attributes can be differentiated between
designed and improvised - these attributes can be measured using nominal
(tall, average), ordinal (1st, best), interval
(30o Celsius), and ratio (30mm, 10 gallons)
measurement types
31Highlighted Capabilities of I2WD Ontologies
- Events can be linked together with temporal or
causal relationships - Ambiguous times ( occurred during the Spring of
2010) and places ( happened in New York) can be
integrated with more precise information
(occurred on April 18th, 2010, happened in
Central Park) - Vocabulary for output of sentiment analysis
32Using States to Express Time Dependent Attributes
- In 2004, Alaa al-Tamimi became Mayor of Baghdad.
Mayor Role
Temporal Interval
Gain Of Role
Year
Person
Government
City
Is instance of
Is instance of
Is instance of
Is instance of
Is instance of
Is instance of
Is instance of
Baghdad
Alaa al-Tamimis Mayor Role
2004
Delimited by
Is organizational Context of
Has role
Interval during
City Government Of Baghdad
Participates in
Temporal Interval of Gain of Alaa
al-Tamimis Mayor Role
Gain of Alaa al-Tamimis Mayor Role
Alaa al-Tamimi
Participates in
Occurs on
33Designed and Measured Artifact Attributes
Is nominal measurement of
Lithium
Thermal Stability Nominal Measurement
Portion of Lithium Cobalt Oxide
Thermal Stability
bearer_of
Oxygen
is made of
Cobalt
Inheres_in
is made of
Thermal Stability Nominal Measurement Value
Design Specifications of Samsung Galaxy S4
prescribed_by
Lithium Ion Battery
has_part
Samsung Galaxy S4
Has text value
bearer_of
has_part
Poor
Data Transfer Speed Specification
Data Transfer Speed
Data Transfer Speed Ratio Measurement
prescribes
Is ratio meausrement of
Inheres_in
Inheres_in
Data Transfer Speed Measurement Value
Has decimal value
Data Transfer Speed Specification Value
Has decimal value
42.2
36.6
Mbps
Mbps
Uses measurement unit
Uses measurement unit
34Ontology Content Based on Standards
Partial List of Doctrine and Standards Used
- Basic Formal Ontology (BFO)
- DOD Dictionary of Military and Associated Terms
(JP 1-02) - Operations (FM 3-0)
- Multinational Operations (JP 3-16)
- Counterinsurgency (FM 3-24)
- International Standard Industrial Classification
of all Economic Activities Rev.4 (ISIC4) - Universal Joint Task List (CJSCM 3500.04C)
- Weapon Technical Intelligence (WTI) Improvised
Explosive Device IED Lexicon - JC3IEDM
- Information Artifact Ontology (IAO)
- Phenotype and Trait Ontology (PATO)
- Foundational Model of Anatomy (FMA)
- Regional Connection Calculus (RCC-8)
- Allen Time Calculus
- Wikipedia
35Ontology Content Tested Against Data
Partial List of Data Sources Used
- Treasury Office of Foreign Assets Control
Specially Designated Nationals and Blocked
Persons - NCTC Worldwide Incidents Tracking System
- UMD Global Terrorism Database
- RAND Database of Worldwide Terrorism Incidents
- LDM version .60 (TED)
- VMF PLI
- DCGS-A Event Reporting
- BFT Report (CCRi test data)
- Cidne Sigact (CCRi test data)
- Long War Journal
- Harmony Documents from CTC at West Point
- Threats Open Source Intelligence Gateway
36Ontologies Use a Common Upper Ontology
Entity
Object
Quality
bearer_of
Organization
Quality of Physical Artifact
Quality of Organization
Physical Artifact
has_quality
has_quality
- Produces common patterns within ontologies
- Reuse of mappings from the sources
- Easier to include new sources of data
- Enables more uniformity between queries
- Easier to transition to new domains of interest
37Ontologies are Modular
Entity
Object
Organization
Physical Artifact
Spatial Location
located_at
located_at
- Each Class is defined in one place
- Facilitates locating a class within the target
ontologies - Provides better recall in queries
- Less likely to overlook relevant data
38Ontologies Enable both Early and Late Fusion
Data Source 1
- Granular classes allow direct mappings from
various perspectives on the same domain while
preserving information that can be later used for
entity resolution
prescribes
Model
Car
has quality
manufactures
Full Size
Manufacturer
Length of Wheelbase
Mid Size
Compact
is nominally measured by
designates
Vehicle Identification Number
Data Source 2
Data Source 3
39Organization of Ontologies
- A limited number of upper and mid-level
ontologies are carefully managed - Domain ontologies are developed by subject matter
experts and tested by automated procedures - Content is pushed from domain ontologies to
mid-level ontologies as usage levels warrant
40Future Re-Organization of Ontologies
BFO
Upper Level Ontology Mid-Level Ontology
Domain Ontology
Extended Relation Ontology
Information Artifact Ontology
Quality Ontology
Time Ontology
Geospatial Ontology
Event Ontology
Artifact Ontology
Agent Ontology
Chemical Ontology
Plant Taxonomy
Animal Taxonomy
Geological Taxonomy
Military Events
Anthropogenic Feature
Human Anatomy
Watercraft
Interpersonal Events
Atmospheric Feature
Ethnicities
Ground Vehicles
Aircraft
Weather Events
Hydrographic Feature
Occupations
Nationalities
Clothing
Acts of Government
Landform
Military Units
Weapons
Geopolitical Feature
Legal System Events
Religions
Communication Devices
Acts of Artifact Use
Role Defined Area
Ideologies
Tools
Criminal Acts
Disease Ontology
Mental Function Ontology
41Conformance Testing
- Inconsistency A class is identified as being
uninstantiable - Semantic Smuggling A class or property is
reused with changed content - Multiple Inheritance A class or property is
asserted to be a subclass of more than one
superclass - Taxonomy Overloading A class or property is
related to its parent by a relationship other
than subclass - Containment A class or property is not a child
of any class or property of the imported
ontologies - Conflation A class or property includes
information model assertions that are not true of
the domain - Logic of Terms A class or property is a
set-theoretic combination of other classes or
properties
42Building a Taxonomy Common Problems
- Use Mention Errors
- Part of rather than subclass of
-
43Building a Taxonomy Common Problems
- Narrower in meaning than rather than subclass of
- Logic of Terms
-
- In Thomasnet.com(http//www.thomasnet.com/browse)
classes are formed by conjunctions and the class
hierarchy contains examples of subclasses based
on search patterns -
44Building a Taxonomy Common Problems
- Narrower in meaning than rather than subclass of
-
- In the Phenotypic Quality Ontology
(http//purl.obolibrary.org/obo/PATO_0000320)
classes are subclasses by hue. -
45Building a Taxonomy Common Problems