Big Data and How to Overcome the Problems it Causes PowerPoint PPT Presentation

presentation player overlay
1 / 45
About This Presentation
Transcript and Presenter's Notes

Title: Big Data and How to Overcome the Problems it Causes


1
Big Data and How to Overcome the Problems it
Causes
  • Ontology Engineering CSE 510/PHI 598
  • Fall 2014
  • September 8, 2014

2
Big Data Problem
  • Wikipedia defines Big Data as a collection of
    data sets so large and complex that it becomes
    difficult to process using on-hand database
    management tools.
  • Gartner defines Big Data with three Vs
  • Volume
  • Velocity (of production and analysis)
  • Variety
  • This means that Big Data are beyond our control
    (as opposed to those complex and big systems with
    diverse and changing data where the complexity is
    known)

3
The Promise of Big Data
  • Great insights can be obtained from large diverse
    data sets if properly exploited with the right
    analytics
  • Proper exploitation requires solutions in the
    areas of
  • Hardware
  • Software
  • Method

4
Knowledge Representations Attribute-Value
Systems
Restaurant Cuisine Cost Avg. Diner Review Avg. Critic Review Reservation Required
Toms Diner American 3.2 2.8 No
Les Gros Poissons French 4.5 4.8 Yes
Il Grand Pesce Italian 3.8 3.5 Yes
El Gran Pez Spanish 4.3 4.4 No
Den Stora Fisken Swedish 3.2 4.8 Yes
De Grote Vis Dutch 4.0 2.2 Preferred
5
A Shortcoming of Attribute-Value Systems
  • Duplicate Attributes

Restaurant Cuisine Owner Owner 2 Owner 3
Toms Diner American Tom Washington
Les Gros Poissons French Jean Adams Simone Jefferson
Il Grand Pesce Italian Robert Madison Simone Jefferson
El Gran Pez Spanish Louis Adams
Den Stora Fisken Swedish Philip Jackson Claire Van Buren Susan Harrison
De Grote Vis Dutch Kate Tyler
6
Relational Database Solutions
  • 1st Normal Form No Attributes which are
    themselves sets

Restaurant Cuisine Owner
Toms Diner American Tom Washington
Les Gros Poissons French Jean Adams
Les Gros Poissons French Simone Jefferson
Il Grand Pesce Italian Robert Madison
Il Grand Pesce Italian Simone Jefferson
El Gran Pez Spanish Louis Adams
7
Rows Represent Unique Objects
  • Each row now uniquely represents an aggregate
    entity of Restaurant and Owner
  • This aggregate forms the primary key of the table

Restaurant Cuisine Owner
Toms Diner American Tom Washington
Les Gros Poissons French Jean Adams
Les Gros Poissons French Simone Jefferson
Il Grand Pesce Italian Robert Madison
Il Grand Pesce Italian Simone Jefferson
El Gran Pez Spanish Louis Adams
8
A Shortcoming of 1st Normal Form
  • Since the attributes depend on only a part of the
    primary key (i.e. Restaurant) the table is
    subject to risks of inconsistencies if the
    attributes of one of the objects is changed but
    not the others

Restaurant Cuisine Owner
Toms Diner American Tom Washington
Les Gros Poissons Creole Jean Adams
Les Gros Poissons French Simone Jefferson
Il Grand Pesce Italian Robert Madison
Il Grand Pesce Italian Simone Jefferson
El Gran Pez Spanish Louis Adams
9
Relational Database Solutions
  • 2nd Normal Form requires that any attribute must
    describe the object designated by the primary key
    rather than just some part of it

Restaurant Cuisine Cost
Toms Diner American
Les Gros Poissons Creole
Il Grand Pesce Italian
El Gran Pez Spanish
Den Stora Fisken Swedish
De Grote Vis Dutch
Restaurant Owner
Toms Diner Tom Washington
Les Gros Poissons Jean Adams
Les Gros Poissons Simone Jefferson
Il Grand Pesce Robert Madison
Il Grand Pesce Simone Jefferson
El Gran Pez Louis Adams
10
A Shortcoming of 2nd Normal Form
  • While both Date and Day of Purchase describe the
    unique object of the table (i.e. the
    RestaurantOwner primary key) there are duplicate
    combinations of the two
  • If one of the combinations is changed without the
    other a date may be shown has falling on two days
    of the week

Restaurant Owner Date of Purchase Day of Purchase
Toms Diner Tom Washington 5/3/1994 Wednesday
Les Gros Poissons Jean Adams 4/14/2008 Friday
Les Gros Poissons Simone Jefferson 4/14/2008 Saturday
Il Grand Pesce Robert Madison 10/28/2003 Thursday
Il Grand Pesce Simone Jefferson 2/2/1998 Monday
El Gran Pez Louis Adams 7/30/2012 Tuesday
11
Relational Database Solutions
  • 3rd Normal Form requires that any attribute
    describes the entity represented by the primary
    key and only that entity
  • No transitive descriptions as in the example from
    the previous slide

Restaurant Owner Date of Purchase
Toms Diner Tom Washington 5/3/1994
Les Gros Poissons Jean Adams 4/14/2008
Les Gros Poissons Simone Jefferson 4/14/2008
Il Grand Pesce Robert Madison 10/28/2003
Il Grand Pesce Simone Jefferson 2/2/1998
El Gran Pez Louis Adams 7/30/2012
Date Day of Week
5/3/1994 Wednesday
4/14/2008 Friday
10/28/2003 Thursday
2/2/1998 Monday
7/30/2012 Tuesday
12
Knowledge Representations As Highly Designed
Artifacts
Restaurant Cuisine Cost
Toms Diner American
Les Gros Poissons Creole
Il Grand Pesce Italian
El Gran Pez Spanish
Den Stora Fisken Swedish
De Grote Vis Dutch
Restaurant Owner Date of Purchase
Toms Diner Tom Washington 5/3/1994
Les Gros Poissons Jean Adams 4/14/2008
Les Gros Poissons Simone Jefferson 4/14/2008
Il Grand Pesce Robert Madison 10/28/2003
Il Grand Pesce Simone Jefferson 2/2/1998
El Gran Pez Louis Adams 7/30/2012
Date Day of Week
5/3/1994 Wednesday
4/14/2008 Friday
10/28/2003 Thursday
2/2/1998 Monday
7/30/2012 Tuesday
13
Application Translation Layers
Presentation Layer
Business Layer
Data Access Layer
14
Big Data Hardware Solution
  • Costly and can overrun the capabilities of the
    largest single machines
  • A solution is to distribute information across
    many smaller machines

15
Hardware Solution is Contrary to Relational Design
  • Designed to run on single machines
  • Attempting to disassemble them and run them on a
    cluster of machines is very difficult
  • Big Data requires a different Data Model, one
    that is cluster friendly, that is, one that can
    be distributed while still being efficient at
    retrieving the data that is needed

16
NoSQL Database Solutions
  • Do not require a highly structured representation
    of data, the data models are relatively simple
  • Key Value Model
  • Document Model
  • Column Family Model
  • Graph Model

17
Key-Value Data Model
  • Key Value pair where the key is associated to
    some value
  • The value can be any type of object, a number a
    text value, an array, an image, a file, etc.

Toms Diner
Value associated with Toms Diner
Les Gros Poissons
Value associated with Les Gros Poissos
Il Grand Pesce
Value associated with Il Grand Pesce
El Gran Pez
Value associated with El Gran Pez
18
Document Data Model
  • Each element is a document, that is, a complex
    data structure of some type, usually expressed in
    JSON (JavaScript Object Notation)
  • No set schema for the documents
  • More transparent than the Key-Value model

"id" 1, "Name" "Tom's
Diner", "Cuisine" "American",
"Cost" "", "Average Diner Review"
3.2, "Average Critic Review" 2.8,
"Reservation Required" "No", "Owner"
"Tom Washington"
19
Column Family Data Model
  • A Row Key is associated with n-many column
    families (i.e. groups of columns that store
    related data)

Name
Toms Diner
Cuisine
American
Restaurant Column Family
Cost

Avg Review
2.8
1234
Row Key
Owner Column Family
Name
Tom Washington
20
Aggregate Orientation
  • As noticed and described by Martin Fowler all of
    the aforementioned noSQL data models share an
    orientation towards storing a the description of
    a significant object
  • This enables the distribution of data that tends
    to be requested together (cluster-friendly)
  • Tends to be difficult to re-order the data to
    query by different aggregates

NoSQL Distilled A Brief Guide to the Emerging
World of Polyglot Persistence, by Sadalage, P.J.
and Fowler, M. (2012)
21
Graph Data Model
Wednesday
Reservations Not Required
Toms Diner
5/13/94Date of Purchase
Avg. Critic Review of 2.8
Restaurant
Owner
Tom Washington
Avg. Diner Review of 3.2
American Cuisine
Cost of
22
Graph Data Model
  • Does not have an aggregate orientation, rather
    the opposite, a granular orientation that breaks
    the aggregate into its composite elements
  • Good for data exploration
  • Still cluster friendly, similar data can be
    stored in separate graphs

23
RDF Data Model
  • RDF specifies a regular syntax for well formed
    expressions
  • rdfstatement a simple expression that relates
    one entity to another
  • rdfsubject the entity the statement is about
  • rdfpredicate the relationship said to hold
    between the two entities
  • rdfobject the entity that is related to the
    subject
  • Humans are mortal
  • UBs website homepage has URL http//www.buffalo.e
    du/
  • Remus is the brother of Romulus

24
RDF Data Model
Subject Predicate Object
Toms Dinner Is_a Restaurant
Toms Dinner Offers American Cuisine
Toms Dinner Costs
Toms Dinner Has_average_diners review 3.2
Toms Dinner Has_average_critics_review 2.8
Toms Dinner Requires_reservation No
Toms Dinner Has_owner Tom Washington
25
Methodological Solution
Linking Open Data cloud diagram, by Richard
Cyganiak and Anja Jentzsch. http//lod-cloud.net/
26
Origin
  • Formats of data sources included free text,
    semi-structured and structured
  • Some data sets are made available only a short
    time prior to system testing
  • Data sets and domain of interest will change
  • Data can not be collected into a single store
  • Provide cross-source searching and analytics
  • Need to maintain the provenance of data

27
High Level View of Ontology Content
  • Enable Description of Human Activity

Actions
to perform
People Organizations
Artifacts
that take place in
use
Natural Artificial Environments
are distinguished by
Time
Attributes
28
High Level View of Ontology Content
  • Including the Activity of Describing Human
    Activity

People Organizations
Information
produce
that describe
at a
Time
is distinguished by
Attributes
29
Current Import Structure of the I2WD Ontologies
Relation Ontology (RO)
Basic Formal Ontology (BFO)
RO BFO Bridge 1.1
Upper Level Ontology Mid-Level Ontology
Domain Ontology
Extended Relation Ontology
Time Ontology
Agent Ontology
Quality Ontology
Information Entity Ontology
Artifact Ontology
Geospatial Ontology
Event Ontology
ChEBI Ontology
Emotion Ontology
Manufactured Chemicals Ontology
AIRS Mid-Level Ontology
Counter-terrorism Ontology
Information Technology Ontology
30
Highlighted Capabilities of Ontologies
  • Objects (persons, organizations, facilities,
    materials, etc.) are linked to qualities,
    functions and roles
  • these links can be time-stamped
  • these attributes can be differentiated between
    designed and improvised
  • these attributes can be measured using nominal
    (tall, average), ordinal (1st, best), interval
    (30o Celsius), and ratio (30mm, 10 gallons)
    measurement types

31
Highlighted Capabilities of I2WD Ontologies
  • Events can be linked together with temporal or
    causal relationships
  • Ambiguous times ( occurred during the Spring of
    2010) and places ( happened in New York) can be
    integrated with more precise information
    (occurred on April 18th, 2010, happened in
    Central Park)
  • Vocabulary for output of sentiment analysis

32
Using States to Express Time Dependent Attributes
  • In 2004, Alaa al-Tamimi became Mayor of Baghdad.

Mayor Role
Temporal Interval
Gain Of Role
Year
Person
Government
City
Is instance of
Is instance of
Is instance of
Is instance of
Is instance of
Is instance of
Is instance of
Baghdad
Alaa al-Tamimis Mayor Role
2004
Delimited by
Is organizational Context of
Has role
Interval during
City Government Of Baghdad
Participates in
Temporal Interval of Gain of Alaa
al-Tamimis Mayor Role
Gain of Alaa al-Tamimis Mayor Role
Alaa al-Tamimi
Participates in
Occurs on
33
Designed and Measured Artifact Attributes
Is nominal measurement of
Lithium
Thermal Stability Nominal Measurement
Portion of Lithium Cobalt Oxide
Thermal Stability
bearer_of
Oxygen
is made of
Cobalt
Inheres_in
is made of
Thermal Stability Nominal Measurement Value
Design Specifications of Samsung Galaxy S4
prescribed_by
Lithium Ion Battery
has_part
Samsung Galaxy S4
Has text value
bearer_of
has_part
Poor
Data Transfer Speed Specification
Data Transfer Speed
Data Transfer Speed Ratio Measurement
prescribes
Is ratio meausrement of
Inheres_in
Inheres_in
Data Transfer Speed Measurement Value
Has decimal value
Data Transfer Speed Specification Value
Has decimal value
42.2
36.6
Mbps
Mbps
Uses measurement unit
Uses measurement unit
34
Ontology Content Based on Standards
Partial List of Doctrine and Standards Used
  • Basic Formal Ontology (BFO)
  • DOD Dictionary of Military and Associated Terms
    (JP 1-02)
  • Operations (FM 3-0)
  • Multinational Operations (JP 3-16)
  • Counterinsurgency (FM 3-24)
  • International Standard Industrial Classification
    of all Economic Activities Rev.4 (ISIC4)
  • Universal Joint Task List (CJSCM 3500.04C)
  • Weapon Technical Intelligence (WTI) Improvised
    Explosive Device IED Lexicon
  • JC3IEDM
  • Information Artifact Ontology (IAO)
  • Phenotype and Trait Ontology (PATO)
  • Foundational Model of Anatomy (FMA)
  • Regional Connection Calculus (RCC-8)
  • Allen Time Calculus
  • Wikipedia

35
Ontology Content Tested Against Data
Partial List of Data Sources Used
  • Treasury Office of Foreign Assets Control
    Specially Designated Nationals and Blocked
    Persons
  • NCTC Worldwide Incidents Tracking System
  • UMD Global Terrorism Database
  • RAND Database of Worldwide Terrorism Incidents
  • LDM version .60 (TED)
  • VMF PLI
  • DCGS-A Event Reporting
  • BFT Report (CCRi test data)
  • Cidne Sigact (CCRi test data)
  • Long War Journal
  • Harmony Documents from CTC at West Point
  • Threats Open Source Intelligence Gateway

36
Ontologies Use a Common Upper Ontology
Entity
Object
Quality
bearer_of
Organization
Quality of Physical Artifact
Quality of Organization
Physical Artifact
has_quality
has_quality
  • Produces common patterns within ontologies
  • Reuse of mappings from the sources
  • Easier to include new sources of data
  • Enables more uniformity between queries
  • Easier to transition to new domains of interest

37
Ontologies are Modular
Entity
Object
Organization
Physical Artifact
Spatial Location
located_at
located_at
  • Each Class is defined in one place
  • Facilitates locating a class within the target
    ontologies
  • Provides better recall in queries
  • Less likely to overlook relevant data

38
Ontologies Enable both Early and Late Fusion
Data Source 1
  • Granular classes allow direct mappings from
    various perspectives on the same domain while
    preserving information that can be later used for
    entity resolution

prescribes
Model
Car
has quality
manufactures
Full Size
Manufacturer
Length of Wheelbase
Mid Size
Compact
is nominally measured by
designates
Vehicle Identification Number
Data Source 2
Data Source 3
39
Organization of Ontologies
  • A limited number of upper and mid-level
    ontologies are carefully managed
  • Domain ontologies are developed by subject matter
    experts and tested by automated procedures
  • Content is pushed from domain ontologies to
    mid-level ontologies as usage levels warrant

40
Future Re-Organization of Ontologies
BFO
Upper Level Ontology Mid-Level Ontology
Domain Ontology
Extended Relation Ontology
Information Artifact Ontology
Quality Ontology
Time Ontology
Geospatial Ontology
Event Ontology
Artifact Ontology
Agent Ontology
Chemical Ontology
Plant Taxonomy
Animal Taxonomy
Geological Taxonomy
Military Events
Anthropogenic Feature
Human Anatomy
Watercraft
Interpersonal Events
Atmospheric Feature
Ethnicities
Ground Vehicles
Aircraft
Weather Events
Hydrographic Feature
Occupations
Nationalities
Clothing
Acts of Government
Landform
Military Units
Weapons
Geopolitical Feature
Legal System Events
Religions
Communication Devices
Acts of Artifact Use
Role Defined Area
Ideologies
Tools
Criminal Acts
Disease Ontology
Mental Function Ontology
41
Conformance Testing
  • Inconsistency A class is identified as being
    uninstantiable
  • Semantic Smuggling A class or property is
    reused with changed content
  • Multiple Inheritance A class or property is
    asserted to be a subclass of more than one
    superclass
  • Taxonomy Overloading A class or property is
    related to its parent by a relationship other
    than subclass
  • Containment A class or property is not a child
    of any class or property of the imported
    ontologies
  • Conflation A class or property includes
    information model assertions that are not true of
    the domain
  • Logic of Terms A class or property is a
    set-theoretic combination of other classes or
    properties

42
Building a Taxonomy Common Problems
  • Use Mention Errors
  • Part of rather than subclass of

43
Building a Taxonomy Common Problems
  • Narrower in meaning than rather than subclass of
  • Logic of Terms
  • In Thomasnet.com(http//www.thomasnet.com/browse)
    classes are formed by conjunctions and the class
    hierarchy contains examples of subclasses based
    on search patterns

44
Building a Taxonomy Common Problems
  • Narrower in meaning than rather than subclass of
  • In the Phenotypic Quality Ontology
    (http//purl.obolibrary.org/obo/PATO_0000320)
    classes are subclasses by hue.

45
Building a Taxonomy Common Problems
  • Non-Disjoint Classes
Write a Comment
User Comments (0)
About PowerShow.com