X10 Overview - PowerPoint PPT Presentation

About This Presentation
Title:

X10 Overview

Description:

X10 Overview Vijay Saraswat vsaraswa_at_us.ibm.com This work has been supported in part by the Defense Advanced Research Projects Agency (DARPA) under contract No ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 30
Provided by: VijaySa4
Learn more at: http://saraswat.org
Category:

less

Transcript and Presenter's Notes

Title: X10 Overview


1
X10 Overview
  • Vijay Saraswat
  • vsaraswa_at_us.ibm.com

This work has been supported in part by the
Defense Advanced Research Projects Agency (DARPA)
under contract No. NBCH30390004.
2
Acknowledgements
  • X10 Tools
  • Julian Dolby, Steve Fink, Robert Fuhrer,
    Matthias Hauswirth, Peter Sweeney, Frank Tip,
    Mandana Vaziri
  • University partners
  • MIT (StreamIt), Purdue University (X10), UC
    Berkeley (StreamBit), U. Delaware (Atomic
    sections), U. Illinois (Fortran plug-in),
    Vanderbilt University (Productivity metrics),
    DePaul U (Semantics)
  • X10 core team
  • Philippe Charles
  • Chris Donawa (IBM Toronto)
  • Kemal Ebcioglu
  • Christian Grothoff (Purdue)
  • Allan Kielstra (IBM Toronto)
  • Douglas Lovell
  • Maged Michael
  • Christoph von Praun
  • Vivek Sarkar
  • Additional contributors to X10 ideas
  • David Bacon, Bob Blainey, Perry Cheng,
    Julian Dolby, Guang Gao (U Delaware), Robert
    O'Callahan, Filip Pizlo (Purdue), Lawrence
    Rauchwerger (Texas AM), Mandana Vaziri, Jan
    Vitek (Purdue), V.T. Rajan, Radha Jagadeesan
    (DePaul)

X10 PMTools Team Lead Kemal Ebcioglu,
Vivek Sarkar PERCS Principal Investigator
Mootaz Elnozahy
3
The X10 Programming Model
Place
Place
Outbound activities
Inbound activities
Partitioned Global heap
Partitioned Global heap
Place-local heap
Place-local heap
. . .
Activities
Activities
. . .
. . .
Inbound activity replies
Outbound activity replies
Immutable Data
  • Program may spawn multiple (local or remote)
    activities in parallel.
  • Program must use asynchronous operations to
    access/update remote data.
  • Program may repeatedly detect quiescence of a
    programmer-specified, data-dependent, distributed
    set of activities.
  • A program is a collection of places, each
    containing resident data and a dynamic collection
    of activities.
  • Program may distribute aggregate data (arrays)
    across places during allocation.
  • Program may directly operate only on local data,
    using atomic blocks.

Cluster Computing P gt 1
Shared Memory (P1)
MPI (P gt 1)
4
X10 v0.409 Cheat Sheet
DataType ClassName InterfaceName
ArrayType nullable DataType future
DataType Kind value reference
Stm async ( Place ) clocked ClockList
Stm when ( SimpleExpr ) Stm finish
Stm next c.resume()
c.drop() for( i Region ) Stm foreach (
i Region ) Stm ateach ( I Distribution )
Stm Expr ArrayExpr ClassModifier
Kind MethodModifier atomic
x10.lang has the following classes (among
others) point, range, region,
distribution, clock, array Some of these are
supported by special syntax.
5
X10 v0.409 Cheat Sheet Array support
Region Expr Expr
-- 1-D region Range, , Range
-- Multidimensional Region Region
Region -- Intersection
Region Region --
Union Region Region
-- Set difference BuiltinRegion Distribution
Region -gt Place
-- Constant Distribution Distribution Place
-- Restriction
Distribution Region --
Restriction Distribution Distribution
-- Union Distribution Distribution
-- Set difference Distribution.overlay
( Distribution ) BuiltinDistribution
ArrayExpr new ArrayType ( Formal ) Stm
Distribution Expr
-- Lifting ArrayExpr Region
-- Section ArrayExpr Distribution
-- Restriction ArrayExpr
ArrayExpr -- Union
ArrayExpr.overlay(ArrayExpr) --
Update ArrayExpr. scan( fun , ArgList )
ArrayExpr. reduce( fun , ArgList )
ArrayExpr.lift( fun , ArgList ) ArrayType
Type Kind Type Kind region(N)
Type Kind Region Type Kind
Distribution
Language supports type safety, memory safety,
place safety, clock safety
6
Design Principles
  • Support for scalability
  • Support locality.
  • Support asynchrony.
  • Ensure synchronization constructs scale.
  • Support aggregate operations.
  • Ensure optimizations expressible in source.
  • Support for productivity
  • Extend OO base.
  • Design must rule out large classes of errors
    (Type safe, Memory safe, Pointer safe, Lock safe,
    Clock safe )
  • Support incremental introduction of types.
  • Integrate with static tools (Eclipse).
  • Support automatic static and dynamic optimization
    (CPO).

General purpose language for scalable server-side
applications, to be used by High Productivity and
High Performance programmers.
7
Past work
  • Java
  • Base language
  • Cilk
  • async, finish
  • PGAS languages
  • places
  • SPMD languages, Synchronous languages
  • clocks
  • Atomic operations
  • ZPL, Titanium, (HPF)
  • Regions, distributions

8
Future language extensions
  • Type system
  • semantic annotations
  • clocked finals
  • aliasing annotations
  • dependent types
  • Determinate programming
  • e.g. immutable data
  • Weaker memory model?
  • ordering constructs
  • First-class functions
  • Generics
  • Components?
  • User-definable primitive types
  • Support for operators
  • Relaxed exception model
  • Middleware focus
  • Persistence?
  • Fault tolerance?
  • XML support?

9
RandomAccess
public boolean run() distribution D
distribution.factory.block(TABLE_SIZE) long.
table new longD (point i) return i
long. RanStarts new longdistribution.factory.
unique() (point i) return
starts(i) long. SmallTable new long
valueTABLE_SIZE (point i) return
iS_TABLE_INIT finish ateach (point i
RanStarts ) long ran nextRandom(RanStarts
i) for (int count 1N_UPDATES_PER_PLACE)
int J f(ran) long K
SmallTableg(ran) async atomic tableJ
K ran nextRandom(ran)
return table.sum() EXPECTED_RESULT
Allocate and initialize table as a
block-distributed array.
Allocate and initialize RanStarts with one random
number seed for each place.
Allocate a small immutable table that can be
copied to all places.
Everywhere in parallel, repeatedly generate
random table indices and atomically
read/modify/write table element.
10
Backup
11
Performance and Productivity Challenges
1) Memory wall Architectures exhibit severe
non-uniformities in bandwidth latency in memory
hierarchy
2) Frequency wall Architectures introduce
hierarchical heterogeneous parallelism to
compensate for frequency scaling slowdown
Clusters (scale-out)
SMP
Multiple cores on a chip
Coprocessors (SPUs)
SMTs
SIMD
ILP
3) Scalability wall Software will need to
deliver 105-way parallelism to utilize
peta-scale parallel systems
12
High Complexity Limits Development Productivity
One billion transistors in a chip
1995 entire chip can be accessed in 1 cycle
. . .
2010 only small fraction of chip can be accessed
in 1 cycle
. . .
Major sources of complexity for application
developer 1) Severe non-uniformities in data
accesses 2) Applications must exhibit large
degrees of parallelism (up to 105 threads)
\\
. . .
Complexity leads to increases in all phases of
HPC Software Lifecycle related to parallel code
Memory
//
//
Development of Parallel Source Code --- Design,
Code, Test, Port, Scale, Optimize
Production Runs of Parallel Code
Maintenance and Porting of Parallel Code
Written Specification
Algorithm Development
ParallelSpecification
Requirements
Input Data
Source Code
HPC Software Lifecycle
13
PERCS Programming Model/Tools Overall
Architecture
Fortran/MPI/OpenMP)
X10 source code
JavaThreadsConc utils
C/C /MPI /OpenMP
. . .
Performance Exploration
Java Development Toolkit
X10 Development Toolkit
C Development Toolkit
Fortran Development Toolkit
. . .
Productivity Metrics
Integrated Programming Environment Edit,
Compile, Debug, Visualize, Refactor Use Eclipse
platform (eclipse.org) as foundation for
integrating tools Morphogenic Software
separation of concerns, separation of roles
Fortran components
C/C components
Fast extern interface
C/C runtime
Fortran runtime
Integrated Concurrency Library messages,
synchronization, threads
PERCS Productive Easy-to-use Reliable Computer
Systems
Continuous Program Optimization (CPO)
PERCS System Software (K42)
PERCS System Hardware
14
async
async PlaceExpressionSingleListopt Statement
  • async (P) S
  • Parent activity creates a new child activity at
    place P, to execute statement S returns
    immediately.
  • S may reference final variables in enclosing
    blocks.

double AD // Global dist. array final int k
async ( A.distribution99 ) //
Executed at A99s place atomic A99 k

cf Cilks spawn
15
finish
Statement finish Statement
  • finish S
  • Execute S, but wait until all (transitively)
    spawned asyncs have terminated.
  • Trap all exceptions thrown by spawned activities.
  • Throw an (aggregate) exception if any spawned
    async terminates abruptly.
  • Useful for expressing synchronous operations on
    remote data
  • And potentially, ordering information in a weakly
    consistent memory model

finish ateach(point iA) Ai i finish
async(A.distributionj) Aj 2 // All Aii
will complete before Aj2
finish ateach(point iA) Ai i finish
async(A.distributionj) Aj 2 // All Aii
will complete before Aj2
cf Cilks sync
Rooted Exception Model
16
atomic
Statement atomic Statement MethodModifier
atomic
  • Atomic blocks are
  • Conceptually executed in a single step, while
    other activities are suspended
  • An atomic block may not include
  • Blocking operations
  • Accesses to data at remote places
  • Creation of activities at remote places

// target defined in lexically enclosing
environment. public atomic boolean CAS( Object
old, Object new)
if (target.equals(old)) target new
return true return false
// push data onto concurrent list-stackNodeltintgt
nodenew Nodeltintgt(17)atomic node.next
head head node
17
when
Statement WhenStatement WhenStatement
when ( Expression ) Statement
class OneBuffer nullable Object datum
null boolean filled false public
void send(Object v) when ( !filled )
this.datum v this.filled
true public Object
receive() when ( filled )
Object v datum datum null
filled false return v
  • Activity suspends until a state in which the
    guard is true in that state the body is executed
    atomically.

18
regions, distributions
region R 0100 region R1 0100,
0200 region RInner 199, 1199 // a local
distribution distribution D1R-gt here // a
blocked distribution distribution D
block(R) // union of two distributions distributi
on D (01) -gt P0 (2N) -gt P1 distribution
DBoundary D RInner
  • Region
  • a (multi-dimensional) set of indices
  • Distribution
  • A mapping from indices to places
  • High level algebraic operations are provided on
    regions and distributions

Based on ZPL.
19
arrays
  • Array section
  • A RInner
  • High level parallel array, reduction and span
    operators
  • Highly parallel library implementation
  • A-B (array subtraction)
  • A.reduce(intArray.add,0)
  • A.sum()
  • Arrays may be
  • Multidimensional
  • Distributed
  • Value types
  • Initialized in parallel
  • int D A new intD (point i,j)
    return Nij

20
ateach, foreach
ateach ( FormalParam Expression )
Statement foreach ( FormalParam Expression )
Statement
  • ateach (point pA) S
  • Creates region(A) async statements
  • Instance p of statement S is executed at the
    place where Ap is located
  • foreach (point pR) S
  • Creates R async statements in parallel at
    current place
  • Termination of all activities can be ensured
    using finish.

public boolean run() distribution D
distribution.factory.block(TABLE_SIZE) long.
table new longD (point i) return i
long. RanStarts new longdistribution.factory.
unique() (point i) return
starts(i) long. SmallTable new long
valueTABLE_SIZE (point i) return
iS_TABLE_INIT finish ateach (point i
RanStarts ) long ran nextRandom(RanStarts
i) for (int count 1N_UPDATES_PER_PLACE)
int J f(ran) long K
SmallTableg(ran) async atomic tableJ
K ran nextRandom(ran) return
table.sum() EXPECTED_RESULT
21
clocks
  • async (P) clock (c1,,cn)S
  • (Clocked async) activity is registered on the
    clocks (c1,,cn)
  • Static Semantics
  • An activity may operate only on those clocks it
    is live on.
  • In finish S,S may not contain any top-level
    clocked asyncs.
  • Dynamic Semantics
  • A clock c can advance only when all its
    registered activities have executed c.resume().
  • Operations
  • clock c new clock()
  • c.resume()
  • Signals completion of work by activity in this
    clock phase.
  • next
  • Blocks until all clocks it is registered on can
    advance. Implicitly resumes all clocks.
  • c.drop()
  • Unregister activity with c.

No explicit operation to register a clock.
Supports over-sampling, hierarchical nesting.
22
Example SpecJBB
  • finish async
  • clock c new clock()
  • Company company createCompany(...)
  • for (int w 0wh_num) for (int t 0term_num)
  • async clocked(c) // a client
  • initialize
  • next //1.
  • while (company.mode!STOP)
  • select a transaction
  • think
  • process the transaction
  • if (company.modeRECORDING)
  • record data
  • if (company.modeRAMP_DOWN)
  • c.resume() //2.
  • gather global data
  • // a client

// master activity next //1. company.mode
RAMP_UP sleep rampuptime company.mode
RECORDING sleep recordingtime company.mode
RAMP_DOWN next //2. // All clients in
RAMP_DOWN company.mode STOP // finish //
Simulation completed. print results.
23
Formal semantics (FX10)
  • Based on Middleweight Java (MJ)
  • Configuration is a tree of located processes
  • Tree necessary for finish.
  • Clocks formalized using short circuits (PODC 88).
  • Bisimulation semantics.
  • Basic theorems
  • Equational laws
  • Clock quiescence is stable.
  • Monotonicity of places.
  • Deadlock freedom (for language w/out when).
  • Type Safety
  • Memory Safety

24
Current Status
09/03
PERCS Kickoff
02/04
  • We have an operational X10 0.41 implementation
  • All programs shown here run.

X10 Kickoff
07/04
Code Templates
X10 0.32 Spec Draft
X10 Multithreaded RTS
X10 Grammar
Annotated AST
Target Java
Native code
AST
Analysis passes
Parser
Code emitter
02/05
JVM
X10 source
X10 Prototype 1
PEM Events
  • Code metrics
  • Parser 45/14K
  • Translator 112/9K
  • RTS 190/10K
  • Polyglot base 517/80K
  • Approx 180 test cases.
  • ( classesinterfaces/LOC)

Program output
  • Structure
  • Translator based on Polyglot (Java compiler
    framework)
  • X10 extensions are modular.
  • Uses Jikes parser generator.
  • Limitations
  • Clocked final not yet implemented.
  • Type-checking incomplete.
  • No type inference.
  • Implicit syntax not supported.

07/05
X10 ProductivityStudy
12/05
X10 Prototype 2
06/06
Open Source Release?
25
Future Work Implementation
  • Type checking/inference
  • Clocked types
  • Place-aware types
  • Consistency management
  • Lock assignment for atomic sections
  • Data-race detection
  • Activity aggregation
  • Batch activities into a single thread.
  • Message aggregation
  • Batch small messages.
  • Load-balancing
  • Dynamic, adaptive migration of places from one
    processor to another.
  • Continuous optimization
  • Efficient implementation of scan/reduce
  • Efficient invocation of components in foreign
    languages
  • C, Fortran
  • Garbage collection across multiple places

Welcome University Partners and other
collaborators.
26
Future work Other topics
  • Design/Theory
  • Atomic blocks
  • Structural study of concurrency and distribution
  • Clocked types
  • Hierarchical places
  • Weak memory model
  • Persistence/Fault tolerance
  • Database integration
  • Tools
  • Refactoring language.
  • Applications
  • Several HPC programs planned currently.
  • Also web-based applications.

Welcome University Partners and other
collaborators.
27
Backup material
28
Type system
  • nullable is a type constructor
  • nullable T contains the values of T and null.
  • Place types T_at_P, specify the place at which the
    data object lives.
  • Value classes
  • May only have final fields.
  • May only be subclassed by value classes.
  • Instances of value classes can be copied freely
    between places.

Future work Include generics and dependent types.
29
Example Latch
public class Latch implements future
protected boolean forced false protected
nullable boxed result null protected
nullable exception z null public atomic
boolean setValue( nullable Object val,
nullable exception z ) if (
forced ) return false // these
assignment happens only once. this.result
.val val this.z z
this.forced true return true
public atomic boolean forced() return
forced public Object force()
when ( forced ) if (z ! null)
throw z return result

public interface future boolean forced()
Object force() public class boxed nullable
Object val
Write a Comment
User Comments (0)
About PowerShow.com