Customizable Domain-Specific Computing -- Proposal for NSF - PowerPoint PPT Presentation

About This Presentation
Title:

Customizable Domain-Specific Computing -- Proposal for NSF

Description:

Customizable DomainSpecific Computing Proposal for NSF Expedition in Computing Program – PowerPoint PPT presentation

Number of Views:131
Avg rating:3.0/5.0
Slides: 44
Provided by: yiz97
Learn more at: https://www.cs.rice.edu
Category:

less

Transcript and Presenter's Notes

Title: Customizable Domain-Specific Computing -- Proposal for NSF


1
Customizable Domain-Specific Computing --
Proposal for NSF Expedition in Computing Program
  • Point of Contact Jason Cong
  • cong_at_cs.ucla.edu
  • Participating Universities
  • UCLA (lead), Rice, Ohio-State, and UC Santa
    Barbara
  • (Complete list of PI/Co-PI available inside)

2
Outline
  • Motivation
  • Overall approach
  • Research plan
  • Management and collaboration plan
  • Value added as Expedition
  • Education and outreach plan
  • Deliverables and knowledge transfer

3
The Power Barrier
Source Shekhar Borkar, Intel
4
Current Solution Parallelization
Parallelization
Source Shekhar Borkar, Intel
5
Rise of Multi-core Processors
Sony-Toshiba-IBM Cell Processor(1PPE8SPE)
Intel Larrabee (32core)
Nvidia's GT200 GPU (308 240 cores)
Sun Rock processor (44 16 cores)
6
Cluster of Computers
IBM BlueGene/L No.1 in the Top500 list of
nov.2007, now No.4 in the newest Top500 list
7
Cost and Energy are Still a Big Issue
  • Cost of computing
  • HW acquisition
  • Energy bill
  • Heat removal
  • Space

8
Our Proposal Beyond Parallelization
Customizable Domain-Specific Computing
Parallelization
Customization
Source Shekhar Borkar, Intel
9
Motivation and Vision
  • A few facts
  • We have sufficient computing power for most
    applications
  • Each user/enterprise need high computing power
    for only limited tasks in his/her
    application-domain
  • Application-specific integrated circuits (ASIC)
    can lead to 1000X more power performance
    efficiency, but too expensive to design and
    manufacture
  • Our vision and approach
  • A general, domain-specific customizable platform
    with customizable computing engines and
    interconnects
  • Can be customized to a wide-range of applications
    in the domain with novel compilation and runtime
    systems
  • A supercomputer-in-a-box for the intended
    domain with 100X performance/power efficiency
    (vs. general-purpose solutions)
  • Can be massively produced with cost efficiency
  • Can be programmed efficiently
  • Analogy advance of civilization via
    specialization/customization

10
Overview of the Proposed Research
  • Domain-specific modeling
  • Domain-Specific Coordination Graph (DSCG) and
    Domain-Specific Language Extensions (DSLEs)
  • Executable models that generate application
    characterizations for CHP Mapping
  • Creation of customizable heterogeneous platform
    (CHP) for domain-specific computing
  • Customizable computing engines
  • Customizable interconnects
  • CHP mapping
  • Source-to-source CHP mapper
  • Compilation for customization
  • Adaptive runtime
  • Application domain healthcare
  • Medical imaging
  • Hemodynamic simulation
  • Integration and demonstration

11
Need Slides from Glenn, Vivek, and Alex
  • 1. Overview of the research tasks in the thrust
    (1)
  • 2. Transformative nature of research in terms of
    its impact to the society and the field (1)
  • 3. Fundamental theoretical contribution and
    implication, if applicable (1)
  • 4. A well integrated milestone chart with annual
    milestones covering all activities by PI/Co-Pis
    in the thrust (1)
  • 5. A list of possible summer research projects
    for UG and high-school students (1)

12
Application Domains Medical Image Processing
Hemodynamic Simulation
  • Medical imaging has changed the nature of
    healthcare and research
  • An in vivo method for understanding the nature of
    disease and the human condition
  • It is estimated that medical imaging accounts for
    100 billion/year in US healthcare costs
  • Better/faster algorithms can minimize the time
    spent by the patient in the scanner and improve
    clinical assessment
  • Many advanced image processing techniques to
    improve images and analyses are too slow for
    clinical purposes
  • Compressive sensing promises much faster imaging
    but need computationally demanding image recovery
    algorithms
  • Methods are needed to drive costs down while
    addressing computational needs
  • Hemodynamic simulation
  • Surgical procedures involving blood flow and
    vasculature increasingly consider hemodynamics
  • Planning reduces complications during the
    operation
  • Simulations built from angiography can take
    several days to construct

Magnetic resonance (MR) angiography of an aneurysm
Intracranial aneurysm reconstruction with
hemodynamics
13
Application Domains Medical Image Processing
Pipeline
  1. Minimization of energy formed of data fidelity
    term (modeling Rician noise) and total variation
    regularization term (non-explicit solution many
    iterations).
  2. One-step explicit solution, requires non-local
    communication, non-iterative

total variational algorithm
denoising
highly parallel, local and global communication
sparse linear algebra, structured grid,
optimization methods
Current models use physics principles with local
(linear/nonlinear) regularization, and with local
(L2) or non-local (mutual information, MI)
similarity measures. MI requires computation of
(non-local) histograms. PDEs are nonlinear.
Full analysis of an anatomical volume (e.g.,
brain) takes 3 hours, but for real-time clinical
applications, this full pipeline must be
performed in lt 2 minutes
fluid registration
registration
parallel, global communication
dense linear algebra, optimization methods
level set methods
Involves solving a system of nonlinear PDEs and
the use of an implicit surface (level sets) that
evolve to detect boundaries (anatomical regions)
segmentation
local communication
dense linear algebra, spectral methods, MapReduce
  1. 3D Navier-Stokes equations
  2. Population-based comparisons

analysis
local communication
sparse linear algebra, n-body methods, graphical
models
14
Application Domains Research Tasks
  • Creation of a real-time imaging pipeline
  • Each step involves computationally intensive
    algorithms with distinct communication and
    processing patterns
  • Implement current image processing hemodynamic
    simulation algorithms
  • ITK concurrent collection-based implementation
  • Total variational techniques
  • Develop models of aneurysms for surgical planning
  • Establish baseline benchmarks for core
    algorithms, comparing to GPU and FPGA
  • A CHP-based environment will also foster a new
    class of image processing algorithms
  • Compressive sensing
  • Investigate new algorithms based on CHP, allowing
    for changes in performance parameters given a
    dynamic platform
  • Evaluation of speed-up and benefit analysis
    (cost, quality of life) using real clinical data

Importantly, core methods in both areas have
applications to other computational
domains Insert results from Yi here
15
Domain-Specific Modeling Research Tasks
  • Design and implementation of Domain-Specific
    Coordination Graph (DSCG) and Domain-Specific
    Language Extensions (DSLEs)
  • DSLEs include domain-specific stencil
    computation, type systems, data structures,
    bitwidths, along with wrappers for
    domain-specific libraries
  • Design and implementation of simulation tools for
    executable models that generate application
    characterization to drive CHP Creation
  • Application characterization includes
    identification of intrinsic parallelism
    communication topologies, and selection of
    operand clusters for customized CHP instructions
  • Creation of executable models for medical imaging
    hemodynamic flow simulation domains
  • Simulations should be driven by both synthetic
    and (anonymized) real-world data
  • Transformative nature use of Domain-Specific
    Modeling to drive CHP Creation instead of
    software playing second fiddle to hardware

16
Domain-Specific Modeling Fundamental theoretical
contributions and implications
  • Deterministic executable models with implicit
    parallelism
  • Executable models are inherently fault-tolerant
  • Failed step can be re-executed without change in
    semantics
  • High-level stencil operations and their semantics
  • Yij ... -b(Xi2j-6Xi1j-6Xi-1j
    Xi-2j Xij2-6Xij1-6Xij-1X
    ij-2 Xi1j1Xi-1j1Xi1j-1Xi-
    1j-1)
  • For example, replace the above statement by Y
    -bapplyStencil(X,St)
  • Type systems with error significance and
    probabilities
  • Enables new forms of local reasoning about
    uncertainties and errors in software

17
CHP Creation Research Tasks
  • Hierarchical simulation methodology for CHP
    design space exploration
  • Fast analytical/statistical models
  • Initial design space pruning
  • Kernel-level simulation
  • Cycle-accurate simulation to refine the candidate
    set (e.g. SECS MC-Sim)
  • Full-system simulation
  • Cycle-accurate simulation of domain applications
    (e.g. SIMICS/GEMS)
  • CHP creation and optimization
  • Intelligent search of optimal CHP guided by
    domain-specific models knowledge
  • E.g. knowledge on working set, ILP, etc
  • Pruning guided by fast analytical/statistical
    models and/or kernel-level simulation
  • Validation by full-system simulation
  • Considering the impact of compilation and runtime
    systems
  • Silicon Implementation of CHP prototypes
  • Design based on simulation-driven exploration of
    design space
  • Industry partners will provide implementation
    support

18
CHP Design Space Exploration
  • Core Parameters

19
CHP Design Space Exploration
  • Core Parameters
  • NoC Parameters

20
CHP Design Space Exploration
  • Core Parameters
  • NoC Parameters
  • Custom Instructions and Accelerators

The key question What is the desired level of
tunability for a given domain?
21
Adaptive Interconnect with RF-I
  • NoC Topology Adapts to Application Demand
  • One example is application-specific shortcuts in
    hybrid mesh topology

Physical Topology
LOGICAL A
LOGICAL B
22
Tri-band On-Chip RF-I Test Results
Process IBM 90nm CMOS Digital Process
Total 3 Channels 30GHz, 50GHz, Base Band
Data Rate in each channel  RF Band 4Gbps Base Band 2Gbps
Total Data Rate 10Gbps
Bit Error Rate Across all Bands lt10E-9
Latency 6 ps/mm
Enegry Per Bit (RF) 0.09pJ/bit/mm
Enegry Per Bit (BB) 0.125pJ/bit/mm
VCO power (5mW) can be shared by all (many tens)
parallel RF-I links in NOC and does not
burden individual link significantly.
30GHz Channel
50GHz Channel
Base Band Channel
Output Spectrum of the RF-Bands, 30GHz and 50GHz
Data Output waveform
23
Further Amortization of CHP Cost
Daughter Board CHP
  • One generic CHP for all domains/applications
  • Still expect better performance/power efficiency
    over existing CMPs due to heterogeneity and
    programmability
  • One base CHP shared for a spectrum of domains
  • Contains key components and tunability for the
    intersection of many domain-optimal CHPs
  • Further cost amortization over many domains
  • Each domain has a domain-specific co-processor
  • Provides further customization within a domain
  • RF-I or optical provides low-latency
    communication between CHP and co-processor

General Purpose CHP
RF or optical connection
Fine-grain Cores
Domain-Specific (DS) Daughter Board
DS IP blocks
3D CHP
Fine-grain Cores
DCT Unit
Layer 2
Layer 1
24
CHP Mapping Overall Structure
25
CHP Mapping Research Tasks
  • Source-to-source CHP Mapper for given CHP
  • Includes loop transformations, polyhedral
    optimizations, space-time scheduling, RF-I
    bandwidth allocation, mapping to heterogeneous
    cores
  • Reconfiguring and optimizing back-end
  • Includes selection of register file sizes, cache
    sizes, datapath bitwidths
  • C/C-to-RTL synthesizer for FPGAs
  • Includes SDC scheduling, communication and
    behavior co-optimization
  • Adaptive Runtime
  • Includes fine-grained task scheduling using
    domain-specific information, as well as
    adaptation to different phases of the application
  • Software Reliability
  • Complement hardware reliability fault-tolerant
    algorithms with type checking for error
    significance, DSCG test coverage, and
    re-execution of DSCG steps

26
Tentative Experimental Hardware Platform
  • Nallatech FSB Compute module
  • FPGA-based accelerator unit
  • (Xilinx Virtex-5 LX330T FPGA 51,840 Slices)
  • Xeon-socket compatible
  • Allows stacking (2 to 4) compute modules
  • NVIDIA Tesla C1070
  • The fasted GPU / Computing Processor by NVIDIA
  • 4GB device memory
  • 30 Multi-processors (each has 8 cores)
  • Standard PCI-express 2.0 interface 8GB/s
  • Intel S7000 series server motherboard
  • Supporting up to 4 Xeon CPUs
  • 1066 MHz FSB (bandwidth 8.5GB/s)

27
CDSC Organization
Alberle
Baraniuk
Cong(Director)
Cheng
Chang
UCLA Rice UCSB Ohio State
Domain-specific specification Bui, Reinman, Potkonjak Sarkar, Baraniuk Sadayappan
CHP creation Chang, Cong, Reinman Cheng
CHP mapping Cong, Palsberg, Potkonjak Sarkar Sadayappan
Application modeling Aberle, Bui, Vese Baraniuk
Experimental systems All (led by Cong Bui) All All All
Sadayappan
Palsberg
Sarkar(Associate Dir)
Vese
28
Management and Collaboration Plan
  • Director Jason Cong (UCLA), Associate Director
    Vivek Sarkar (Rice)
  • Oversee the center operation
  • Research Executive Committee (REC) leaders of 4
    research thrusts 2 directors
  • Monthly teleconferences to review the research
    progress and facilitate inter-thrust
    collaboration
  • Each thrust will have weekly or biweekly meeting
    driven by research milestones
  • Leveraging extensive collaboration history among
    PI/Co-PIs
  • Everyone had/has joint projects/publications with
    others in the center
  • Inter-campus students exchanges are planned and
    encouraged
  • Three center-wide meetings each year
  • January, May, and September (annual review, with
    guests from NSF and industry)
  • Research talks poster sessions brainstorm
    sessions feedback session (at annual review)
  • Student activities
  • Seminars and workshops on interdisciplinary
    research, career development, ethics,
    entrepreneurship

29
Application Domains Milestones
  • Year 1
  • Identify and prioritize components of the ITK
    library to transform to concurrent collections
    and CHP
  • Select major medical image processing algorithms
    to form benchmarks as part of image pipeline
  • Initiate GPU and FPGA implementations of the
    selected algorithms (as appropriate)
  • Identify core hemodynamic simulation algorithms
    for transformation into CHP
  • Establish image testbed with gold standard
    results
  • Year 2
  • Complete base image testbed
  • Demonstration of initial implementation of select
    image processing algorithms on Prototype 1a
  • Assess potential speed-up and subsequent points
    for compiler and hardware improvements, compare
    to baseline benchmarks
  • Ascertain issues related to translation of C
    code to target CHP code representation
  • Year 3
  • Complete and document initial implementation of
    ITK library components
  • Demonstrate remaining medical image processing
    algorithms in Prototype 1b, including changes
    identified in Year 2 testing
  • Complete compressive sensing, TV methods
  • Year 4
  • Demonstrate initial hemodynamic simulation
    algorithms running under CHP, Prototype 1b
  • Demonstrate the adaptive runtime environment
    based on the algorithms thus far for CHP
  • Assess degree of recoding required to move from
    C to CHP
  • Perform profiling to inform improvements to
    compiler and hardware implementations
  • Year 5
  • Final demonstration of ITK library for image
    processing and hemodynamics simulation on CHP
    prototype
  • Evaluation of CHP performance and impact relative
    to real-world clinical data

30
CHP Creation Milestones
  • Year 1
  • Simulation Infrastructure
  • Initial CHP prototype COTS components (Prototype
    1a) enable SW development
  • Year 2
  • CHP design space exploration initial space
    pruning
  • Domain-specific component synthesis and selection
  • Prototype RF-I chip (Prototype 1b) with traffic
    generators and multicast
  • Year 3
  • CHP design space exploration refining with
    kernel simulation
  • CHP testbed creation component design and unit
    test
  • Year 4
  • CHP design space exploration full system
    simulation
  • CHP testbed prototyping (Prototype 2) on FPGAs
  • Year 5
  • CHP testbed tapeout (Prototype 2)
  • Full system integration and demonstration

31
Integrated Research and Education
  • New courses planned based on the research
  • Architecture and compilation for domain-specific
    computing
  • Computational techniques for medical imaging, and
  • Programming models and application development
    for domain-specific computing
  • With projects for new domain, e.g. scientific
    computing, VLSI CAD, and digital entertainment
  • May be jointly taught (multi-disciplinary)
  • Will be distributed and shared on Connexions
    (cnx.org), an open-access education project now
    has about 750,000 users per month
  • Graduate student training
  • Estimated around 18 students in total in four
    campuses
  • Undergraduate student training
  • 10 summer research fellowship each year, via UCLA
    FOCUS, Rice AGEP and similar programs
  • AGEP program especially targets women and URM
    candidates
  • Outreach to high-school graduates
  • 5-7 each year, via UCLA SMARTS or similar programs

32
Outreach Partner Frontier Opportunities in
Computing for Underrepresented Students (FOCUS)
  • Aims at increasing the number of underrepresented
    minorities interested in computing disciplines.
  • Currently has 50 underrepresented undergraduates
  • 23 in CS
  • 27 in CSE.
  • http//ceed.ucla.edu/focus/

2007 summer research poster competition

The first prize winner
33
Outreach Partner Science Mathematics
Achievement and Research Technology for Students
(SMARTS)
  • A six-week summer college preparation program at
    UCLA
  • Engage underrepresented students in science,
    technology, engineering and math training.
  • SMARTS activities
  • Course related activities,
  • Math courses (Intro to Statistics and AP Calculus
    Readiness)
  • SAT preparation
  • Research activities
  • Will have CDSC faculty and graduate students
    involved to serve as mentors and provide projects
  • This year, SMARTS program has over 80 applicants
  • 30-35 will be admitted (due to limitation of
    funding).

34
Possible Partner Teach For Americawww.teachfora
merica.org
  • About 300 teachers in LA area (6000 nationwide)
  • Cover 3,000 students in LA area (400,000
    nationwide/year)
  • 95 underrepresented students
  • Initial contact with Celia Alvarado, Manager for
    LA area
  • Will let CDSC speaker at orientation of teachers
    in related areas (e.g. math, science)
  • TFA teachers will introduce CDSC summer program
    to high school students, and make recommendation
    of students for the program
  • Will contact TFA in areas close to other CDSC
    campuses

35
CHP Creation Summer Outreach Projects
  • Premise
  • Small-scale, introductory projects that are
    self-contained
  • Leverage our development infrastructure to
    accelerate development time
  • Expose students to cutting edge tools and ideas
  • Simulation Infrastructure
  • Sample Undergraduate Project Refinement of
    statistical regression models
  • Sample High School Project Exploration of new
    design drivers and kernels
  • Physical Design
  • Sample Undergraduate Project Analysis of
    critical loops in new design drivers and creation
    of custom accelerators (leveraging our design
    framework)
  • Sample High School Project Basic power modeling
    methodology
  • RF Interconnect
  • Sample Undergraduate Project NoC exploration
    with RF

36
Value Added As Expedition
37
CHP Creation Transformative Nature of Research
  • Custom Heterogeneous CMPs
  • Conventional designs exploit parallelism with
    homogeneous resources
  • Designs are general
  • Resource allocation (i.e. datapath width, cache
    organization, types of functional units)
  • Instruction implementation (i.e. generally
    designed ISAs and machine organization)
  • Next transformative step in computing
  • Domain-specific integration of
  • Tunable processing cores (i.e. match the resource
    requirements of the application)
  • Programmable fabric (i.e. offload critical
    computation to customized bit-parallel datapaths)
  • Power-efficient domain-specific performance
  • Reconfigurable Network on Chip
  • Adapts to domain demand
  • Provide power-hungry bandwidth only where it is
    required
  • Shift away from designing for worst-case behavior
  • Two emerging technologies each enable efficient
    reconfiguration
  • RF Interconnect
  • Optical Interconnect

38
Knowledge Transfer
  • Main outcome of the project
  • CHP prototypes
  • Compilation and runtime system for CHP mapping
  • Application drivers both the original source
    code and modified source code with
    domain-specific modeling
  • General methodology for customizable computing
    (mainly through publications)
  • 1 - 3 will be shared with the research
    community via web as they become available
  • Industrial partners
  • Altera, IBM, Intel, Magma, Mentor Graphics,
    Nvidia, Xilinx
  • More will be contacted and included if the
    project is officially funded
  • Startup experience
  • Aplus design technologies (acquired by Magma in
    2003), AutoESL design technologies (Magma and
    Xilinx were investors)
  • Extensive experience working with Office of IP
    Administration (OIPA) for tech transfer

39
Backup Slides
40
Domain of Focus Health Care
  • Health care consumes of 16 of US economic output
    as of 2004 and is still increasing rapidly
  • Has most directly impact on the quality-of-life
  • Revolution in this area with the rapid advance of
    the computer and information technologies
    arguably has the most significant impact to the
    society and national economy
  • Many problems in this domain are extremely
    computationally challenging and beyond the reach
    of current computing technology

41
Simulation Framework
  • Statistical/Analytical Models initial design
    space pruning
  • Architecture-agnostic application model
  • Working set size, thread scaling, sequential vs
    parallel sections, etc
  • Core model
  • NoC/coherence model
  • MC-Sim design space refinement
  • Cycle accurate simulator based on SESC (MIPS
    emulation)
  • No operating system overhead useful for running
    application kernels
  • SIMICS/GEMS full system simulation
  • Cycle accurate simulation with operating system
    support
  • Used to run the full applications in the domain

42
Intelligent Design Space Exploration
  • Pruning
  • Initial studies with statistical/analytical
    models and kernel simulation will prune portions
    of the design space
  • Pruning will be conservative to avoid false
    negatives
  • Guided Space Exploration
  • Rather than explore the entire space in a brute
    force manner, we will filter certain
    architectural parameters based on domain specific
    knowledge
  • Example 1 Working set size is 2 MB for a
    particular domain
  • We will guide the design space to only consider
    architectural configurations which directly (i.e.
    2MB caches) or indirectly (i.e. aggressive
    prefetching or thread-level speculation) provide
    this effective working set size
  • Example 2 Domain has very limited ILP
  • We will prune dynamically scheduled cores from
    the space to avoid wasting power on needless ILP
    discovery

43
Connexions Open Access Course Ware
  • Connexions (cnx.org) is a non-profit open access
    educational publishing system based at Rice
  • goal make high-quality educational content
    available for free on the web and at very low
    cost in print
  • open-licensed repository of 10,000 Lego-block
    modules for authors, instructors, learners
  • global reach gt1M users monthly from nearly
    200 countries
  • A vibrant community will develop around the
    materials(think Wikipedia)
  • Will take leadership role in quality evaluation
    of open materials
  • build on successful Connexions partnership with
    IEEE(IEEEcnx.org)
Write a Comment
User Comments (0)
About PowerShow.com