Intel - PowerPoint PPT Presentation

About This Presentation
Title:

Intel

Description:

... IXP channel to communicate fabric flow control information from egress ... Media / Switch Fabric Interface. PCI interface. 2 QDR SRAM interface controllers ... – PowerPoint PPT presentation

Number of Views:235
Avg rating:3.0/5.0
Slides: 33
Provided by: matthewa3
Learn more at: http://www.cs.ucr.edu
Category:
Tags: fabric | intel

less

Transcript and Presenter's Notes

Title: Intel


1
Intel IXP2XXX Network Processor Architecture
Overview
John Morgan Infrastructure Processor
Division September 2004
2
IXP2400 External Features
  • External Interfaces
  • MSF Interface supports UTOPIA 1/2/3, SPI-3
    (POS-PL3), and CSIX.
  • Four independent, configurable, 8-bit channels
    with the ability to aggregate channels for wider
    interfaces.
  • Media interface can support channelized media on
    RX and 32-bit connect to Switch Fabric over SPI-3
    on TX (and vice versa) to support Switch Fabric
    option.
  • 2 Quad Data Rate SRAM channels.
  • A QDR SRAM channel can interface to
    Co-Processors.
  • 1 DDR SDRAM channel.
  • PCI 64/66 Host CPU interface.
  • Flash and PHY Mgmt interface.
  • Dedicated inter-IXP channel to communicate fabric
    flow control information from egress to ingress
    for dual chip solution.

Host CPU (Optional)
PCI 64-bit / 66 MHz
IXA SW
QDR SRAM 1.6 GBs 64 M Byte
Classification Accelerator
IXP2400 (Ingress)
CoProc Bus
DDR DRAM 2 GByte
Micro-Engine Clusters
Customer ASICs
IXP2400 (Egress)
Flash
Slow Port
Utopia 1/2/3 or POS-PL2/3 Interface
Flow Control Bus
Utopia 1,2,3 SPI 3 (POS-PL3) CSIX
ATM / POS PHY or Ethernet MAC
Switch Fabric Port Interface
3
72
IXP2400
MEv2 2
MEv2 1
DDRAM
Rbuf 64 _at_ 128B
S P I 3 or C S I X
32b
MEv2 3
MEv2 4
Intel XScale Core 32K IC 32K DC
G A S K E T
Tbuf 64 _at_ 128B
PCI (64b) 66 MHz
32b
64b
MEv2 6
MEv2 5
Hash 64/48/128
Scratch 16KB
MEv2 7
MEv2 8
QDR SRAM 1
QDR SRAM 2
CSRs -Fast_wr -UART -Timers -GPIO -BootROM/Slow
Port
E/D Q
E/D Q
18
18
18
18
4
IXP2400 Resources Summary
  • Half Duplex OC-48 / 2.5 Gb/sec Network Processor
  • (8) Multi-Threaded Microengines
  • Intel XScale Core
  • Media / Switch Fabric Interface
  • PCI interface
  • 2 QDR SRAM interface controllers
  • 1 DDR SDRAM interface controller
  • 8 bit asynchronous port
  • Flash and CPU bus
  • Additional integrated feature
  • Hardware Hash Unit
  • 16 KByte Scratchpad Memory,Serial UART port
  • 8 general purpose I/O pins
  • Four 32-bit timers
  • JTAG Support

5
IXP2800 External Features
6
18
18
18
IXP2800
Stripe
RDRAM 1
RDRAM 3
RDRAM 2
MEv2 2
MEv2 3
MEv2 4
MEv2 1
Rbuf 64 _at_ 128B
S P I 4 or C S I X
16b
MEv2 7
MEv2 6
MEv2 5
MEv2 8
Intel XScale Core 32K IC 32K DC
G A S K E T
PCI (64b) 66 MHz
Tbuf 64 _at_ 128B
64b
16b
MEv2 10
MEv2 11
MEv2 12
MEv2 9
Hash 48/64/128
Scratch 16KB
MEv2 15
MEv2 14
MEv2 13
QDR SRAM 2
QDR SRAM 1
QDR SRAM 3
MEv2 16
QDR SRAM 4
CSRs -Fast_wr -UART -Timers -GPIO -BootROM/SlowPo
rt
E/D Q
E/D Q
E/D Q
E/D Q
18
18
18
18
18
18
18
18
7
IXP2800 Resources Summary
  • Half Duplex OC-192 / 10 Gb/sec Network Processor
  • (16) Multi-Threaded Microengines
  • Intel XScale Core
  • Media / Switch Fabric Interface
  • PCI interface
  • 4 QDR SRAM Interface Controllers
  • 3 Rambus DRAM Interface Controllers
  • 8 bit asynchronous port
  • Flash and CPU bus
  • Additional integrated features
  • Hardware Hash Unit for generating of 48-, 64-, or
    128-bit adaptive polynomial hash keys
  • 16 KByte Scratchpad Memory
  • Serial UART port for debug
  • 8 general purpose I/O pins
  • Four 32-bit timers
  • JTAG Support

8
IXP2800 and IXP2400 Comparison
IXP2400
IXP2800
600/400MHz
1.4/1.0 GHz/ 650 MHz
Frequency
1 channel DDR DRAM - 150MHz Up to 2GB
3 channels RDRAM 800/1066MHz Up to 2GB
DRAM Memory
2 channels QDR (or co-processor)
4 channels QDR (or co-processor)
SRAM Memory
Separate 32 bit Tx Rx configurable to SPI-3,
UTOPIA 3 or CSIX_L1
Separate 16 bit Tx Rx configurable to SPI-4 P2
or CSIX_L1
Media Interface
8 (MEv2)
16 (MEv2)
Number of MicroEngines
Dual chip full duplex OC48
Dual chip full duplex OC192
Performance
9
MicroEngine v2
D-Push Bus
S-Push Bus
From Next Neighbor
Control Store 4K/8K Instructions
Local Memory 640 words
128 GPR
128 GPR
128 Next Neighbor
128 S Xfer In
128 D Xfer In
LM Addr 1
2 per CTX
B_op
A_op
LM Addr 0
Prev B
Prev A
P-Random
B_Operand
A_Operand
CRC Unit
Multiply
Lock 0-15
Status and LRU Logic (6-bit)
TAGs 0-15
32-bit ExecutionData Path
Find first bit
CAM
CRC remain
Add, shift, logical
Status
Entry
OtherLocal CSRs
ALU_Out
To Next Neighbor
Timers
128 S Xfer Out
128 D Xfer Out
Timestamp
D-Pull Bus
S-Pull Bus
10
Microengine v2 Features Part 1
  • Clock Rates
  • IXP2400 600/400 MHz
  • IXP2800 - 1.4/1.0 GHz/ 650 MHz
  • Control Store
  • IXP2400 4K Instruction store
  • IXP2800 8K Instruction store
  • Configurable to 4 or 8 threads
  • Each thread has its own program counter,
    registers, signal and wakeup events
  • Generalized Thread Signaling (15 signals per
    thread)
  • Local Storage Options
  • 256 GPRs
  • 256 Transfer Registers
  • 128 Next Neighbor Registers
  • 640 - 32bit words of local memory

11
Microengine v2 Features Part 2
  • CAM (Content Addressable Memory)
  • Performs parallel lookup on 16 - 32bit entries
  • Reports a 9-bit lookup result
  • 4 State bits (software controlled, no impact to
    hardware)
  • Hit entry number that hit Miss LRU entry
  • 4-bit index of Cam entry (Hit) or LRU (Miss)
  • Improves usage of multiple threads on same data
  • CRC hardware
  • IXP2400 - Provides CRC_16, CRC_32
  • IXP2800 - Provides CRC_16, CRC_32, iSCSI, CRC_10
    and CRC_5
  • Accelerates CRC computation for ATM AAL/SAR, ATM
    OAM and Storage applications
  • Multiply hardware
  • Supports 8x24, 16x16 and 32x32
  • Accelerates metering in QoS algorithms
  • DiffServ, MPLS
  • Pseudo Random Number generation
  • Accelerates RED, WRED algorithms
  • 64-bit Time-stamp and 16-bit Profile count

12
Intel XScale Core Overview
  • High-performance, Low-power, 32-bit Embedded RISC
    processor
  • Clock rate
  • IXP2400 600 MHz
  • IXP2800 700/500/325 MHz
  • 32 Kbyte instruction cache
  • 32 Kbyte data cache
  • 2 Kbyte mini-data cache
  • Write buffer
  • Memory management unit

13
Web Switch Design Using Network Processors NSF
Project 2002-2005
  • Funded by NSF and Intel Not Intel Confidential
  • L. Zhao, Y. Luo, L. Bhuyan and R. Iyer, A
    Network
  • Processor-Based Content Aware Switch
  • IEEE Micro, May/June 2006

14
Web Switch or Layer 5 Switch
www.yahoo.com
Internet
Image Server
APP. DATA
TCP
IP
Application Server
Switch
GET /cgi-bin/form HTTP/1.1 Host www.yahoo.com
HTML Server
  • Layer 4 switch
  • Content blind
  • Storage overhead
  • Difficult to administer
  • Content-aware (Layer 5/7) switch
  • Partition the servers database over different
    nodes
  • Increase the performance due to improved hit rate
  • Server can be specialized for certain types of
    request

15
Layer-7 Two-way Mechanisms
  • TCP gateway
  • Application level proxy on the web switch
    mediates the communication between the client and
    the server
  • TCP splicing
  • Reduce the overhead in TCP gateway by
    forwarding directly by OS

user
kernel
kernel
16
TCP Splicing
  • Establish connection with the client
  • Three-way handshake
  • Choose the server
  • Establish connection with the server
  • Splice two connections
  • Map the sequence for subsequent packets

Time
Client
Switch
Server
17
Partitioning the Workload
18
Latency on a Linux-based switch
  • Latency is reduced by TCP splicing

19
Latency using NP
20
Throughput
21
NePSim http//www.cs.ucr.edu/yluo/nepsim/
  • Objectives
  • Open-source
  • Cycle-level accuracy
  • Flexibility
  • Integrated power model
  • Fast simulation speed
  • Challenges
  • Domain specific instruction set
  • Porting network benchmarks
  • Difficulty in debugging multithreaded programs
  • Verification of the functionality and timing

Yan Luo, Jun Yang, Laxmi Bhuyan, Li Zhao, NePSim,
IEEE Micro Special Issue on NP, Sept/Oct 2004,
Intel IXP Summit Sept 2004, Users from UCSD,
Univ. of Arizona, Georgia Tech, Northwestern
Univ., Tsinghua Univ. NePSim has so far 3530 web
page visits, 806 downloads by October 2006 since
July  2004
22
NePSim Software Architecture
  • Microengine (six)

Microengine
SRAM
  • Memory (SRAM/SDRAM)

Stats
  • Network Device

SDRAM
Network Device
Debugger
  • Debugger

Verification
  • Statistic

NePSim
  • Verification

23
Power Model
H/W component Model Type Tool Configurations
GPR per Microengine Array XCacti 2 64-entry files, one read/write port per file
Control store, scratchpad Cache w/o tag path XCacti 4KB, 4byte per block, direct mapped, 10-bit address
ALU, shifter ALU and shifter Wattch 32bit

24
Benchmarks
  • ipfwdr
  • IPv4 forwarding(header validation, IP lookup)
  • Medium SRAM access
  • nat
  • Network address translation
  • Medium SRAM access
  • url
  • Examines payload for URL pattern
  • Heavy SDRAM access
  • md4
  • Compute a 128-bit message signature
  • Heavy computation and SDRAM access

25
Verification of NePSim
X. Chen, Y. Luo, H. Hsieh, L. Bhuyan, F. Balarin,
"Utilizing Formal Assertions for System Design of
Network Processors," Design Automation and Test
in Europe (DATE), 2004.
26
Performance-Power Trend Power consumption
increases faster than performance
Power
Power
Performance
Performance
url
ipfwdr
Power
Power
Performance
Performance
md4
nat
27
Dynamic Voltage Scaling
  • Power C a V2 f
  • Voltage Frequency
  • Reduce PE voltage and frequency when PE has idle
    time


28
Power Reduction with DVS
Power Reduction
Perf. Reduction
url ipfwdr md4 nat avg
Yan Luo, Jun Yang, Laxmi Bhuyan, Li Zhao, NePSim
A Network Processor Simulator with Power
Evaluation Framework, IEEE Micro Special Issue
on Network Processors, Sept/Oct 2004
29
Power Saving by Clock Gating
  • Shutdown unnecessary PEs, re-activate PEs when
    needed
  • Clock gating retains PE instructions

Yan Luo, Jia Yu, Jun Yang, Laxmi Bhuyan, Low
Power Network Processor Design Using Clock
Gating, IEEE/ACM Design Automation Conference
(DAC), June , 2005 , Extended Version to appear
in ACM Trans on Architecture and Code Optimization
30
Challenges of Clock Gating PEs
  • Terminating threads safely
  • Threads request memory resources
  • Stop unfinished threads result in resource leakage
  • Reschedule packets to avoid orphan ports
  • Static thread-port mapping prohibits shutting
    down PEs
  • Dynamically assign packets to any waiting threads
  • Avoid extra packet loss
  • Burst packet arrival can overflow internal buffer
  • Use a small extra buffer space to handle burst

31
Experiment Results of Clock Gating
lt4 reduction on system throughput
32
Main Contributions
  • Constructed an execution driven multiprocessor
    router simulation framework, proposed a set of
    benchmark applications and evaluated performance
  • Built NePSim, the first open-source network
    processor simulator, ported network benchmarks
    and conducted performance and power evaluation
  • Applied dynamic voltage scaling to reduce power
    consumption
  • Used clock gating to adapt number of active PEs
    according to real-time traffic
Write a Comment
User Comments (0)
About PowerShow.com