Title: Why Parallel Optics
1- Why Parallel Optics?
- High bandwidth, low latency
- Extended distance vs copper links
- Greater Channel Packaging Density
- Lower cost per Gbit/s
- (see 2004 IBM Academy study on Intra-CEC
Interconnects, - G. Katopis M. Taubenblatt)
- IBM Applications
- pSeries Federation switch links (2004 GA)
- xSeries BladeCenter Apollo-lite
- Optical InfiniBand links
- BladeCenter (4X, 12X)
- (active cable or card dongle)
- z,pSeries (12X DDR)
- Industry wide applications (Sun, Topspin, etc.)
2Prediction of Component Failures in Enterprise
Server Parallel Optical Interconnects
Dr. Casimer DeCusatis and Rob Atkins IBM eServer
Hardware I/O
3Prediction of Component Failures in Enterprise
Server Parallel Optical Interconnects
This version is for IBM internal use only.
4- Overview
- Reliability of multi-channel devices is critical
for both current and future eServer applications - Conventional ESD signature fails
- Occur soon after manufacturing
- Tend to occur at the same position in the array
- Affect more than one laser element
- New failure mechanism recently discovered
- Latent damage, MTTF about 25-30 days
- Single laser fails, randomly distributed
- Related to manufacturing process, can be
screened root cause analysis ongoing
5- pSeries Federation Switch
- Optical Riser Card
- SNAP 12 Optics
-
- Two Transmitters per Riser Card
TX
Each TX contains an array of 12 Vertical Cavity
Surface Emitting Lasers (VCSELs)
6Visual Inspection of VCSEL
7Spreadsheet Excerpt
8FA Measurements
- FA on 108 optical riser cards
- Multiple laser fails in one TX were not typically
adjacent - Pretty even distribution across array
9Wafer mapping for some returned arrays
10Spectral Analysis of remaining channels looked
normal
11First Increase to Burn-In
- Due to considerable restriction of capacity in
starting up Proposal A, elected Proposal B at
first in April 2004. - Optics Vendor later discovered that these
calculations were inaccurate - Current to lasers is limited at the module
level, reducing the acceleration of the
temperature.
12TIVA
- Thermally Induced Voltage Alteration
- How TIVA Works
- Scanned optical beam with wavelength below the
semiconductor bandgap locally heats integrated
circuit (IC). - Temperature rise in IC creates local changes in
resistance. - Constant current source supplies bias that
results in voltage variation with resistance
changes. - Digital record of voltage versus scanner position
produces TIVA image.
13Failure Mechanism
- Failure Analysis to date indicates VCSEL damage
- Signature suggests ESD / EOS (Electrical Static
Discharge / Electrical Overstress) - Verified by Agilent via TIVA analysis
TIVA Analysis Channel 6
TIVA Analysis Channel 7
TIVA Analysis of channel 6. Leaky path at left
hand side along the edge of the aperture.
TIVA Analysis of channel 7. Leaky path at left
hand side along the edge of the aperture.
14Final Burn-In / Screen Details
- Currently, infant failures observed at customers
range from a max of 4 days to a max of
approximately 85 days - Optics vendor implemented extended module burn in
at 100C-48H for detection on March 17th, 2004 - Due to the activation energy being unknown, to
further increase the sensitivity of detection,
optics vendor extended module burn based on the
potential for the defect to have a lower
activitation energy, resulting in burn-in to
10/18 days. - Stage 1 one module will be drawn from each
production lot (max 8 modules per lot) for
100C-10 days burn in (implementation date May
26th). This will increase confidence in modules
shipped in parallel - Stage 2 implement 100 module burn in at 100C-10
/18 day. Optics vendor ordered additional burn in
ovens and burn in boards. By July 16th, 2004,
optics vendor had adequate capacity for whole
parallel product line. - If failures are observed at 100C-10 days module
burn in, modules will go for additional 8 days
burn in. If there is any failure after 18 days
burn in, the shipment will be on hold pending
analysis and disposition - To further improve screen / detection
sensitivity, optics vendor implemented extended
VCSEL burn in by doubling the current burn in
time to 96H-100C-15mA (without impacting the end
life) started May 21st, 2004.
15Conclusions
- We observe ESD signature fails that appear to
have been caused by events occurring early in the
optics vendor manufacturing process - ESD causes dislocations in VCSEL which grow with
time when powered on, resulting in latent
failures - Extensive ESD audits done at optics vendor (and
ECAT card manufacturer and IBM for good measure) - No root cause found
- Overall, optics vendor had very good ESD
processes. Identified a couple areas for
improvement - Screen put in place at final module burn-in at
optics vendor - 10/18 days - Caught a few ESD fails before process
improvements in place - Optics vendor process improvements were
implemented. Have not seen an ESD fail from a
part built after improvements were in place