Title: M Windows NT 4.0 Setup and Debugging
1MWindows NT 4.0Setup and Debugging
- Joseph West
- Sr Technology Specialist
2Agenda
- Setup (build overview)
- Three phases of Setup
- Character-Based Setup
- Boot from Character-Based to GUI-Based Setup
- GUI-Based Setup
- Troubleshooting
- (Blue Screens Stop Codes)
- Latest information for NT 4.0
- SP4
3Hardware Compatibility List
- How important is it
- Support parameters
- http//www.microsoft.com/hwtest/
- http//support.microsoft.com/
4Character-Based Setup
- Gathering of System
- Architecture Information
- CPU Type
- Motherboard Architecture
- Hard Drive Controllers
- File Systems
- Disk Free Space
- Memory
5Info Gathered is Required for Basic System
Initialization
- Failure to Detect will lead to failure of Setup
- Unsupported components and enhancements
- PCI 2.1
- Special Bus Drivers
- Caching Chips for Burst Mode
6Boot from Character-Based to GUI-Based Setup
- Windows NT Kernel is loaded completely for the
first time - Finds a valid Hard Drive
- Polls Adapters and tests Bus
- Most likely point of failure
- Drivers are loaded into Memory and
Multi-threading is initialized
7GUI-Based Setup
- Install secondary Drivers
- Create Accounts
- Machine and Administrator
- Configure Network Settings
- Build final System Tree and Registry
8Troubleshooting Character-Based Setup
- NTHQ Tool
- Located in Support Directory
- Purpose is to show all hardware peripheral
settings - Works with PCI, PnP and Legacy peripherals
9Troubleshooting Character-Based Setup
10Troubleshooting Character-Based Setup
- Unsupported Controller
- and BIOS Enhancements
- 32-bit I/O
- Enhanced Drive Access
- Multiple Block Access or Rapid IDE
- Power Management Features
11Troubleshooting Character-Based Setup
- Setup Hangs During Initial Boot
-
- Disable CD-Boot capability before installing
- Needs to be done at both the Controller and BIOS
levels
12Troubleshooting Character-Based Setup
- Setup Cannot Find Hard Drive
- Scan System for Viruses
- Make certain there is valid Boot Sector on the
Hard Drive
13Troubleshooting Character-Based Setup
- Setup Cannot Find Hard Drive
- If Hard Drive Controller is SCSI
- Are devices properly terminated
- Is SCSI BIOS enabled - first Controller (if at
all) - On secondary Controllers, make certain BIOS is
disabled - Partition and format using current Controller
14Troubleshooting Character-Based Setup
- Setup Cannot Find Hard Drive
- If Hard Drive Controller is IDE or EIDE
- Make certain drive is on primary Controller
Channel - Make certain drive is jumpered correctly
- (i.e.) Master, Slave, Independent
15Troubleshooting Character-Based Setup
- Setup Does Not Detect Hard
- Drive Controller Correctly
- Manually select Controller type
- Make certain that an NT 4.0 driver is being
loaded - Use NTHQ Tool to check for correct IRQ and Memory
addressing
16Troubleshooting Character-Based Setup
- Setup Cannot Find a Valid Partition
- If Windows 95 is on the system, back-up and Fdisk
Hard Drive (no support for Fat 32) - Recreate Partitions and Format with DOS 6.22
- Restore Windows 95 and proceed with Windows NT
installation - Make certain that correct HAL is being loaded
17Troubleshooting Failure to RebootFrom
Character-Based to GUI-Based Setup
- Stop Messages
- Record Hex Value, 0x1e, 0x7b, etc.
- Record Values in parentheses
- Record component where failure occurred
- Note where in Boot Process error occurred
- Call PSS (installation support)
18Troubleshooting Failure to Reboot
- Stop Messages Which
- can be Solved in the Field
- 0x7b, (0x4,0,0,0), or 0x8b
- Indicates problem with Master Boot Record
- Scan for Viruses
- Confirm correct Controller driver is loaded
- Refresh Master Boot Record
19Troubleshooting Failure to Reboot
- After Reboot,
- Video Remains Black
- Check for devices using IRQs 2, 9 or 12 (PCI)
- Scan Hard Drive for Viruses
20Troubleshooting Failure to Reboot
- Stop Messages Which
- can be Solved in the Field
- 0x1e or 0xa
- Disable any Third-party services or drivers which
were loaded prior to Upgrade - Use NTHQ to confirm appropriate Memory and IRQ
settings
21Troubleshooting GUI-Based Setup Issues
- Setup Will Not Read
- From CD-ROM Drive
- Make certain CD is on HCL
- Copy I386 directory to the Hard Drive and start
again from the beginning - Make certain that the Controller and/or Hard
Drive is correctly configured
22Troubleshooting GUI-Based Setup Issues
- If Setup Fails During
- Copy of Files to Hard Drive
- Disable all external Caches in BIOS
- Make certain Hard Drives are terminated
correctly Active Preferred
23Setup Enhancements in Windows NT 4.0
- Bootable CD-ROM
- Supports only El Torrito Specification
- Can only be used in No Emulation Mode
- Must be supported by both System and SCSI BIOS
24Setup Enhancements in Windows NT 4.0
- Winnt Character-Based
- Setup Logging
- Using Winnt or Winnt32 /L
- Logs all actions during character-based setup to
find last successful action - Helps to isolate where setup halted without
requiring special DLLs
25Setup Enhancements in Windows NT 4.0
- Restartable GUI-Based Setup
- If the machine fails during GUI-mode Setup the
problem can be fixed and setup will continue from
reboot
26Agenda
- Setup (build overview)
- Three phases of Setup
- Character-Based Setup
- Boot from Character-Based to GUI-Based Setup
- GUI-Based Setup
- Troubleshooting
- (Blue Screens Stop Codes)
- Latest information for NT 4.0
- SP4
27Youre Up and Running, But ...
28Debugging(the connection)
- Connect
- Modem, Null-modem cable, LAN
- Boot.ini
- / Debug /Debugportcom1 / Baudrate19200
- Symbols
- Retail NT CD (in the) support\debug\platform\sym
bols sub-directory
29Debugging(the connection)
30Interpreting Blue Screens
- The error code and parameters at the top of the
screen - The list of modules that have successfully loaded
and initialized in the middle of the screen - The list of modules that are currently on the
stack at the bottom of the screen
31Stop Codes
Note For a complete listing of stop codes, see
Windows NTW 4.0 Resource Kit, Chapter 39,
Windows NT Debugger, or Q142657 article on
http//support.microsoft.com
32Common Stop Codes
- 0xA
- 0x1E
- 0x24
- 0x3F
- 0x50
- 0x7B
- 0x7F
- 0xC000021A
330xA
- 0x0000000A IRQL_NOT_LESS_OR_EQUAL
- Description
- An attempt was made to touch paged out memory at
a process interrupt request level (IRQL) that is
too high. Code that runs at higher interrupt
levels cant touch paged-out memory because
paging would be to expensive. If it happens that
a pageable page is not committed, but its
virtual address range is still in the translation
buffer, high irql code can get away with touching
it. But if the system is stressed then the
memory manager will have likely paged that page
out and when an in page is attempted - the
bugcheck will occur. So, this is why certain
bugs tend to not show up on developers boxes
which are less stressed than production. - Typical Scenarios
- System configuration changes, virus scanners,
other file I/O filters.
340x1E
- 0x0000001E KMODE_EXCEPTION_NOT_HANDLED
- Description
- Essentially, this bugcheck identifies an error
that occurred in a section of code where no error
detection routines were in place. Most
exceptions are generated directly in the section
of code that is executing. In this case, the
error was not trapped in the middle of the code
that was executing. Therefore, the error was
allowed to fall through to this default error
handler. This makes the error a very common
exception. The actual instruction fault is
usually similar to a STOP 0xA that is a memory
access violation. - Typical Scenarios
- Invalid or obsolete third-party driver or system
service, Microsoft driver or system service bug,
file I/O filter drivers.
350x24
- 0x00000024 NTFS_FILE_SYSTEM
- Description
- A STOP 0x24 is the result of NTFS code that
detects a problem with the structure of the NTFS
file system. This is not a cut and dried
exception code and debugging it is sometimes
difficult. Disk corruption can generate a STOP
0x23 (FAT_FILE_SYSTEM) and 0x24. However any
processes involved in reading or writing data
from a FAT or NTFS file system could cause the
disk data to appear corrupted. Therefore SCSI
and IDE drivers as well as the disk structure
itself (hard errors, i.e. bad blocks) can be
suspect. The file system calls this bug check in
multiple places and this will help us identify
the actual source line that generated the bug
check. Also, this bugcheck can be caused by I/O
filter drivers (resource hangs, race conditions,
etc.). After the above is eliminated, more
low-level constructs such as file system
synchronization objects, scb attributes, etc.
need to be examined by the debug engineer. - Typical Scenarios
- This bugcheck is encountered when the NTFS file
system has a corruption, or the hard drive has a
bad block.
360x3F
- 0x0000003F NO_MORE_SYSTEM_PTES
- Description
- This stop isnt as common as most of the others
in this section, but a good explanation is
warranted. A STOP 0x3F is the result of a system
doing lots of I/O, therefor fragmenting the
system PTEs. The bugcheck occurs not because
the system is out of PTE's, but because a driver
requests a huge chunk of memory that cant be
satisfied because a contiguous block that big
isnt available. - Typical Scenarios
- Often video drivers will allocate large amounts
of kernel memory that must succeed. Also, some
backup programs do the same. - For these situations, consult a PSS engineer for
the Registry hack that allows the increase of
total system PTEs.
370x50
- 0x00000050 PAGE_FAULT_IN_NONPAGED_AREA
- Description
- A STOP 0x50 is caused when a memory region that
is not supposed to be paged out (usually for
performance reasons) is paged out. This stop can
be caused by a variety of problems including
corrupt NTFS volumes, bad network packet data,
and in general kernel mode drivers that corrupt
memory. Also, drivers that free an MDL but dont
communicate it to all portions of the driver.
Others include Disk, Controller, and Disk Driver
problems. - Typical Scenarios
- Usually third-party kernel mode drivers munging
memory, or reading beyond allowable memory.
Also, when the file system is pushed to the
tested limits (large Mac volumes), bugs in NTFS
are exposed that result in this STOP. This STOP
can occur due to interaction problems between
SCSI Controller firmware and Hard Drive firmware.
380x7B
- 0x0000007B INACCESSIBLE_BOOT_DEVICE
- Description
- During the initialization of the I/O system, the
driver for the boot device may have failed to
initialize the device that the system is
attempting to boot from, or the file system that
is supposed to read that device may have either
failed its initialization or simply not
recognized the data on the boot device as a file
system structure. - If this is the initial setup of the system, this
error may have occurred because the system was
installed on an unsupported Hard Disk or SCSI
Controller. - This error can also be caused by the installation
of a new SCSI Adapter or Hard Disk Controller or
by repartitioning the Hard Disk with the System
Partition. - Typical Scenarios
- VIRUS
- LBA type problems, MBR type problems, SCSI
Controller/Hard Drive geometry issues, etc.
390x7F
- 0x0000007F UNEXPECTED_KERNEL_MODE_TRAP
- Description
- This error means a trap occurred in kernel mode,
either a kind of trap that the kernel is not
allowed to have or catch (a bound trap), or a
kind of trap that is always instant death (double
fault). - Typical Scenarios
- Hardware, kernel mode drivers that manipulate
critical system data in an untimely fashion. - This STOP most often is the result of the
processor taking a double 0x7f (8,0,0,0). Note
that these parameters can also show up for a
modern software issue involving Netmon (bhnt.sys).
400xC000021A
- 0xC000021A FATAL_SYSTEM_ERROR
- Description
- This is a typical description that accompanies
this error The Windows Subsystem System process
terminated unexpectedly with a status of
(0x6130F2B6 0x01B6FBA4). The system has been
shutdown. - The failing process sometimes is listed in the
blue screen itself. - This bugcheck occurs when a user-mode subsystem
such as Winlogon or CSRSS is fatally compromised
such that security can not be guaranteed. The
Operating System makes a transition into kernel
mode and throws this exception. - Typical Scenarios
- A typical cause of this crash would be an
extensible perfmon counter that overwrites its
Winlogon shared data buffer (Q171033), and in
general any access violation that compromises a
user-mode subsystem.
41Break
42Agenda
- Setup (build overview)
- Hardware Compatibility List
- Three Phases of Setup
- Character-Based Setup
- Boot from Character-Based to GUI-Based Setup
- GUI-Based Setup
- Troubleshooting
- (Blue Screens Stop Codes)
- Latest Information for NT 4.0
- SP4
43A Day in the Life
Video
44NT4 Service Pack 4
- Contents
- Hotfixes for important customer-reported problems
- Resource and memory leak bugfixes from NT5
- 30 support, diagnostic and repair tools from the
NT Resource Kit are included on the SP4 CDROM - Event log entries for clean and dirty shutdown
- Process Improvements
- Dedicated Service Pack test team
- Beta Program for Service Packs
- Improving the Knowledge Base, depth and ease of
use - Slipstreaming Service Packs into OEM releases
45Resource / Memory Leaks
- Problem
- Leaks lead to hung systems and bluescreen crashes
- Some customers do preventive reboots
- Difficult to stop or kill the offending process
- Solutions
- Fix leaks several hundred in NT5, key fixes in
NT4 SP4 - Job objects in NT5, set memory limits on a
collection of processes - Visual Studio adding leak checking to MFC and CRT
- Next Work Items
- Better leak detection
- Logging in under low resource conditions
- Stopping and killing processes
46Bugchecks (Blue Screens)
- Kernel mode code detected a serious error
- Blue screens are still frequent and very hard to
diagnose - Crash dumps take too long on large memory systems
- Prevention
- Find and fix bugs in our code
- Review all calls to KEbugcheck by NT5 RTM
- Improve diagnosis
- Reduced clutter on the blue screen, focus on key
data, and add hints - Crash dumps are now dramatically faster in NT5
- Developing comprehensive crashdump analysis tools
for NT4 and NT5
47Bugchecks (Blue Screens)
483rd Party Drivers
- Problem
- One of the most common complaints from PSS
- Source of pool corruption - difficult to diagnose
- Solution
- DDK driver samples and documentation is improved
in NT5 - Enhanced driver testing in NT4 and NT5, including
pool corruption tests - NT5 will have driver signing, warning level by
default - WDM drivers will drive higher quality
- We are testing major third-party anti-virus
software regularly
49Unnecessary Reboots in NT5
- Problem
- Hardware and software configuration and
maintenance - Solutions
- Fixed 50 software configuration cases which
required a reboot in NT4. Key fixes include - Adding, removing and configuring network
protocols changing IP addresses - Reconfiguring settings on PCI and other PnP
hardware - Reboots still required for some rare cases
- Machine name change, domain membership changes,
system locale and system font changes, service
pack installation - Hardware reconfiguration by clustering solutions
in NTS/E - Where possible, hotfixes will avoid requiring a
reboot
50Diagnosis and Recovery
- Recovery Involves
- Detection (hard with a hung application or
server) - Diagnosis (need good tools, need parallel
installs, bad error messages) - System Recovery (chkdsk, crash dump biggest time
hits) - Application recovery (SQL, Exchange Store, etc)
- We are delivering
- 30 of the most critical support, diagnostic, and
repair tools in SP4 and NT5 B2 - Fixing 35 worst error messages by B230, then
next 200 as time allows - NT5 Safe-mode Boot today and Floppy Boot by NT5
RTM - Both support NTFS
- Web-based trouble-shooter for most common
bluescreens - Online chkdsk post NT5
51NT Test Initiatives
- Long duration Server stress
- 10 Servers running stress for a month starting
at NT5 Beta 2 - Mix of stress including BackOffice, IIS,
Client/Server, etc - Specifically watching for memory and resource
leaks - Improved driver testing for NT4 and NT5
- Catch pool corruption
- Fault injection
- Better integration testing of Server applications
- BackOffice applications Exchange, SQL Server
- Using automated scripts from BackOffice teams
- Testing with Oracle, SAP R/3, Lotus Notes
- 100 Top Server Applications from Tier 1 RDP
customers - Expanded tests for customer configurations
- RDP Customer configurations, ISP
52Resource Kit Tools
- Network Diagnostic and Support Tools
- nettest - quickly determine whether local uses
network is configured properly (IDW) - Applications, Service Problems and Memory Leaks
- memsnap - detection of memory and resource leaks
over time (dump directory) - Disk Problems
- fixacls - resets ACLs on system files to
installation defaults, fixes users who hose their
ACLs - Debugger Tools
- debug wizard - easy setup of debuggers for
customers - Other
- windiff - file compare util, critical for many
situations (reskit)
53Event Log Analyst
- Prototype tool for collecting and analyzing event
log reliability data - Designed for collecting reliability trend data
from an entire datacenter in few hours - Collected data from 800 CDC servers in 5 hours
- Analysis is manual with Excel, less than 3 hours
- Provides trend analysis of reboots, bugchecks,
and Dr Watsons
54Event Log Analyst
55Event Log Analyst Metrics
- Mean time between reboots
- Mean time between bugchecks
- Mean time between Dr Watsons
- Trend analysis of reboots/server-year
- Trend analysis of bugchecks/server-year
- Trend analysis of Dr Watsons/server-year
- Bugcheck distribution
- Dr Watson distribution
- SP4 Only Availability percentage
- SP4 Only Mean time to repair
56Tools for NT4 SP4 and NT5
- Network Diagnostic and Support Tools
- browstat - only useful tool for diagnosing
browser problems (reskit) - dhcpcmd - useful for fixing DHCP issues (reskit)
- dnscmd - diagnose and repair DNS problems
(reskit) - eseutil - used for WINS and DHCP database
diagnosis and repair - nettest - quickly determine whether local uses
network is configured properly (IDW) - winscl - diagnose and repair WINS (reskit)
- winsadd - command line tool for batching static
and dynamic entries in WINS - nltest - used for resetting secure channels,
diagnosing and fixing trust problems (reskit)
57Tools for NT4 SP4 and NT5
- Applications, Service Problems and Memory Leaks
- depends - display and troubleshoot application
dependency problems (IDW) - tlist - list running processes, used in
conjunction with kill (reskit) - kill - forcibly terminate processes (reskit)
- memsnap - detection of memory and resource leaks
over time (dump directory) - pmon - detection of memory and resource leaks
over time (reskit) - pviewer - gather extended information about
running processes (reskit) - reg - registry utility, used for diagnosis and
repair of many types of issues
58Tools for NT4 SP4 and NT5
- Disk Problems
- disksave - saves and restores the MBR (reskit)
- fixacls - resets ACLs on system files to
installation defaults, fixes users who hose their
ACLs - ftedit - used daily to help customers repair
fault tolerant volumes (reskit) - Debugger Tools
- gflags - set global flags needed for various
kinds of debugging (IDW) - remote - allow remote debugging by PSS (reskit)
- debug wizard - easy setup of debuggers for
customers - all standard debuggers - already ships in
/support dir
59Tools for NT4 SP4 and NT5
- Other
- uptomp - update system from uniproc to multiproc
(reskit) - robocopy - used daily by PSS during support
calls, easiest way to move large amounts of data
around very quickly. - shutdown - remote shutdown of systems (reskit)
- ntevntlg.mdb ntmsgs.hlp - better error message
docs (reskit) - windiff - file compare utility critical for many
situations (reskit) - dumpel - dump event log messages from local or
remote systems (reskit) - list - used daily by PSS for reviewing
exceedingly large log files, etc.
60Summary
- Best Practices matter
- Mature, disciplined planning procedures
- Design, Implement, Test
- Configuration Operational control
- Technology matters
- OS system services
- UPS, RAID, ECC Memory, multi-homing
- Cluster Services
- We can deliver availability with Windows NT today
- Microsoft is investing heavily in availability
61References and Resources
- http//www.microsoft.com/ntserver/
- http//www.microsoft.com/ntworkstation/
- http//www.microsoft.com/windowsnt5/
- http//www.microsoft.com/hwtest/
- http//support.microsoft.com/
- http//support.microsoft.com/support/kb/articles/q
103/0/59.asp - Descriptions of Bug Codes for Windows NT
62References and Resources
- Inside Windows NT Second Edition, David A.
Solomon - MS Press 1998
- Windows NTW 4.0 Resource Kit
- Chapter 19 What Happens When You Start Your
Computer - Chapter 21 Troubleshooting Startup and Disk
Problems - Chapter 36 General Troubleshooting
- Chapter 39 Windows NT Debugger, or Q142657
article - Supporting Windows NT Server in the Enterprise
- MS Press 1998
- Chapter 7 Troubleshooting Tools and Methods
63Questions?
64M