Module 7: Server Cluster Maintenance and Troubleshooting - PowerPoint PPT Presentation

About This Presentation
Title:

Module 7: Server Cluster Maintenance and Troubleshooting

Description:

Title: Module 7: Troubleshooting Cluster Service Author: Priscilla Johnston Last modified by: xx Created Date: 8/9/2000 8:01:19 PM Document presentation format – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 93
Provided by: Prisci76
Category:

less

Transcript and Presenter's Notes

Title: Module 7: Server Cluster Maintenance and Troubleshooting


1
Module 7 Server Cluster Maintenance and
Troubleshooting
2
Overview
  • Cluster Maintenance
  • Troubleshooting Cluster Service

3
  • Server cluster maintenance and troubleshooting
    are considered two separate disciplines.
    Maintenance is continuous, whereas
    troubleshooting has a beginning when the problem
    is discovered, and an end when the problem is
    resolved. The two disciplines are complimentary,
    however. When every troubleshooting procedure
    that you follow fails, you will need to rebuild
    the cluster from a backup tape that was generated
    during a maintenance procedure.

4
  • After completing this module, you will be able
    to
  • Perform the steps to successfully back up a
    server cluster.
  • Perform the steps to successfully restore a
    server cluster.
  • Evict a node from a server cluster.
  • Identify the tools that are necessary to
    troubleshoot a cluster failure.
  • Interpret the entries on the cluster log.
  • Identify and troubleshoot common server cluster
    failures network communications, small computer
    system interface (SCSI) configuration problems,
    group, resource, and quorum failures.

5
Cluster Maintenance
  • Backup
  • Restoring the First Node
  • Restoring Cluster Disks
  • Restoring the Second Node
  • Evicting a Node

6
  • Cluster service uses the self-tuning features of
    Microsoft Windows 2000 and requires very little
    maintenance. The only day-to-day maintenance
    operation that you need to perform is to back up
    the cluster.
  • Under special circumstances, a node in the
    cluster may need to be replaced, for example,
    when your organization decides to perform a
    hardware upgrade. In this situation, you need to
    evict a node from the cluster and add the
    upgraded node to the cluster.

7
Backup
  • Backing Up the System State
  • Backing Up the Local Disk
  • Backing Up the Cluster Disk

8
  • Backing up the cluster is no different from
    backing up Microsoft Windows 2000 Advanced
    Server. It is recommended that you perform
    regular backups by using the Windows 2000 Backup
    program (NTBackup), or other compatible backup
    programs. Additional backup agents are still
    necessary to back up applications running on the
    cluster, such as Microsoft SQL Server and
    Microsoft Exchange.
  • Note A cluster-aware backup program will be able
    to perform the same backup operations as
    NTBackup, especially with regard to backing up
    the System State and the cluster configuration
    database.

9
Backing Up the System State
  • The configuration information for the cluster is
    located on the registry on each node
    (HKEY_LOCAL_MACHINE\Cluster). The Backup tool
    that is included with Windows 2000 backs up the
    cluster database when you back up each nodes
    system state.
  • NTBackup backs up the system state on each node.
    The system state includes
  • The quorum log.
  • The local registry.
  • The Cluster registry hive.

10
Backing Up the Local Disk
  • Follow standard computer backup procedures to
    back up the operating system and the data on the
    local drives. You must also back up key cluster
    files on the local disks.
  • On each node, back up the cluster database files
    systemroot\cluster\CLUSDB systemroot\cluster\
    CLSUDB.LOG
  • On each node, back up the clustering service
    systemroot\cluster\.
  • Note Backup is essential, but regular testing to
    make sure that backups and restores actually work
    as expected is also necessary. A good practice is
    to schedule test backup and restore operations
    frequently.

11
Backing Up the Cluster Disks
  • It is critical to back up cluster files on the
    quorum disk and data on the cluster disks,
    because Cluster service will write information to
    files in the \mscsdirectory on the quorum disk
    and cluster-aware applications will likely be
    placing data on the cluster disk. Because either
    node of the cluster could own the cluster disk
    resource at any time, it is possible for each
    node to back up the data on the drive. However,
    having each node back up data would require you
    to install backup hardware and software on each
    cluster node, which is not the best solution.
  • One possibility is to identify a nonclustered
    server running Windows 2000 Server and schedule
    it to back up data remotely through a network
    connection to the Cluster disks administrative
    share or a hidden share that you create. For
    example, you might create FBackup, GBackup,
    HBackup, and WBackup file share resources on
    the virtual server for the root of drives F, G,
    H, and W. F, G, and H would be cluster disks with
    data, and W would be the drive letter for the
    quorum disk. Hidden shares would not appear in a
    browse list and you could configure them to allow
    access only to members of the Backup Operators
    group.

12
  • The following sections describe the procedure for
    restoring a server cluster in the event that both
    nodes and the cluster disk fail. It is possible
    that any one of the components in the cluster
    could fail independently. In the case of a failed
    component, you follow the same procedure for
    restoring that specific component.

13
  • Performing a complete restore of a server cluster
    is a straightforward process.
  • Restore a node of the cluster.
  • Restore the cluster disks of the restored first
    node.
  • Restore the remaining node of the cluster.
  • Perform node testing.

14
Restoring the First Node
  • Steps For Restoring a Server Cluster
  • Restore the first node
  • Restore the cluster disks
  • Restore the second node
  • Perform node testing

15
Restoring a Node of the Cluster
  • To restore a node in a server cluster, you follow
    the same procedure that you would use in
    restoring a Windows 2000 operating system.
  • Install a fresh copy of Windows 2000 Advanced
    Server on the node to be restored.
  • Log on as Administrator and restore the system
    and boot partition, system state, and associated
    volumes from the backup. Make sure that you
    select the option to restore the system state to
    the original location in the backup program.
  • Restart the node.
  • Perform the steps for restoring the cluster disk.
    These steps follow in the next section.
  • Note The difference between the time of the
    backup and the time of the restoration to the new
    computer may affect the computer account on the
    domain controller. You may have to join a
    workgroup and then rejoin the domain.

16
Restoring Cluster Disks
  • Restoring Disk Signature Files
  • Restoring the Data on the Cluster Disk
  • Restoring the Cluster Configuration Files

17
  • After you have restored a node in the cluster,
    you must restore the cluster disks. Restoring the
    cluster disks involves restoring the disk
    signature file that the cluster uses to identify
    the disk. You may also need to restore a cluster
    disk if you are running out of disk space or if
    there is impending disk failure of a disk. It can
    be costly to make mistakes while replacing a
    cluster disk the consequence can be the
    irrecoverable loss of all of the data on that
    disk. If the disk is the quorum disk, the server
    cluster's configuration data is at risk.
  • Before restoring the cluster disks, stop Cluster
    service on all of the nodes of the cluster.
    Stopping Cluster service will ensure that it will
    not attempt to start, which would place a lock on
    the disks.

18
Restoring Disk Signature Files
  • Because Cluster service relies on disk signatures
    to identify and mount volumes, if a disk is
    replaced, or if the bus is re-enumerated, Cluster
    service will not find the disk signatures that it
    is expecting and will not function.
  • You can run Dumpcfg.exe to extract the disk
    signature from the registry and write it to the
    new disk. Cluster service will recognize the new
    disk and successfully start the resource.
  • Note The Dumpcfg.exe is a resource kit utility
    that restores an old disk signature file to a new
    disk.

19
  • If the disk that you are replacing is the quorum
    disk, use Cluster Administrator to move the
    quorum to a different disk, and proceed in the
    replacement of the disk. After the disk is
    brought back online, you can move the quorum back
    to the new disk.

20
Restoring the Data on the Cluster Disk
  • Restoring the data on the cluster disk is the
    same as a restore of a local disk. Before
    restoring the data, make sure that you have
    associated each cluster disk to the same drive
    letter as before the disaster or failure. When
    restoring, make sure that you restore the data to
    the original location and verify the integrity
    after you have completed the restore.

21
Restoring the Cluster Configuration Files
  • The cluster configuration files include the
    cluster database and the quorum log. The cluster
    database is the database or configuration data
    (cluster objects and their settings) that are
    pertinent to the cluster. This database is the
    product of the cluster registry key checkpoint
    and the changes that are recorded in the quorum
    log. All of the nodes of the cluster hive
    maintain a local copy of this database in the
    nodes local registry.
  • After you have restored the disk signature file
    and data, you can start the server cluster. If
    the cluster files were not restored, or were
    corrupted, the following procedure can restore
    the cluster database from the registry of the
    restored node.

22
  • Identify the node on which you will restore the
    database (in the case of a disaster restore, this
    will be the first node that you have restored).
    Restore the cluster database on the selected node
    by restoring the system state. Restoring the
    system state creates a temporary folder under the
    Systemroot\Cluster folder called
    Cluster_backup.
  • You use NTBackup to restore the cluster
    configuration files, which places them on the
    node. You then restore the cluster database to
    the nodes registry by using the Clusrest.exe
    tool. Clusrest.exe restores both the quorum log
    (Quorum.log) file and the cluster database
    (Clusdb).
  • Note The Clusrest.exe tool is available in the
    Windows 2000 Resource Kit. This tool is a free
    download from www.microsoft.com

23
Restoring the Second Node
  • Restoring the Remaining Node(s) of a Cluster
  • Perform Node Testing

24
  • After you complete the process of restoring a
    node of a cluster, and Cluster service has
    started successfully on the newly restored node,
    you can start the restore process on the other
    node of the cluster.

25
Restoring the Remaining Node(s) of the Cluster
  • The restoration of the second node of a cluster
    is the same procedure as restoring the first node
    of a cluster, except that you will not have to
    restore the cluster disks.

26
Performing Node Testing
  • Testing the failover and failback policy is
    recommended before putting the cluster back into
    production.
  • Verify that the disk and cluster resources are
    available on the correct node.
  • Fail over each group and resource to verify that
    they can successfully start on the other node of
    the cluster.
  • Test the failback policy of each resource by
    allowing the resource to fail back to a preferred
    owner after the node has come back online.

27
Evicting a Node
  • Steps for Evicting a Node
  • Back up both nodes
  • Verify backup
  • Move all groups to the remaining node
  • Stop Cluster service on the node to be removed
  • Evict the node
  • Unplug the server from the shared bus

28
  • If you need to change a node of a cluster, for
    example, to add a more powerful server, you need
    to logically remove the node before physically
    removing the node from the cluster. When you
    configure a new server with the shared bus, and
    the public and private networks, you can then run
    the Cluster Installation Wizard.
  • To remove a node from a cluster, from Cluster
    Administrator, right-click on the node to access
    the menu with the Stop Cluster option and Evict
    Node options.

29
  • To evict a node
  • Back up both nodes.
  • Verify backup.
  • Move all of the groups to the remaining node.
  • Stop Cluster service on the node that is to be
    removed.
  • Evict the node.
  • Unplug the server from the shared bus (if the
    shared bus is a SCSI bus, be careful about
    termination).
  • Note If a new server is to join the cluster
    later, run the Cluster Installation Wizard and
    select Join a Cluster.

30
Troubleshooting Cluster Service
  • Troubleshooting Tools
  • Examining the Cluster Log
  • Troubleshooting Network Communications
  • SCSI Configuration Problems
  • Group and Resource Failures
  • Quorum Log Corruption

31
  • Troubleshooting a problem with Cluster service
    can be more complex than troubleshooting a single
    server because of the virtual servers and the
    need for intracluster communications. Virtual
    servers change ownership from one node to
    another, which may cause network connectivity
    problems. Applications running on the cluster are
    difficult to troubleshoot, because they are
    running on a virtual server instead of a physical
    server. You could also have a node-to-node
    communication problem because servers usually
    work independently of each other and not
    together. You might experience hardware problems
    with the shared bus and the cluster disk
    resources.

32
  • The most common failures are due to improper
    configurations within groups and resources.
    Cluster service will fail if the quorum log
    becomes corrupt. It is important to know how to
    repair the quorum log to restart the cluster.
  • You use the same tools to identify problems on
    the cluster as you would use to identify problems
    on a physical server. The best resource for
    troubleshooting is the cluster log because
    Cluster service records the activity of each node
    in the cluster log. This log can help you
    identify problems on the node or in the cluster.

33
Troubleshooting Tools
  • Disk Manager
  • Task Manager
  • Performance Monitor
  • Network Monitor
  • Dr. Watson
  • Services Snap-in

34
  • When troubleshooting Cluster service, you can use
    the same tools and methodologies that you would
    when troubleshooting Windows 2000 Advanced
    Server.

35
  • Cluster service writes logging information to the
    system log of every node in the cluster. Cluster
    service also writes a more detailed log of
    cluster activity to the cluster log on each node.
    Use these two sources to gather information when
    you begin troubleshooting a problem. You will be
    able to determine whether the problem is related
    to the network, to services or applications, or
    to physical components in the cluster.
  • Note Use Event Viewer to filter the system log
    on event source ClusSvc. You can view general
    events, such as if Microsoft Cluster service
    failed to join the cluster on this node and
    Microsoft Cluster service successfully created a
    cluster on this node.

36
  • After you have determined the type of problem,
    you can use the following tools to search for the
    source of the problem. You must check each node
    individually when using any of these tools.

37
  • Disk Manager. You check disk manager to find out
    the health of the cluster disk. You can check
    whether the operating system recognizes the
    disks, and whether the cluster disks are basic
    versus dynamic. You also need to verify that the
    drive letters of the cluster disks are the same
    on both nodes.

38
  • Task Manager. You can verify that Cluster service
    is running in Microsoft Windows 2000 Task
    Manager. You can also use Task Manager as a
    performance monitor, but you do not obtain the
    level of detail as you would with a performance
    monitor. In Task Manager, you will be able to
    verify the CPU utilization percentage and the
    memory resources on the node.

39
  • Performance Monitor. Microsoft Windows 2000
    Performance Monitor is the primary tool for
    finding bottlenecks on servers running Windows
    2000. It is recommended that you create a
    baseline before and after you add cluster
    resources to the cluster. You also need to create
    a baseline on each node during failover and
    failback of resources to check for potential
    physical resource deficiencies. It is recommended
    that you configure a computer to monitor the
    Cluster service property on every node of the
    cluster, and send an e-mail message to an
    administrator when a node or the cluster is
    offline.

40
  • Network Monitor. You use Microsoft Windows 2000
    Network Monitor to troubleshoot any node-to-node
    and client-to-node communication. You must
    configure Network Monitor to capture data on the
    private network to see node-to-node
    communication.

41
  • Dr. Watson. Dr. Watson is a user-mode debugging
    tool. If a clustered application or the Cluster
    Administrator crashes, the debugging information
    is found in the Dr. Watson log file.

42
  • Services Snap-in. Cluster service runs as a
    service in Windows 2000. If Cluster service is
    not running correctly, check the properties of
    the service through the services snap-in to
    ensure that the default properties have not
    changed. Verify that Cluster service
  • Is set to start automatically.
  • Is set to log on as the designated domain service
    account.
  • Is set to restart after a failure.

43
  • Make sure that the four following services have
    started
  • Network Connections (Network Connections has a
    Remote Procedure Call (RPC) dependency)
  • RPC
  • Windows Management Instrumentation Driver
    Extensions
  • Windows Time

44
Examining the Cluster Log
45
  • The cluster log is a diagnostic log that is a
    more complete record of cluster activity than the
    Microsoft Windows 2000 Event Log. The cluster log
    records the Cluster service activity (Clussvc.exe
    and associated processes) that leads up to the
    events that are recorded in the event log.
    Although the event log can point you to a
    problem, the cluster log helps you to determine
    the source of the problem. So, for diagnosis,
    check the event log for general information and
    the cluster log for specific details about the
    cluster status. If you see a problem in the event
    log, note the timestamp and go to approximately
    the same timestamp on the cluster log.
  • The cluster log is enabled by default when you
    install Cluster service, but will not start
    logging information until after the first restart
    of the node. Cluster log output is written to
    SystemRoot\Cluster\Cluster.log, and you can
    view it with Microsoft Wordpad.

46
Setting the Logging Level
  • You can set four logging levels in the cluster
    log. Four logging levels are possible. The
    default level is two, which logs enough
    information necessary for normal troubleshooting.
    To set a different logging level, click Start,
    point to Settings, click Control Panel, and then
    double-click the System icon. Create a system
    environment variable under the Advanced button
    called ClusterLogLevel with a value of 0, 1, 2,
    or 3, where 0no logging, 1Errors only, 2Errors
    and Warnings, and 3Everything that happens.

47
Setting the Log File Size
  • The log file defaults to a maximum size of 8
    megabytes (MB). When the log file size reaches 8
    MB, the log file will start overwriting the data
    in the log file. To specify a larger file size,
    add the registry entry ClusterLogSize under
    HKEY_LOCAL_MACHINE\SYSTEM\ CurrentControlSet\Servi
    ces\ClusSvc\ Parameters. ClusterLogSize has a
    type of DWORD and it should specify the maximum
    size in MB for the log file. If this value is set
    to 0, logging is disabled.

48
Cluster Log Entries
  • There are two types of cluster log entries
    Component Event Log entries and Resource
    dynamic-link library (DLL) log entries. Cluster
    service is made up of a number of components,
    such as the database manager and the global
    update manager. The cluster log records the
    interactions of these components, making it a
    powerful diagnostic tool. Because resource groups
    are the basic unit of failover, resource DLL
    entries are essential to understanding cluster
    activity.
  • The first line in the body of a typical cluster
    log is
  • 378.32c1999/06/09-180018.874 Cluster service
    started -Cluster Node Version 3.2051

49
  • The main elements of this line are common to
    every line of the log
  • The IDs of the process and thread issuing the log
    entry. These two IDs are concatenated, separated
    by a period. In the previous example, the Process
    ID is 378, and the Thread ID is 32c.
  • Timestamp. The timestamp is recorded in the
    following format, in Greenwich Mean Time (GMT)
  • yyyy/mm/dd-hhmmss.sss
  • Event description. One example of an event
    description would be Cluster service started.

50
Component Event Log Entries
  • In the following example, NM indicates the
    component that wrote the event to the cluster
    log in this case, NM stands for node manager.
  • 378.3801999/06/09-180050.881 NM Forming
    cluster membership.

51
Resource DLL Log Entries.
  • The following example is a cluster log entry for
    a resource DLL event. This example is one of the
    entries from the disk arbitration process.
  • 15c.4581999/06/09-180047.897 Physical Disk
    ltDisk Dgt DISKARB Arbitration Parameters (1
    9999).

52
  • Instead of listing an abbreviated component name
    between the timestamp and event description as
    component log entries do, entries describing
    resource DLL events list the following
    information
  • Resource type (Physical Disk)
  • Resource name (ltDisk Igt)
  • The event description in this example is
    DISKARB Arbitration Parameters (1 9999).

53
Troubleshooting Network Communications
  • Troubleshooting Node-to-Node Communication
  • Verify RPC Communications
  • Verify Cluster Heartbeats
  • Troubleshooting Client-to-Node Communications
  • Check NetBT Cache with Nbtstat
  • Ping IP Address
  • WINS Static Mappings

54
  • There are two types of cluster network
    communications that can fail the client may be
    unable to access the cluster or the nodes may be
    unable to communicate with each other. When
    client communications are interrupted, there is a
    problem with the public network. When the nodes
    are unable to communicate, there is a problem
    with either the public or the private network.
    Troubleshooting these two types of
    network-related problems requires different
    approaches.

55
Troubleshooting Node-to-Node Communications
  • You can use Windows 2000 Network Monitor before
    installing Cluster service to capture the trace
    of the ping between the nodes on the public and
    private network. After Cluster service is
    installed, you use Network Monitor to verify
    remote procedure call (RPC) communication and
    cluster heartbeats.
  • Note You can also use RPC Ping, which is an RPC
    connectivity verification tool that is a free
    download from www.microsoft.com. This tool
    verifies that Windows 2000 Server services are
    responding to the call requests of remote
    procedures between nodes.

56
Verifying RPC Communication
  • To verify that RPC communication is occurring
    between the nodes of a cluster, use a network
    capture utility, such as Microsoft Network
    Monitor. Windows 2000 Server includes a simple
    version of Network Monitor that you can install
    by using the Network program in Control Panel.
  • To verify RPC communication, configure the
    Capture utility to capture all of the traffic
    between the nodes of a cluster. After you have
    started a capture, using Cluster Administrator to
    create a group or resource will result in RPC
    traffic between the nodes.

57
Verifying Cluster Heartbeats
  • As with RPC communication, to verify that cluster
    heartbeats are occurring between the nodes of a
    cluster, you must use a network capture utility.
  • Cluster service uses User Datagram Protocol (UDP)
    port 3343 to send heartbeats on the network. Use
    Network Monitor to capture port 3343 to verify
    both nodes of the cluster are sending and
    receiving cluster heartbeats.

58
Troubleshooting Client-to-Node Communications
  • After a failover occurs, clients must still be
    able to gain access to a cluster, even though
    they will be accessing a different node. The
    client must be able to resolve any cluster
    network names so that they will always connect to
    the node on which the resources are online. If
    clients cannot connect to virtual servers, verify
    that
  • The client is accessing the cluster by using the
    correct network name or IP address.
  • The client has the Transmission Control
    Protocol/Internet Protocol (TCP/IP) protocol
    correctly installed and configured.

59
Check NetBT Cache with Nbtstat
  • Depending on the resource that is being accessed,
    the client can address the cluster by specifying
    either the resource network name or the IP
    address. In the case of the network name, you can
    verify proper name resolution by checking the
    NetBT cache (using the Nbtstat.exe utility) to
    determine whether the name had been previously
    resolved. Also, confirm proper Windows Internet
    Name Service (WINS) configuration, at the client
    and at the cluster nodes.

60
Ping IP Address Using Ping Utility
  • If the client is accessing the resource through a
    specific IP address, ping the IP address of the
    cluster resource and cluster nodes from a command
    prompt.

61
WINS Static Mappings
  • You should not create static network name to IP
    address mappings for any cluster names in a WINS
    database. WINS is the only name resolution method
    that will cause problems when using static
    mappings, because WINS static mappings use the
    media access control (MAC) address of the network
    card as part of the static mapping.

62
  • If clients are having a problem connecting to a
    virtual server, an administrator might have
    created a WINS static mapping for a virtual
    server. The node for which the mapping is created
    will be able to bring the network name resource
    online and clients will be able to connect.
    However, if failover occurs, the second node in
    the cluster will be able to bring the IP address
    online but not the network name. When the second
    node attempts to bring the network name online,
    WINS will return an error preventing it from
    registering the network name. WINS prevents the
    network name from going online because the second
    node does not have the same physical address as
    the one recorded in the static mapping for the
    network name.
  • Note For more WINS troubleshooting information,
    see Recommended WINS Configuration for Microsoft
    Cluster Server, Q193890, on the Student compact
    disk.

63
SCSI Configuration Problems
  • SCSI Controllers
  • SCSI Terminiation
  • SCSI Cabling

64
  • If you suffer from hardware failures, you may
    have to replace hardware components of the
    cluster. If you replace components in the SCSI
    subsystems, you need to make sure that the new
    SCSI configurations conform to the following
    guidelines.

65
  • SCSI Controllers

SCSI IDs Each device on the shared SCSI bus must have a unique SCSI ID. Most SCSI controllers default to SCSI ID 7. Therefore, you must change the SCSI ID for one of the controllers on the shared SCSI bus to something other than ID 7.
Boot Time SCSI Bus Reset Cluster service uses SCSI bus resets, but in a controlled way during a membership regroup operation. Some SCSI controllers reset the SCSI bus when they initialize at start time, before Windows 2000 is loaded. If the SCSI controllers reset the SCSI bus, the bus reset can interrupt any data transfers between the other node and drives on the shared SCSI bus. Therefore, you should disable automatic SCSI bus resets, if possible, by using the adapter configuration program accessible at computer start time.
Non-Compliant Controllers It is important to verify that the SCSI controllers that are being used are on the Cluster service Hardware Compatibility List (HCL). For a SCSI controller to work with Cluster service, it must support the SCSI reserve and release commands and bus resets.
66
  • SCSI Termination

Active or Forced-Perfect Termination There are three types of termination that are used for terminating the SCSI bus passive termination, active termination, and forced perfect termination. Because both active and forced perfect termination use electronics to provide termination, these types provide the best termination. You should not use passive termination in a cluster, because it can result in problems, such as unnecessary failover or inability to access the quorum disk.
On-Card Termination Many SCSI controllers provide on-card termination however, the on-card termination does not provide termination when the computer is not turned on. On-card termination only becomes an issue when external terminators are not used. When using external terminators, the on-card termination should be disabled.
67
  • SCSI Cabling

Tri-Link or Y-cable SCSI Connectors Attaching Y-cables or tri-link connectors to the back of the SCSI controllers at each end of the bus is one method that you can use to allow the SCSI bus to remain terminated even when one node is turned off. These components allow you to use external terminators that will continue to provide termination if a node is turned off. You must ensure that the SCSI cards in the nodes are not providing termination when using these connectors.
Long Cables It is very common to have multiple external SCSI drives on the shared SCSI bus. When configuring multiple external drives, it is very important not to exceed the maximum combined cable length that the controller manufacturer recommends. The SCSI specifications specify the maximum combined cable length when using different types of cabling. If the manufacturer of the controller recommends a shorter distance, be sure to follow the recommendation of the manufacturer.
68
Group and Resource Failures
69
  • If groups or resources are not available to
    clients, you need to verify whether it is a
    restart, failover, or failback problem. In
    Cluster Administrator, you will see a visual
    notification that a group or a resource in a
    group is offline. Because there are a variety of
    reasons for a failure, you will have to
    troubleshoot the cause to find out whether it is
    a resource or group failure.

70
Problem Possible Resolution
A Resource Fails, But is Not Brought Back Online In the Policies dialog box for the resource properties, verify that Dont restart is cleared (not selected). Verify that the resource dependencies are correctly configured. Verify that any dependent resources are online.
The Default Quorum Resource Will Not Come Online Verify that there are no hardware errors by using Event Viewer and looking for disk input/output (I/O) error messages.
Cannot Bring a Group Online Verify that there are no hardware or configuration problems with any disk resources for the group. Verify that the resource dependencies are correctly configured. Move the group to the other node and attempt to bring the group online. If this works, verify that the first node can gain access to everything that is necessary to bring the groups resources online (for instance, the disk resource).
71
  • (continued)

Problem Possible Resolution
A Group Cannot Be Moved or Failed Over to the Other Node Verify that the resource is properly installed on the node. Verify that the other node is set as a possible owner for all resources in the group in the Properties dialog box for the resource.
A Group Failed Over But Did Not Fail Back Verify that the failback policies for the group are properly configured. In the Properties dialog box for the group, verify that Prevent failback is cleared. If Failback immediately is selected, be sure to wait long enough for the group to fail back. Check these settings for all of the resources within a group. Because groups fail over as a whole, one resource that is prevented from failing back will affect the entire group. Ensure that the node to which you want the groups to fail back is configured as the preferred owner of the group. If not, Cluster service will leave the groups on the node to which they failed over.
72
  • (continued)

Problem Possible Resolution
The Entire Group Failed and Has Not Restarted If the node on which the group had been running is offline, verify that the other node is a possible owner of the group and of all of the resources in the group. Ensure that the group has not exceeded its failover threshold or its failover period. Bring the resources online one at a time to determine which resource is causing the problem. Create a temporary group (for testing purposes), and then move the resources to it one at a time, bringing each resource online after moving the resource.
73
Quorum Log Corruption
  • Reset the Quorum Log
  • Clussvc debug -resetquorumlog
  • Delete the Quorum Log
  • -noquorumlogging

74
  • Microsoft Cluster service maintains details about
    changes within the cluster through a quorum log
    file. If this file becomes corrupted for any
    reason, it is possible that Cluster service will
    not start. The following error message may occur
    when you attempt to start Cluster service on a
    node of the server cluster Event ID 1147
    Source ClusSvc
  • If the cluster will not start because of a
    corrupted quorum log, you can reset the quorum
    log. If Cluster service still will not start
    after attempting a reset, you can access the
    quorum disk and remove the corrupted quorum log.

75
Reset the Quorum Log
  • If you do not have a backup of the quorum log
    file, perform the following steps
  • Open a command prompt.
  • Go to the Systemroot\Cluster.
  • Start Cluster service by typing clussvc -debug
    -resetquorumlog which attempts to create a new
    quorum log file that is based on the cluster
    configuration information in the local system's
    cluster registry hive.
  • Stop Cluster service by pressing CTRLC.
  • Restart Cluster service by typing net start
    clussvc
  • Close the command prompt.

76
Delete the Quorum Log
  • If the log file becomes corrupted and cannot be
    reset, Cluster service may not start. To correct
    this problem, you must use the -noquorumlogging
    option when starting Cluster service. This option
    allows the cluster to start without quorum
    logging. You may then access the quorum disk and
    remove the corrupted Quolog.log file.

77
  • Use the following procedure to help recover from
    this situation
  • If Cluster service is running, use Control Panel
    on both nodes to stop Cluster service.
  • On one node, use the Services tool in Control
    Panel to specify the startup parameter for
    Cluster service as -noquorumlogging and start the
    service.
  • On the quorum disk, run Chkdsk. If the disk does
    not show corruption, the log file may be
    corrupted. In this case, delete the Quolog.log
    file and any .tmp files that are located in the
    MSCS folder on the quorum disk.
  • In Services, stop Cluster service, and then start
    Cluster service without startup parameters. After
    the service starts, you may start it on the other
    node.
  • Note When you disable quorum logging within a
    cluster, changes to the cluster configuration
    cannot be logged. If a node goes offline during
    this period, recent changes may be lost if
    changes could not be communicated to the other
    node. Quorum logging should only be disabled when
    necessary to recover from log file corruption.

78
Lab A Cluster Maintenance
79
Objectives
  • After completing this lab, you will be able to
  • Back up cluster configuration files.
  • Restore cluster configuration files.
  • Evict a node from the cluster.
  • Uninstall Cluster service.

80
Scenario
  • In this exercise, you will back up a nodes
    system state, which includes the cluster
    configuration files. After the backup is
    complete, you will restore the system state and
    verify that the cluster configuration files were
    restored to the node. At this point, to restore
    the cluster, you would run the Clustrest.exe
    utility, but for the purposes of this lab, you
    will not restore the cluster. You will evict a
    node from a cluster and uninstall the Cluster
    service on both nodes.
  • The following exercises will refer to your
    computers as Node A and Node B. For this lab, you
    will perform all of the tasks on both Node A and
    Node B, with the exception of evicting a node,
    which you will perform only on Node B.

81
Exercise 1 Backup and Restore
  • In this exercise, you will learn how NTBackup is
    used to backup and restore the cluster.

82
To back the Cluster
  • Complete this lab from Node A and Node B.
  • Click Start, point to Programs, point to
    Accessories, point to System Tools, and then
    click Backup.
  • In the Backup dialog box, click Backup Wizard.
  • In the Backup Wizard dialog box, click Next.
  • Select Only backup the System State data, and
    then click Next.
  • In the Backup media or file name dialog box,
    type c\Backup.bkf and then click Next.
  • Click Finish to start the backup.
  • NTBackup will start backing up the system state,
    which will take a couple of minutes.
  • When the backup is complete, click Close.

83
To Restore the Cluster
  1. In the Backup dialog box, click Restore Wizard.
  2. Click Next.
  3. Click Import File to locate the backup file of
    the system state.
  4. In the Catalog backup file dialog box, type
    c\Backup.bkf and then click OK.
  5. In the What to restore box, expand File, expand
    Media created.
  6. Select the System State box, click Next, and then
    click Finish.
  7. In the Enter Backup File Name dialog box, click
    OK.

84
  • The Restore process will take a couple of
    minutes.
  • When Restore is complete, click Close.
  • Do not restart the computer, click No.
  • Close NTBackup.
  • Note NTBackup does not restore the cluster files
    to the cluster disk. NTBackup places the cluster
    files on the local node.

85
To examine the cluster files that are restored by
NTBackup
  • Click Start, and then click Run.
  • In the Run dialog box, type systemroot\cluster
    and then click OK.
  • Double-click the cluster_backup folder to view
    the files that are restored by NTBackup.
  • What utility would you use to restore these files
    to the shared drive?___________________

86
To create a group after backup
  • To test the restore process, you will create a
    group after the backup. The restore procedure
    will roll back the cluster to the state when the
    backup was performed.
  • Perform this task from Node A.
  • In Cluster Administrator, click File, select New,
    select Group.
  • In the New Group dialog box, fill out the
    following properties Name Test Group
    Description Test Group
  • Click Next.
  • In the Preferred Owners dialog box, click Finish.
  • Click OK to acknowledge that the group was
    successfully created.

87
To install the Clusrest.exe
  • In this task Node B will install the ClustRest
    utility and restore the cluster to the state of
    the last backup. Close Cluster Administrator on
    Node A and Node B if it is running.
  • Perform this task from Node B.
  • On the Start menu, click Run.
  • In the Run dialog box, type c\moc\2087a\labfiles\
    mscs and then click OK.
  • In the Microsoft Web Installation Wizard Tool
    CLUSRESTEXE dialog box, click Next.
  • Click I Agree, and then click Next.
  • Click Install Now.
  • Click Finish.

88
  1. On the Start menu, click Run.
  2. In the Run dialog box, type cmd and then click
    OK.
  3. In the command prompt, type cd\program
    files\resource kit and then press ENTER.
  4. In the command prompt, type clusrest and press
    ENTER.
  5. In the command prompt, type y to continue.
  6. Wait for clusrest before proceeding.
  7. Open Cluster Administrator.
  8. Expand Groups and notice that the Test Group that
    was created in the previous task is now missing
    and that Node A is OfflLine.

89
Exercise 2 Removing Cluster Service
  • In this exercise, you will remove Cluster service
    from both computers in the cluster.

90
To evict a node
  • Complete this task from Node A only.
  • Log on as Administrator with a password of
    password.
  • Open Cluster Administrator from the
    Administrative Tools menu.
  • If prompted, click Yes to restart Cluster service
    on Node A.
  • Right-click Node B.
  • Click Stop Cluster Service.
  • Click Yes.
  • Right-click Node B.
  • Click Evict Node.
  • Click Yes.

91
To remove Cluster service from NodeA and NodeB
  • Complete this task from Node A and Node B.
  • Log on as Administrator with a password of
    password.
  • On the Start menu, select Settings, and then
    click Control Panel.
  • Open Add/Remove Programs from Control Panel.
  • Click Add/Remove Window Components.
  • Clear the Cluster Service check box, and then
    click Next.
  • Click Finish.
  • Click Yes to restart the computer.

92
Review
  • Cluster Maintenance
  • Troubleshooting Cluster Service
Write a Comment
User Comments (0)
About PowerShow.com