Title: Framework Functionality, Tuning and Troubleshooting
1Framework Functionality, Tuning and
Troubleshooting
2Purpose
- Tuning is a delicate and and complex mechanism.
The variables - are vast and tuning must begin at the operating
system and - network levels. Before applications can be
streamlined, ensuring - that the operating system can service requests,
as well as network - stability is crucial. An Understanding of the
tuning components are - imperative and only careful analysis and
implementation will - ensure that the environment is streamlined and
scalable. All of - these variable are related to each other and
balancing them is key. - Every environment is unique in regards to
applications and - functions that they serve. This document will
explain the - components that must be tuned. As with most
scenarios a baseline - set of configuration are implemented before
reaching a final - working configuration.
3Agenda
- Operating System Tuning
- File Descriptors
- Process Per User
- Network Tuning
- Memory Buffer
- Socket Buffer
- TCP
- UDP
- Tivoli Tuning
- Oserv
- Gateway
- Dump/Cores
- Endpoint Manager
- MDist/MDist2
- Endpoints
- Endpoint Communication Reliability
- 3.7.1 4.1
- 4.1.1 4.3.1
4Functionality, Tuning and Troubleshooting
- Gateway Daemon
- Jobq threads
- TMF_Dispatch Thread
- Threads Categorized
- Gateway Functionality - logstatus
- epact.bdb Endpoint Activity Database
- Gateway Commands
- Gateway Tuning Considerations
- Gateway tuning script
- Examples of errors
5Operating System Tuning
- File Descriptors
- Handle created by a Process when opened.
- Retired when a file is closed or terminated.
- Commands to view descriptors
- Commands to change descriptors
- Make the change permanent.
- Sockets
- Formulas
- Processes Per User
- Predefined Limit
- Maximum of concurrent processes per user-id.
- Formulas
6Operating System Tuning contd
Network Tuning
- Network Options
- Memory Buffer
- Formula
- Socket Buffer Maximum
- Formula
- TCP
- sendspace
- recvspace
- Formula
- UDP
- Sendspace
- recvspace
- Formula
7Tivoli Tuning
- ITM/DM
- request_manager.threads
- taskengine.max_threads
- Endpoints
- Ifs_ignore
- Login thread usage
- Endpoint command usage
- continue_eplogin_onerror
- Oserv
- rpc_max_threads
- iom_by_name
- set_force_bind
- Gateway
- rpc_maxthreads
- max_concurrent_jobs
- max_concurrent_logins
- eplogin_timeout
- continue_eplogin_onerror
- tcp_backlog
- Endpoint Manager
- epmgr_rpc_max_threads
- login_limit
- timeout_interval
- MDist/MDist2
- max_sessions
- mem_max
- net_load
- max_conn
- packet size
- rpt_dir
- disk_dir
8Endpoint Communication Reliability
GATEWAY Receives endpoint login Request, and puts
login request in the JOBQ JOBQ handler start
login thread
EPMGR Processes, initial migratory or isolated
logins. Sends login_info to the GW
LCFD Sends login request to the Gateway Receives
login results from the Gateway
Process Normal Login Sends result back to
Endpoint
Yes
Is the endpoint login a normal login?
No
For initial, migratory or isolated logins
Endpoint login threads contact the endpoint
manager. Sends new login_info to the Endpoint
Login Process for TMF 3.7.x or 4.1
9Endpoint Communication Reliability continued
GATEWAY Receives endpoint login request, and puts
login request in endpoint login JOBQ JOBQ
handler starts login thread
LCFD Sends login request to Gateway Receives
login timeout Sets login interval Receives
login results from the gateway or -1 timeout,
if the endpoint manager is busy and resets
timeout in login cycle
EPMGR Processes initial, migratory or isolated
logins Sends login_info to the gateway If the
endpoint manager is busy, throws an exception
Yes
Process Normal Login Sends result back to
Endpoint
Is the endpoint login a normal login?
No
Initial, migratory or isolated logins are
forwarded to the endpoint manager Sends new
login_info or -1 timeout if the endpoint manager
is busy to the endpoint
Login Process for TMF 4.1.1
10Gateway Log Messages
- Gateway Boot Process
- Read
- Initialize
- Opens
- Contacts Endpoint Manager
- Threads started
- Reconnect Thread
- Listens for incoming request from the endpoints
- Login Thread
- Gateway process listens to the broadcast login
requests from endpoints - Reader Thread
- Processes all gateway requests assigned to the
reader queue - Logstatus Thread
- At every logstatus_interval logs status of JOBQs
and endpoints
11Endpoint Login Thread
- Endpoint Data Functions
- Data Stored
- Downcall Process
- Checks credentials of the user
- Resolves method dependencies
- Opens Connection
- Endpoint Performs Downcall
- Returns Results
- Upcall Process
- Reconnect Thread gets Upcall
- Reader Thread creates Job
- JOBQ Monitoring Threads Starts
- Updates
- Upcall Method resolves
- Gateway Returns Results.
- Endpoint Login Thread
- Login request
- Runs login_filter
- Compares endpoint data
- Login_info
- Upgrade
- Login Policy Boot Method
- Multi-Cast
- Notifies Repeater
- Login_status
- Endpoint Logout Process
- Logout Packet
12Gateways Purposes and Impact
- Gateway Process
- Designed to Handle a large percentage of
Management Function.
- Gateway Stability
- Number of Management units affected
- Ability to Management Endpoints
- Crucial to 7X24 operations
- Products Impacted by
- Gateway Stability
- Tivoli Software Distribution
- Tivoli Inventory
- System Availability
13Gateway Daemon
- Gateway Thread Usage
- Fixed Remote Procedure Call (RPC)
- Gateway method spawned from an oserv thread
- gateway_method_in()
- Two types of gateway threads execute gateway
methods - Jobq threads
- Tmf_dispatch thread
14Gateway Daemon (contd)
- Jobq Threads
- new_job()
- Processed by pthreads
- max_concurrent_jobs
- Pthreads best practices
- Pthreadsmax_concurrent_jobsnumber of hosting
gateways gt 200 lt 1000 (or to the nearest 100) - Monitoring memory usage of gateway process.
- Monitoring the endpoint manager and management
regions methods - TMF_Dispatch Threads
- tmf_dispacth()
15Gateway Thread Execution
16Gateway Function - logstatus
- wgateway ltgw_labelgt logstatus
- Jobq_threads in queue
- Jobq_threads running
- Reader_threads in queue
- GW methods running
- Examples of logstatus responses
- bash wgateway gbl_gw logstatus
- Status Data ---------
- Jobq_threads in queue 0
- Jobq_threads running 0
- Reader_threads in queue 0
- GW methods running 1
17Gateway Function logstatus (contd)
- Logstatus_interval
- Sets interval for gatelog update
- Set smaller than the default for more useful data
- Entries are written to the gatelog regardless of
debug level - Example of logstatus entry in gatelog
- 2002/09/18 230348 05 010275A0 STATUS DATA
jobqq 0 jobqr 0 reqdq 0 gwmethods 0 - Setting the logstatus_interval
- wgateway ltgw_labelgt logstatus_interval 600
18Logstatus of a Gateway
The reconnect_thread listens to TCP communication
from an endpoint
Gateway Method Calls
Gateway_method_in()
The read queue processes only endpoint logins,
endpoint logouts, upcall, upcall proxy and
endpoint control packets
tmf_dispatch()
Job Queue
Running Jobs
Exit
19Gateway Commands
- set_session_timeout seconds
- Can override with task, software package times,
etc - set_rpc_maxthreads count
- Currently must be (max_concurrent_jobs 50)
- set_max_concurrent_jobs count
- Currently max 2000
- set_max_concurrent_logins count
- logstatus_interval seconds
- Interval at which statistics are written to the
gatelog - set_method_trace_time seconds
- Disable, or set interval to refresh epact.bdb
with endpoint method date and time information - set_debug_level
- Keep the value of 1 when not troubleshooting to
reduce the load on the system resources.
20Gateway Tuning Considerations
- Operating System File Descriptors
- Oserv RPC threads
- Gateway operations RPC threads and jobq
- Endpoint Manager RPC threads
21RPC Threads
OS File Descriptors
Limits
Admin Console SP Editor Web Gateway
Admin Console Agent SP Editor
Oserv RPC Threads
Limits
EPMgr RPC Threads
Gateway RPC Threads
The number of OS File Descriptors limits the
number of all other RPC Threads in Tivoli.
22Operating System File Descriptors
- Threads created consumes OS File Descriptors
- NT/W2K - (all versions) testing reveals that the
maximums are 2038. - AIX It not affected if the files descriptors
(nofiles(descriptors)) are set to unlimited, but
only the soft limit. - Solaris Does have limitations to how large the
file descriptors (no files) can be set, but the
best practices is to ensure that the value is at
least twice the value of the oserv rpc threads.
23Procedures to change File Descriptors
- NT/W2K (all versions)
- Default desktop memory heap is 512
- Check value
- regedt32
- HKEY_LOCAL_MACHINES\System\CurrentControlSet\Contr
ol\Session\Manager\subsystem - The default value for this registry value will
look something like the following - SystemRoot\systems32\crss.exe
ObjectDirectory\Windows - SharedSection1024,3072,512
- WindowsOnSubSystem
- TypeWindows
- ServerDllbasesrv,1
- ServerDllwinsrvUserserverDllInitialization,3
- ServerDll-winsrvConServerDllInitialization,2
- ProfileControlOffMaxRequestThreads16
- Double click Windows and change the SharedSection
to 1024,3072,1024 - This changes the third value from 512 to 1024
- Close regedt32
- Reboot the system
24Procedures to change File Descriptors (contd)
- AIX
- ulimit a will list the current values for the
current userid. - The data will look something like the following
- file(blocks) 100
- data(kbytes) 523256
- stack(kbytes) 512
- coredump(blocks) 200
- nofiles(descriptors) 64
- memory(kbytes) unlimited
- ulimit n will set the current userid descriptors
dynamically - (eg. ulimit n unlimited)
- The oserv needs to be re-cycled (NOT reexec)
before changes will take affect - ulimit Sn will be the command to put in the
system startup files to allow the value to be set
to accommodate tivoli rpc threads at boot time
statically. - The operating system will need to be rebooted
before changes will take affect.
25Procedures to change File Descriptors (contd)
- Solaris
- ulimit a will list the current values for the
current userid. - The data will look something like the following
- time(seconds) unlimited
- file(blocks) 2097151
- data(kbytes) 131072
- stack(kbytes) 32768
- memory(kbytes) 32768
- coredump(blocks) 2097151
- nofiles(descriptors) 2000
- ulimit n will set the current userid descriptors
dynamically - (eg. ulimit n unlimited)
- The oserv needs to be re-cycled (NOT reexec)
before changes will take affect - ulimit Sn will be the command to put in the
system startup files to allow the value to be set
to accommodate tivoli rpc threads at boot time
statically. - The operating system will need to be rebooted
before changes will take affect.
26Oserv RPC Threads
- Set the rpc_max_threads to as big as OS will
allow - (default is 250)
- Viewing this parameter
- odadmin get_rpc_max_threads
- Setting this parameter
- odadmin set_rpc_max_threads
27RPC Threads and Jobq
- Every gateway method handled by a Gateway RPC
thread - initially spawned via an oserv thread. - The Gateway RPC Thread will determine what type
of thread to execute this method with - Queued
- Note valid range for max concurrent jobs allowed
Gateway is 200-2500 and cannot be greater than
(rpc_maxthreads 50) - To set the value
- wgateway ltgw_labelgt set_max_concurrent_jobs
ltvaluegt - Non-Queued
- Note valid range is 250 2000 and cannot be
less than max_concurrent_jobs50) - Formula rpc_maxthreads gt max_concurrent_jobs
max_concurrent_logins 20 - To set the value
- wgateway ltgw_labelgt set_rpc_maxthreads ltvaluegt
28Endpoint Manager Threads
- Endpoint Manager Policy Threads
- max_install maximum number of concurrent
allow_install_policy scripts - max_sgp maximum number of concurrent
select_gateway_policy scripts - max_after maximum number of concurrent
after_install_policy scripts - Viewing the Endpoint Manager RPC threads
- wepmgr get max_epmgr_rpc_threads
- Yields something like the following
- max_install 10
- max_sgp 10
- max_after 10
- login_interval 300
- stanza_interval 720
- max_iom_records 500
- epmgr_flags 1
- max_epmgr_rpc_threads 300
- automigrate 0ff
- migrate_max 0
- chk_cntl_chars 0
- labelspace
29Gw_tune.pl script
Usage gw_tune.pl -h -r -s
Calculate the recommended values of RPC Threads
for the oserv, epmgr and gateway.
Additionally, calculate the
max_concurrent_jobs setting for the Gateways.
(Read Script Header for methodology,
or the IBM/Tivoli Gateway Tuning Field
Guide) This script only works on
the TMR it is being run on. It should be
run on each TMR in an environment. No
Options This Message -r
Calculate and show the recommendations ONLY.
-s Calculate the
recommendations AND actually
set them in the TMR.
-h This usage statement. NOTE This
script can ask you to change a value that
requires operating Changes.
30Error and Examples
- Gateway Thread Usage errors in the gatelog
- 2003/09/20 140422 01 RPC Request rejected
outstanding threads 250 - Summary of Queued Methods
- Downcalls to Endpoints
- Excessive isolated logins
- Delayed distributions
- Summary of Non-Queued Methods
- Note the following methods (except for
login_policy) are bound by a RPC thread limit of
250 on the gateway. - The foremost method is the login_policy method
- Deleting and Adding endpoint uses RPC threads
- Mass migrations of endpoints uses up RPC threads
on the old gateway. - Debug levels
- Endpoints
- Debug level 3 is typically the best level to
troubleshoot effectively. - Debug level 4 will provide the necessary data to
identify a network issue in addition to Tivoli
issues. - Gateways
- Debug level 6 is typically the best level to
troubleshoot effectively. - Debug level 9 provides more granular errors for
troubleshooting and diagnosis. - NOTE There is a policy script mechanism that I
engineered to help with endpoint login and
streamlining of endpoint maintenance.
31The Tivoli Factors
- Endpoint Instability
- Gateway Instability/Unresponsiveness
- Endpoint Manager Instability/Unresponsiveness
- Oserv instability
32The Tivoli Factors - Oserv Instability
- Resource exhaustion of RPC threads
(rpc_max_threads) - Limits
- Windows (Hard coded in TMF 2038)
- Unix (flavour specific, Limited by the number of
OS file descriptors) - FWK commands that contact all MNs will exhaust a
poorly configured parameter wchkdb wgateway - ITMs Request Manager uses an oserv thread for
each endpoint downcall it manages. - request_manager.threads (The default is 10, which
is usually too low) - Be cautious when changing.
- The tec_gateway uses RPC threads to receive
upcalls from endpoints. - Parameter GWThreadCount (tec_gateway max RPC
threads) range 250 10000 (default 250) - Be very cautious when changing.
33The Tivoli Factors - Oserv Instability
- Disk/File System Space
- DBDIR
- Corrupt/inconsistent odb.bdb
- Memory on AIX
- Default maximum process size of 256MBytes
(function of LDR_CNTRL and hard limit) - Memory leaks can cause core dumps when limit is
reached - Transaction log size
- Odb.log
-
34The Tivoli Factors - Endpoint Manager Instability
- Resource Exhaustion of RPC threads
(max_epmgr_rpc_threads) - Too many endpoint calls can exhaust this resource
quickly - Logins (Isolation, Initial, Orphaned)
- Migrations
- Long running policy scripts and w-commands
- The real story about max_sgp, max_allow and
max_after - Hung/Hanging gateway methods
- Restricted/Impacted by the availability of oserv
RPC threads - So exercise some caution when changing
35The Tivoli Factors - Gateway Instability
- Resource Exhaustion of RPC threads
(rpc_maxthreads) - Each gateway method uses an RPC thread until its
type is determined. - Each MDist2 session uses a RPC thread (rpt lt-gt
app, rpt lt-gt rpt) - Restricted/Impacted by the availability of oserv
RPC threads. - JOBQ threads (pthreads) max_concurrent_jobs,
max_concurrent_logins - Too many upcalls can result in this resource
being over extended - Results in requests being queued
- Can result in isolation logins
36The Tivoli Factors - Gateway Instability
- Disk Space
- Depot directory (rpt_dir)
- depot and states directories share same file
system - Corrupt/inconsistent MDist2.bdb in states
directory - reconnect_thread overloaded
- Single threaded
- TCP Backlog full
- TCP requests get a slow response
- No failures from a framework perspective
- Applications fail because of delay
37The Tivoli Factors - Gateway workflow
38The Tivoli Factors - Endpoint Instability
- Gateway unavailable
- Down
- Results in an Isolation login to alternate
gateway - Unmanageable until it completes login
- Busy
- Results in queued requests
- Delays
- Retries
- Port saturation
- Upcall and downcall clash.
- ITM heartbeat is a downcall that had a large
dependency list - Fixed in 5.1.1-ITM-FP05
- Upcalls not possible during downcall.
39Software Distribution
- Slow reporting
- Distributions and reporting use MDist2 sessions
- Consider increasing sessions to ease contention
- report_thread_limit controls the number of
mdist2_result methods that run from SH on TMR
server - When report_thread_limit is reached, SWDIST
throws an exception and messages in queue go into
INTERRUPTED state - Parameter conn_retry_interval is used to trigger
reprocessing of messages - Default of 900 seconds too high. Decrease to 120
seconds - no endpoints on gateway
- customers using 20 seconds with good results
Text slide withlarge image
40Distributed Monitoring
- Too many upcalls because of too many monitors
- saturate the reader_thread on gateway
- Symptoms
- JOBQ_THREAD_EXCEPTION errors because listening
port no longer available to reciprocating
downcall. - Not many jobs in JOBQ
- Upcall request delays seen in lcfd.log (no
errors)
41IBM Tivoli Monitoring
- ITM Heartbeat downcall takes too long (fixed
just an example) - Upcalls on endpoints cant succeed because of
downcalls - When application upcall fails it forces endpoint
to become isolated - Cascading effect of many isolation logins causing
gateways to become unresponsive because it is
busy - Incorrectly configured Web Health Console and
Heartbeat - Too many downcalls generated
42Inventory
- Collection Table of Contents (CTOC) processing
- Uses pthreads which are configured by
max_input_thread and max_output_threads - Maximum of 100 input and output threads
respectively - Each pthread opens an IOM channel to pickup
datapack - same IOM channel used if datapack source is the
same as previous connection - new IOM channel opened if datapack source is
different from previous connection - Maximum of 250 concurrent IOM channels
- RPC threads used to maintain IOM channels for
upcall (CTOC) and downcall (datapack) - RIM connections
- RDBMS deadlocks caused by too many connections
43Best Practices - Stability
- Monitor Disk/File System Space and Transaction
Log Size - To avoid odb.bdb corruption and oserv instability
- To avoid gateway distribution failures because of
MDist2.bdb corruption - Dont stop oserv if odb.log is too large
- INVESTIGATE!
- Inventory
- 1 RIM and 5 connections as a starting point
- Increase connections and RIMs as performance
requires - Software Distribution
- 50,000 endpoints per distribution
- 7,000-10,000 endpoints per activity plan
- 5-10 activities per plan
44ifs_ignore
- Gateway will listen on one or all interfaces
- Gateway listens to the same interfaces as the
oserv based on the odadmin set_force_bind
setting - Use one or all, no other solution
- More flexibility and control
- Can now select which interfaces you want the
gateway to use - Was available in 3.7.1 and 4.1 as an idlattr
- wgateway ltgwlabelgt set ifs_ignore ltIP addressgt
- wgateway ltgwlabelgt run ifs_ignore_remove ltIP
addressgt
45mcache_bwcontrol
- Method/dependency downloads were not bandwidth
controlled - Method/dependency downloads flooded slowlinks
- Preloading/prestaging of cache directory
workarounds - Method/dependency downloads can now use mdist2
bandwidth - wgateway ltgwlabelgt mcache_bwcontrol TRUE
- Default is FALSE.
46run sync_login_interval
- Change to EPMGR login_interval
- Gateway would use old value
- Gateway loads login_interval at boot time
- Hence a restart of gateway needed to effect
changes - wgateway ltgwlabelgt run sync_login_interval
- Gateways login_interval is synchronized with
EPMGR - No restart of gateway needed to effect change.
47New Fixes/Features
- In TMF 4.3.1 the RDBMS interface module (RIM)
- uses ODBC to connect to a Microsoft SQL Server
database rather than using a Microsoft SQL Server
client. Therefore, you must create an ODBC data
source to enable RIM to connect to a Microsoft
SQL Server database. For more information about
creating an ODBC data source, see the chapter
about using RIM objects in the Tivoli Enterprise
Installation Guide.
48New Fixes/Features
- Update JRE to 1.4.2 in TMF 4.3.1
- As JRE 1.3.1 is going out of service and new
platforms prerequisite JRE 1.4, TMF 4.3.1 will
distribute the new JRE 1.4.2. TCM Java components
like Software Package Editor, APM, CCM, GUIs and
ISMP will be rebuilt using the new JDK 1.4 and
will work runtime with the JRE 1.4.2.
49New Fixes/Features
- The following new platforms are supported in
4.3.1 - AIX 6.1 (Server/Gateway)
- Windows 2008 (Server/Gateway, ISMP)
- Windows Vista EP (added Desktop formerly not
supported due to JRE) - SLES 10 and RHEL 5 (EP, Server/Gateway, ISMP)
- Linux PPC (Server/Gateway)
- Solaris x86 (Server/Gateway)
50New Fixes/Features
- Manage 64 bit Windows OS in TMF 4.3.1
- TCM/TMF are 32 bit applications, but may need to
run 64 bit binaries or .vbs and access directly
to the 64 bit registers and file system in
Windows endpoints. Produced wrappers that take as
input the name of script or binary and required
parameters, then disable registers and fs
redirections, run binary and scripts, then
restore the redirection. - (wrun-AMD64.exe and wrun-IA64.exe) under
directory lcf_bundle.43100\bin\w32-ix86\tools
51Oserv Tuning
- Protect the oserv
- Child process RPC threads must not be greater
than 60 of oserv RPC threads
52Gateway Tuning
- Configure gateway to handle workload
- Preliminary/TBD - max_concurrent_logins number
of endpoints that do regular logins (workstation
machines, up to maximum of 500) - Preliminary/TBD - max_concurrent_jobs max
(number of endpoints that are servers or have
DM/ITM/TEC installed, Mdist2 sessions total) - Preliminary/TBD - rpc_maxthreads max
(max_concurrent_jobs, max_concurrent_logins)1.3 - Child RPC threads lt 60 of oserv RPC threads
- Use logstatus to monitor workload (set
logstatus_interval 300) - 750-1000 endpoints per gateway (Windows)
- 1500-2000 endpoints per gateway (Unix)
53Gateway Tuning (cont.)
- Memory
- 10MB for gateway process
- 3KB per endpoint
- 48KB per thread (pthread and RPC)
- 48KB per job in JobQ
- Minimum requirement of 256MB for 1,000 endpoints
- 350MB gateway process for 1,000 endpoints (stress
test) - Function of Mdist2 mem_max
- Recommendation 512MB per 1,000 endpoints
54Endpoint Manager Tuning
- Beware of Endpoint Manager RPC thread usage
- Increase max_epmgr_rpc_threads
- Must increase oservs rpc_max_threads gt
(max_epmgr_rpc_threads/0.6) - Must increase OS file descriptor limit gt
rpc_max_threads - EPMGR
- CLI wepmgr login_limit ltNUMgt
- Default is 80 of max_epmgr_rpc_threads
- TBD/preliminary Recommended values is 60 of
max_epmgr_rpc_threads - 0 disallows logins
- Thread usage by login type (NOTE Not all of the
threads are run simultaneously) - MIGRATION
- Avoid migrating endpoints to specific gateways
- Each migration command uses a total of 12 epmgr
threads - Use gateway clouds/farms where possible
55Endpoint Manager Tuning (cont.)
- ISOLATION
- 7 epmgr threads
- Impatient endpoints exacerbate the problem
- Make udp_interval/login_interval reasonable
(default) - ORPHANED
- 7 epmgr threads
- Do something about these logins
- INITIAL
- 8 epmgr threads
- Dont be too aggressive with roll-out of new
endpoints - Thread usage of other operations
- CLI wep status uses 6 epmgr threads for
security - CLI wdelep uses 7 threads, plus one thread for
each PM
56Gateway Internals
- Spawning new gateway methods
- Oserv RPC thread
- Gateway RPC thread
- What type of method is it?
- tmf_dispatch method
- jobq method
- tmf_dispatch methods
- Use a gateway RPC thread
- Use an oserv RPC thread
- Methods that run against TME objects, e.g. epmgr
57Gateway Internals
- jobq methods
- Jobs are placed in Job Queue waiting for
available pthreads - Max_concurrent_jobs pthreads available for job
processing - Methods that run on endpoints, e.g. upcalls
58Workload/Workflow
- Workload how much work is the gateway doing?
- CLI
- bash wgateway cayman_gw4110 logstatus
- Status Data ------
- Jobq_threads in queue 0
- Jobq_threads running 0
- Reader_threads in queue 0
- GW methods running 1
- Number of Endpoints 2
- Endpoints connected 2
59Workload/Workflow
- Updated STATUS DATA in gatelog
- STATUS DATA jobqq0 jobqr3 redyq0 gwmethods0
loginq0 loginr0 ephq0 ephr0 sendq0 recvq1
high0 med0 low0 - THREAD USAGE
- jobqq JOBQ threads in queue
- jobqr JOBQ threads running
- redyq Reader_threads in queue
- gwmethods GW methods/RPC threads running
- LOGIN QUEUE
- loginq logins requests in login queue
- loginr logins running
60Workload/Workflow
- HEALTH CHECK
- ephq endpoint health checks in queue
- ephr endpoint health checks running (maximum of
20) - MDIST2
- sendq jobs in send queue on gateway
- recvq jobs in receive queue on gateway
- high number of high session in use on gateway
- med number of medium session in use on gateway
- low number of low session in use on gateway
61Impediments
- TCP Backlog
- Listen(), handshaking, accept()
- Contains the complete TCP connections
- Contains the incompleted TCP connections (on
Windows) - Windows has a backlog size of 128 (maximum)
- Unix uses a syncache for incomplete connections
(configurable) - TCP connection request flood
- Backlog gets full
- SYN,RST seen as connections are refused
- Connection attempts never seen by gateway
62Impediments
- reconnect_thread
- There is only one
- High volume of TCP connections can overwhelm
thread - Needs to accept() connections before TCP Backlog
is full - Applications may trigger fail-over mechanisms due
to delays - TMA does not record an error because upcall
eventually works - Low number of jobs running and waiting in Job
Queue
63Tuning and Recommendations
- Memory usage on AIX
- AIX allocates 16 segments to each process
- Each segment is 256MB
- Process Private Data (heap, stack and user area)
is one segment - LDR_CNTRL environment variable can be used to
assign multiple data segments to a process - LDR_CNTRLMAXDATA0x80000000 (maximum of 8 data
segments per process 2GB) - Used in odadmin environ
- http//nscp.upenn.edu/aix4.3html/aixprggd/genprogc
/lrg_prg_support.htm
64Tuning and Recommendation
- TCP Backlog - revisited
- AIX
- /usr/sbin/no -o clean_partial_conns1
- This setting will instruct the kernel to randomly
remove half-open sockets from the q0 queue to
make room for new sockets. - /usr/sbin/no -a or no -o somaxconn
- /usr/sbin/no -o somaxconnNewValue (default
1024) - Change takes effect immediately. Change is
effective until next boot. Permanent change is
made by adding no command to /etc/rc.net.
65Tuning and Recommendations
- Solaris
- /usr/sbin/ndd -set /dev/tcp tcp_conn_req_max_q
1024 - The q queue holds sockets awaiting an accept()
call from the application. - /usr/sbin/ndd -set /dev/tcp tcp_conn_req_max_q0
2048 - The q0 queue contains half-open sockets.
66Tuning and Recommendations
- Tru64 UNIX
- /sbin/sysconfig -r socket sominconn65535
- The value of sominconn determines how many
simultaneous incoming SYN packets can be handled
by the system. - /sbin/sysconfig -r socket somaxconn65535
- The value of somaxconn sets the maximum number of
pending TCP connections. - HP-UX
- /usr/sbin/ndd -set tcp_syn_rcvd_max 1024
- /usr/sbin/ndd -set tcp_conn_request_max 200
67Tuning and Recommendations
- Linux kernel 2.2
- /sbin/sysctl -w net.ipv4.tcp_max_syn_backlog1280
- Increases the size of the socket queue
(effectively, q0). - /sbin/sysctl -w net.ipv4.tcp_syn_cookies1
- Enables support for TCP SYN cookies, which
mitigates the effectiveness of SYN floods. - Possible side effects - see RFC1323 and RFC2018
- FreeBSD
- sysctl -w kern.ipc.somaxconn1024
68Tuning and Recommendation
- IRIX
- The listen() queue is hardcoded to 32. However,
the system actually enforces the limit of pending
connections as ((3 backlog) / 2) 1. This
yields a maximum backlog of 49 connections - Windows
- NT fixed at 100
- W2K fixed at 128
69Login Types
- Login Message Types
- LOGIN_EP LOGIN_FAIL
- LOGIN_EP sub-types
- LOGIN_INITIAL
- LOGIN_NORMAL
- LOGIN_ISOLATED
- LOGIN_INFO
- LOGIN_MIGRATING
- LOGIN_TMR_REDIRECT
70INITIAL login
- INITIAL login
- Endpoint sends a LOGIN_INITIAL to gateway
- Gateway forwards packet to EPMGR
- EPMGR processes login
- allow_install_policy
- should this endpoint be allowed access?
- select_gateway_policy
- where should he login?
- after_install_policy
- now that it exists, now what?
- Gateway passes back a LOGIN_INFO
- Endpoint sends a LOGIN_NORMAL to assigned gateway
- login_policy
- now that you are available, what should you do?
71ISOLATION login
- ISOLATION login
- Endpoint sends a LOGIN_ISOLATED when the
LOGIN_NORMAL is not answered. - Gateways response is dependent on whether it
owns the endpoint - Ownership uses key to decrypt packet
- treated like LOGIN_NORMAL
- No ownership (alternate gateway) packet sent to
EPMGR - EPMGR uses key to decrypt packet
- select_gateway_policy
- Assigns endpoint to new gateway
- Alternate (intercepting) gateway returns
LOGIN_INFO to endpoint - Endpoint sends LOGIN_NORMAL to new gateway
- login_policy
72MIGRATORY login
- MIGRATORY login
- Endpoint sends LOGIN_NORMAL to gateway
- No ownership, gateway forwards LOGIN_MIGRATING to
EPMGR - CLI wep migrate used
- CLI wep set gateway e/-g was not used
- EPMGR tells gateway where endpoint should be
logged in - Gateway send LOGIN_INFO to endpoint
- Endpoint send LOGIN_NORMAL to the migration
gateway - login_policy
73ORPHANED login
- ORPHANED login
- Endpoint sends LOGIN_NORMAL
- Gateway forwards to EPMGR (not in epcache)
- EPMGR cant find key (resource not found)
- Gateway returns LOGIN_FAIL
- Endpoint send LOGIN_ISOLATED
- EPMGR uses master key to decrypt packet
- Checks region number for validity
- EPMGR creates new account
- Policy scripts executed as per INITIAL login
- Gateway send LOGIN_INFO to endpoint
- Endpoint sends LOGIN_NORMAL to gateway
- login_policy
74TMR Redirection
- TMR Redirection
- Endpoint sends LOGIN_INITIAL to gateway
- Gateway forwards login to EPMGR
- EPMGR processes login
- allow_install_policy
- select_gateway_policy
- Passes back a gateway in another TMR region
- EPMGR understands that this is a redirection
- Gateway passes new gateway in LOGIN_INFO to
endpoint - Endpoint does a LOGIN_TMR_REDIRECT to new gateway
75Health Check
- A health check, rather than a heart beat.
- Framework based.
- No products needed
- Configurable
- wgateway ltgwlabelgt epcheck_interval
- Executes every epcheck_interval seconds (default
3600) - wgateway ltgwlabelgt epcheck_atboot TRUEFALSE
- Check the endpoints at boot time (default FALSE)
- CLI wepstatus lteplabelgt
- Reads the results from the gateways health check
cache - Does not contact the endpoint
- CLI wep lteplabelgt status 1
- Initiates intrinsic health check method on single
endpoint - Gateways health check cache updated with new
result
76What does it do?
- Purpose to establish that downcalls and upcalls
will work - Runs as intrinsic method on LCFD
- Creates temporary file and write to file in temp
directory. Then deletes it. - Checks the amount of temp space it must be more
than diag_temp_space. - Creates a temporary file and write to file in
cache root directory. Then deletes it.
(LCF_DATDIR/cache) - Checks free space on cache file system/volume.
- Generate tmersrvd token and spawn test process.
- Generate BuiltinNTAdministrator token and spawn
test process - Not performed on Netware, OS/2 and OS/400
77Results
- Connected
- the endpoint has passed all the health check
tests. Downcall, method and task executions and
upcall are successful. - Disconnected
- The endpoint sends a logout message to the
gateway when it logs out. The gateway will not
attempt any downcalls to the endpoint until its
state changes. - Unavailable
- Gateway cannot communicate with endpoint. The
important point here is that the lcfd process
accepted the connection from the gateway, but the
method execution failed.
78Results
- Unreachable
- Gateway cannot communicate with endpoint. In this
case the lcfd process did not accept the
connection from the gateway. This could mean that
the lcfd process is not running, network failure
or terminal illness of the process. - Unknown
- Gateway that is hosting endpoint is down, or the
gateway has no health check status of the
endpoint yet.
79Results
- Endpoint Health Status Error Codes
- 0 - File permission error - cannot create
temporary file. - 1 - Insufficient disk space in temporary
directory. - 2 - File permission error - cannot create/update
method cache. - 3 - Insufficient disk space in LCF_CACHEDIR.
- 4 - Cannot generate token for tmeresrvd account.
- 5 - Cannot generate token for builtin
administrator account. - 6 - Cannot spawn a process.
80How long does a login take?
- A monitoring thread manages login queue
- Number of endpoint login jobs in run state
- Number of endpoint login jobs in the queue
- Rolling average time taken by a typical initial
or normal logins - Initial login (in this context)
- Any login that is sent to the EPMGR
- INITIAL, ISOLATED, ORPHANED
81Questions?