Title: ASE 131: Troubleshooting ASE
1ASE 131 Troubleshooting ASE
David WeinPrincipal Product Support
Engineer david.wein_at_sybase.com July 17th, 2003
2Agenda
- Introduction
- Presentation Goals
- Problem Solving Resources
- Understanding ASE Diagnostics
- Finding information in error messages
- Decoding stack traces
- Picking good key words for research
- Trace Flags
- The shared memory dump facility
- Types of Problems
- Performance Issues
- Locking / Blocking / Deadlocking
- Low-Level Task Related Problems
- Backup and Recovery Issues
3Presentation Goals
- Help you help yourself
- What resources are available to customers?
- Understand how Sybase Tech Support works on
problems - What is the process? What can you do to help?
- Understand ASEs diagnostics
- Errors, stack traces, and memory dumps
- Understand how to deal with specific problems
that may be encountered with ASE - Minimize impact by understanding what is
happening within ASE and how to handle it.
4Online Problem Solving Resources
- Online support home is http//www.sybase.com/suppo
rt - Product Manuals
- Technical Documents
- Newsgroups
- Solved Cases / Existing Change Requests
- EBF Downloads
- Case Management
- ASE documents are very strong
- Troubleshooting Guide
- Early chapters are a wealth of information on
getting out of trouble and staying out of
trouble. - Also contains detailed write-ups on many ASE
messages and should always be checked as a first
step. - System Administration (SAG) and Performance and
Tuning (PT) guides are also very useful for
understanding the area of ASE where you may be
having problems
5Online Problem Solving Resources, Continued
- Public newsgroups are an invaluable service
- Over a dozen ASE newsgroups covering general
issue to specific topcis such as High
Availability and Linux - Searchable archive
- Informally monitored by many Sybase employees and
lots of experienced customers - Solved cases contains over 17,000 searchable
support cases with answers - Common problems and their answers are well
documented in this knowledge base. - Hundreds of additional entries are added every
month - Change Requests (CRs or bugs) are also searchable
- Contains most CRs reported by customers starting
in May of 2002. - Contains a summary and applicable version
6PT Newsgroup Screenshot
7Agenda
- Introduction
- Problem Solving Resources
- Working with Sybase Tech Support
- Change Request / Bug Fix Process
- Understanding ASE Diagnostics
- Finding information in error messages
- Decoding stack traces
- Picking good key words for research
- Trace flags
- The shared memory dump facility
- Types of Problems
- Performance Issues
- Locking / Blocking / Deadlocking
- Low-Level Task Related Problems
- Backup and Recovery Issues
8ASE Error Message Formats
- Messages that appear in the log have a header
string that provides good information - 0200000000562002/12/16 234049.14 kernel
current process (0x11890001) infected with 11 - Use this information to isolate overlapping
messages on multi-engine systems
Date / Time
Family ID
ASE Layer
Process ID
Engine ID
9ASE Error Message Formats, Continued
- ASE uses both numbered and unnumbered messages
- The numbered messages correspond to what is
found in the sysmessages table. - The unnumbered messages are more internal in
nature, do not exist in sysmessages, and are not
sent to the client. - Examples
- Numbered message
- Msg 208, Level 16, State 1
- Line 1 foo not found. Specify owner.objectname
or use sp_help to check whether the object exists
(sp_help may produce lots of output). - Unnumbered message
- 2002/09/05 164449.82 kernel sddone write
error on virtual disk 16 block 3990498
10Severity and State
- Numbered messages consist of four components
- The message number itself, i.e. 208, 695, 1105,
etc. - A severity level between 10 and 24.
- Severity determines how ASE responds to the error
- Levels 10 18 represent a user error. These are
non-fatal. 10 16 can be corrected by the user.
17 and 18 may require DBA intervention. - Levels 19 26 are fatal and will result in a
terminated client session. These are more severe
problems and represent internal inconsistencies - Severity determines how the error is reported
- Non-SA client only receive non-fatal errors (lt
18). They receive an alternate message for
severity 19 and higher. - SA clients receive errors regardless of severity
- Only severity 19 and higher are sent to the
errorlog / console. - If traceflag 3602 is enabled, severity 10-18 are
also sent to the errorlog. - A state value
- States identify the instance of an error. For
errors raised several places, this isolates the
code location that raised the instance. - This is very important for tech support!
- The actual text of the message
11Stack Traces
An invaluable diagnostic tool
- Stack traces provide useful information relating
to what a task was doing when a fatal condition
was encountered - Stack traces go to the errorlog and console when
a fatal error is encountered - Fatal conditions such as time slice violations
and signals also cause stack traces to be
generated - Trace flag 3601 was cause a stack trace to be
generated for non-fatal errors as well, but this
is not normally advisable. - Stack traces are often preceded or followed by
relevant information - Be sure to save the entire errorlog if you see a
stack trace - If you are working with Tech Support, they will
want to see the entire log, not just a cut and
paste of the stack trace. - Picking good key words out of a stack may help
you find the cause online - Using good key words for searches may turn up the
answer in the known problems or solved cases
databases. -
12Stack Traces, Continued
An example stack trace
- 0500000000742002/06/21 085821.17 kernel
current process (0x128b007b) infected with
11 - 0500000000742002/06/21 085821.23 kernel
Address 0x0068b7f8 (getnext_ctlib_subst0x138),
siginfo (code, address) (1, 0x0000010a) - 0500000000742002/06/21 085821.23 kernel
- 0500000000742002/06/21 085821.23 kernel SQL
causing error execute etcm.etcm.ap_get_ba_addres
s1 _at_country 'CAN', _at_addr_type 'I', _at_basicba
'999999', _at_addrsub '99', _at_baname '' - 0500000000742002/06/21 085821.23 kernel
- 0500000000742002/06/21 085821.23 server SQL
Text execute etcm.etcm.ap_get_ba_address1
_at_country 'CAN', _at_addr_type 'I', _at_basicba
'999999', _at_addrsub '99', _at_baname '' - 0500000000742002/06/21 085821.23 kernel
curdb 13 pstat 0x10100 lasterror 0 - 0500000000742002/06/21 085821.23 kernel
preverror 0 transtate 0 - 0500000000742002/06/21 085821.23 kernel
curcmd 197 program
13Stack Traces, Continued
- 0500000000742002/06/21 085821.24 kernel pc
0x0064f8dc pcstkwalk0x24(0x29b869a0, 0x00000000,
0x0000270f, 0x00000002, 0xfffffff8) - 0500000000742002/06/21 085821.24 kernel pc
0x0064f7e8 ucstkgentrace0x194(0x128b007b,
0x2d3ccd58, 0x2d3ccd58, 0x2bb795c8, 0x00000000) - 0500000000742002/06/21 085821.24 kernel pc
0x0061cc90 ucbacktrace0xa8(0x2bb795c8,
0x00000001, 0x128b007b, 0x2d14f238, 0x00000000) - 0500000000742002/06/21 085821.25 kernel pc
0x000f62e0 terminate_process0xd70(0x00beef50,
0x000054e8, 0x000054e4, 0xffffffff, 0x00005000) - 0500000000742002/06/21 085821.25 kernel pc
0x006324d4 kisignal0x1e0(0x29b8745c, 0x29b871c4,
0x0000000b, 0x29b87198, 0x29b87450) - 0500000000742002/06/21 085821.26 kernel pc
0xff34b7dc _getfp0x220(0x0000000b, 0x29b87450,
0x29b87198, 0x006322f4, 0x00bf3424) - 0500000000742002/06/21 085821.26 kernel pc
0xff3484c0 _fork0x864(0x0000000b, 0x00bf3380,
0x00000000, 0x00000000, 0x00000000)
14Stack Traces, Continued
- 0500000000742002/06/21 085821.26 kernel pc
0x0068b77c getnext_ctlib_subst0xbc(0x00000200,
0x00000000, 0x00000000, 0x00000000, 0x38433e50) - 0500000000742002/06/21 085821.26 kernel
Handler pc 0x006886ec omni_backout installed by
the following function- - 0500000000742002/06/21 085821.26 kernel pc
0x0068b490 omni_getnext0x2c0(0x38433e50,
0x2c65f574, 0x2d3ccd58, 0x2c662c5c, 0x2d3d2298) - 0500000000742002/06/21 085821.27 kernel pc
0x0042a760 exec_eop0x1750(0x00002000,
0x2d3ccd58, 0x00000000, 0x00000000, 0x000005c5) - 0500000000742002/06/21 085821.27 kernel pc
0x004297ac exec_eop0x79c(0x00007000, 0x2d3ccd58,
0x45478880, 0x00000000, 0x00000000) - 0500000000742002/06/21 085821.28 kernel
Handler pc 0x0042e510 execerr installed by the
following function- - 0500000000742002/06/21 085821.28 kernel
Handler pc 0x0056f508 jcsExHandler installed by
the following function- - 0500000000742002/06/21 085821.30 kernel
Handler pc 0x003dfe84 sortmerr installed by the
following function-
15Stack Traces, Continued
- 0500000000742002/06/21 085821.30 kernel
Handler pc 0x0042e510 execerr installed by the
following function- - 0500000000742002/06/21 085821.30 kernel
Handler pc 0x0042e510 execerr installed by the
following function- - 0500000000742002/06/21 085821.30 kernel
Handler pc 0x00365b38 aritherr installed by the
following function- - 0500000000742002/06/21 085821.30 kernel pc
0x00428960 execute0xb80(0x455e96d0, 0x0000702c,
0x455e96d0, 0x2d3ccd58, 0x00005000) - 0500000000742002/06/21 085821.30 kernel pc
0x0038aebc s_execute0x2404(0x80000000,
0x000000c5, 0x455e9348, 0x00005393, 0x00001000) - 0500000000742002/06/21 085821.30 kernel
Handler pc 0x003a4e50 s_handle installed by the
following function- - 0500000000742002/06/21 085821.30 kernel pc
0x003a1e8c sequencer0x10d8(0x00be0400,
0x00beec00, 0x29025200, 0x00005388, 0x2d3ccd58) - 0500000000742002/06/21 085821.30 kernel pc
0x003921f8 execproc0x79c(0x00005388, 0x00007000,
0x2d3ccd58, 0x00005000, 0x0000001f)
16Stack Traces, Continued
- 0500000000742002/06/21 085821.30 kernel pc
0x0038b3d4 s_execute0x291c(0x00beec00,
0x00be2800, 0x29025200, 0x00000008, 0x00000000) - 0500000000742002/06/21 085821.30 kernel
Handler pc 0x003a4e50 s_handle installed by the
following function- - 0500000000742002/06/21 085821.30 kernel pc
0x003a1e8c sequencer0x10d8(0x00be0400,
0x00beec00, 0x29025200, 0x00005388, 0x2d3ccd58) - 0500000000742002/06/21 085821.30 kernel pc
0x003921f8 execproc0x79c(0x00005388, 0x00007000,
0x2d3ccd58, 0x00005000, 0x0000001b) - 0500000000742002/06/21 085821.30 kernel pc
0x0038b3d4 s_execute0x291c(0x00beec00,
0x00be2800, 0x29025200, 0x00000008, 0x00000000) - 0500000000742002/06/21 085821.30 kernel
Handler pc 0x003a4e50 s_handle installed by the
following function- - 0500000000742002/06/21 085821.30 kernel pc
0x003a1e8c sequencer0x10d8(0x00be0400,
0x00beec00, 0x29025200, 0x00005388, 0x2d3ccd58)
17Stack Traces, Continued
- 0500000000742002/06/21 085821.30 kernel pc
0x003921f8 execproc0x79c(0x00005388, 0x00007000,
0x2d3ccd58, 0x00005000, 0x0000001b) - 0500000000742002/06/21 085821.30 kernel pc
0x0038b3d4 s_execute0x291c(0x00beec00,
0x00be2800, 0x29025200, 0x00000008, 0x00000000) - 0500000000742002/06/21 085821.30 kernel
Handler pc 0x003a4e50 s_handle installed by the
following function- - 0500000000742002/06/21 085821.30 kernel pc
0x003a1e8c sequencer0x10d8(0x00be0400,
0x00beec00, 0x29025200, 0x00005388, 0x2d3ccd58) - 0500000000742002/06/21 085821.30 kernel pc
0x0011cf9c tdsrecv_language0xb4(0x00be0400,
0x00befc00, 0x00000001, 0x00005000, 0x00005400) - 0500000000742002/06/21 085821.30 kernel
Handler pc 0x0013d9ec hdl_backout installed by
the following function- - 0500000000742002/06/21 085821.31 kernel
Handler pc 0x0033fe1c ut_handle installed by
the following function- - 0500000000742002/06/21 085821.31 kernel
Handler pc 0x0033fe1c ut_handle installed by
the following function-
18Stack Traces, Continued
- 0500000000742002/06/21 085821.31 kernel pc
0x0012c6bc conn_hdlr0x20f0(0x00befc00,
0x00be0400, 0x00000000, 0x000054e4, 0x00005400) - 0500000000742002/06/21 085821.31 kernel pc
0x00662618 _coldstart(0x00000081, 0x0012a5cc,
0x00000000, 0x00000000, 0x00000000) - 0500000000742002/06/21 085821.31 kernel end
of stack trace, spid 74, kpid 311099515, suid 141
19Good Information from Stack Traces
- While most of the stack is only relevant for
internal Sybase folks, some of it can be used to
help match problems with exiting cases or bugs. - First, make note of the problem. Here it is an
infected with 11 - 0500000000742002/06/21 085821.17 kernel
current process (0x128b007b) infected with
11 - Next, note which function was executing when the
signal arrived - 0500000000742002/06/21 085821.23 kernel
Address 0x0068b7f8 (getnext_ctlib_subst0x138),
siginfo (code, address) (1, 0x0000010a) - Here the function getnext_ctlib_subst() is where
the problem was encountered. - This line is only relevant for signal stack
traces and doesnt apply to time slices or most
other errors.
20Good Information from Stack Traces, Continued
- The SQL causing error can help you reproduce or
avoid the problem. - 0500000000742002/06/21 085821.23 kernel SQL
causing error execute etcm.etcm.ap_get_ba_addres
s1 _at_country 'CAN', _at_addr_type 'I', _at_basicba
'999999', _at_addrsub '99', _at_baname ' - Lines that follow contain the current database
(pcurdb) as well as any errors that were
encountered prior to the event that triggered the
stack trace.
21Picking Good Key Words for Research
- Use good searching strategies, such as
remembering synonyms. - When researching stack traces, search on
functions near the top of the stack. - For infected with issues, try searching on the
function in the address line following the
infected with 11. - In the stack example we just looked at, we would
search on getnext_ctlib_subst. - Search example on following slides
22Picking Good Keywords for Research, Continued
- Some ASE functions appear in almost every stack
trace, and are therefore poor keywords to use
when doing research - Avoid searching based on these functions
- pcstkwalk
- ucstkgentrace
- ucbacktrace
- os_backtrace
- terminate_process
- kisignal
- hdl_default
- hdl_backout
- s_handle
- execerr
- aritherr
- ex_raise
- close_network
- execute
- s_execute
- exec_eop
- sequencer
- s_compile
- tdsrecv_language
- conn_hdlr
- _coldstart
- kaclkintr
- _fork
- _getfp
- Functions towards the top of the stack are more
likely to be relevant than those on the bottom - If you match the bottom of the stack but not the
top, you likely havent matched the problem.
23Researching our Stack Trace
Step 1 Search solved cases
Keyword we have entered
24Researching our Stack Trace, Continued
Step 2 Get the results
25Researching our Stack Trace, Continued
Step 3 Click on the case number and see the
details
26Researching our Stack Trace, Continued
Step 4 Search for CR details
Changed search type
27Researching our Stack Trace, Continued
Step 5 Get more details on the CR
28ASE Trace Flags
- ASE has many, many trace flags that affect the
server behavior. - Some are officially documented, some are not.
- Some basic rules apply for all trace flags
- Dont turn on a trace flag unless you understand
what it does and what the side effects are. - Document who turned the flag on, when it was
turned on, and why it was turned on. - Dont ever add a trace flag to a RUN_server file
without adding a comment detailing why it is
there.
29ASE Trace Flags How they work
- With few exceptions, trace flags are not tracked
on a per-connection basis. - A single, server-wide, mask of activated trace
flags is maintained. - When a connection turns on a trace flag, it will
inherit the behavior of all other activate trace
flags within the server. - Example
- Spid 10 runs dbcc traceon (3604, 302, 310)
- Spid 11 runs dbcc traceon (8399)
- Trace 302 and 310 output will be generated for
any query spid 11 runs. However, the output will
not be sent back to the client since spid 11 did
not activate trace 3604 - Some trace flags will automatically apply to all
connections, regardless of whether or not they
have activated other flags. - Access to the ASE code is required to determine
which flags have this behavior, therefore when in
doubt you should consult Sybase Tech Support - Most flags that affect query processing fall into
this category - The only flags that must appear in the RUN_server
file are those that affect the boot-time behavior
and initialization of the server.
30The Sybmon Utility
ASEs advanced diagnostic facility
- Diagnostic utility built into the dataserver
binary - Provides access to shared memory structures
- May be used to analyze live servers or dumped
shared memory images of servers - Non-intrusive and requires no prior planning
before invoking - The halt and restart commands are intrusive
and will cause the server to suspend processing.
However, these are the exceptions. - Sybmon dumps provide a snapshot image into a
server - This snapshot is often times very useful in
determining the cause of a problem - However, Sybmon dumps alone cannot answer every
problem - Problems that occur over time (example loss of
SIGALRMs) cant be seen with a snapshot - Problems that are occur outside of the shared
memory segment cant be seen.
31The Shared Memory Dump Facility (CSMD)
- ASE has the ability to copy its shared memory to
a file for examination by Sybase Tech Support - The shared memory dump can be manually triggered,
or it can be event based - The event based dump is called a Configurable
Shared Memory Dump or CSMD - CSMD can be configured for
- A specific error or a group of errors, such as
605 or 6xx - All errors with a severity equal to or greater
than that configured - A signal, such as signal 10 or 11 (infected with
10 or 11) - A timeslice violation
- A panic / stack overflow
- An unnumbered message that might appear in the
log - Memory dumps can be analyzed by Sybase to
determine the state of the server and the
contents of memory structures at the time of a
fault
32sp_shmdumpconfig Stored Procedure
The interface to CSMD
- The sp_shmdumpconfig stored procedure interfaces
with the CSMD facility - Add or drop dump conditions
- View configured dump conditions
- View and modify dump defaults
- To view current setup, run sp_shmdumpconfig with
no arguments - Example on the next slide
- For details on syntax and usage, see the notes
section of this slide - Two relevant sp_configure parameters
- dump on conditions set to 1 to turn on, set to
0 (default) to turn off - max dump conditions sets the maximum number of
dump conditions that can be configured (default
10)
33sp_shmdumpconfig Sample Output
- gt sp_shmdumpconfig
- gt go
- Type Value Maxdumps Page_Cache Proc_Cache
Unused_Space Est_File_Size - Filename Directory
- ---------- ----- -------- ---------- ----------
------------ -------------
------------------------------ ---------- - Error 813 Default Default Default
Default 13 MB - Default File Name Default Directory
- Signal 10 Default Include Default
Default 22 MB - Default File Name Default Directory
- Timeslice --- Default Default Default
Default 13 MB - Default File Name Default Directory
- Panic --- Default Default Default
Default 13 MB - Default File Name Default Directory
- Defaults --- 1 Omit Omit
Omit 13 MB - Generated File Name /opt/sybase/csmd
- Current number of conditions 4
- Maximum number of conditions 10
34Sybmon and CSMD What You Should Know
- Sections of memory included in dump files are
configurable - Core data structures are always included
- Page cache, procedure cache, and unused space are
optional - Inclusion of procedure cache is recommended
- Dump files can be very large
- Size is based on the amount of memory ASE is
configured for and the optional memory modes - ASE processing will stop while the dump file is
being written - This is done to achieve a consistent image of
memory in the dump file - Time of the halt is based on the size of the dump
and the speed of the I/O system - Specify a dump directory, but dont specify a
file name - It is best to let ASE generate the name for the
dump file. This prevents dump files from being
overwritten - Always FTP dump files in binary mode
35Should You Be Proactive With CSMD?
- Some shops setup CSMD before they encounter
problems to help speed resolution should they
encounter any - Do this at your own discretion.
- Keep in mind that generating a CSMD causes ASE to
halt while the file is being written - Some conditions to consider
- Signals 4 (very rare), 10, 11 (signal 0xC0000005
on NT) - Timeslice, panic
- Fatal errors (configure for severity 19)
- Keep in mind that the dump file may not be that
useful and Tech Support may not be interested in
it - Based on the specifics of the issue the CSMD may
not be the best way to approach the problem
36Agenda
- Introduction
- Problem Solving Resources
- Working with Sybase Tech Support
- Change Request / Bug Fix Process
- Understanding ASE Diagnostics
- Finding information in error messages
- Decoding stack traces
- Picking good key words for research
- Trace Flags
- The shared memory dump facility
- Types of Problems
- Performance Issues
- Locking / Blocking / Deadlocking
- Low-Level Task Related Problems
- Backup and Recovery Issues
37Performance Problems
- Planning ahead greatly simplifies the resolution
of performance problems. - Due to the variances in applications and
environment, it is difficult to determine what is
normal and what is abnormal for a particular
customer. - Gathering benchmark data prior to encountering a
problem allows for the isolation of problems
38Basic Performance Troubleshooting
- Isolation query issues from system issues
- Optimizer issue, blocking issue, or resource
issue? - Gather data over time
- Establish baselines for all levels of activity
- Understand typical system loads at various times
- Look for trends
- Avoid tuning based on a single 5 minute sample
- Understand what you are changing
- Read documentation before tuning anything
- Document what you did and why you did it
- Dont ever turn on a trace flag until you know
what it does and what the side effects are!
39Tools to Consider for General Problems
- sp_sysmon
- Very good at giving an overall view of what the
server is doing. - Not so good at narrowing down to specific users,
queries, etc. - Well documented and easy to use, but can be
difficult to find meaningful data. - Monitor and historical servers
- Provides data sp_sysmon cant
- Better at monitoring specific users, queries,
objects, etc. - Requires separate setup, but this is trivial
- Can serve client apps written with the Monitor
Client Library, such as Sybase central plug-in - Shipped as part of ASE, no separate licenses
required - DBXray
- New optional feature for ASE 12.5.0.3
- Provides real-time, graphical monitoring of ASE
- Other external tools
- OS level tools such as iostat, vmstat, sar, etc.
- In extreme cases, tools such as truss can be used
to determine the amount of time spent in
individual system calls, but this adds a lot of
overhead.
40Guidelines for Using sp_sysmon
- Archive your output
- If performance drops off, archived output will be
very useful - Run sysmon prior to major changes
- Hardware
- Application
- Configuration parameters
- Run sysmon for short periods of time
- Dont overflow monitor counters (unsigned int)
- A sample that includes both busy and idle time
may reveal neither. - Look at per second and per transaction numbers
- Better than raw numbers for making comparisons
- Be careful of conflicts
- Monitor and historical server share counters with
sysmon - Run only a single sysmon at a time
- Be careful of automated recommendations
- Newer versions make recommendations
- Look at the system over time and consider
implications prior to following any of them.
41Locking / Blocking Tempdb Contention
As the overall scalability of ASE dramatically
improved, tempdb became a bottleneck
- Why is this a problem?
- Creating and dropping tables requires exclusive
locks on system catalogs - High throughput environments create / drop many
temp tables - This is normally fast, but having to do physical
IO while holding these locks can queue up other
processes - Becomes a point of serialization in a high
throughput environment
42Tempdb Contention, Continued
Types of temporary tables
- Procedural tables
- Created inside of a stored procedure
- Visible only to that procedure and sub procedures
- Automatically dropped when proc exists
- Session tables
- Created outside of a proc using tablename
- Visible only to that session
- Automatically dropped when that session exists
- Shareable tables
- A regular table that lives in tempdb
- Visible to all users
- Will not be automatically dropped (except for ASE
reboot)
Contention
43Tempdb Contention, Continued
Some things to try
- On the application side, reduce the number of
tables created - Replace procedural tables with session tables
- proc can truncate the table at entry
- Have multiple sessions share the same regular
table - Row lock this table
- Requires additional logic at the application
level - One the server side, reduce latency while holding
locks - Place tempdb on a ramdisk
- Use named caches to take load out of default data
cache
44Tempdb Contention, Continued
Common Misconceptions
- Performing create followed by insertselect is
better than doing select into - Based on the belief that ASE holds the system
table locks for duration of select into - Reality is select into releases catalog locks
prior to inserting rows - Select into also faster due to reducing logging
- Truncating a table prior to dropping it reduces
time locks are held - This is absolutely true. During drop we must
deallocate all extents while holding the system
table locks. - However, ASE already truncates temp tables prior
to grabbing system table locks (introduced in
11.9.2) - Traceflag 3703 disables this, and it can increase
contention. A common misconception is that it
will reduce contention.
45Tempdb Contention, Continued
Big Improvements in 12.5.0.3
- Multiple tempdbs spread the load
- 12.5.0.3 allows DBAs to create and assign
multiple tempdbs, based on application, login,
etc. - Spreads the load and dramatically improves
throughput - Lazy log writing reduces physical IO
- Log buffers are not flushed to disk during commit
processing in temporary databases - Data buffers not flushed to disk at end of select
into where target table is in a temporary
database
46Deadlocks
- Deadlocks are almost always an application issue.
- Get details by turning on print deadlock
information - sp_configure option
- Replaces the older 1204 traceflag for every day
troubleshooting - Traceflags can provided additional details
- 1204 prints deadlock chains (similar to print
deadlock information above). This is required
for other deadlock traceflags, so it should only
be used when the trace flags below are required. - 1205 prints a stack trace for each task in the
deadlock chain. Tech support may request this in
rare circumstances (requires the use of trace
1204) - 1218 prints the query plan (showplan) upon a
deadlock. (requires the use of trace 1204) - My favorite way to understand deadlocks is to
diagram them. - Note 12.5.0.3 MDA tables also contain deadlock
information
47Other Lock Troubleshooting Tips
- Traceflag 1202 will insert blocked lock requests
into syslocks - Normally blocked requests do not go to syslocks,
and therefore are not displayed by sp_lock - 1202 causes them to be displayed for additional
troubleshooting - Traceflags 1212 and 1217 provide extensive lock
tracing - 1212 traces every lock acquisition and release
- 1217 traces lock acquisition and release for user
tables only (same as 1212, but doesnt display
system table locks) - Both of these traces provide A LOT of output, but
may be helpful in isolated test and development
environments - May provide a better idea of how the locking
system is handling your query
48Low-Level Task Related Problems
Things that make your process go boom
- A task may be terminated due to some internal
problems - Time slice violations
- Operating system signals
- Stack and stack guard overflows
- Normally the task will be terminated and ASE will
move on. However - If the terminating task is holding a spinlock,
ASE will shut itself down. - Stack overflow issues will always result in an
ASE shutdown
49Low-Level Task Related Problems Time Slice
Time slice violations prevent a run-away task
from taking over your server
- ASE uses a non-preemptive scheduling algorithm
- Each scheduled task is responsible for
voluntarily or automatically scheduling itself
out. - Every task is given a time quantum to run in,
along with a grace time. - If a task fails to yield after this grace time is
exhausted, it will be killed. - Tasks normally run for a very short time,
yielding when - The task must sleep for a resource, such a
network or physical I/O - Execution has completed and the task becomes idle
- They have exhausted their time quantum and are
executing in the grace period. ASE code
frequently checks to see if the time quantum has
been exhausted and if the task should yield. - Tasks may fail to yield within the grace period
when - The task is caught in a loop that does not
contain any yield point - A long code path is being executed that does not
contain a necessary yield point - An operating system call that was expected to
take a short period of time takes a long time
50Time Slice - Yielding
51Time Slice Errors
52Time Slice Errors, Continued
What should you do if you get a time slice error?
- Two sp_configure parameters are commonly
referenced - time slice -gt typically this should not be
changed as it will have implications for every
task. - cpu grace time -gt this can be increased to give
a task more time to execute before being killed.
This will only impact processes that are running
for an extremely long period of time. - Tweaking parameters is easy, but not always
fruitful - The default grace time of 500 already gives a
task 50 seconds to yield, which should be more
than enough time. - If it doesnt yield in 50 seconds, it probably
wont yield in 75 or 100 seconds, and you will
have tied up the engine for that much longer - Finding a solution
- Try to isolate the behavior, i.e. errors occur on
RPCs to a specific site. - Pick good key words from the stack and check the
solved cases database. - Work with Sybase Tech Support to find the root
cause. In some cases, a configured shared memory
dump may help.
53Low-Level Task Related Problems - Signals
ASE can survive most signals
- ASE tasks may receive fatal signals from the
operating system. - At boot time ASE installs handlers for many of
these signals so that it can respond
appropriately. - ASE prefers not to dump core and exit. Rather,
it will terminate its internal task that
generating the signal and continue scheduling
work on that engine. - A message in the error log will read Current
process infected with and a stack trace will
be generated. - Signals you may see
- Infected with (signal) 11 / Segmentation
Violation an attempt was made to read or write
memory outside of ASEs valid address space. - Infected with (signal) 10 / Bus Error similar to
signal 11, but may be related to a misaligned
attempt to access memory. - On the Windows platform, you would see a Storage
Access Violation with code 0xC0000005
54Signals, Continued
Dealing with signal problems
- Causes of fatal signals
- Coding error within ASE
- Operating system / hardware problem
- Improperly coded applications should not result
in ASE performing illegal memory access. - Be sure to see if this is a known problem
- Check solved cases and open bugs, using good key
words from the stack trace. - Consider a newer EBF is your release is
significantly behind. - Make sure your OS patches meet those required by
the release bulletin, as well as those
recommended by your OS vendor. - Reproduce, reproduce, reproduce
- It is very difficult to develop a code solution
to a signal without a reproduction. - Use information such a the SQL Causing Error,
application, and login to try to reproduce the
problem. - Tech support can and will assist without a
reproduction, and their efforts will largely
focus on developing a reproduction. - Shared memory dumps are very common in these
situations, and may help support in developing a
reproduction.
55Low-Level Task Related Problems Stacks
ASE must monitor its own stacks to look for
overflows
- ASE maintains stack space for its tasks
- The per-task allocation consists of space for the
stack, followed by a guard word. - The guard word allows a task to overflow its
stack without affecting adjacent memory.
Normally, it is empty space. - If the guard word is not empty, or if it has been
overflowed, ASE considers memory to be corrupt
and panics, shutting itself down immediately. - What to do
- The size of the stack and the guard word are
independently configurable. If you must make an
adjustment, change the stack size and not the
guard word. - Again, check OS patches. This is especially
important on Solaris. - A shared memory dump for the panic condition can
be gathered. This allows support to examine the
raw stacks. - If this is a recurring problem, try to isolate
specific queries.
56Backup and Recovery Problems
- Validate your backups!!!
- Most major backup and recovery issues occur when
a customer is forced to go to backup and then
finds out their backups cannot be loaded. - My advice do not consider you database backed-up
unless you have successfully tested a load of the
dump. - This can be an expensive bit of overhead on VLDB
systems, but it is a very good investment! - There is very little that anybody can do to get a
corrupted database or transaction dump to load.
57Thank You