CICS Transaction Server fo r z/OS
Version 4 Release 1
Recovery and Restart Guide
SC34-7012-01
CICS Transaction Server fo r z/OS
Version 4 Release 1
Recovery and Restart Guide
SC34-7012-01
Note
Before using this information and the product it supports, read the information in “Notices” on page 243.
This edition applies to Version 4 Release 1 of CICS Transaction Server for z/OS (product number 5655-S97) and to
all subsequent releases and modifications until otherwise indicated in new editions.
Recovering information from the system log . . 61
Driving backout processing for in-flight units of
work...............61
Concurrent processing of new work and backout 61
Other backout processing.........62
Rebuilding the CICS state after an abnormal
termination ..............62
Files ................62
Temporary storage ...........63
Transient data............63
Start requests .............64
Terminal control resources ........64
Distributed transaction resources ......65
Chapter 7. Automatic restart
management ............67
CICS ARM processing ...........67
Registering with ARM ..........68
Waiting for predecessor subsystems .....68
De-registering from ARM.........68
Failing to register ...........69
ARM couple data sets ..........69
CICS restart JCL and parameters .......69
Workload policies ............70
Connecting to VTAM ...........70
The COVR transaction..........71
Messages associated with automatic restart . . . 71
Automatic restart of CICS data-sharing servers. . 71
Server ARM processing .........71
Chapter 8. Unit of work recovery and
abend processing ..........73
Unit of work recovery ...........73
Transaction backout..........74
Backout-failed recovery .........79
Commit-failed recovery .........83
Indoubt failure recovery .........84
Investigating an indoubt failure.......85
Recovery from failures associated with the coupling
facility ................88
Cache failure support ..........88
Lost locks recovery ...........89
Connection failure to a coupling facility cache
structure ..............91
Connection failure to a coupling facility lock
structure ..............91
MVS system recovery and sysplex recovery. . 91
Transaction abend processing ........92
Exit code ..............92
Abnormal termination of a task.......93
Actions taken at transaction failure ......94
Processing operating system abends and program
checks ................94
Chapter 9. Communication error
processing .............97
Terminal error processing..........97
Node error program (DFHZNEP) ......97
Terminal error program (DFHTEP).....97
Intersystem communication failures ......98
Part 3. Implementing recovery and
restart..............99
Chapter 10. Planning aspects of
recovery.............101
Application design considerations ......101
Questions relating to recovery requirements . . 101
Validate the recovery requirements statement102
Designing the end user ’s restart procedure. . 103
End user’s standby procedures ......103
Communications between application and user103
Security ..............104
System definitions for recovery-related functions104
Documentation and test plans ........105
Chapter 11. Defining system and
general log streams........107
Defining log streams to MVS ........108
Defining system log streams ........108
Specifying a JOURNALMODEL resource
definition..............109
Model log streams for CICS system logs .. . 110
Activity keypointing ..........112
Defining forward recovery log streams .....116
Model log streams for CICS general logs .. . 117
Merging data on shared general log streams . . 118
Defining the log of logs ..........118
Log of logs failure ...........119
Reading log streams offline ........119
Effect of daylight saving time changes .....120
Adjusting local time ..........120
Time stamping log and journal records ....120
Chapter 12. Defining recoverability for
CICS-managed resources ......123
Recovery for transactions .........123
Defining transaction recovery attributes. . . 123
Recovery for files ............125
VSAM files .............125
Basic direct access method (BDAM) .....126
Defining files as recoverable resources ....126
File recovery attribute consistency checking
(non-RLS).............129
Implementing forward recovery with
user-written utilities ..........131
Implementing forward recovery with CICS
VSAM Recovery MVS/ESA.......131
Recovery for intrapartition transient data ....131
Backward recovery..........131
Forward recovery ...........133
Recovery for extrapartition transient data ....134
ivCICS TS for z/OS 4.1: Recovery and Restart Guide
Input extrapartition data sets .......134
Output extrapartition data sets......135
Using post-initialization (PLTPI) programs. . 135
Recovery for temporary storage .......135
Backward recovery..........135
Forward recovery ...........136
Recovery for Web services .........136
Configuring CICS to support persistent
messages ..............136
Defining local queues in a service provider .. 137
Persistent message processing .......138
Chapter 13. Programming for recovery 141
Designing applications for recovery ......141
Splitting the application into transactions . . . 141
SAA-compatible applications .......143
Program design............143
Dividing transactions into units of work .. . 143
Processing dialogs with users .......144
Mechanisms for passing data between
transactions .............145
Designing to avoid transaction deadlocks . . . 146
Implications of interval control START requests 147
Implications of automatic task initiation (TD
trigger level)............148
Implications of presenting large amounts of data
to the user .............148
Managing transaction and system failures ....149
Transaction failures ..........149
System failures ............151
Handling abends and program level abend exits 151
Processing the IOERR condition ......152
START TRANSID commands .......153
PL/I programs and error handling .....153
Locking (enqueuing on) resources in application
programs...............153
Implicit locking for files .........154
Implicit enqueuing on logically recoverable TD
destinations .............157
Implicit enqueuing on recoverable temporary
storage queues ............157
Implicit enqueuing on DL/I databases with
DBCTL ..............158
Explicit enqueuing (by the application
programmer) ............158
Possibility of transaction deadlock .....159
User exits for transaction backout......160
Where you can add your own code .....160
XRCINIT exit ............161
XRCINPT exit ............161
XFCBFAIL global user exit ........161
XFCLDEL global user exit ........162
XFCBOVER global user exit.......162
XFCBOUT global user exit ........162
Coding transaction backout exits ......162
Chapter 14. Using a program error
program (PEP) ...........163
The CICS-supplied PEP ..........163
YourownPEP.............164
Omitting the PEP ............165
Chapter 15. Resolving retained locks
on recoverable resources ......167
Quiescing RLS data sets ..........167
The RLS quiesce and unquiesce functions . . . 168
Switching from RLS to non-RLS access mode. . . 172
Exception for read-only operations .....172
What can prevent a switch to non-RLS access
mode?...............173
Resolving retained locks before opening data
sets in non-RLS mode.........174
Resolving retained locks and preserving data
integrity ..............176
Choosing data availability over data integrity177
The batch-enabling sample programs ....178
CEMT command examples ........178
A special case: lost locks.........180
Overriding retained locks ........180
Coupling facility data table retained locks ....182
Chapter 16. Moving recoverable data
sets that have retained locks....183
Procedure for moving a data set with retained
locks ................183
Using the REPRO method ........183
Using the EXPORT and IMPORT functions . . 185
Rebuilding alternate indexes .......186
Chapter 17. Forward recovery
procedures ............187
Forward recovery of data sets accessed in RLS
mode ................187
Recovery of data set with volume still available 188
Recovery of data set with loss of volume . . . 189
Forward recovery of data sets accessed in non-RLS
mode ................198
Procedure for failed RLS mode forward recovery
operation ...............198
Procedure for failed non-RLS mode forward
recovery operation...........201
Chapter 18. Backup-while-open (BWO) 203
BWO and concurrent copy .........203
BWO and backups..........203
BWO requirements...........204
Hardware requirements .........205
Which data sets are eligible for BWO .....205
How you request BWO ..........206
Specifying BWO using access method services206
Specifying BWO on CICS file resource
definitions .............207
Removing BWO attributes .........208
Systems administration ..........208
BWO processing ............209
File opening .............210
File closing (non-RLS mode) .......212
Shutdown and restart.........213
Data set backup and restore .......213
Contentsv
Forward recovery logging ........215
Forward recovery ...........216
Recovering VSAM spheres with AIXs ....217
An assembler program that calls DFSMS callable
services ...............218
Chapter 19. Disaster recovery ....223
Why have a disaster recovery plan? ......223
Disaster recovery testing.........224
Six tiers of solutions for off-site recovery ....225
Tier 0: no off-site data.........225
Tier 1 - physical removal........225
Tier 2 - physical removal with hot site ....227
Tier 3 - electronic vaulting ........227
Tier 0–3 solutions ...........228
Tier 4 - active secondary site .......229
Tier 5 - two-site, two-phase commit .....231
Tier 6 - minimal to zero data loss......231
Tier 4–6 solutions ...........233
Disaster recovery and high availability .....234
Peer-to-peer remote copy (PPRC) and extended
remote copy (XRC) ..........234
Remote Recovery Data Facility......236
Choosing between RRDF and 3990-6 solutions237
Disaster recovery personnel considerations. . 237
Returning to your primary site......238
Disaster recovery facilities .........238
MVS system logger recovery support ....238
CICS VSAM Recovery QSAM copy .....239
Remote Recovery Data Facility support....239
CICS VR shadowing ..........239
CICS emergency restart considerations .....239
Indoubt and backout failure support....239
Remote site recovery for RLS-mode data sets239
Final summary .............240
Part 4. Appendixes ........241
Notices ..............243
Trademarks ..............244
Bibliography............245
CICS books for CICS Transaction Server for z/OS245
CICSPlex SM books for CICS Transaction Server
for z/OS ...............246
Other CICS publications ..........246
Accessibility............247
Index ...............249
vi
CICS TS for z/OS 4.1: Recovery and Restart Guide
Preface
What this book is about
This book contains guidance about determining your CICS®recovery and restart
needs, deciding which CICS facilities are most appropriate, and implementing your
design in a CICS region.
The information in this book is generally restricted to a single CICS region. For
information about interconnected CICS regions, see the CICS IntercommunicationGuide.
This manual does not describe recovery and restart for the CICS front end
programming interface. For information on this topic, see the CICS Front EndProgramming Interface User's Guide.
Who should read this book
This book is for those responsible for restart and recovery planning, design, and
implementation—either for a complete system, or for a particular function or
component.
What you need to know to understand this book
To understand this book, you should have experience of installing CICS and the
products with which it is to work, or of writing CICS application programs, or of
writing exit programs.
You should also understand your application requirements well enough to be able
to make decisions about realistic recovery and restart needs, and the trade-offs
between those needs and the performance overhead they incur.
How to use this book
This book deals with a wide variety of topics, all of which contribute to the
recovery and restart characteristics of your system.
It’s unlikely that any one reader would have to implement all the possible
techniques discussed in this book. By using the table of contents, you can find the
sections relevant to your work. Readers new to recovery and restart should find
the first section helpful, because it introduces the concepts of recovery and restart.
viiiCICS TS for z/OS 4.1: Recovery and Restart Guide
Changes in CICS Transaction Server for z/OS, Version 4
Release 1
For information about changes that have been made in this release, please refer to
What's New in the information center, or the following publications:
v CICS Transaction Server for z/OS What's New
v CICS Transaction Server for z/OS Upgrading from CICS TS Version 3.2
v CICS Transaction Server for z/OS Upgrading from CICS TS Version 3.1
v CICS Transaction Server for z/OS Upgrading from CICS TS Version 2.3
Any technical changes that are made to the text after release are indicated by a
vertical bar (|) to the left of each new or changed line of information.
It is very important that a transaction processing system such as CICS can restart
and recover following a failure. This section describes some of the basic concepts
of the recovery and restart facilities provided by CICS.
Problems that occur in a data processing system could be failures with
communication protocols, data sets, programs, or hardware. These problems are
potentially more severe in online systems than in batch systems, because the data
is processed in an unpredictable sequence from many different sources.
Online applications therefore require a system with special mechanisms for
recovery and restart that batch systems do not require. These mechanisms ensure
that each resource associated with an interrupted online application returns to a
known state so that processing can restart safely. Together with suitable operating
procedures, these mechanisms should provide automatic recovery from failures
and allow the system to restart with the minimum of disruption.
The two main recovery requirements of an online system are:
v To maintain the integrity and consistency of data
v To minimize the effect of failures
CICS provides a facility to meet these two requirements called the recovery
manager. The CICS recovery manager provides the recovery and restart functions
that are needed in an online system.
Maintaining the integrity of data
Data integrity means that the data is in the form you expect and has not been
corrupted. The objective of recovery operations on files, databases, and similar data
resources is to maintain and restore the integrity of the information.
Recovery must also ensure consistency of related changes, whereby they are made
as a whole or not at all. (The term resources used in this book, unless stated
otherwise, refers to data resources.)
Logging changes
One way of maintaining the integrity of a resource is to keep a record, or log, of all
the changes made to a resource while the system is executing normally. If a failure
occurs, the logged information can help recover the data.
An online system can use the logged information in two ways:
1. It can be used to back out incomplete or invalid changes to one or more
resources. This is called backward recovery, or backout. For backout, it is
necessary to record the contents of a data element before it is changed. These
records are called before-images. In general, backout is applicable to processing
failures that prevent one or more transactions (or a batch program) from
completing.
2. It can be used to reconstruct changes to a resource, starting with a backup copy
of the resource taken earlier. This is called forward recovery. For forward
recovery, it is necessary to record the contents of a data element after it is
changed. These records are called after-images.
In general, forward recovery is applicable to data set failures, or failures in
similar data resources, which cause data to become unusable because it has
been corrupted or because the physical storage medium has been damaged.
Minimizing the effect of failures
An online system should limit the effect of any failure. Where possible, a failure
that affects only one user, one application, or one data set should not halt the
entire system.
Furthermore, if processing for one user is forced to stop prematurely, it should be
possible to back out any changes made to any data sets as if the processing had
not started.
If processing for the entire system stops, there may be many users whose updating
work is interrupted. On a subsequent startup of the system, only those data set
updates in process (in-flight) at the time of failure should be backed out. Backing
out only the in-flight updates makes restart quicker, and reduces the amount of
data to reenter.
Ideally, it should be possible to restore the data to a consistent, known state
following any type of failure, with minimal loss of valid updating activity.
The role of CICS
The CICS recovery manager and the log manager perform the logging functions
necessary to support automatic backout. Automatic backout is provided for most
CICS resources, such as databases, files, and auxiliary temporary storage queues,
either following a transaction failure or during an emergency restart of CICS.
If the backout of a VSAM file fails, CICS backout failure processing ensures that all
locks on the backout-failed records are retained, and the backout-failed parts of the
unit of work (UOW) are shunted to await retry. The VSAM file remains open for
use. For an explanation of shunted units of work and retained locks, see “Shunted
units of work” on page 13.
If the cause of the backout failure is a physically damaged data set, and provided
the damage affects only a localized section of the data set, you can choose a time
when it is convenient to take the data set offline for recovery. You can then use the
forward recovery log with a forward recovery utility, such as CICS VSAM
Recovery, to restore the data set and re-enable it for CICS use.
Note: In many cases, a data set failure also causes a processing failure. In this
event, forward recovery must be followed by backward recovery.
You don't need to shut CICS down to perform these recovery operations. For data
sets accessed by CICS in VSAM record-level sharing (RLS) mode, you can quiesce
the data set to allow you to perform the forward recovery offline. On completion
of forward recovery, setting the data set to unquiesced causes CICS to perform the
backward recovery automatically.
For files accessed in non-RLS mode, you can issue a SET DSNAME RETRY command
after the forward recovery, which causes CICS to perform the backward recovery
online.
4CICS TS for z/OS 4.1: Recovery and Restart Guide
Another way is to shut down CICS with an immediate shutdown and perform the
forward recovery, after which a CICS emergency restart performs the backward
recovery.
Recoverable resources
In CICS, a recoverable resource is any resource with recorded recovery information
that can be recovered by backout.
The following resources can be made recoverable:
v CICS files that relate to:
– VSAM data sets
– BDAM data sets
v Data tables (but user-maintained data tables are not recovered after a CICS
failure, only after a transaction failure)
v Coupling facility data tables
v The CICS system definition (CSD) file
v Intrapartition transient data destinations
v Auxiliary temporary storage queues
v Resource definitions dynamically installed using resource definition online
(RDO)
In some environments, a VSAM file managed by CICS file control might need to
remain online and open for update for extended periods. You can use a backup
manager, such as DFSMSdss, in a separate job under MVS
file at regular intervals while it is open for update by CICS applications. This
operation is known as backup-while-open (BWO). Even changes made to the
VSAM file while the backup is in progress are recorded.
DFSMSdss is a functional component of DFSMS/MVS, and is the primary data
mover. When used with supporting hardware, DFSMSdss also provides a
concurrent copy capability. This capability enables you to copy or back up data
while that data is being used.
If a data set failure occurs, you can use a backup of the data set and a forward
recovery utility, such as CICS VSAM Recovery (CICSVR), to recover the VSAM file.
CICS backward recovery (backout)
Backward recovery, or backout, is a way of undoing changes made to resources
such as files or databases.
Backout is one of the fundamental recovery mechanisms of CICS. It relies on
recovery information recorded while CICS and its transactions are running
normally.
Before a change is made to a resource, the recovery information for backout, in the
form of a before-image, is recorded on the CICS system log. A before-image is a
record of what the resource was like before the change. These before-images are
used by CICS to perform backout in two situations:
v In the event of failure of an individual in-flight transaction, which CICS backs
out dynamically at the time of failure (dynamic transaction backout)
™
, to back up a VSAM
Chapter 1. Recovery and restart facilities5
v In the event of an emergency restart, when CICS backs out all those transactions
that were in-flight at the time of the CICS failure (emergency restart backout).
Although these occur in different situations, CICS uses the same backout process in
each case. CICS does not distinguish between dynamic backout and emergency
restart backout. See Chapter 6, “CICS emergency restart,” on page 61 for an
explanation of how CICS reattaches failed in-flight units of work in order to
perform transaction backout following an emergency restart.
Each CICS region has only one system log, which cannot be shared with any other
CICS region. The system log is written to a unique MVS system logger log stream.
The CICS system log is intended for use only for recovery purposes, for example
during dynamic transaction backout, or during emergency restart. It is not meant
to be used for any other purpose.
CICS supports two physical log streams - a primary and a secondary log stream.
CICS uses the secondary log stream for storing log records of failed units of work,
and also some long running tasks that have not caused any data to be written to
the log for two complete activity key points. Failed units of work are moved from
the primary to the secondary log stream at the next activity keypoint. Logically,
both the primary and secondary log stream form one log, and as a general rule are
referred to as the system log.
Dynamic transaction backout
In the event of a transaction failure, or if an application explicitly requests a
syncpoint rollback, the CICS recovery manager uses the system log data to drive
the resource managers to back out any updates made by the current unit of work.
This process, known as dynamic transaction backout, takes place while the rest of
CICS continues normally.
For example, when any updates made to a recoverable data set are to be backed
out, file control uses the system log records to reverse the updates. When all the
updates made in the unit of work have been backed out, the unit of work is
completed. The locks held on the updated records are freed if the backout is
successful.
For data sets open in RLS mode, CICS requests VSAM RLS to release the locks; for
data sets open in non-RLS mode, the CICS enqueue domain releases the locks
automatically.
See “Units of work” on page 13 for a description of units of work.
Emergency restart backout
If a CICS region fails, you restart CICS with an emergency restart to back out any
transactions that were in-flight at the time of failure.
During emergency restart, the recovery manager uses the system log data to drive
backout processing for any units of work that were in-flight at the time of the
failure. The backout of units of work during emergency restart is the same as a
dynamic backout; there is no distinction between the backout that takes place at
emergency restart and that which takes place at any other time. At this point,
while recovery processing continues, CICS is ready to accept new work for normal
processing.
6CICS TS for z/OS 4.1: Recovery and Restart Guide
The recovery manager also drives:
v The backout processing for any units of work that were in a backout-failed state
at the time of the CICS failure
v The commit processing for any units of work that had not finished commit
processing at the time of failure (for example, for resource definitions that were
being installed when CICS failed)
v The commit processing for any units of work that were in a commit-failed state
at the time of the CICS failure
See “Unit of work recovery” on page 73 for an explanation of the terms
commit-failed and backout-failed.
The recovery manager drives these backout and commit processes because the
condition that caused them to fail might be resolved by the time CICS restarts. If
the condition that caused a failure has not been resolved, the unit of work remains
in backout- or commit-failed state. See “Backout-failed recovery” on page 79 and
“Commit-failed recovery” on page 83 for more information.
CICS forward recovery
Some types of data set failure cannot be corrected by backward recovery; for
example, failures that cause physical damage to a database or data set.
Recovery from failures of this type is usually based on the following actions:
1. Take a backup copy of the data set at regular intervals.
2. Record an after-image of every change to the data set on the forward recovery
log (a general log stream managed by the MVS system logger).
3. After the failure, restore the most recent backup copy of the failed data set, and
use the information recorded on the forward recovery log to update the data
set with all the changes that have occurred since the backup copy was taken.
These operations are known as forward recovery. On completion of the forward
recovery, as a fourth step, CICS also performs backout of units of work that failed
in-flight as a result of the data set failure.
Forward recovery of CICS data sets
CICS supports forward recovery of VSAM data sets updated by CICS file control
(that is, by files or CICS-maintained data tables defined by a CICS file definition).
CICS writes the after-images of changes made to a data set to a forward recovery
log, which is a general log stream managed by the MVS system logger.
CICS obtains the log stream name of a VSAM forward recovery log in one of two
ways:
1. For files opened in VSAM record level sharing (RLS) mode, the explicit log
stream name is obtained directly from the VSAM ICF catalog entry for the data
set.
2. For files in non-RLS mode, the log stream name is derived from:
v The VSAM ICF catalog entry for the data set if it is defined there, and if
RLS=YES is specified as a system initialization parameter. In this case, CICS
file control manages writes to the log stream directly.
v A journal model definition referenced by a forward recovery journal name
specified in the file resource definition.
Chapter 1. Recovery and restart facilities7
Forward recovery journal names are of the form DFHJnn where nn is a
number in the range 1–99 and is obtained from the forward recovery log id
(FWDRECOVLOG) in the FILE resource definition.
In this case, CICS creates a journal entry for the forward recovery log, which
can be mapped by a JOURNALMODEL resource definition. Although this
method enables user application programs to reference the log, and write
user journal records to it, you are recommended not to do so. You should
ensure that forward recovery log streams are reserved for forward recovery
data only.
Note: You cannot use a CICS system log stream as a forward recovery log.
The VSAM recovery options or the CICS file control recovery options that you
require to implement forward recovery are explained further in “Defining files as
recoverable resources” on page 126.
For details of procedures for performing forward recovery, see Chapter 17,
“Forward recovery procedures,” on page 187.
Forward recovery for non-VSAM resources
CICS does not provide forward recovery logging for non-VSAM resources, such as
BDAM files. However, you can provide this support yourself by ensuring that the
necessary information is logged to a suitable log stream. In the case of BDAM files,
you can use the CICS autojournaling facility to write the necessary after-images to
a log stream.
Failures that require CICS recovery processing
The following section briefly describes CICS recovery processing after a
communication failure, transaction failure, and system failure.
Whenever possible, CICS attempts to contain the effects of a failure, typically by
terminating only the offending task while all other tasks continue normally. The
updates performed by a prematurely terminated task can be backed out
automatically.
CICS recovery processing following a communication failure
Causes of communication failure include:
v Terminal failure
v Printer terminal running out of paper
v Power failure at a terminal
v Invalid SNA status
v Network path failure
v Loss of an MVS image that is a member of a sysplex
There are two aspects to processing following a communications failure:
1. If the failure occurs during a conversation that is not engaged in syncpoint
protocol, CICS must terminate the conversation and allow customized handling
of the error, if required. An example of when such customization is helpful is
for 3270 device types. This is described below.
8CICS TS for z/OS 4.1: Recovery and Restart Guide
2. If the failure occurs during the execution of a CICS syncpoint, where the
conversation is with another resource manager (perhaps in another CICS
region), CICS handles the resynchronization. This is described in the CICSIntercommunication Guide.
If the link fails and is later reestablished, CICS and its partners use the SNA
set-and-test-sequence-numbers (STSN) command to find out what they were doing
(backout or commit) at the time of link failure. For more information on link
failure, see the CICS Intercommunication Guide.
When communication fails, the communication system access method either retries
the transmission or notifies CICS. If a retry is successful, CICS is not informed.
Information about the error can be recorded by the operating system. If the retries
are not successful, CICS is notified.
When CICS detects a communication failure, it gives control to one of two
programs:
®
v The node error program (NEP) for VTAM
logical units
v The terminal error program (TEP) for non-VTAM terminals
Both dummy and sample versions of these programs are provided by CICS. The
dummy versions do nothing; they allow the default actions selected by CICS to
proceed. The sample versions show how to write your own NEP or TEP to change
the default actions.
The types of processing that might be in a user-written NEP or TEP are:
v Logging additional error information. CICS provides some error information
when an error occurs.
v Retrying the transmission. This is not recommended because the access method
will already have made several attempts.
v Leaving the terminal out of service. This means that it is unavailable to the
terminal operator until the problem is fixed and the terminal is put back into
service by means of a master terminal transaction.
v Abending the task if it is still active (see “CICS recovery processing following a
transaction failure” on page 10).
v Reducing the amount of error information printed.
For more information about NEPs and TEPs, see Chapter 9, “Communication error
processing,” on page 97.
XCF/MRO partner failures
Loss of communication between CICS regions can be caused by the loss of an MVS
image in which CICS regions are running.
If the regions are communicating over XCF/MRO links, the loss of connectivity
may not be immediately apparent because XCF waits for a reply to a message it
issues.
The loss of an MVS image in a sysplex is detected by XCF in another MVS, and
XCF issues message IXC402D. If the failed MVS is running CICS regions connected
through XCF/MRO to CICS regions in another MVS, tasks running in the active
regions are initially suspended in an IRLINK WAIT state.
XCF/MRO-connected regions do not detect the loss of an MVS image and its
resident CICS regions until an operator replies to the XCF IXC402D message.
Chapter 1. Recovery and restart facilities9
When the operator replies to IXC402D, the CICS interregion communication
program, DFHIRP, is notified and the suspended tasks are abended, and MRO
connections closed. Until the reply is issued to IXC402D, an INQUIRE
CONNECTION command continues to show connections to regions in the failed
MVS as in service and normal.
When the failed MVS image and its CICS regions are restarted, the interregion
communication links are reopened automatically.
CICS recovery processing following a transaction failure
Transactions can fail for a variety of reasons, including a program check in an
application program, an invalid request from an application that causes an abend,
a task issuing an ABEND request, or I/O errors on a data set that is being accessed
by a transaction.
During normal execution of a transaction working with recoverable resources,
CICS stores recovery information in the system log. If the transaction fails, CICS
uses the information from the system log to back out the changes made by the
interrupted unit of work. Recoverable resources are thus not left in a partially
updated or inconsistent state. Backing out an individual transaction is called
dynamic transaction backout.
After dynamic transaction backout has completed, the transaction can restart
automatically without the operator being aware of it happening. This function is
especially useful in those cases where the cause of transaction failure is temporary
and an attempt to rerun the transaction is likely to succeed (for example, DL/I
program isolation deadlock). The conditions when a transaction can be
automatically restarted are described under “Abnormal termination of a task” on
page 93.
If dynamic transaction backout fails, perhaps because of an I/O error on a VSAM
data set, CICS backout failure processing shunts the unit of work and converts the
locks that are held on the backout-failed records into retained locks. The data set
remains open for use, allowing the shunted unit of work to be retried. If backout
keeps failing because the data set is damaged, you can create a new data set using
a backup copy and then perform forward recovery, using a utility such as CICSVR.
When the data set is recovered, retry the shunted unit of work to complete the
failed backout and release the locks.
Chapter 8, “Unit of work recovery and abend processing,” on page 73 gives more
details about CICS processing of a transaction failure.
CICS recovery processing following a system failure
Causes of a system failure include a processor failure, the loss of a electrical power
supply, an operating system failure, or a CICS failure.
During normal execution, CICS stores recovery information on its system logstream, which is managed by the MVS system logger. If you specify
START=AUTO, CICS automatically performs an emergency restart when it restarts
after a system failure.
During an emergency restart, the CICS log manager reads the system log backward
and passes information to the CICS recovery manager.
10CICS TS for z/OS 4.1: Recovery and Restart Guide
The CICS recovery manager then uses the information retrieved from the system
log to:
v Back out recoverable resources.
v Recover changes to terminal resource definitions. (All resource definitions
installed at the time of the CICS failure are initially restored from the CICS
global catalog.)
A special case of CICS processing following a system failure is covered in
Chapter 6, “CICS emergency restart,” on page 61.
Chapter 1. Recovery and restart facilities11
12CICS TS for z/OS 4.1: Recovery and Restart Guide
Chapter 2. Resource recovery in CICS
Before you begin to plan and implement resource recovery in CICS, you should
understand the concepts involved, including units of work, logging and journaling.
Units of work
When resources are being changed, there comes a point when the changes are
complete and do not need backout if a failure occurs later. The period between the
start of a particular set of changes and the point at which they are complete is
called a unit of work (UOW). The unit of work is a fundamental concept of all CICS
backout mechanisms.
From the application designer's point of view, a UOW is a sequence of actions that
needs to be complete before any of the individual actions can be regarded as
complete. To ensure data integrity, a unit of work must be atomic, consistent,
isolated, and durable.
The CICS recovery manager operates with units of work. If a transaction that
consists of multiple UOWs fails, or the CICS region fails, committed UOWs are not
backed out.
A unit of work can be in one of the following states:
v Active (in-flight)
v Shunted following a failure of some kind
v Indoubt pending the decision of the unit of work coordinator.
v Completed and no longer of interest to the recovery manager
Shunted units of work
A shunted unit of work is one awaiting resolution of an indoubt failure, a commit
failure, or a backout failure. The CICS recovery manager attempts to complete a
shunted unit of work when the failure that caused it to be shunted has been
resolved.
A unit of work can be unshunted and then shunted again (in theory, any number
of times). For example, a unit of work could go through the following stages:
1. A unit of work fails indoubt and is shunted.
2. After resynchronization, CICS finds that the decision is to back out the indoubt
unit of work.
3. Recovery manager unshunts the unit of work to perform backout.
4. If backout fails, it is shunted again.
5. Recovery manager unshunts the unit of work to retry the backout.
6. Steps 4 and 5 can occur several times until the backout succeeds.
These situations can persist for some time, depending on how long it takes to
resolve the cause of the failure. Because it is undesirable for transaction resources
to be held up for too long, CICS attempts to release as many resources as possible
while a unit of work is shunted. This is generally achieved by abending the user
task to which the unit of work belongs, resulting in the release of the following:
v Locks on recoverable data. If the unit of work is shunted indoubt, all locks are
retained. If it is shunted because of a commit- or backout-failure, only the locks
on the failed resources are retained.
v System log records, which include:
– Records written by the resource managers, which they need to perform
recovery in the event of transaction or CICS failures. Generally, these records
are used to support transaction backout, but the RDO resource manager also
writes records for rebuilding the CICS state in the event of a CICS failure.
– CICS recovery manager records, which include identifiers relating to the
original transaction such as:
- The transaction ID
- The task ID
- The CICS terminal ID
- The VTAM LUNAME
- The user ID
- The operator ID.
Locks
For files opened in RLS mode, VSAM maintains a single central lock structure
using the lock-assist mechanism of the MVS coupling facility. This central lock
structure provides sysplex-wide locking at a record level. Control interval (CI)
locking is not used.
The locks for files accessed in non-RLS mode, the scope of which is limited to a
single CICS region, are file-control managed locks. Initially, when CICS processes a
read-for-update request, CICS obtains a CI lock. File control then issues an ENQ
request to the enqueue domain to acquire a CICS lock on the specific record. This
enables file control to notify VSAM to release the CI lock before returning control
to the application program. Releasing the CI lock minimizes the potential for
deadlocks to occur.
For coupling facility data tables updated under the locking model, the coupling
facility data table server stores the lock with its record in the CFDT. As in the case
of RLS locks, storing the lock with its record in the coupling facility list structure
that holds the coupling facility data table ensures sysplex-wide locking at record
level.
For both RLS and non-RLS recoverable files, CICS releases all locks on completion
of a unit of work. For recoverable coupling facility data tables, the locks are
released on completion of a unit of work by the CFDT server.
Active and retained states for locks
CICS supports active and retained states for locks.
14CICS TS for z/OS 4.1: Recovery and Restart Guide
When a lock is first acquired, it is an active lock. It remains an active lock until
successful completion of the unit of work, when it is released, or is converted into
a retained lock if the unit of work fails, or for a CICS or SMSVSAM failure:
v If a unit of work fails, RLS VSAM or the CICS enqueue domain continues to
hold the record locks that were owned by the failed unit of work for recoverable
data sets, but converted into retained locks. Retaining locks ensures that data
integrity for those records is maintained until the unit of work is completed.
v If a CICS region fails, locks are converted into retained locks to ensure that data
integrity is maintained while CICS is being restarted.
v If an SMSVSAM server fails, locks are converted into retained locks (with the
conversion being carried out by the other servers in the sysplex, or by the first
server to restart if all servers have failed). This means that a UOW that held
active RLS locks will hold retained RLS locks following the failure of an
SMSVSAM server.
Converting active locks into retained locks not only protects data integrity. It also
ensures that new requests for locks owned by the failed unit of work do not wait,
but instead are rejected with the LOCKED response.
Synchronization points
The end of a UOW is indicated to CICS by a synchronization point, usually
abbreviated to syncpoint.
A syncpoint arises in the following ways:
v Implicitly at the end of a transaction as a result of an EXEC CICS RETURN
command at the highest logical level. This means that a UOW cannot span tasks.
v Explicitly by EXEC CICS SYNCPOINT commands issued by an application program
at appropriate points in the transaction.
v Implicitly through a DL/I program specification block (PSB) termination (TERM)
call or command. This means that only one DL/I PSB can be scheduled within a
UOW.
Note that an explicit EXEC CICS SYNCPOINT command, or an implicit syncpoint at
the end of a task, implies a DL/I PSB termination call.
v Implicitly through one of the following CICS commands:
A UOW that does not change a recoverable resource has no meaningful effect for
the CICS recovery mechanisms. Nonrecoverable resources are never backed out.
A unit of work can also be ended by backout, which causes a syncpoint in one of
the following ways:
v Implicitly when a transaction terminates abnormally, and CICS performs
dynamic transaction backout
v Explicitly by EXEC CICS SYNCPOINT ROLLBACK commands issued by an application
program to backout changes made by the UOW.
Examples of synchronization points
In Figure 1, task A is a nonconversational (or pseudoconversational) task with one
UOW, and task B is a multiple UOW task (typically a conversational task in which
each UOW accepts new data from the user). The figure shows how UOWs end at
syncpoints. During the task, the application program can issue syncpoints
explicitly, and, at the end, CICS issues a syncpoint.
Unit of work
Task A
SOTEOT
UOWUOWUOWUOW
Task B
SOTSPSPSPEOT
Abbreviations:
EOT:End of taskSOT:Start of task
UOW:Unit of workSP:Syncpoint
Figure 1. Units of work and syncpoints
Figure 2 on page 17 shows that database changes made by a task are not
committed until a syncpoint is executed. If task processing is interrupted because
of a failure of any kind, changes made within the abending UOW are
automatically backed out.
If there is a system failure at time X:
v The change(s) made in task A have been committed and are therefore not backed
out.
v In task B, the changes shown as Mod 1 and Mod 2 have been committed, but
the change shown as Mod 3 is not committed and is backed out.
v All the changes made in task C are backed out.
(SP)
(SP)
16CICS TS for z/OS 4.1: Recovery and Restart Guide
Task B
X
Unit of work.
.
Task A.
.
SOTEOT .
(SP).
Mod.
.
.
Commit.
Mod.
.
.
Backout.
===========.
.
UOW 1UOW 2UOW 3.UOW 4
.
.
SOTSPSP.SPEOT
.(SP)
ModModMod.Mod
123.4
.
.
CommitCommit.CommitCommit
Mod 1Mod 2.Mod 3Mod 4
.
.
Backout.
=======================.
.
Task C
.
SOT.EOT
.(SP)
ModMod .
.
.
.Commit
Abbreviations:.Mods
EOT:End of task.
UOW:Unit of workX
Mod:Modification to database
SOT:Start of task
SP:Syncpoint
X:Moment of system failure
Figure 2. Backout of units of work
CICS recovery manager
The recovery manager ensures the integrity and consistency of resources (such as
files and databases) both within a single CICS region and distributed over
interconnected systems in a network.
Figure 3 on page 18 shows the resource managers and their resources with which
the CICS recovery manager works.
The main functions of the CICS recovery manager are:
Chapter 2. Resource recovery in CICS17
c
o
L
RDO
v Managing the state, and controlling the execution, of each UOW
v Coordinating UOW-related changes during syncpoint processing for recoverable
resources
v Coordinating UOW-related changes during restart processing for recoverable
resources
v Coordinating recoverable conversations to remote nodes
v Temporarily suspending completion (shunting), and later resuming completion
(unshunting), of UOWs that cannot immediately complete commit or backout
processing because the required resources are unavailable, because of system,
communication, or media failure
r
c
e
u
a
o
s
e
R
l
TS
M
a
n
a
g
e
r
s
TD
FC
Recovery
LOG
C
o
m
Figure 3. CICS recovery manager and resources it works with
m
u
LU6.2
n
i
c
a
Manager
DBCTL
t
LU6.1
i
o
n
s
M
a
n
a
g
Managing the state of each unit of work
The CICS recovery manager maintains, for each UOW in a CICS region, a record of
the changes of state that occur during its lifetime.
Typical events that cause state changes include:
v Creation of the UOW, with a unique identifier
v Premature termination of the UOW because of transaction failure
v Receipt of a syncpoint request
v Entry into the indoubt period during two-phase commit processing (see the
MRO
e
r
s
CICS Transaction Server for z/OS Glossary for a definition of two-phase commit)
MQM
R
M
o
f
I
FC/RLS
DB2
t
o
m
e
R
r
e
R
e
e
c
r
u
o
s
18CICS TS for z/OS 4.1: Recovery and Restart Guide
Loading...
+ 238 hidden pages
You need points to download manuals.
1 point = 1 manual.
You can buy points or you can get point for every manual you upload.