Monday, October 14, 2019

Transaction Guard | Application Continuity | DML Failover in RAC 19c


Introduction:

Starting from 12c TAF can failover DMLs to the available nodes on a RAC environment in case of any interruption happens to it including node/service/network failure. This feature is called  Application Continuity or Transaction Guard.

Pre-requisites: [For 19c]

The applications should use Oracle Client 19c. Although the Oracle note Doc ID 2011697.1 mentioning that any 12.1+ Oracle Client will work, it didn't work for me while testing on a 19c DB.
Application Continuity works on the assumption that the applications are well written in terms of connection usage:
  • Borrowing connections and returning them to the connection pool instead of pinning connections.
  • If a statement cache at the application server level is enabled it must be disabled when the replay is used. or use JDBC statement cache, which is compatible with Application Continuity. 
  • Additional CPU utilization will happen on the client-side in order to handle garbage collections. 
Implementation:

# srvctl add service -database sprint -service pssfztest_rpt -preferred sprint1,sprint2 -tafpolicy BASIC -failovertype TRANSACTION -commit_outcome TRUE -failovermethod BASIC -failoverretry 100 -stopoption IMMEDIATE -session_state DYNAMIC -role PRIMARY -policy AUTOMATIC -clbgoal long -verbose -failover_restore LEVEL1 -replay_init_time 3600

Notes:

- Don't use replay_init_time along with failoverdelay or you will get this error when trying to start the service:
CRS-2632: There are no more servers to try to place resource 'ora.sprint.pssfztest_gg.svc' on that would satisfy its placement policy

- In order for the applications to use the DML failover feature (Application Continuity) application users should be granted execute permission on DBMS_APP_CONT: [For simplicity I'm granting it for everyone but some applications don't work properly with this feature, so it's recommended to test your application and grant this permission to the users of the applications that support Application Continuity]
SQL> grant execute on DBMS_APP_CONT to public;

- clbgoal=short is less stable than clbgoal=long where failover retries can be exhausted before reaching its max limit.

- clbgoal=short balance the sessions between RAC nodes based on response time, while clbgoal=long balance the sessions based on the total number of sessions on each node.

- The PRECONNECT option for -tafpolicy parameter is deprecated in 19c.

- If you set -failovertype = TRANSACTION, then you must set -commit_outcome to TRUE.

- For -session_state Oracle recommends to set it to DYNAMIC for most applications, to use the default session settings (NLS settings, optimizer preferences,..) after the session fails over.

- replay_init_time: Specifies the time in seconds after which replay (failover) will not happen. [It's set to 3600sec =1 hour above]

The following activities if happened can cause the transaction to failover without being disrupted [transaction will hang for a few seconds till the connectivity get restored on the available nodes]:

- instance crash.
- Partial Network disruption.
- OS kill -9 of the instance main processes (PMON/SMON).
- OS kill -STOP followed by kill -9 of the same session.
- shutdown immediate (from SQLPLUS console).
- shu abort (from SQLPLUS console).
- srvctl stop service -d sprint -i sprint1
- srvctl stop service -d sprint -i sprint1 -force
- srvctl stop instance -d sprint -i sprint1 -failover
- ALTER SYSTEM KILL SESSION command.
- ALTER SYSTEM DISCONNECT SESSION command.

The following activities will terminate the DML transaction WITHOUT failing them over but the session itself will re-connect automatically: [If performed on the node where the session is connected]
- OS kill -9 of the session PID.
- ALTER SYSTEM CANCEL SQL '<SID>,<SERIAL#>';
- srvctl stop instance -d sprint -i sprint1 -force
- crsctl stop cluster
- crsctl stop crs



Conclusion:

- Application Continuity feature can let you carry out activities like software patching, hardware/network maintenance with real ZERO downtime.

- Before using Application Continuity feature, you have to make sure that your applications are compatible with this feature by testing all the scenarios you may go through. It will be wise if you consult your application vendor before implementing this feature.
Using this feature blindly without proper testing may result in unexpected application behavior.

- Applications should not use the default database service (which has the same name as DB_NAME/DB_UNIQUE_NAME of the database) as this service is not supported by the high availability features.

References:

3 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Thankyou Mahmmoud that's a great explanation on this HA feature. Also for your warning about correct service name usage.

    ReplyDelete