Tuesday, April 18, 2023

ERROR OGG-01172 Discard file exceeded max bytes

Problem:

On a Goldengate 12.2 setup, one of the replicate processes get ABENDED with below error:

ERROR   OGG-01172  Oracle GoldenGate Delivery for Oracle, rep1.prm:  Discard file (/goldengate/gghome/dirrpt/rep1.dsc) exceeded max bytes (2000000000).


Solution:


You can modify the parameter file for the same processes increasing the maximum size of the discard file which can reach up to 2048 MB like this:

DISCARDFILE ?GG_HOME/dirrpt/rep1.dsc, APPEND, MEGABYTES 2048

But again, once it reaches 2048 MB the process will ABEND, no matter what option you will use APPEND/PURGE/ROLLOVER , as this option works only at the time the replicate process starts up!

The other solution is to simply disable the reporting of the discarded transactions in the discard file if you don't care about them.

Remove or hash the entire line for the parameter DISCARDFILE and replace it with NODISCARDFILE parameter inside the replicate parameter file:

-- DISCARDFILE ?GG_HOME/dirrpt/rep1.dsc, APPEND, MEGABYTES 2048
NODISCARDFILE

Note: Both parameters DISCARDFILE  & NODISCARDFILE are mutually exclusive, which means you cannot use both of them in the same parameter file, you have to use only one; otherwise the process will get ABENDED with the below cute error when you try to start it up:

ERROR   OGG-10107 Parsing error, parameter [nodiscardfile] conflicts with parameter [discardfile]


References
:

https://docs.oracle.com/goldengate/c1221/gg-winux/GWURF/discardfile-nodiscardfile.htm#GWURF462

https://docs.oracle.com/en/middleware/goldengate/non-stop/12.3.0.1/reference/discardrollover.html#GUID-9A8B91D2-F0D7-46ED-B553-D72E76D9BCE6

Goldengate ports to allow between Source & Target

 I'm getting this question frequently and hence thought to write it in a separate post:

Question: What are the Ports you need to Allow for goldengate to Work between Source & Target?

Answer: You need to allow the following range of ports between Source and Target with bidirectional option in order for goldengate to work:

First check if parameter DYNAMICPORTLIST is being used, it's specifying the list of ports which goldengate is using, then you can allow the firewall for this list plus port 7809.

 For example, if the parameter is specifying the list of ports in this fashion:

 Then you need to open the firewall between source and target for the following port:

Ports from range 7820 to 7830 plus individual ports 7833, 7835 along with port 7809 which is the default port for the Manager process.

If parameter DYNAMICPORTLIST is not configured to define the range of ports goldengate will use, then goldengate will use ports randomly from a range starting from port 7819 up to port 12818 along with port 7809 which is being used by the manager process.

References:

Friday, April 14, 2023

Goldengate Pump Processes Fail to Start Throwing Error "TCP/IP error 111 (Connection refused)"

 Problem:

Goldengate pump processes fail to start on the source site, ggserr.log is reporting this error:

TCP/IP error 111 (Connection refused), endpoint: 10.1.1.1:7809


 Analysis:

First make sure that goldengate ports are opened between both Source & Target sites. If parameter DYNAMICPORTLIST is not configured to define the range of ports goldengate will use, then goldengate can use ports randomly from range starting from port 7819 up to port 12818 along with port 7809 which is being used by the manager process.

I've already checked the route between Source & Target and confirmed that ports are opened:

# telnet 10.1.1.1 7809
Trying 10.1.1.1...
Connected to 10.1.1.1.


# telnet 10.1.1.1 7819
Trying 10.1.1.1...
Connected to 10.1.1.1.

 When checking GLOBALS file, it shows two paths:

# cat /goldengate/gghome/GLOBALS

GGSCHEMA GGOWNER
CHECKPOINTTABLE GGOWNER.CHECKPOINTTAB
ALLOWOUTPUTDIR
/goldengate/ggdata_bkp
ALLOWOUTPUTDIR /goldengate/ggdata

While the right path "/goldengate/ggdata" which is being used by Source site is already existed, the other path "/goldengate/ggdata_bkp" is not exist:

# ls -ld /goldengate/ggdata_bkp

ls: cannot access /goldengate/ggdata_bkp: No such file or directory
 

 Solution:

Make sure the path mentioned after ALLOWOUTPUTDIR parameter in GLOBALS file is referring to an exist directory, otherwise remove or hash the line which refers to the non-exist directory, then restart the manager.

I've removed the following line, which refers to a non-exist directory:

# cat /goldengate/gghome/GLOBALS

GGSCHEMA GGOWNER
CHECKPOINTTABLE GGOWNER.CHECKPOINTTAB

ALLOWOUTPUTDIR /goldengate/ggdata

Then, restarted the manager process in the TARGET site:

GGSCI> stop manager

Manager process is required by other GGS processes.
Are you sure you want to stop it (y/n)?
y

GGSCI> start manager

 

Then, start the pump processes in the SOURCE site:

GGSCI> start p*


References:

https://docs.oracle.com/en/middleware/goldengate/core/19.1/admin/administering-oracle-goldengate.pdf

Thursday, April 13, 2023

Goldengate Manager is not Deleting|Purging Old Trail Files

Problem:

In a Goldengate 19.1.0.0.1 setup, the following line was set inside inside mgr.prm to automatically delete the processes trail files older than 10 hours, but unfortunately the trail files are not getting deleted:

# cat mgr.prm | grep PURGEOLDEXTRACTS

PURGEOLDEXTRACTS /goldengate/ggdata/*  , USECHECKPOINTS, MINKEEPHOURS 24

I can see I've trail files as old as 10 days back:

[goldengate@ggserver ggdata]$ ll /goldengate/ggdata
total 228G
-rw-r----- 1 goldengate goldengate  96M Apr  3 22:55 RB000000000
-rw-r----- 1 goldengate goldengate  96M Apr  3 23:00 RB000000001
-rw-r----- 1 goldengate goldengate  96M Apr  3 23:00 RB000000002
-rw-r----- 1 goldengate goldengate  96M Apr  3 23:09 RC000000000
....
-rw-r----- 1 goldengate goldengate  96M Apr 13 08:03 RD000000365
-rw-r----- 1 goldengate goldengate  96M Apr 13 08:10 RC000001055
-rw-r----- 1 goldengate goldengate  96M Apr 13 08:13 RA000000409


Analysis:

The trail files location /goldengate/ggdata was a symbolic link for the actual location which was /u01/goldengate/ggdata

[goldengate@ggserver ggdata]$ ll /u01/goldengate/ggdata
lrwxrwxrwx 1 goldengate goldengate 11 Oct  4  2022
/u01/goldengate/ggdata -> /goldengate/ggdata
 

Solution:

Make sure the trail files location in  PURGEOLDEXTRACTS parameter line refers to the actual location, not to a symbolic link.

# cat mgr.prm | grep PURGEOLDEXTRACTS

PURGEOLDEXTRACTS /u01/goldengate/ggdata/*  , USECHECKPOINTS, MINKEEPHOURS 24

Monday, April 10, 2023

Recovering the Standby Database from MRP0: Detected orphaned datafiles after Opening it in Snapshot mode

Problem:

I ran into the following situation on a 19c STANDBY DB:

1- Converted the Standby DB to SNAPSHOT mode for testing purposes.
2- During the test, the /u01 filesystem where ORACLE is installed got full and the following errors started to be reported in the alertlog:
ORA-00313: open failed for members of log group 2 of thread 1
ORA-00312: online log 2 thread 1: '+REDO2/SPR/ONLINELOG/group_2.260.1040915087'
ORA-17503: ksfdopn:2 Failed to open file +REDO2/SPR/ONLINELOG/group_2.260.1040915087
ORA-15045: ASM file name 'ORA-09925: Unable to create audit trail file
Linux-' is not in reference form
ORA-15120: ASM file name 'ORA-09925: Unable to create audit trail file
Linux-' does not begin with the ASM prefix character
Linux-x86_64 Error: 28: No space left on device

2- Converted back the DB to Physical Standby.
3- Startup the recovery, it started but not recovering anything, and there are no errors in the alertlog as well!

Analysis:

When converting the database back from SNAPSHOT to PHYSICAL STANDBY mode, I can see in the DB alertlog the flashback has happened so quick within few seconds while the DB was already opened in SNAPSHOT mode for a long time and tons of transactions been performed on it requires long time to Flashback.

Also, when I checked the database incarnation using "list incarnation" RMAN command, it was still pointing to the incarnation when the DB was converted to SNAPSHOT STANDBY which is wrong:

Note: Each time you convert the PHYSICAL STANDBY to SNAPSHOT mode, the DB will be assigned a new incarnation number.

Solution:

1- Set the DB incarnation to the right BASE incarnation when the DB was in PHYSICAL STANDBY mode:

RMAN> RESET DATABASE TO INCARNATION 1920977;

RMAN> LIST INCARNATION;

2- Flashback the database to a time older than the time you converted the DB to SNAPSHOT mode while it was still in PHYSICAL STANDBY mode: [Luckily, the Flashback feature was turned on before the conversion to SNAPSHOT mode]

SQL> select timestamp_to_scn(to_timestamp('08/04/2023 00:01:00','DD/MM/YYYY HH24:MI:SS')) as scn from dual;

6683184418651

SQL> Flashback database to SCN 6683184418651;

Start the Recovery process:

SQL> RECOVER MANAGED STANDBY DATABASE USING CURRENT LOGFILE NODELAY DISCONNECT; 

Monitor the alertlog to make sure that the recovery processes is now recovering from the archivelogs until it be in-sync with the primary.

SQL> select value "Time LAG between Primary and Standby" from v$dataguard_stats where name='apply lag';

Done. The database is supposed to be a functional PHYSICAL STANDBY now.

Conclusion:

- Having ORACLE HOME filesystem full during the SNAPSHOT STANDBY mode has triggered this weird issue.

- If the Standby database is unable to recover from archivelogs after converting it back from SNAPSHOT mode or if it's throwing ORA-19909 in the alertlog, try first to check if the database is on the right base incarnation which was CURRENT before the conversion to SNAPSHOT mode using "list incarnation" RMAN command.

- If the recovery is still unable to start, you may need to consider flashing back the Standby database to an older time before the conversion to SNAPSHOT mode.

- Enabling FLASHBACK feature on the STANDBY DB can save you from many catastrophic scenarios, one of them; is to rebuild the Standby if you reach to a dead end.