Friday, February 14, 2014

Oracle Database Monitoring Script



The backbone monitoring solution for Oracle databases in my environment is dbalarm.sh script. It's a kind of "deploy and forget" script where it can monitor lots of database system components in one go.

The script can report/monitor Database, ASM, Clusterware, Golden Gate and OS main events, including:
  
Database Monitoring:
  - Monitor the DB instance ALERT LOG file and report ORA and TNS errors.
  - Monitor TABLESPACES utilization.  - Monitor FLASH RECOVERY AREA (FRA) utilization.
  - Monitor ASM Disk Groups utilization.
  - Monitor BLOCKING LOCKS.
  - Monitor database named SERVICES. [Service names to be provided to SERVICEMON variable]
  - Monitor RMAN Backup Failure.
  - Monitor the database if it goes OFFLINE or gets HANGED.
  When the Paranoid mode is set to ON:
  - Startup/Shutdown events of the DB instances will be reported.
  - ALTER SYSTEM commands will be reported.
  - ALTER DATABASE commands will be reported.
  - EXPORT/IMPORT operations will be reported.

- Listener Monitoring:
  - Monitor the LISTENERS' LOG file and report TNS errors which plays a crucial rule in reporting connectivity security breaches, applications misconfiguration and network failures.

Grid Infrastructure Monitoring:
  - Monitor the ASM alert log for ORA and TNS errors.
  - Monitor the Grid Infrastructure alert log for ORA and TNS errors.
  - Monitor the Grid Infrastructure alert log for Shutdown/Startup events.
  - Monitor the Grid Infrastructure alert log for Node eviction events.
  - Monitor the Grid Infrastructure alert log for Network IP conflict.
  - Monitor the Grid Infrastructure alert log for Heart Beat failures.
  - Monitor the Grid Infrastructure alert log for service failure events.

OS/Hardware Staff:
  - Monitor the CPU for high utilization.
  - Monitor the Filesystems / Mount Points space utilization.
  - Monitor dmesg log for new entries which let you know what is going on on the OS side.

Golden Gate Monitoring:
  - Monitor the Golden Gate log for Errors and Process ABENDED events.

How it works?

The script is coded to send only new errors that are triggered since the last execution of the script, once the error gets reported once, it won't be reported again unless it appears again in the alert log file.

In order to tailor the script to fulfill your requirements, you have to follow the following simple steps:

Step 1:
Download the script from this link:


Step 2:
Open the script and change the E-mail address to your email address in the line# 60
EMAIL="youremail@yourcompany.com"

Note: sendmail service should be configured on your server to allow emails to be sent out from the DB machine.

Step 3:
By Oracle user:
In the crontab, schedule the script to run at least every 5 minutes:
# crontab -e
#Add this line: 
*/5 * * * * /home/oracle/dbalarm.sh
Note: /home/oracle/dbalarm.sh is the full path that points to dbalarm script where /home/oracle is the Oracle user home directory.

In case you will schedule this script to run from root user's crontab: [Not recommended]
# crontab -e
#Add this line to schedule the run of dbalarm.sh script every 5minutes:
*/5 * * * * su - oracle -c /home/oracle/dbalarm.sh
Now the only thing left here is to set back and relax and the script will report you all errors and all the breached predefined threshold.

One thing more, you can adjust the threshold inside the script as per your preferences by altering the below red colored values in THRESHOLDS section inside the script:

# #########################
# THRESHOLDS:
# #########################
# Modify the THRESHOLDS to the value you prefer:

HTMLENABLE=Y       # Enable HTML Email Format[DB]
FSTHRESHOLD=95 # THRESHOLD FOR FILESYSTEM %USED [OS]
CPUTHRESHOLD=95  # THRESHOLD FOR CPU %UTILIZATION [OS]
TBSTHRESHOLD=95 # THRESHOLD FOR TABLESPACE %USED [DB]
FRATHRESHOLD=95 # THRESHOLD FOR FRA %USED [DB]
FSITHRESHOLD=95   # THRESHOLD FOR FILESYSTEM INODES %USED [OS]
ASMTHRESHOLD=95 # THRESHOLD FOR ASM DISK GROUPS [DB]
BLOCKTHRESHOLD=1          # THRESHOLD FOR BLOCKED SESSIONS count [DB]
WAIT_FOR_LOCK_THRES=1 # THRESHOLD FOR REPORTING BLOCKED SESSIONS BLOCKED FOR MORE THAN N MINUTES
LAST_MIN_BKP_CHK=5 # REPORT RMAN Backups FAILURE in the last N MINUTES. Should be the same interval as the execution interval of dbalarm script in crontab.
CHKRMANBKP=Y         # Enable/Disable Checking of RMAN Backup FAILURE. [Default Enabled]
CHKLISTENER=Y         # Enable/Disable Checking Listeners: [Default Enabled]
CHKOFFLINEDB=Y      # Enable/Disable Database Down Alert: [Default Enabled]
CHKGOLDENGATE=Y   # Enable/Disable Goldengate Alert: [Default Enabled]
CPUDIGMORE=Y          # Break down to DB Active sessions when CPU hit the threshold: [RECOMMENDED TO SET =N on VERY BUSY environments]
TIMEOUTDIGMORE=Y  # Enable/Disable the display of Network Errors when TIMEOUT error gets detected. [Default Enabled]
SERVICEMON=""       # Monitor Specific Named DB Services. e.g.  SERVICEMON="'ORCL_RO','ERP_SRVC','SAP_SERVICE'"
PARANOIDMODE=N    # Paranoid mode will report more events like export/import, instance shutdown/startup. [Default Disabled]
CHKASMALERTLOG=Y  # Enable/Disable Monitoring ASM instance ALERTLOG. [Default Enabled]
CHKCLSALERTLOG=Y  # Enable/Disable Monitoring GRID INFRASTRUCTURE ALERTLOG[Default Enabled]
DEVICEDRIVERLOG=Y  # Enable/Disable Check "dmesg" Device Driver log for errors. [Default Enabled]

You can control script features like checking listener/offline databases/Golden Gate alert/show DB active sessions when CPU is high/monitor specific services by using Y to enable or N to disable it.

You can enable the script to do more checks like (reporting instance startup/shutdown, export/import operations, alter system/database commands and other major DB activities) when activating the Paranoid mode by setting its threshold PARANOIDMODE=Y

In addition, the script gives you the option to exclude specific database, tablespace, ASM Diskgroup, filesystm from having the script run against. You can do so by editing the following parameters:

# #######################################
# Excluded INSTANCES:
# #######################################
# Here you can mention the instances dbalarm will IGNORE and will NOT run against:
# Use pipe "|" as a separator between each instance name.
# e.g. Excluding: -MGMTDB, ASM instances:

EXL_DB="\-MGMTDB|ASM|APX"       #Excluded INSTANCES [Will not get reported offline].

# #########################
# Excluded TABLESPACES:
# #########################
# Here you can exclude one or more tablespace if you don't want to be alerted when they hit the threshold:
# e.g. to exclude "UNDOTBS1" modify the following variable in this fashion without removing "donotremove" value:
# EXL_TBS="donotremove|UNDOTBS1"
EXL_TBS="donotremove"

# #########################
# Excluded ASM Diskgroups:
# #########################
# Here you can exclude one or more ASM Disk Groups if you don't want to be alerted when they hit the threshold:
# e.g. to exclude "FRA" DISKGROUP modify the following variable in this fashion without removing "donotremove" value:
# EXL_DISK_GROUP="donotremove|FRA"
EXL_DISK_GROUP="donotremove"

# ################################
# Excluded FILESYSTEM/MOUNT POINTS:
# ################################
# Here you can exclude specific filesystems/mount points from being reported by dbalarm:
# e.g. Excluding: /dev/mapper, /dev/asm mount points:

EXL_FS="\/dev\/mapper\/|\/dev\/asm\/"                          #Excluded mount points [Will be skipped during the check].

Moreover, you can exclude specific ORA- TNS- errors in case you are OK with ignoring them, I've already excluded some minor frequent incoming errors to not get bothered about them:

# #########################
# Excluded ERRORS:
# #########################
# Here you can exclude the errors that you don't want to be alerted when they appear in the logs:
# Use pipe "|" between each error.

EXL_DB_ALERT_ERR="ORA-2396|TNS-00507|TNS-12502|TNS-12560|TNS-12537|TNS-00505"   #Excluded ALERTLOG ERRORS [Will not get reported].
EXL_LSNR_ERR="TNS-00507|TNS-12502|TNS-12560|TNS-12537|TNS-00505"  #Excluded LISTENER ERRORS [Will not get reported].
EXL_GRID_ALERT_ERR="donotremove"    #Excluded GRID INFRA ERRORS [Will not get reported].
EXL_GG_ERR="donotremove"                        #Excluded GoldenGate ERRORS [Will not get reported].
EXL_DMESG_ERR="donotremove|scsmd"     #Excluded OS DEVICE DRIVERS ERRORS [Will not get reported].


This script was tested on Linux and SUN environments.

DISCLAIMER: THIS SCRIPT IS DISTRIBUTED IN THE HOPE THAT IT WILL BE USEFUL BUT WITHOUT ANY WARRANTY. IT IS PROVIDED "AS IS".

If you're looking for a script to check the database health check on a daily basis, please follow this link: 
http://dba-tips.blogspot.ae/2015/05/oracle-database-health-check-script.html

More and more smart and "easy to use" scripts for database administration tasks can be found in the DBA Bundle:
http://dba-tips.blogspot.ae/2014/02/oracle-database-administration-scripts.html

Finally, If you have an opinion/comment/suggestion please share it with me, your feedback is highly appreciated.

In case the download link is not working, you can find the full code below:



21 comments:

  1. Hi GIThub,
    I think this script DBAALARM is for Linux only.Is it compatible with AIX also?
    Hari

    ReplyDelete
  2. Excellent work. This is working fine

    Thanks,
    Balajee

    ReplyDelete
  3. Hi, I'm using your script. But ORA errors that get through the mail can not read.For example: ORA-01110: ТЮИК ДЮММШУ 228.

    ReplyDelete
  4. Hi Hari,
    Actually I didn't test this script on an AIX environment, so I cannot give you an answer, but I've a doubt that all the functions inside the script will work there.

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. Is it possible to sent alert for RMAN failure as below :

    STATUS IN('RUNNING WITH WARNINGS','RUNNING WITH ERRORS','COMPLETED WITH WARNINGS','COMPLETED WITH ERRORS','FAILED')

    ReplyDelete
  7. Hi Rahul,

    Thanks for your suggestion. Today I'll be publishing a new release of the script including your suggestion.

    15-10-18 Added RMAN Backup Failure Check. [Recommended by Rahul Malode]

    ReplyDelete
  8. Also, created a new post explaining this feature inside dbalarm script
    https://dba-tips.blogspot.com/2018/10/monitoring-of-failed-rman-backup-jobs.html

    ReplyDelete
  9. Thanks for sharing this script Adel. Excellent work

    ReplyDelete
  10. hello,
    service monitoring feature looks not working with me, in the script I've updated SERVICEMON parameter to:
    SERVICEMON="SUPPORT_SERVICE" but whenever SUPPORT_SERVICE goes down I don't get an alert.

    ReplyDelete
  11. Hi Zaid,

    Please enclose the service name between a single quote along with the double quote, then it will work.
    e.g.
    SERVICEMON="'SUPPORT_SERVICE'"

    ReplyDelete
  12. The last version which send the alerts in html is not working on my servers. no errors but I'm not receiving the emails, any idea?

    ReplyDelete
  13. I doubt that your SMTP server is blocking the Emails that include HTML content. Please check first with your Network team, if that was the case then you can revert back to the TEXT format by setting the following option "HTMLENABLE" under THRESHOLD section to "N" like this:

    # #########################
    # THRESHOLDS:
    # #########################
    # Modify the THRESHOLDS to the value you prefer:

    HTMLENABLE=N # Enable HTML Email Format [DB]

    ....

    ReplyDelete
  14. Hi,
    I am getting some error on below line. Database is running on Red Hat Enterprise Linux Server release 7.6 (Maipo). Appreciate your help in this.


    # Getting Utilized CPU (100-%IDLE):
    CPU_UTL_FLOAT=`echo "scale=2; 100-($CPU_IDLE)"|bc`



    ReplyDelete
  15. dbalerm.sh: line 334: bc: command not found
    CPU CHECK Completed.

    ReplyDelete
  16. Hi Abdul,

    Inside the script please replace line 334:

    CPU_UTL_FLOAT=`echo "scale=2; 100-($CPU_IDLE)"|bc`

    with this line:

    CPU_UTL_FLOAT=`awk "BEGIN {print 100-($CPU_IDLE)}"`

    ReplyDelete
  17. The script hang in this stage when running it for the first time:

    [Grid Infrastructure Setup Detected] Locating Grid Infrastructure ALERTLOG ...
    Checking GRID INFRASTRUCTURE ALERTLOG ...

    What is the problem?

    ReplyDelete
  18. Just cancel the current execution by pressing Ctrl+c
    And execute the script again, you shouldn't face this issue again.

    ReplyDelete
  19. Hi. I am unable to reach to dropbox link. Could you please share your script to vn6498@gmail.com. Appreciate your prompt help.

    ReplyDelete
  20. Able to copy the script, thank you. However, facing many errors as it is not picking up the email id - line 70 , not picking the environment from .bash_profile (FYI - i am running on 12.2.1) etc. Is there anyway to discuss one to one with you.

    ReplyDelete
  21. Sure, you can Email me on mahmmoudadel@hotmail.com

    ReplyDelete