Optimizing and Verifying TSM Backups on AIX PowerHA Clusters

One of my customers recently noticed that the TSM backups for their AIX PowerHA clusters were taking a long time and seemed to be backing up a lot more data than they expected, so they asked me to look into it. In this article, I’ll explain what the problem was, how it was resolved, and include some simple scripts that I used for testing.

NOTE 1: This article involves two different usages of the word “node” – 1. The “nodes” of a PowerHA cluster (the servers that make up the cluster) and 2. The “nodes” defined in TSM (host definitions). In an attempt to avoid confusion, I’ll refer to each node of a cluster as a “server,” and continue to use the word “node” for the nodes defined in TSM. This may actually cause more confusion than it prevents, but that’s a risk we’ll have to take.

NOTE 2: I realize that TSM is now officially known as “IBM Spectrum Protect.” But it will always be TSM in my heart (plus it is faster to type), so I’ll refer to it as TSM in this article. Theoretically, I should also use the official product term “IBM PowerHA SystemMirror for AIX” instead of just PowerHA. But that is far too lengthy, so PowerHA will have to suffice. Just be glad that I’m no longer calling it HACMP.

PowerHA/TSM Configuration Details
The customer had several two-server clusters, most of which had a single resource group in an active/passive configuration containing one or more shared/cluster volume groups and a cluster ip address. In TSM, they had nodes defined for each server in the cluster as well as a separate node for the cluster hostname. That way, the filesystems that were in the cluster resource group volume group(s) would get backed up to a separate TSM node. This would avoid the problem of having the shared/cluster files backed up to TSM Node A when the resource group was active on Server A and then backed up to TSM Node B when active on Server B. In that scenario, in order to restore shared/cluster files, you would need to know which was the active server when the applicable backup was taken of the file that you wanted restored. That would be confusing, and life is hard enough without adding extra confusion.

Within their configuration, the active server of the cluster would always have two TSM scheduler processes running – one for the regular TSM node and a second for the cluster TSM node. The TSM scheduler for the cluster node would be started using the “-optfile” option of the “dsmc sched” command to specify a cluster dsm.opt file (located in one of the cluster filesystems). The inactive server of the cluster would only have one TSM scheduler instance running for the regular TSM node.

The Problem: Duplicate TSM Backups
When I reviewed the TSM configuration, the problem was simple enough to identify – by default, a TSM filesystem backup on AIX will backup all local mounted filesystems (except /tmp), so on the active server of the cluster all filesystems were being backed up twice – once to the regular TSM node, and once to the cluster node. Some of the filesystems were quite large, so the duplication added a lot of time to the process and a lot of excess data unnecessarily backed up to TSM.

If I hadn’t been able to find the problem by examining the TSM configuration, I would have done some digging using server performance monitoring tools. In this case, the customer has Galileo Performance Explorer installed on all of their AIX servers, so I would have been able to quickly and easily look at all sorts of performance data from both the problem server as well as the TSM server to identify any issues or bottlenecks in CPU, memory, disk, paging space, adapter, network, etc. and I could have tracked the trends over many months to see if/when a problem began to occur.

The Solution: Include/Exclude Files
The desired result is for the non-cluster filesystems to be backed up to the regular TSM nodes while the cluster filesystems are backed up to the cluster node. This can be accomplished via include/exclude files, which are specified via the “Inclexcl” option in the dsm.sys file.

Since the TSM default is to include all local filesystems in the include/exclude file for the regular TSM node, we would just need to explicitly exclude the cluster filesystems. That would leave all of the non-cluster filesystems to be backed up.

For the include/exclude file for the cluster TSM node, we would do the opposite – have it specifically include all of the cluster filesystems and then have it explicitly exclude all other filesystems.

Example
Let’s demonstrate this with an example.  Say we have a two-server Oracle cluster consisting of Server_A and Server_B.  Each server has two non-clustered volume groups containing the same filesystems as shown below:

rootvg:
/
/home
/opt
/opt/galileo
/tmp
/usr
/var
/var/hacmp

u01vg:
/u01   (contains local Oracle binaries)

The cluster has a single resource group (active/passive) which contains one volume group:

oraclevg:
/tsm_CL   (includes the cluster TSM files - dsm.opt file, include/exclude file, logs, etc.)
/u02
/u03

Since each server in the cluster contains the same filesystems, they can each have identical include/exclude files for their regular TSM nodes, which exclude the cluster filesystems:

/usr/tivoli/tsm/client/ba/bin64/inclexcl.list:
exclude.fs /tsm_CL
exclude.fs /u0[2-3]

The include/exclude file for the cluster node should exist in one of the clustered filesystems, should include just the cluster filesystems, and then exclude everything else. Include/exclude files are read from the bottom-up, so the cluster include/exclude file should look like this:

/tsm_CL/inclexcl.CL.list:
exclude "/.../*"
include /tsm_CL/.../*
include /u0[2-3]/.../*

After creating the above include/exclude files, be sure that the dsm.sys file on each server is updated so that they both contain a stanza for the cluster TSM node and that the “Inclexcl” option is correct for all stanzas. In fact, you may want to create an identical dsm.sys file on both servers, containing both regular TSM nodes and the cluster node.  If you do so, you could also add the dsm.sys file to a PowerHA File Collection to make sure it stays in sync on both servers in the cluster.

Now, we restart the TSM scheduler for both regular TSM nodes and for the cluster node to pick up the changes.

Testing
Next, we need a way to see if it is working as expected.  I created a set of simple scripts to help us verify. I’ll write out the complete scripts at the end of this article, but for now, a brief description:

mk_date_files.ksh – This script simply creates a new test file in every filesystem on the server for TSM to backup. The file names are in the form “test_<date>” where <date> is the current date in YYYYMMDD format.

verify_date_file_backups.ksh – This script queries TSM to find “test_<date>” files that were backed up to the specified TSM node. Two variables will have to be set in this script – the date (when the mk_date_files.ksh script was run) and the dsm.opt file location (set it for the regular TSM node or for the cluster TSM node).

First, we run the mk_date_files.ksh script on both servers in the cluster to make the date files to test our TSM backups. It uses today’s date (May 24, 2016 = 20160524). The output of the script is shown below:

Active Server:

# ./mk_date_files.ksh
Creating date files for oraclevg:
/tsm_CL/test_20160524
/u02/test_20160524
/u03/test_20160524

Creating date files for rootvg:
//test_20160524
/home/test_20160524
/opt/test_20160524
/opt/galileo/test_20160524
/tmp/test_20160524
/usr/test_20160524
/var/test_20160524
/var/hacmp/test_20160524

Creating date files for u01vg:
/u01/test_20160524

Inactive Server:

# ./mk_date_files.ksh
Creating date files for rootvg:
//test_20160524
/home/test_20160524
/opt/test_20160524
/opt/galileo/test_20160524
/tmp/test_20160524
/usr/test_20160524
/var/test_20160524
/var/hacmp/test_20160524

Creating date files for u01vg:
/u01/test_20160524

At this point, we need TSM to run to see if it backs up the files correctly. We could either wait until after its next regularly scheduled backup, or we could force it to perform a backup now so we can continue testing. I’ll choose the latter option and run “dsmc incr” on both servers to back up to the regular TSM nodes and then run “dsmc incr -optfile=/tsm_CL/dsm.CL.opt” on the active server to back up to the cluster TSM node.

With the backups complete on both servers, we can query TSM to see which of the test files got backed up to the regular TSM nodes and which got backed up to the cluster node. We’ll use the verify_date_file_backups.ksh script to perform these queries.

I will first check to see which of the test files got backed up to the regular TSM nodes. So I edit the verify_date_file_backups.ksh script on both servers to set the DATE variables to the date that the mk_date_files.ksh used (20160524) and set the OPT_FILE variables to point to the regular TSM node opt files.

We will run this script first on the active server. It should find that a dated test file has been backed up in every filesystem within the non-clustered volume groups (rootvg and u01vg), with the exception of /tmp (which is excluded by TSM default). It should also find that there has NOT been a dated test file backed up for any of the filesystems in the clustered volume group (oraclevg). The output is below (note that it traverses the volume groups in alphabetic order):

Active Server Regular TSM Node:

# ./verify_date_file_backups.ksh

Verifying date file backups for all filesystems to TSM Node Server_A.
OPT file is /usr/tivoli/tsm/client/ba/bin64/dsm.opt

Regular output will be logged to /tmp/verify_date_file_backups_Server_A.20160524.out.  Detailed output will be logged to /tmp/verify_date_file_backups_Server_A.20160524.log
...

Verifying date file backups for oraclevg:

/tsm_CL/test_20160524 has NOT been backed up to this TSM NODE
/u02/test_20160524 has NOT been backed up to this TSM NODE
/u03/test_20160524 has NOT been backed up to this TSM NODE
...

Verifying date file backups for rootvg:

//test_20160524 has been backed up to this TSM NODE
/home/test_20160524 has been backed up to this TSM NODE
/opt/test_20160524 has been backed up to this TSM NODE
/opt/galileo/test_20160524 has been backed up to this TSM NODE
/tmp/test_20160524 has NOT been backed up to this TSM NODE
/usr/test_20160524 has been backed up to this TSM NODE
/var/test_20160524 has been backed up to this TSM NODE
/var/hacmp/test_20160524 has been backed up to this TSM NODE
...

Verifying date file backups for u01vg:

/u01/test_20160524 has been backed up to this TSM NODE

Verification complete.  See /tmp/verify_date_file_backups_Server_A.20160524.out for basic output and /tmp/verify_date_file_backups_Server_A.20160524.log for details

TSM has performed as expected. We will next run the same script on the Inactive server. It should have the exact same results as it did on the active server, except—the cluster volume group filesystems are not mounted there currently, so they will not be listed:

Inactive Server Regular TSM Node:

# ./verify_date_file_backups.ksh

Verifying date file backups for all filesystems to TSM Node Server_B.
OPT file is /usr/tivoli/tsm/client/ba/bin64/dsm.opt

Regular output will be logged to /tmp/verify_date_file_backups_Server_B.20160524.out.  Detailed output will be logged to /tmp/verify_date_file_backups_Server_B.20160524.log
...

Verifying date file backups for rootvg:

//test_20160524 has been backed up to this TSM NODE
/home/test_20160524 has been backed up to this TSM NODE
/opt/test_20160524 has been backed up to this TSM NODE
/opt/galileo/test_20160524 has been backed up to this TSM NODE
/tmp/test_20160524 has NOT been backed up to this TSM NODE
/usr/test_20160524 has been backed up to this TSM NODE
/var/test_20160524 has been backed up to this TSM NODE
/var/hacmp/test_20160524 has been backed up to this TSM NODE
...

Verifying date file backups for u01vg:

/u01/test_20160524 has been backed up to this TSM NODE

Verification complete.  See /tmp/verify_date_file_backups_Server_B.20160524.out for basic output and /tmp/verify_date_file_backups_Server_B.20160524.log for details

Again, it looks like TSM performed as expected.

Next, we will modify the verify_date_file_backups.ksh script on the active server by setting the OPT_FILE variable to point to the cluster opt file. This will cause it to look for dated test files in every filesystem on the server to see which have been backed up to the cluster TSM node. It should find that dated test files have NOT been backed up in any filesystem within the non-clustered volume groups (rootvg and u01vg), but it should find that dated test files have been backed up for all of the filesystems in the clustered volume group (oraclevg). The output is below:

Active Server Cluster TSM Node:

# ./verify_date_file_backups.ksh

Verifying date file backups for all filesystems to TSM Node Cluster.
OPT file is /tsm_CL/dsm.CL.opt

Regular output will be logged to /tmp/verify_date_file_backups_Cluster.20160524.out.  Detailed output will be logged to /tmp/verify_date_file_backups_Cluster.20160524.log
...

Verifying date file backups for oraclevg:

/tsm_CL/test_20160524 has been backed up to this TSM NODE
/u02/test_20160524 has been backed up to this TSM NODE
/u03/test_20160524 has been backed up to this TSM NODE
...

Verifying date file backups for rootvg:

//test_20160524 has NOT been backed up to this TSM NODE
/home/test_20160524 has NOT been backed up to this TSM NODE
/opt/test_20160524 has NOT been backed up to this TSM NODE
/opt/galileo/test_20160524 has NOT been backed up to this TSM NODE
/tmp/test_20160524 has NOT been backed up to this TSM NODE
/usr/test_20160524 has NOT been backed up to this TSM NODE
/var/test_20160524 has NOT been backed up to this TSM NODE
/var/hacmp/test_20160524 has NOT been backed up to this TSM NODE
...

Verifying date file backups for u01vg:

/u01/test_20160524 has NOT been backed up to this TSM NODE

Verification complete.  See /tmp/verify_date_file_backups_Cluster.20160524.out for basic output and /tmp/verify_date_file_backups_Cluster.20160524.log for details

Once again, it looks like TSM performed as expected. The cluster verify script does not need to be run on the Inactive node since the cluster resource group is not there currently (and the cluster dsm.opt file is not there since it is in a cluster filesystem, so the script would not run anyway).

This concludes our testing, as we have successfully configured TSM to perform filesystem backups within a PowerHA cluster on AIX. See below for the complete scripts that were used above.

Appendix – PowerHA TSM Test Scripts

mk_date_files.ksh:

#!/bin/ksh

## This script creates a test file containing today's date in each filesystem ##
## in each open volume group on the server: /<filesystem>/test_<date>         ##

DATE=`date +"%Y%m%d"`
LOG=/tmp/mk_date_files.$DATE.log

VG=""
for VG in `lsvg -o | grep -v caavg_private | sort`
do
  print "Creating date files for ${VG}:"
  FS=""
  for FS in `lsvgfs ${VG} | sort`
  do
          FILE=""
          FILE=${FS}/test_${DATE}
          print ${FILE}
          date > ${FILE}
  done
  print
done

exit



verify_date_file_backups.ksh:

#!/bin/ksh

## This script queries TSM file backups to find test_<date> files that were   ##
## created by using the mk_date_files.ksh script.                             ##
## In this script, the DATE and OPT_FILE variables must be manually set.      ##
## DATE is the date when the date files were created by running the           ##
## mk_date_files.ksh script.
## OPT_FILE is the full path to the opt file, and will determine whether the  ##
## script queries backups for the "regular" or "cluster" TSM node.            ##

# Set DATE and OPT_FILE variables:
# DATE should have YYYYMMDD format
DATE=20160524

# OPT_FILE should point to regular TSM optfile, or cluster optfile:
# Regular:
OPT_FILE="/usr/tivoli/tsm/client/ba/bin64/dsm.opt"
# Cluster:
#OPT_FILE="/tsm_CL/dsm.CL.opt"

NODE=""
NODE=`grep -i servername $OPT_FILE | awk '{print $NF}'`
OUT=/tmp/verify_date_file_backups_$NODE.$DATE.out
> $OUT
LOG=/tmp/verify_date_file_backups_$NODE.$DATE.log
> $LOG

print "\nVerifying date file backups for all filesystems to TSM Node $NODE." | tee -a $OUT
print "OPT file is $OPT_FILE" | tee -a $OUT
print "\nVerifying date file backups for all filesystems to TSM Node $NODE." >> $LOG
print "OPT file is $OPT_FILE" >> $LOG

print "\nRegular output will be logged to $OUT.  Detailed output will be logged to $LOG" | tee -a $OUT

VG=""
for VG in `lsvg -o | grep -v caavg_private | sort`
do
  print "..."
  print "\nVerifying date file backups for ${VG}:\n" | tee -a $OUT
  print "\nVerifying date file backups for ${VG}:\n" >> $LOG
  FS=""
  for FS in `lsvgfs $VG | sort`
  do
          RC=100
          FILE=""
          FILE=${FS}/test_${DATE}
          print "\n# dsmc query backup ${FILE} -optfile=$OPT_FILE" >> $LOG
          dsmc query backup ${FILE} -optfile=$OPT_FILE >> $LOG 2>&1
          RC=$?
          if [[ $RC -eq 0 ]]
          then
                  print "$FILE has been backed up to this TSM NODE" | tee -a $OUT
                  print "$FILE has been backed up to this TSM NODE" >> $LOG
          else
                  print "$FILE has NOT been backed up to this TSM NODE" | tee -a $OUT
                  print "$FILE has NOT been backed up to this TSM NODE" >> $LOG
      fi
  done
done

print "\nVerification complete.  See $OUT for basic output and $LOG for details." | tee -a $OUT

exi

 

Leave a Reply

Your email address will not be published. Required fields are marked *