Sunday, June 19, 2011

Troubleshooting Unix MISSED backups

Four part strategy...

1) LOGON AND CHECK FOR A SCHEDULER DAEMON
Log onto TSM Backup/Archive (TSM B/A) client host via SSH. To find your target IP use the following..
q node xxxx f=d
Look for the TCP/IP Address: value
If the IP address does not show up here, try this command:
q actlog begind=-3 endd=today msgno=0406 search=xxxx
Look for the IP address that the host is talking to TSM server with
Once SSH’ed into host, sudo up to root:
Sudo su -

Find out what type of Unix OS you are dealing with:

uname -a

Find out if the TSM B/A scheduler daemon is running:

NON-Linux unix OS'es:

ps -ef | grep dsm

Linux unix OS'es:

ps -ef | grep tsm


You should see something similar to the following output:

NON-Linux unix OS’es:

root 2608 1 0 Aug 13 ? 58:58 /opt/tivoli/tsm/client/ba/bin/dsmc schedule


Linux unix OS'es:

You may see multiple daemons returned from your ‘ps –ef | grep tsm’ command. This is ok, as there should be one ‘master’ daemon and 4-5 ‘child’ daemons


2) CHECK FOR A HUNG SCHEDULER DAEMON

View the dsm.sys config file to see where dsmerror.log and dsmsched.log files are being written (on AIX, replace /opt/ with /usr/):

more /opt/tivoli/tsm/client/ba/bin/dsm.sys


Find the SCHEDLOGName entry - this typically points to /var/adm/dsmsched.log

Find the ERRORLOGName entry - this typically points to /var/adm/dsmerror.log


Wherever the two log files point to, cd to that directory:

cd /var/adm


Find out when files were last updated:

ls -ltr | grep dsm


Find out the current time of this host:

date


If the dsmerror.log file has a timestamp pretty close (within a couple of hours) to the current host time, look at the last few entries to see what’s going on:

tail -500 dsmerror.log


Sometimes this shows the TSM B/A client continuously trying to establish a connection to TSM server, but unable to do so. If this is the case, the scheduler daemon is probably hung, and needs to be killed/restarted


If no errors appear to indicate that agent is hung, move onto next check


Check dsmsched.log for last few entries

tail -100 dsmsched.log


If the last few entries seem to indicate that a backup is still running, yet the date/time stamps are old (ie. not near the current time), the scheduler daemon is probably hung and needs to be killed/restarted.


3) KILLING/RESTARTING A HUNG SCHEDULER DAEMON

Get the daemon process ID of the TSM B/A scheduler daemon that is running:


NON-Linux unix OS'es:

ps -ef | grep dsm


Linux unix OS'es:

ps -ef | grep tsm


You should see something similar to the following output:


NON-Linux unix OS’es:

root 2608 1 0 Aug 13 ? 58:58 /opt/tivoli/tsm/client/ba/bin/dsmc schedule

Linux unix OS'es:


You may see multiple daemons returned from your ‘ps –ef | grep tsm’ command. This is ok, as there should be one ‘master’ daemon and 4-5 ‘child’ daemons


Kill the TSM B/A scheduler daemon:

kill -9 2608


The number 2608 is the PID in this example command is based on the output from the above ps –ef commands.


In reality, your PID number will be different from the above example.

Be sure you are killing the correct PID!

Verify that the daemon automatically restarted itself:


NON-Linux unix OS’es:

root 3512 1 0 Aug 13 ? 58:58 /opt/tivoli/tsm/client/ba/bin/dsmc schedule


You may see multiple daemons returned from your ‘ps –ef | grep tsm’ command. This is ok, as there should be one ‘master’ daemon and 4-5 ‘child’ daemons


If you do see output similar to the above example, verify that TSM B/A scheduler daemon successfully retrieved next job from TSM server:

tail -20 /var/adm/dsmsched.log

You should see output similar to:

08/17/06 13:41:12 Querying server for next scheduled event.

08/17/06 13:41:12 Node Name: server

08/17/06 13:41:12 Session established with server server: AIX-RS/6000

08/17/06 13:41:12 Server Version 5, Release 2, Level 2.0

08/17/06 13:41:12 Server date/time: 08/17/06 12:24:57 Last access: 08/17/06 12:16:03

08/17/06 13:41:12 --- SCHEDULEREC QUERY BEGIN

08/17/06 13:41:12 --- SCHEDULEREC QUERY END

08/17/06 13:41:12 Next operation scheduled:

08/17/06 13:41:12 ------------------------------------------------------------

08/17/06 13:41:12 Schedule Name: 0000MST

08/17/06 13:41:12 Action: Incremental

08/17/06 13:41:12 Objects:

08/17/06 13:41:12 Options:

08/17/06 13:41:12 Server Window Start: 00:00:00 on 08/18/06

08/17/06 13:41:12 ------------------------------------------------------------

08/17/06 13:41:12 Command will be executed in 11 hours and 36 minutes.



4) MANUALLY STARTING A SCHEDULER DAEMON

Ensure TSM B/A client can communicate with TSM server:

dsmc query sched

If command completes successfully, and output returns no errors, TSM B/A client can communicate with TSM server, proceed to starting TSM B/A scheduler daemon


Start TSM B/A client scheduler daemon (on AIX, replace /opt/ with /usr/):

/opt/tivoli/tsm/client/ba/bin/dsmc schedule >/dev/null 2>&1 &

Ensure TSM B/A client scheduler daemon is running:


ps -ef | grep -v grep | grep dsm


NON-Linux unix OS'es:


ps -ef | grep -v grep | grep tsm


You should see output similar to:

root 4561 1 0 Aug 13 ? 58:58 /opt/tivoli/tsm/client/ba/bin/dsmc schedule

No comments:

Post a Comment