Part 1: Live Partition Mobility in IBM Power Servers– Lessons Learned

Live Partition Mobility (LPM) is not a new feature for IBM Power servers, but I recently revisited it, and discovered some interesting problems (and solutions) that I wanted to share, to help anyone else that might run into the same issues.

I hadn’t touched Live Partition Mobility for a couple years, but a customer requested that I test LPM for them on several AIX LPARs which were giving them errors when they attempted the LPM operations. I had successfully performed LPM operations for this customer on different AIX LPARs in their environment several years prior, so I thought it would be a piece of cake.

I was wrong.

In this article (Part 1), I’ll explain the problems that we ran into and how we overcame them. In a future article, I’ll get into the details of IBM’s devscan utility and how it helped us solve the major problem that we faced.

LPM Validation Failure #1
I started my testing with an LPAR named “lpm_test_lpar” by trying to perform a very basic LPM Validation which I initiated from the HMC GUI. I left most of the options on the default selections – only choosing the Destination System (LABSRVR2) and specifying a new Destination Profile Name (LPM-TEST). After hitting the “Validate” button, it failed almost immediately with a communication error:

HSCLA246 The management console cannot communicate with partition lpm_test_lpar. Either the network connection is not available or the partition does not have a level of software that is capable of supporting this operation. Verify the correct network and setup of the partition, and try the operation again.

From the LPAR, I ran the following command to verify RMC communication with the two HMC’s that it was connected to:

# /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc
Management Domain Status: Management Control Points
  I A  0x1385f9aace3b86bc  0001  192.10.10.34
  I A  0x89ab270a087ada8d  0002  192.10.10.1

That output looks good, so I ran the following command from the HMC command line (as hscroot):

HMC1$ lspartition -dlpar | grep lpm_test_lpar
<#22> Partition:<10*9179-MHB*104A67D, lpm_test_lpar.domain.us, 192.10.10.14>
       Active:<1>, OS:<AIX, 6.1, 6100-09-02-1412>, DCaps:<0x0>, CmdCaps:<0x0, 0x0>, PinnedMem:<1088>

It has the correct hostname and IP address, so that is good. “Active” is “1”, so that is also good. But “DCaps” is “0x0″ – that is NOT good. But what is the problem? Let’s go back to the AIX LPAR and check the status of the RMC subsystem:

# lssrc -a | grep rsct
 ctrmc            rsct             8257538      active
 IBM.DRM          rsct_rm          2883756      inoperative
 IBM.CSMAgentRM   rsct_rm          6619324      active
 IBM.ServiceRM    rsct_rm          8519720      active
 ctcas            rsct                          inoperative
 IBM.ERRM         rsct_rm                       inoperative
 IBM.LPRM         rsct_rm                       inoperative
 IBM.StorageRM    rsct_rm                       inoperative
 IBM.ConfigRM     rsct_rm                       inoperative
 IBM.FSRM         rsct_rm                       inoperative
 IBM.MicroSensorRM rsct_rm                      inoperative
 IBM.SensorRM     rsct_rm                       inoperative
 IBM.WLMRM        rsct_rm                       inoperative
 IBM.AuditRM      rsct_rm                       inoperative
 IBM.HostRM       rsct_rm                       inoperative
 IBM.MgmtDomainRM rsct_rm                       inoperative

They don’t all need to be active, but “IBM.DRM” is inoperative and that is a problem. Let’s restart and see if that fixes it:

Stop the daemons:

# /usr/sbin/rsct/bin/rmcctrl -z
# lssrc -a | grep rsct
 ctcas            rsct                          inoperative
 ctrmc            rsct                          inoperative
 IBM.ERRM         rsct_rm                       inoperative
 IBM.LPRM         rsct_rm                       inoperative
 IBM.StorageRM    rsct_rm                       inoperative
 IBM.ConfigRM     rsct_rm                       inoperative
 IBM.FSRM         rsct_rm                       inoperative
 IBM.MicroSensorRM rsct_rm                      inoperative
 IBM.SensorRM     rsct_rm                       inoperative
 IBM.WLMRM        rsct_rm                       inoperative
 IBM.DRM          rsct_rm                       inoperative
 IBM.CSMAgentRM   rsct_rm                       inoperative
 IBM.ServiceRM    rsct_rm                       inoperative
 IBM.AuditRM      rsct_rm                       inoperative
 IBM.HostRM       rsct_rm                       inoperative
 IBM.MgmtDomainRM rsct_rm                       inoperative

Restart the daemons:

# /usr/sbin/rsct/bin/rmcctrl -A
0513-071 The ctrmc Subsystem has been added.
0513-059 The ctrmc Subsystem has been started. Subsystem PID is 20709438.
# lssrc -a | grep rsct
 ctrmc            rsct             20709438     active
 IBM.DRM          rsct_rm          6619364      active
 IBM.CSMAgentRM   rsct_rm          8519750      active
 IBM.ServiceRM    rsct_rm          15007908     active
 IBM.MgmtDomainRM rsct_rm          24117390     active
 IBM.HostRM       rsct_rm          11665500     active
 IBM.AuditRM      rsct_rm          15073452     active
 ctcas            rsct                          inoperative
 IBM.ERRM         rsct_rm                       inoperative
 IBM.LPRM         rsct_rm                       inoperative
 IBM.StorageRM    rsct_rm                       inoperative
 IBM.ConfigRM     rsct_rm                       inoperative
 IBM.FSRM         rsct_rm                       inoperative
 IBM.MicroSensorRM rsct_rm                      inoperative
 IBM.SensorRM     rsct_rm                       inoperative
 IBM.WLMRM        rsct_rm                       inoperative

That looks better, and IBM.DRM is active again. The final command will enable the daemons for remote client connections:

# /usr/sbin/rsct/bin/rmcctrl -p

Now, let’s go back to the HMC command line and see how the “lspartition –dlpar” command output looks now:

HMC1$ lspartition -dlpar | grep lpm_test_lpar
<#22> Partition:<10*9179-MHB*104A67D, lpm_test_lpar.domain.us, 192.10.10.14>
       Active:<1>, OS:<AIX, 6.1, 6100-09-02-1412>, DCaps:< 0x2c5f>, CmdCaps:< 0x1b, 0x1b>, PinnedMem:< 3540>

Everything looks good now, and “DCaps” is a value higher than “0x0″. The communication problem should be solved.

LPM Validation Failure #2
I attempted the LPM Validation again by using the HMC GUI, with the same selections. It failed again, this time on a problem concerning the VSCSI (Virtual SCSI) adapter, which is used on this LPAR to mount a virtual optical drive from one of the VIO Servers. The error:

HSCLA27C The operation to get the physical device location for adapter U9179.MHB.103A23F-V1-C87 on the virtual I/O server partition labvios1 has failed.
The partition command is:
migmgr -f get_adapter -t vscsi -s U9179.MHB.103A23F-V1-C87 -d 0
The partition standard error is:
Null

The error message was a bit cryptic, so I logged onto the VIO Server (as padmin) to look at the VSCSI server adapter:

padmin@labvios1:/home/padmin $ lsmap -vadapter vhost7
SVSA            Physloc                                      Client Partition ID
--------------- -------------------------------------------- ------------------
vhost7          U9179.MHB.103A23F-V1-C87                     0x0000000a
VTD                   lpm_test_lpar-cd
Status                Available
LUN                   0x8100000000000000
Backing device
Physloc
Mirrored              N/A

padmin@labvios1:/home/padmin $ lsvopt | grep lpm_test_lpar-cd
lpm_test_lpar-cd        No Media                                 n/a

So, a virtual optical device (lpm_test_lpar-cd) is mapped to the VSCSI server adapter, but no media is loaded to it. I assumed that was the problem, and then I noticed in the HMC’s validation error window, if I clicked on the “Detailed information” button there was a more informative message:

HSCL400A There was a problem running the VIOS command. HSCLA29A The RMC command issued to partition labvios1 failed.
The partition command is:
migmgr -f get_adapter  -t vscsi -s U9179.MHB.103A23F-V1-C87 -d 1
The RMC return code is:
0
The OS command return code is:
80
The OS standard out is:
Running method '/usr/lib/methods/mig_vscsi'
80
VIOS_DETAILED_ERROR
lpm_test_lpar-cd is backed by a non migratable device: optical
End Detailed Message.

That confirms it – the virtual optical device needed to be removed, so I did:

padmin@labvios1:/home/padmin $ rmdev -dev lpm_test_lpar-cd
lpm_test_lpar-cd deleted

padmin@labvios1:/home/padmin $ lsmap -vadapter vhost7
SVSA            Physloc                                      Client Partition ID
--------------- -------------------------------------------- ------------------
vhost7          U9179.MHB.103A23F-V1-C87                     0x0000000a

VTD                   NO VIRTUAL TARGET DEVICE FOUND

LPM Validation Failure #3
At this point, I decided to switch to using the HMC command line instead of the GUI, so that I could specify more details for how the mobile LPAR would be configured on the destination server – the LPAR ID, slot numbers of the virtual devices, and VSCSI and VFC (Virtual Fibre Channel) mappings. Here is the command that I ran:

HMC1$ migrlpar -o v -m LABSRVR1 -t LABSRVR2 -p lpm_test_lpar -n LPM-TEST -i "dest_lpar_id=51,\"virtual_fc_mappings=10/labvios3//61/fcs2,11/labvios4//61/fcs3\",virtual_scsi_mappings=19/labvios3//108"

I don’t want to go into detail about all of the command line options that are available, but I will offer a brief explanation of what the above command does. This command runs a LPM validation of migrating LPAR “lpm_test_lpar” from server “LABSRVR1” to server “LABSRVR2”, using a new LPAR profile name of “LPM-TEST”. The new LPAR ID on LABSRVR2 will be “51”. The LPAR’s VFC client adapter with adapter ID “10” will be mapped to physical fibre channel adapter “fcs2” on VIO Server “labvios3” using a vfchost adapter with adapter ID “61”. Likewise, the VFC client adapter with adapter ID “11” will be mapped to physical fibre channel adapter “fcs3” on “labvios4” using a vfchost adapter with adapter ID “61”. And the VSCSI client adapter with adapter ID “19” will be mapped to a vhost adapter on “labvios3” with adapter ID “108”.

This time it gets past the communication and VSCSI errors, but fails with VFC errors:

HSCLA319 The migrating partition's virtual fibre channel client adapter 10 
cannot be hosted by the existing Virtual I/O Server (VIOS) partitions on the 
destination managed system.  To migrate the partition, set up the necessary 
VIOS host on the destination managed system, then try the operation again.
HSCLA319 The migrating partition's virtual fibre channel client adapter 11 
cannot be hosted by the existing Virtual I/O Server (VIOS) partitions on the 
destination managed system.  To migrate the partition, set up the necessary 
VIOS host on the destination managed system, then try the operation again.

This one turned out to be quite a bugger. My good friend Google indicated that usually a HSCLA319 error is due to a SAN zoning problem. Naturally, our SAN team indicated that there weren’t any problems with the zoning – all primary and secondary (LPM) WWPNs were zoned to the storage via IBM’s SVC (SAN Volume Controller). I was a little skeptical since the major thing that changed since my previous successful LPM migrations at this site was that the SAN storage had been migrated from IBM DS8x00 storage to the SVC. So, I had a SAN expert within The ATS Group take a look at it and he verified that the secondary WWPNs were indeed properly zoned to the SVC storage.

We ended up opening Problem tickets with IBM support with both the PowerVM and Storage groups. At first, information gathering and debugging for each ticket was handled separately, with each group finger-pointing and saying that the problem was with the other group. But eventually we were able to get people who understood both the PowerVM and Storage sides, and we were able to get to the bottom of things. To make a long story slightly less long, we verified that our SAN team’s assertion was correct – the zoning of the LPAR to the SVC was indeed done properly. However, the problem was that the primary WWPNs were still zoned to the DS8x00 (in addition to the SVC), but the secondary WWPNs were only zoned to the SVC. Our SAN team hadn’t considered that this might be a problem because there was no longer any storage allocated from the DS8x00. But apparently Live Partition Mobility detected the DS8x00 connection for the primary WWPNs and required that the secondary WWPNs had that same connection, even though no storage was actually assigned through that connection.

The Solution
Since the DS8x00 unit was no longer used to host storage, the SAN team removed the zoning to the DS8x00 from the primary WWPNs, so the SVC connection was all that remained, and that solved the problem, as the zoning for primary and secondary WWPNs now matched. The validation command now runs successfully, although it does spit out hundreds of informational/warning messages. To run an actual LPM migration, we just change “-o v” to “-o m”:

HMC1$ migrlpar -o m -m LABSRVR1 -t LABSRVR2 -p lpm_test_lpar -n LPM-TEST -i "dest_lpar_id=51,\"virtual_fc_mappings=10/labvios3//61/fcs2,11/labvios4//61/fcs3\",virtual_scsi_mappings=19/labvios3//108"

The migration itself ran for several minutes and was successful – it completed without any errors or warnings of any kind. After fixing the problem on this LPAR, we were able to remove the old DS8x00 zoning from several other LPARs and perform successful LPM migrations on them as well.

Digging Deep with devscan
Per a recommendation from IBM support, we used IBM’s “devscan” utility in our process of debugging the SAN connectivity, and it was instrumental in helping us uncover the zoning problem. I found this to be a very useful tool that doesn’t seem to have a lot of real-world documentation. So, I am going to go into further detail about the devscan utility in Part 2 of this article… stay tuned!

Leave a Reply

Your email address will not be published. Required fields are marked *