Live Partition Mobility (LPM) is not a new feature for IBM Power servers, but I recently revisited it, and discovered some interesting problems (and solutions) that I wanted to share, to help anyone else that might run into the same issues.
I hadn’t touched Live Partition Mobility for a couple years, but a customer requested that I test LPM for them on several AIX LPARs which were giving them errors when they attempted the LPM operations. I had successfully performed LPM operations for this customer on different AIX LPARs in their environment several years prior, so I thought it would be a piece of cake.
I was wrong.
In this article (Part 1), I’ll explain the problems that we ran into and how we overcame them. In a future article, I’ll get into the details of IBM’s devscan utility and how it helped us solve the major problem that we faced.
LPM Validation Failure #1
I started my testing with an LPAR named “lpm_test_lpar” by trying to perform a very basic LPM Validation which I initiated from the HMC GUI. I left most of the options on the default selections – only choosing the Destination System (LABSRVR2) and specifying a new Destination Profile Name (LPM-TEST). After hitting the “Validate” button, it failed almost immediately with a communication error:
HSCLA246 The management console cannot communicate with partition lpm_test_lpar. Either the network connection is not available or the partition does not have a level of software that is capable of supporting this operation. Verify the correct network and setup of the partition, and try the operation again.
From the LPAR, I ran the following command to verify RMC communication with the two HMC’s that it was connected to:
# /usr/sbin/rsct/bin/rmcdomainstatus -s ctrmc Management Domain Status: Management Control Points I A 0x1385f9aace3b86bc 0001 192.10.10.34 I A 0x89ab270a087ada8d 0002 192.10.10.1
That output looks good, so I ran the following command from the HMC command line (as hscroot):
HMC1$ lspartition -dlpar | grep lpm_test_lpar <#22> Partition:<10*9179-MHB*104A67D, lpm_test_lpar.domain.us, 192.10.10.14> Active:<1>, OS:<AIX, 6.1, 6100-09-02-1412>, DCaps:<0x0>, CmdCaps:<0x0, 0x0>, PinnedMem:<1088>
It has the correct hostname and IP address, so that is good. “Active” is “1”, so that is also good. But “DCaps” is “0x0″ – that is NOT good. But what is the problem? Let’s go back to the AIX LPAR and check the status of the RMC subsystem:
# lssrc -a | grep rsct ctrmc rsct 8257538 active IBM.DRM rsct_rm 2883756 inoperative IBM.CSMAgentRM rsct_rm 6619324 active IBM.ServiceRM rsct_rm 8519720 active ctcas rsct inoperative IBM.ERRM rsct_rm inoperative IBM.LPRM rsct_rm inoperative IBM.StorageRM rsct_rm inoperative IBM.ConfigRM rsct_rm inoperative IBM.FSRM rsct_rm inoperative IBM.MicroSensorRM rsct_rm inoperative IBM.SensorRM rsct_rm inoperative IBM.WLMRM rsct_rm inoperative IBM.AuditRM rsct_rm inoperative IBM.HostRM rsct_rm inoperative IBM.MgmtDomainRM rsct_rm inoperative
They don’t all need to be active, but “IBM.DRM” is inoperative and that is a problem. Let’s restart and see if that fixes it:
Stop the daemons:
# /usr/sbin/rsct/bin/rmcctrl -z # lssrc -a | grep rsct ctcas rsct inoperative ctrmc rsct inoperative IBM.ERRM rsct_rm inoperative IBM.LPRM rsct_rm inoperative IBM.StorageRM rsct_rm inoperative IBM.ConfigRM rsct_rm inoperative IBM.FSRM rsct_rm inoperative IBM.MicroSensorRM rsct_rm inoperative IBM.SensorRM rsct_rm inoperative IBM.WLMRM rsct_rm inoperative IBM.DRM rsct_rm inoperative IBM.CSMAgentRM rsct_rm inoperative IBM.ServiceRM rsct_rm inoperative IBM.AuditRM rsct_rm inoperative IBM.HostRM rsct_rm inoperative IBM.MgmtDomainRM rsct_rm inoperative
Restart the daemons:
# /usr/sbin/rsct/bin/rmcctrl -A 0513-071 The ctrmc Subsystem has been added. 0513-059 The ctrmc Subsystem has been started. Subsystem PID is 20709438. # lssrc -a | grep rsct ctrmc rsct 20709438 active IBM.DRM rsct_rm 6619364 active IBM.CSMAgentRM rsct_rm 8519750 active IBM.ServiceRM rsct_rm 15007908 active IBM.MgmtDomainRM rsct_rm 24117390 active IBM.HostRM rsct_rm 11665500 active IBM.AuditRM rsct_rm 15073452 active ctcas rsct inoperative IBM.ERRM rsct_rm inoperative IBM.LPRM rsct_rm inoperative IBM.StorageRM rsct_rm inoperative IBM.ConfigRM rsct_rm inoperative IBM.FSRM rsct_rm inoperative IBM.MicroSensorRM rsct_rm inoperative IBM.SensorRM rsct_rm inoperative IBM.WLMRM rsct_rm inoperative
That looks better, and IBM.DRM is active again. The final command will enable the daemons for remote client connections:
# /usr/sbin/rsct/bin/rmcctrl -p
Now, let’s go back to the HMC command line and see how the “lspartition –dlpar” command output looks now:
HMC1$ lspartition -dlpar | grep lpm_test_lpar <#22> Partition:<10*9179-MHB*104A67D, lpm_test_lpar.domain.us, 192.10.10.14> Active:<1>, OS:<AIX, 6.1, 6100-09-02-1412>, DCaps:< 0x2c5f>, CmdCaps:< 0x1b, 0x1b>, PinnedMem:< 3540>
Everything looks good now, and “DCaps” is a value higher than “0x0″. The communication problem should be solved.
LPM Validation Failure #2
I attempted the LPM Validation again by using the HMC GUI, with the same selections. It failed again, this time on a problem concerning the VSCSI (Virtual SCSI) adapter, which is used on this LPAR to mount a virtual optical drive from one of the VIO Servers. The error:
HSCLA27C The operation to get the physical device location for adapter U9179.MHB.103A23F-V1-C87 on the virtual I/O server partition labvios1 has failed. The partition command is: migmgr -f get_adapter -t vscsi -s U9179.MHB.103A23F-V1-C87 -d 0 The partition standard error is: Null
The error message was a bit cryptic, so I logged onto the VIO Server (as padmin) to look at the VSCSI server adapter:
padmin@labvios1:/home/padmin $ lsmap -vadapter vhost7 SVSA Physloc Client Partition ID --------------- -------------------------------------------- ------------------ vhost7 U9179.MHB.103A23F-V1-C87 0x0000000a VTD lpm_test_lpar-cd Status Available LUN 0x8100000000000000 Backing device Physloc Mirrored N/A padmin@labvios1:/home/padmin $ lsvopt | grep lpm_test_lpar-cd lpm_test_lpar-cd No Media n/a
So, a virtual optical device (lpm_test_lpar-cd) is mapped to the VSCSI server adapter, but no media is loaded to it. I assumed that was the problem, and then I noticed in the HMC’s validation error window, if I clicked on the “Detailed information” button there was a more informative message:
HSCL400A There was a problem running the VIOS command. HSCLA29A The RMC command issued to partition labvios1 failed. The partition command is: migmgr -f get_adapter -t vscsi -s U9179.MHB.103A23F-V1-C87 -d 1 The RMC return code is: 0 The OS command return code is: 80 The OS standard out is: Running method '/usr/lib/methods/mig_vscsi' 80 VIOS_DETAILED_ERROR lpm_test_lpar-cd is backed by a non migratable device: optical End Detailed Message.
That confirms it – the virtual optical device needed to be removed, so I did:
padmin@labvios1:/home/padmin $ rmdev -dev lpm_test_lpar-cd lpm_test_lpar-cd deleted padmin@labvios1:/home/padmin $ lsmap -vadapter vhost7 SVSA Physloc Client Partition ID --------------- -------------------------------------------- ------------------ vhost7 U9179.MHB.103A23F-V1-C87 0x0000000a VTD NO VIRTUAL TARGET DEVICE FOUND
LPM Validation Failure #3
At this point, I decided to switch to using the HMC command line instead of the GUI, so that I could specify more details for how the mobile LPAR would be configured on the destination server – the LPAR ID, slot numbers of the virtual devices, and VSCSI and VFC (Virtual Fibre Channel) mappings. Here is the command that I ran:
HMC1$ migrlpar -o v -m LABSRVR1 -t LABSRVR2 -p lpm_test_lpar -n LPM-TEST -i "dest_lpar_id=51,\"virtual_fc_mappings=10/labvios3//61/fcs2,11/labvios4//61/fcs3\",virtual_scsi_mappings=19/labvios3//108"
I don’t want to go into detail about all of the command line options that are available, but I will offer a brief explanation of what the above command does. This command runs a LPM validation of migrating LPAR “lpm_test_lpar” from server “LABSRVR1” to server “LABSRVR2”, using a new LPAR profile name of “LPM-TEST”. The new LPAR ID on LABSRVR2 will be “51”. The LPAR’s VFC client adapter with adapter ID “10” will be mapped to physical fibre channel adapter “fcs2” on VIO Server “labvios3” using a vfchost adapter with adapter ID “61”. Likewise, the VFC client adapter with adapter ID “11” will be mapped to physical fibre channel adapter “fcs3” on “labvios4” using a vfchost adapter with adapter ID “61”. And the VSCSI client adapter with adapter ID “19” will be mapped to a vhost adapter on “labvios3” with adapter ID “108”.
This time it gets past the communication and VSCSI errors, but fails with VFC errors:
HSCLA319 The migrating partition's virtual fibre channel client adapter 10 cannot be hosted by the existing Virtual I/O Server (VIOS) partitions on the destination managed system. To migrate the partition, set up the necessary VIOS host on the destination managed system, then try the operation again. HSCLA319 The migrating partition's virtual fibre channel client adapter 11 cannot be hosted by the existing Virtual I/O Server (VIOS) partitions on the destination managed system. To migrate the partition, set up the necessary VIOS host on the destination managed system, then try the operation again.
This one turned out to be quite a bugger. My good friend Google indicated that usually a HSCLA319 error is due to a SAN zoning problem. Naturally, our SAN team indicated that there weren’t any problems with the zoning – all primary and secondary (LPM) WWPNs were zoned to the storage via IBM’s SVC (SAN Volume Controller). I was a little skeptical since the major thing that changed since my previous successful LPM migrations at this site was that the SAN storage had been migrated from IBM DS8x00 storage to the SVC. So, I had a SAN expert within The ATS Group take a look at it and he verified that the secondary WWPNs were indeed properly zoned to the SVC storage.
We ended up opening Problem tickets with IBM support with both the PowerVM and Storage groups. At first, information gathering and debugging for each ticket was handled separately, with each group finger-pointing and saying that the problem was with the other group. But eventually we were able to get people who understood both the PowerVM and Storage sides, and we were able to get to the bottom of things. To make a long story slightly less long, we verified that our SAN team’s assertion was correct – the zoning of the LPAR to the SVC was indeed done properly. However, the problem was that the primary WWPNs were still zoned to the DS8x00 (in addition to the SVC), but the secondary WWPNs were only zoned to the SVC. Our SAN team hadn’t considered that this might be a problem because there was no longer any storage allocated from the DS8x00. But apparently Live Partition Mobility detected the DS8x00 connection for the primary WWPNs and required that the secondary WWPNs had that same connection, even though no storage was actually assigned through that connection.
The Solution
Since the DS8x00 unit was no longer used to host storage, the SAN team removed the zoning to the DS8x00 from the primary WWPNs, so the SVC connection was all that remained, and that solved the problem, as the zoning for primary and secondary WWPNs now matched. The validation command now runs successfully, although it does spit out hundreds of informational/warning messages. To run an actual LPM migration, we just change “-o v” to “-o m”:
HMC1$ migrlpar -o m -m LABSRVR1 -t LABSRVR2 -p lpm_test_lpar -n LPM-TEST -i "dest_lpar_id=51,\"virtual_fc_mappings=10/labvios3//61/fcs2,11/labvios4//61/fcs3\",virtual_scsi_mappings=19/labvios3//108"
The migration itself ran for several minutes and was successful – it completed without any errors or warnings of any kind. After fixing the problem on this LPAR, we were able to remove the old DS8x00 zoning from several other LPARs and perform successful LPM migrations on them as well.
Digging Deep with devscan
Per a recommendation from IBM support, we used IBM’s “devscan” utility in our process of debugging the SAN connectivity, and it was instrumental in helping us uncover the zoning problem. I found this to be a very useful tool that doesn’t seem to have a lot of real-world documentation. So, I am going to go into further detail about the devscan utility in Part 2 of this article… stay tuned!