I recently performed some Live Partition Mobility (LPM) testing for a customer, and ran into a number of problems – see Part 1 of this post for the full story on all of the problems and solutions.
The most complicated problem required us to investigate our SAN storage connectivity. IBM Support introduced us to the devscan utility to aid in this task. I found devscan to be a powerful and useful utility, but I didn’t find a whole lot of real world documentation on it, so I thought I would document my usage of it in this article.
Introduction to devscan
The devscan utility is a free tool that was developed by IBM to provide information on SAN storage and connectivity and aid in debugging problems. Information about downloading and using devscan can be found at IBM’s AIX Support Center Tools website:
http://www-01.ibm.com/support/docview.wss?uid=aixtools_home
Also, IBM Systems Magazine has an introductory article on devscan that is worth the read:
http://ibmsystemsmag.com/Blogs/AIXchange/July-2015/A-Tool-for-SAN-Troubleshooting/
Problem Recap
After overcoming several minor LPM problems, our attempt to perform a LPM validation from the HMC command line failed with error HSCLA319, indicating that the destination VIO Servers could not host the Virtual Fibre Channel (VFC) adapters required by the client LPM partition. A google search and IBM Support both indicated that SAN zoning was likely the problem. Our SAN team disagreed, so we set about proving or disproving that there was a SAN zoning issue.
Running devscan on the LPM Destination Server’s VIO Servers
We began our debugging by running devscan on the VIO Servers on the LPM destination server. When running devscan on a VIO Server, we run it in NPIV mode so that we can give it the WWPNs of the VFC adapters of the LPM client LPAR, so that devscan can test for SAN connectivity via the physical fibre channel adapters in the VIO Server. In particular, we can specify the secondary (LPM) WWPNs, so that we can verify that the destination server has the required connectivity to support the LPM migration.
In NPIV mode, devscan cannot gather information about specific LUNs, but it does gather data regarding the connections on the storage side. IBM Support had us run the following command for each client VFC adapter WWPN (primary and secondary) on all fscsi adapters on both destination VIO Servers:
# devscan -t f -n <WWPN> --dev=fscsi<#>
In retrospect, I would NOT recommend doing this for all primary and secondary WWPNs, as doing so caused our client LPAR to hang/crash. I’m not sure if this *should* have caused a problem, but establishing the connectivity of the primary WWPNs on the destination server seemed to remove the connectivity from the client LPAR on the source server. In theory, you should only need to run this for the secondary WWPNs on the destination server – to verify that zoning is correct to allow the mobility operation to complete successfully. However, despite causing our LPAR to crash, it turned out to be fortunate that we ran it with both sets of WWPNs because it turned up some interesting information that helped us figure out where the real problem was.
Here is an example of the devscan command and output:
# devscan -t f -n C0407903A2C8002B --dev=fscsi2 devscan v1.0.5 Copyright (C) 2010-2012 IBM Corp., All Rights Reserved cmd: devscan -t f -n C0407903A2C8002B --dev=fscsi2 Current time: 2015-06-03 20:35:27.984569 GMT Running on host: labvios3 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Processing FC device: Adapter driver: fcs2 Protocol driver: fscsi2 Connection type: fabric Link State: up Current link speed: 8 Gbps Local SCSI ID: 0x5d0072 Local WWPN: 0x10000000c6d00257 Local WWNN: 0x20000000c6d00257 NPIV SCSI ID: 0x5d6318 NPIV WWPN: 0xC0407903A2C8002B Device ID: 0xdf1000f237109a04 Microcode level: 202307 SCSI ID LUN ID WWPN WWNN ----------------------------------------------------------- 6e0050 0000000000000000 500507630813064b 5005076308ffc64b 6e0051 0000000000000000 500507630818064b 5005076308ffc64b 6e5424 0000000000000000 500507680140c742 500507680100c742 6e5426 0000000000000000 500507680140c257 500507680100c257 6e5578 0000000000000000 500507680140ed53 500507680100ed53 6e5579 0000000000000000 500507680140ed58 500507680100ed58 6 targets found, reporting 0 LUNs, 0 of which responded to SCIOLSTART. Elapsed time this adapter: 00.279705 seconds Cleaning up... Total elapsed time: 00.282444 seconds Completed with error(s)
The devscan command returns a lot of good info, but we focused on the highlighted section. Notice that the LUN IDs are all “0”. As I mentioned, in NPIV mode devscan cannot find actual LUN information. However, it does demonstrate the connectivity by providing info about the 6 target ports that the client VFC adapter’s WWPN can “see” via the SAN fabric that this physical fibre channel adapter is connected to.
In our configuration, we utilize two SAN fabrics, so each AIX LPAR has a pair of VFC adapters, one for each fabric. On the VIO Server side, the server VFC adapters are mapped across four physical fibre channel adapters to spread the load – two of these adapters are cabled to SAN fabric A while the other two are cabled to fabric B. So, devscan commands for a specific WWPN from one of the client VFC adapters returned the 6 SCSI IDs listed above for half of the adapters (for the two cabled to the SAN fabric where that WWPN is included in the zoning), and the other half returned with a message saying “No targets found”, since that WWPN isn’t zoned in the other fabric. For a WWPN from the other client VFC adapter, the opposite adapters returned SAN connection info, but pertaining to the other SAN fabric. Example from the other fabric:
SCSI ID LUN ID WWPN WWNN ----------------------------------------------------------- 780043 0000000000000000 500507630803864b 5005076308ffc64b 780044 0000000000000000 500507630808864b 5005076308ffc64b 784c2b 0000000000000000 500507680130c742 500507680100c742 784c2c 0000000000000000 500507680130c257 500507680100c257 784f15 0000000000000000 500507680130ed53 500507680100ed53 784f17 0000000000000000 500507680130ed58 500507680100ed58 6 targets found, reporting 0 LUNs,
To summarize our findings, when we ran devscan with the client VFC adapters’ primary WWPNs, we found the following SCSI IDs:
Fabric A Fabric B -------- -------- 6e0050 780043 6e0051 780044 6e5424 784c2b 6e5426 784c2c 6e5578 784f15 6e5579 784f17
However, when we ran devscan with the client VFC adapters’ secondary WWPNs, we only found the following SCSI IDs:
Fabric A Fabric B -------- -------- 6e5424 784c2b 6e5426 784c2c 6e5578 784f15 6e5579 784f17
The secondary WWPNs were missing the following SCSI ID connections:
Fabric A Fabric B -------- -------- 6e0050 780043 6e0051 780044
We presented this information to the SAN team, but they once again merely confirmed that the secondary WWPNs were indeed zoned properly to the SVC storage. So, we had to continue our debugging.
Finding SCSI IDs with lspath on LPM Client LPAR
We decided to run some commands directly on the client LPAR, to see what the connectivity looked like from there. For each disk/VFC adapter combination we were able to find the SCSI ID info that devscan returned on the VIO Servers by using this lspath command:
# lspath -AHE -l <disk> -p <parent adapter> -w <connection>
First, we needed the “connection” information to use with the “-w” flag. We could get that for each disk by using the “-F” flag with the lspath command. For example, for hdisk1:
# lspath -l hdisk1 -F "connection:parent:path_status:status" 500507680140ed53,2000000000000:fscsi0:Available:Enabled 500507680140ed58,2000000000000:fscsi0:Available:Enabled 500507680130ed53,2000000000000:fscsi1:Available:Enabled 500507680130ed58,2000000000000:fscsi1:Available:Enabled
After gathering all of the connection information, we could run lspath again for each disk/adapter/connection combination to return the SCSI ID, along with some other info. For example:
# lspath -AHE -l hdisk1 -p fscsi0 -w "500507680140ed53,2000000000000"
attribute value description user_settable
scsi_id 0x6e5578 SCSI ID False
node_name 0x500507680100ed53 FC Node Name False
priority 1 Priority True
I’ll save you from reading the commands and output for each disk/adapter/connection permutation, but here are the summarized results that we compiled for the three disks on the client LPAR:
DISK CONNECTION ADAPTER SCSI_ID hdisk0 - 500507680140ed53,0 - fscsi0 - 0x6e5578 hdisk0 - 500507680140ed58,0 - fscsi0 - 0x6e5579 hdisk0 - 500507680130ed53,0 - fscsi1 - 0x784f15 hdisk0 - 500507680130ed58,0 - fscsi1 - 0x784f17 hdisk1 - 500507680140ed53,2000000000000 - fscsi0 - 0x6e5578 hdisk1 - 500507680140ed58,2000000000000 - fscsi0 - 0x6e5579 hdisk1 - 500507680130ed53,2000000000000 - fscsi1 - 0x784f15 hdisk1 - 500507680130ed58,2000000000000 - fscsi1 - 0x784f17 hdisk2 - 500507680140c742,1000000000000 - fscsi0 - 0x6e5424 hdisk2 - 500507680140c167,1000000000000 - fscsi0 - 0x6e5426 hdisk2 - 500507680130c742,1000000000000 - fscsi1 - 0x784c2b hdisk2 - 500507680130c257,1000000000000 - fscsi1 - 0x784c2c
Now, we compare the SCSI IDs that lspath found to the SCSI IDs that devscan found on the destination VIO Servers:
6e0050 - not found at all by lspath 6e0051 - not found at all by lspath 6e5424 - lspath found with hdisk2 6e5426 - lspath found with hdisk2 6e5578 - lspath found with hdisk0 & hdisk1 6e5579 - lspath found with hdisk0 & hdisk1 780043 - not found at all by lspath 780044 - not found at all by lspath 784c2b - lspath found with hdisk2 784c2c - lspath found with hdisk2 784f15 - lspath found with hdisk0 & hdisk1 784f17 - lspath found with hdisk0 & hdisk1
So, it appears that the four SCSI IDs that devscan only found with the primary WWPNs on the VIO Servers do not appear to be used at all on the client LPAR anyway.
The plot thickens.
Putting it all Together
We next decided to run devscan on the LPM client LPAR to see what we could learn. We ran the following command on the client to find all of the SCSI IDs:
# devscan --dev=<adapter> --concise | awk -F '|' '{print $2}' | sort -n | uniq
Running it for the two VFC adapters on the client LPAR produced the following results:
# devscan --dev=fscsi0 --concise | awk -F '|' '{print $2}' | sort -n | uniq SCSI/SAS ID 00000000006e0050 00000000006e0051 00000000006e5424 00000000006e5426 00000000006e5578 00000000006e5579 # devscan --dev=fscsi1 --concise | awk -F '|' '{print $2}' | sort -n | uniq SCSI/SAS ID 0000000000780043 0000000000780044 0000000000784c2b 0000000000784c2c 0000000000784f15 0000000000784f17
This information revealed that ALL of the SCSI IDs are indeed seen on this client LPAR (which isn’t surprising, since it is using the primary WWPNs, which did see all of the connections). But why are some of these connections unused?
Running the same command but looking at the full output revealed the answer. The command produces very wide output, so for the sake of clarity I’ll run the command through awk to just print out several of the most important columns:
# devscan --dev=fscsi0 --concise | awk -F '|' '{print $1"|"$2"|"$3"|"$6"|"$7"|"$13}' Parent Name| SCSI/SAS ID| LUN ID| Vendor| Device| ODM name fscsi0|00000000006e0050|0000000000000000|IBM |2107900|No ODM match fscsi0|00000000006e0051|0000000000000000|IBM |2107900|No ODM match fscsi0|00000000006e5424|0000000000000000|IBM |2145 |No ODM match fscsi0|00000000006e5424|0001000000000000|IBM |2145 | hdisk2 fscsi0|00000000006e5426|0000000000000000|IBM |2145 |No ODM match fscsi0|00000000006e5426|0001000000000000|IBM |2145 | hdisk2 fscsi0|00000000006e5578|0000000000000000|IBM |2145 | hdisk0 fscsi0|00000000006e5578|0002000000000000|IBM |2145 | hdisk1 fscsi0|00000000006e5579|0000000000000000|IBM |2145 | hdisk0 fscsi0|00000000006e5579|0002000000000000|IBM |2145 | hdisk1 # devscan --dev=fscsi1 --concise | awk -F '|' '{print $1"|"$2"|"$3"|"$6"|"$7"|"$13}' Parent Name| SCSI/SAS ID| LUN ID| Vendor| Device| ODM name fscsi1|0000000000780043|0000000000000000|IBM |2107900|No ODM match fscsi1|0000000000780044|0000000000000000|IBM |2107900|No ODM match fscsi1|0000000000784c2b|0000000000000000|IBM |2145 |No ODM match fscsi1|0000000000784c2b|0001000000000000|IBM |2145 | hdisk2 fscsi1|0000000000784c2c|0000000000000000|IBM |2145 |No ODM match fscsi1|0000000000784c2c|0001000000000000|IBM |2145 | hdisk2 fscsi1|0000000000784f15|0000000000000000|IBM |2145 | hdisk0 fscsi1|0000000000784f15|0002000000000000|IBM |2145 | hdisk1 fscsi1|0000000000784f17|0000000000000000|IBM |2145 | hdisk0 fscsi1|0000000000784f17|0002000000000000|IBM |2145 | hdisk1
Note that the unused SCSI IDs in question all have a different device type of IBM 2107900, while the “good” SCSI IDs are all IBM 2145. This LPAR is currently allocated storage through an IBM SVC (device type 2145), but it used to get its storage from an IBM DS8x00 (device type 2107). Seemingly, it was still connected to the DS8x00 unit, even though storage was no longer allocated from that unit.
The SAN team confirmed that the primary WWPNs were indeed still zoned to the DS8x00, in addition to the SVC, while the secondary WWPNs had only been zoned to the SVC (since all storage was allocated solely via the SVC at the time that they established the zoning for LPM testing). They didn’t think that would be a problem since no DS8x00 storage was allocated – but apparently LPM requires the secondary WWPNs to have the same connectivity as the primary WWPNs, whether or not any storage is actually allocated via those connections.
We were able to fix our LPM problem by removing the DS8x00 from the zoning of the client VFC adapters’ primary WWPNs. Once the zoning of the primary WWPNs matched the zoning of the secondary WWPNs, LPM worked like a charm.