Part 2: Live Partition Mobility in IBM Power Servers – Debugging with IBM’s devscan Utility

by | Dec 23, 2015

I recently performed some Live Partition Mobility (LPM) testing for a customer, and ran into a number of problems – see Part 1 of this post for the full story on all of the problems and solutions. The most complicated problem required us to investigate our SAN storage connectivity. IBM Support introduced us to the […]

I recently performed some Live Partition Mobility (LPM) testing for a customer, and ran into a number of problems – see Part 1 of this post for the full story on all of the problems and solutions.

The most complicated problem required us to investigate our SAN storage connectivity. IBM Support introduced us to the devscan utility to aid in this task. I found devscan to be a powerful and useful utility, but I didn’t find a whole lot of real world documentation on it, so I thought I would document my usage of it in this article.

Introduction to devscan
The devscan utility is a free tool that was developed by IBM to provide information on SAN storage and connectivity and aid in debugging problems. Information about downloading and using devscan can be found at IBM’s AIX Support Center Tools website:

http://www-01.ibm.com/support/docview.wss?uid=aixtools_home

Also, IBM Systems Magazine has an introductory article on devscan that is worth the read:

http://ibmsystemsmag.com/Blogs/AIXchange/July-2015/A-Tool-for-SAN-Troubleshooting/

Problem Recap
After overcoming several minor LPM problems, our attempt to perform a LPM validation from the HMC command line failed with error HSCLA319, indicating that the destination VIO Servers could not host the Virtual Fibre Channel (VFC) adapters required by the client LPM partition. A google search and IBM Support both indicated that SAN zoning was likely the problem. Our SAN team disagreed, so we set about proving or disproving that there was a SAN zoning issue.

Running devscan on the LPM Destination Server’s VIO Servers
We began our debugging by running devscan on the VIO Servers on the LPM destination server. When running devscan on a VIO Server, we run it in NPIV mode so that we can give it the WWPNs of the VFC adapters of the LPM client LPAR, so that devscan can test for SAN connectivity via the physical fibre channel adapters in the VIO Server. In particular, we can specify the secondary (LPM) WWPNs, so that we can verify that the destination server has the required connectivity to support the LPM migration.

In NPIV mode, devscan cannot gather information about specific LUNs, but it does gather data regarding the connections on the storage side. IBM Support had us run the following command for each client VFC adapter WWPN (primary and secondary) on all fscsi adapters on both destination VIO Servers:

# devscan -t f -n <WWPN> --dev=fscsi<#>

In retrospect, I would NOT recommend doing this for all primary and secondary WWPNs, as doing so caused our client LPAR to hang/crash. I’m not sure if this *should* have caused a problem, but establishing the connectivity of the primary WWPNs on the destination server seemed to remove the connectivity from the client LPAR on the source server. In theory, you should only need to run this for the secondary WWPNs on the destination server – to verify that zoning is correct to allow the mobility operation to complete successfully. However, despite causing our LPAR to crash, it turned out to be fortunate that we ran it with both sets of WWPNs because it turned up some interesting information that helped us figure out where the real problem was.

Here is an example of the devscan command and output:

# devscan -t f -n C0407903A2C8002B --dev=fscsi2

devscan v1.0.5
Copyright (C) 2010-2012 IBM Corp., All Rights Reserved

cmd: devscan -t f -n C0407903A2C8002B --dev=fscsi2
Current time: 2015-06-03 20:35:27.984569 GMT
Running on host: labvios3

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Processing FC device:
    Adapter driver: fcs2
    Protocol driver: fscsi2
    Connection type: fabric
    Link State: up
    Current link speed: 8 Gbps
    Local SCSI ID: 0x5d0072
    Local WWPN: 0x10000000c6d00257
    Local WWNN: 0x20000000c6d00257
    NPIV SCSI ID: 0x5d6318
    NPIV WWPN: 0xC0407903A2C8002B
    Device ID: 0xdf1000f237109a04
    Microcode level: 202307

SCSI ID LUN ID           WWPN             WWNN
-----------------------------------------------------------
6e0050  0000000000000000 500507630813064b 5005076308ffc64b
6e0051  0000000000000000 500507630818064b 5005076308ffc64b
6e5424  0000000000000000 500507680140c742 500507680100c742
6e5426  0000000000000000 500507680140c257 500507680100c257
6e5578  0000000000000000 500507680140ed53 500507680100ed53
6e5579  0000000000000000 500507680140ed58 500507680100ed58

6 targets found, reporting 0 LUNs,
0 of which responded to SCIOLSTART.
Elapsed time this adapter: 00.279705 seconds

Cleaning up...
Total elapsed time: 00.282444 seconds
Completed with error(s)

The devscan command returns a lot of good info, but we focused on the highlighted section. Notice that the LUN IDs are all “0”. As I mentioned, in NPIV mode devscan cannot find actual LUN information. However, it does demonstrate the connectivity by providing info about the 6 target ports that the client VFC adapter’s WWPN can “see” via the SAN fabric that this physical fibre channel adapter is connected to.

In our configuration, we utilize two SAN fabrics, so each AIX LPAR has a pair of VFC adapters, one for each fabric. On the VIO Server side, the server VFC adapters are mapped across four physical fibre channel adapters to spread the load – two of these adapters are cabled to SAN fabric A while the other two are cabled to fabric B. So, devscan commands for a specific WWPN from one of the client VFC adapters returned the 6 SCSI IDs listed above for half of the adapters (for the two cabled to the SAN fabric where that WWPN is included in the zoning), and the other half returned with a message saying “No targets found”, since that WWPN isn’t zoned in the other fabric. For a WWPN from the other client VFC adapter, the opposite adapters returned SAN connection info, but pertaining to the other SAN fabric. Example from the other fabric:

SCSI ID LUN ID           WWPN             WWNN
-----------------------------------------------------------
780043  0000000000000000 500507630803864b 5005076308ffc64b
780044  0000000000000000 500507630808864b 5005076308ffc64b
784c2b  0000000000000000 500507680130c742 500507680100c742
784c2c  0000000000000000 500507680130c257 500507680100c257
784f15  0000000000000000 500507680130ed53 500507680100ed53
784f17  0000000000000000 500507680130ed58 500507680100ed58

6 targets found, reporting 0 LUNs,

To summarize our findings, when we ran devscan with the client VFC adapters’ primary WWPNs, we found the following SCSI IDs:

Fabric A     Fabric B
--------     --------
6e0050       780043
6e0051       780044
6e5424       784c2b
6e5426       784c2c
6e5578       784f15
6e5579       784f17

However, when we ran devscan with the client VFC adapters’ secondary WWPNs, we only found the following SCSI IDs:

Fabric A     Fabric B
--------     --------
6e5424       784c2b
6e5426       784c2c
6e5578       784f15
6e5579       784f17

The secondary WWPNs were missing the following SCSI ID connections:

Fabric A     Fabric B
--------     --------
6e0050       780043
6e0051       780044

We presented this information to the SAN team, but they once again merely confirmed that the secondary WWPNs were indeed zoned properly to the SVC storage. So, we had to continue our debugging.

Finding SCSI IDs with lspath on LPM Client LPAR

We decided to run some commands directly on the client LPAR, to see what the connectivity looked like from there. For each disk/VFC adapter combination we were able to find the SCSI ID info that devscan returned on the VIO Servers by using this lspath command:

# lspath -AHE -l <disk> -p <parent adapter> -w <connection>

First, we needed the “connection” information to use with the “-w” flag. We could get that for each disk by using the “-F” flag with the lspath command. For example, for hdisk1:

# lspath -l hdisk1 -F "connection:parent:path_status:status"
500507680140ed53,2000000000000:fscsi0:Available:Enabled
500507680140ed58,2000000000000:fscsi0:Available:Enabled
500507680130ed53,2000000000000:fscsi1:Available:Enabled
500507680130ed58,2000000000000:fscsi1:Available:Enabled

After gathering all of the connection information, we could run lspath again for each disk/adapter/connection combination to return the SCSI ID, along with some other info. For example:

# lspath -AHE -l hdisk1 -p fscsi0 -w "500507680140ed53,2000000000000"
attribute value              description  user_settable
scsi_id   0x6e5578           SCSI ID      False
node_name 0x500507680100ed53 FC Node Name False
priority  1                  Priority     True

I’ll save you from reading the commands and output for each disk/adapter/connection permutation, but here are the summarized results that we compiled for the three disks on the client LPAR:

DISK     CONNECTION                       ADAPTER  SCSI_ID
hdisk0 - 500507680140ed53,0             - fscsi0 - 0x6e5578
hdisk0 - 500507680140ed58,0             - fscsi0 - 0x6e5579
hdisk0 - 500507680130ed53,0             - fscsi1 - 0x784f15
hdisk0 - 500507680130ed58,0             - fscsi1 - 0x784f17

hdisk1 - 500507680140ed53,2000000000000 - fscsi0 - 0x6e5578
hdisk1 - 500507680140ed58,2000000000000 - fscsi0 - 0x6e5579
hdisk1 - 500507680130ed53,2000000000000 - fscsi1 - 0x784f15
hdisk1 - 500507680130ed58,2000000000000 - fscsi1 - 0x784f17

hdisk2 - 500507680140c742,1000000000000 - fscsi0 - 0x6e5424
hdisk2 - 500507680140c167,1000000000000 - fscsi0 - 0x6e5426
hdisk2 - 500507680130c742,1000000000000 - fscsi1 - 0x784c2b
hdisk2 - 500507680130c257,1000000000000 - fscsi1 - 0x784c2c

Now, we compare the SCSI IDs that lspath found to the SCSI IDs that devscan found on the destination VIO Servers:

6e0050 - not found at all by lspath
6e0051 - not found at all by lspath
6e5424 - lspath found with hdisk2
6e5426 - lspath found with hdisk2
6e5578 - lspath found with hdisk0 & hdisk1
6e5579 - lspath found with hdisk0 & hdisk1

780043 - not found at all by lspath
780044 - not found at all by lspath
784c2b - lspath found with hdisk2
784c2c - lspath found with hdisk2
784f15 - lspath found with hdisk0 & hdisk1
784f17 - lspath found with hdisk0 & hdisk1

So, it appears that the four SCSI IDs that devscan only found with the primary WWPNs on the VIO Servers do not appear to be used at all on the client LPAR anyway.

The plot thickens.

Putting it all Together
We next decided to run devscan on the LPM client LPAR to see what we could learn. We ran the following command on the client to find all of the SCSI IDs:

# devscan --dev=<adapter> --concise | awk -F '|' '{print $2}' | sort -n | uniq

Running it for the two VFC adapters on the client LPAR produced the following results:

# devscan --dev=fscsi0 --concise | awk -F '|' '{print $2}' | sort -n | uniq
     SCSI/SAS ID
00000000006e0050
00000000006e0051
00000000006e5424
00000000006e5426
00000000006e5578
00000000006e5579

# devscan --dev=fscsi1 --concise | awk -F '|' '{print $2}' | sort -n | uniq
     SCSI/SAS ID
0000000000780043
0000000000780044
0000000000784c2b
0000000000784c2c
0000000000784f15
0000000000784f17

This information revealed that ALL of the SCSI IDs are indeed seen on this client LPAR (which isn’t surprising, since it is using the primary WWPNs, which did see all of the connections). But why are some of these connections unused?

Running the same command but looking at the full output revealed the answer. The command produces very wide output, so for the sake of clarity I’ll run the command through awk to just print out several of the most important columns:

# devscan --dev=fscsi0 --concise | awk -F '|' '{print $1"|"$2"|"$3"|"$6"|"$7"|"$13}'
Parent Name|     SCSI/SAS ID|          LUN ID| Vendor| Device|    ODM name
     fscsi0|00000000006e0050|0000000000000000|IBM    |2107900|No ODM match
     fscsi0|00000000006e0051|0000000000000000|IBM    |2107900|No ODM match
     fscsi0|00000000006e5424|0000000000000000|IBM    |2145   |No ODM match
     fscsi0|00000000006e5424|0001000000000000|IBM    |2145   |      hdisk2
     fscsi0|00000000006e5426|0000000000000000|IBM    |2145   |No ODM match
     fscsi0|00000000006e5426|0001000000000000|IBM    |2145   |      hdisk2
     fscsi0|00000000006e5578|0000000000000000|IBM    |2145   |      hdisk0
     fscsi0|00000000006e5578|0002000000000000|IBM    |2145   |      hdisk1
     fscsi0|00000000006e5579|0000000000000000|IBM    |2145   |      hdisk0
     fscsi0|00000000006e5579|0002000000000000|IBM    |2145   |      hdisk1

# devscan --dev=fscsi1 --concise | awk -F '|' '{print $1"|"$2"|"$3"|"$6"|"$7"|"$13}'
Parent Name|     SCSI/SAS ID|          LUN ID| Vendor| Device|    ODM name
     fscsi1|0000000000780043|0000000000000000|IBM    |2107900|No ODM match
     fscsi1|0000000000780044|0000000000000000|IBM    |2107900|No ODM match
     fscsi1|0000000000784c2b|0000000000000000|IBM    |2145   |No ODM match
     fscsi1|0000000000784c2b|0001000000000000|IBM    |2145   |      hdisk2
     fscsi1|0000000000784c2c|0000000000000000|IBM    |2145   |No ODM match
     fscsi1|0000000000784c2c|0001000000000000|IBM    |2145   |      hdisk2
     fscsi1|0000000000784f15|0000000000000000|IBM    |2145   |      hdisk0
     fscsi1|0000000000784f15|0002000000000000|IBM    |2145   |      hdisk1
     fscsi1|0000000000784f17|0000000000000000|IBM    |2145   |      hdisk0
     fscsi1|0000000000784f17|0002000000000000|IBM    |2145   |      hdisk1

Note that the unused SCSI IDs in question all have a different device type of IBM 2107900, while the “good” SCSI IDs are all IBM 2145. This LPAR is currently allocated storage through an IBM SVC (device type 2145), but it used to get its storage from an IBM DS8x00 (device type 2107). Seemingly, it was still connected to the DS8x00 unit, even though storage was no longer allocated from that unit.

The SAN team confirmed that the primary WWPNs were indeed still zoned to the DS8x00, in addition to the SVC, while the secondary WWPNs had only been zoned to the SVC (since all storage was allocated solely via the SVC at the time that they established the zoning for LPM testing). They didn’t think that would be a problem since no DS8x00 storage was allocated – but apparently LPM requires the secondary WWPNs to have the same connectivity as the primary WWPNs, whether or not any storage is actually allocated via those connections.

We were able to fix our LPM problem by removing the DS8x00 from the zoning of the client VFC adapters’ primary WWPNs.  Once the zoning of the primary WWPNs matched the zoning of the secondary WWPNs, LPM worked like a charm.

Related Articles

Tech Talk:  Galileo Cloud Compass

Tech Talk: Galileo Cloud Compass

Simple Cloud Costing? Galileo Cloud Compass to the Rescue! A video chat with the developers who bring right-sized cloud costing information in real-time. It all started with a casual conversation...

read more