Where did my disks go?

So now and then you may run into an issue which cannot be explained properly by just looking at the standard events that show up in “/var/log/messages“.

Issues such as

Oct 7 18:24:20 centos8 kernel: lpfc 0000:81:00.0: 0:1305 Link Down Event xc received Data: xc x20 x800110 x0 x0
Oct 7 18:24:24 centos8 kernel: rport-11:0-4: blocked FC remote port time out: removing target and saving binding
Oct 7 18:24:24 centos8 kernel: lpfc 0000:81:00.0: 0:(0):0203 Devloss timeout on WWPN 50:06:0e:80:07:c3:70:00 NPort x01ee40 Data: x0 x8 x2

are fairly common and the above simply shows a Link Down event. These are the most easy to troubleshoot when the remote switchlog tell you

18:26:59.565715 SCN Port Offline;rsn=0x10004,g=0x12 A2,P0 A2,P0 93 NA 
18:26:59.565721 *Removing all nodes from port A2,P0 A2,P0 93 NA 
18:28:07.998318 SCN LR_PORT(0);g=0x12 A2,P0 A2,P0 93 NA 
18:28:08.006029 SCN Port Online; g=0x12,isolated=0 A2,P0 A2,P1 93 NA 
18:28:08.007307 Port Elp engaged A2,P1 A2,P0 93 NA 
18:28:08.007331 *Removing all nodes from port A2,P0 A2,P0 93 NA 
18:28:08.007594 SCN Port F_PORT A2,P1 A2,P0 93 NA 
18:28:08.099107 SCN LR_PORT(0);g=0x12 LR_IN A2,P0 A2,P0 93 NA 
18:28:20.669283 SCN Port Offline;rsn=0x10004,g=0x14 A2,P0 A2,P0 93 NA 
18:28:20.669288 *Removing all nodes from port A2,P0 A2,P0 93 NA

as a result of

Wed Oct 7 18:28:07 2020 admin, FID 43, 10.75.27.192, portenable 4/29
Wed Oct 7 18:28:20 2020 admin, FID 43, 10.75.27.192, portdisable 4/29

Diagnostics becomes more problematic when is just the events that show the links bounce but show no further information. Obtaining extended information from the HBA drivers may then be very helpful.

Update Drivers and Firmware

As you know I’m very picky when it comes to maintenance. If I see cases where System and/or Storage administrators have basically been slacking for a long time the chances are very high that I will tell you that and commence diagnosing issues as soon as these things are all up to date. You don’t want to know the sheer amount of issues that have been resolved in firmware and drivers over any given time-period.

That being said going to the Linux side of the Emulex (or Broadcom) drivers for the LP31000/LP32000 cards which are very popular in many form-factors.

The driver will show as an lpfc module and is by default compiled into a ramfs image when installed. This will allow the card to be used in a boot-from-san variation if needed. The module will load as such and register with the scsi-subsystem

lpfc 978944 81
nvmet_fc 32768 1 lpfc
nvme_fc 45056 1 lpfc
scsi_transport_fc 69632 1 lpfc

With the most recent versions of the driver it will also provide an NVMe_oF initiator and target so that NVM equipment can be utilized when attached to a FC fabric.

Logging

Loggin with an Emulex card can be done on the driver level as well as the HBA firmware. Unless you get some instructions to do so leave the firmware logging as is. Mainly because changing these parameters will require a reload of the driver that basically instructs the firmware logging facility to capture data in some host memory region. Obviously that will involve some engineering efforts to diagnose anyway so that will not be very helpful to yourself or your OEM support-organisation unless it needs escalating to Emulex.

Changing the logging verbosity of the driver itself is much easier but may also incur some performance impact so don’t just flick on the “0xFFFFFFFF debug” button. The driver logging facility is a bitmap value based on the below table:

LOG	Message	Verbose Mask Definition Verbose Bit Verbose Description
LOG_ELS	0x00000001	ELS events
LOG_DISCOVERY	0x00000002	Link discovery events
LOG_MBOX	0x00000004	Mailbox events
LOG_INIT	0x00000008	Initialization events
LOG_LINK_EVENT	0x00000010	Link events
LOG_IP	0x00000020	IP traffic history
LOG_FCP	0x00000040	FCP traffic history
LOG_NODE	0x00000080	Node table events
LOG_TEMP	0x00000100	Temperature sensor events
LOG_BG	0x00000200	BlockGuard events
LOG_MISC	0x00000400	Miscellaneous events
LOG_SLI	0x00000800	SLI events
LOG_FCP_ERROR	0x00001000	Log errors, not underruns
LOG_LIBDFC	0x00002000	Libdfc events
LOG_VPORT	0x00004000	NPIV events
LOG_SECURITY	0x00008000	Security events
LOG_EVENT	0x00010000	CT,TEMP,DUMP, logging
LOG_FIP	0x00020000	FIP events
LOG_FCP_UNDER	0x00040000	FCP underruns errors
LOG_SCSI_CMD	0x00080000	ALL SCSI commands
LOG_NVME	0x00100000	NVME general events
LOG_NVME_DISC	0x00200000	NVME discovery/connect events
LOG_NVME_ABTS	0x00400000	NVME ABTS events
LOG_NVME_IOERR	0x00800000	NVME I/O Error events
LOG_EDIF	0x01000000	External DIF events
LOG_AUTH	0x02000000	Authentication events

If you don’t know what these mean, or have no clue on how to interpret the output, it’s not much use mucking around with these. The output will only confuse you and if you don’t know what the commands and responses should be it’s only a bunch of hex values.

The values as displayed above can be summed depending on which verbose logging needs to be enabled. For instance if your OEM asks you for Link events, ELS and Initialiation events you may get asked to enable verbose logging with either the “hbacmd” or via “sysfs”. The value of the parameter will than be “0x19”

hbacmd or sysfs

If you have hbacmd installed any change done in the logging preferences also automatically kicks of dracut and builds a new boot image. The command has a few additional parameters

hbacmd setdriverparam 10:00:00:90:fa:c7:cd:f9 G P log-verbose 0x135661

The first three a fairly obvious. Command setting driver parameters for PWWN 10:xxxxxx. The G stands for Global basically meaning it is valid for all adapters and the P stands for Permant. That ensures the parameter that follows is also applied after reboots. The log-verbose parameter is basically the configuration what we’re adjusting. The 0x135661 is a combination of values obtained via the table above.

The value can also dynamically be applied via sysfs in the “/sys/class/scsi_host/host<X>” (where <X> is the adapter ID) directory. The LPFC driver will create the system file as appropriate in that folder and one of which is indeed the “lpfc_log_verbose” file. The 0x<123456> value can be echoed to that file and the driver will dynamically pick this up.

[root@centos8 host11]# cat lpfc_log_verbose 
0x0
[root@centos8 host11]# echo 0x135661 > lpfc_log_verbose

The change is immediate logged

Oct 8 17:03:15 centos8 kernel: lpfc 0000:81:00.0: 0:(0):3053 lpfc_log_verbose changed from 0 (x0) to 1267297 (x135661)

When you change all of them with

[root@centos8 scsi_host]# echo 0x135661 > host11/lpfc_log_verbose 
[root@centos8 scsi_host]# echo 0x135661 > host12/lpfc_log_verbose 
[root@centos8 scsi_host]# echo 0x135661 > host13/lpfc_log_verbose 
[root@centos8 scsi_host]# echo 0x135661 > host14/lpfc_log_verbose

The messagelog will show something similar like this

Oct 8 17:28:28 centos8 kernel: lpfc 0000:81:00.0: 0:(0):3053 lpfc_log_verbose changed from -1 (xffffffff) to 1267297 (x135661)
Oct 8 17:28:50 centos8 kernel: lpfc 0000:81:00.1: 1:(0):3053 lpfc_log_verbose changed from 1267297 (x135661) to 1267297 (x135661)
Oct 8 17:28:58 centos8 kernel: lpfc 0000:83:00.0: 2:(0):3053 lpfc_log_verbose changed from 1267297 (x135661) to 1267297 (x135661)
Oct 8 17:29:04 centos8 kernel: lpfc 0000:83:00.1: 3:(0):3053 lpfc_log_verbose changed from 1267297 (x135661) to 1267297 (x135661)

The interesting past is that the paths to the adapter entries are used here. This is reflected in the “0000:81:00.0:”, 0000:83:00:0:” etc entries.

Remember that in normal circumstances you would not need to change these values. The basics are logged anyway and only in specific circumstances you would need to adjust that. Also be aware that using a debug value of 0xFFFFFFFF can incurr a significant performance overhead on busy systems as a lot needs to be logged.

Another thing that I get often queried about is which HBA port belongs to which SCSI number.

Identifcation of the respective HBA’s can be done by looking at the adapter entries int he eventlog as mentioned above. In this case the 81 and 83 values are a reflection of the PCI id and the 00:.0 and 00.1 are the individual ports on those adapters.

81:00.0 Fibre Channel: Emulex Corporation LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter (rev 01)
81:00.1 Fibre Channel: Emulex Corporation LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter (rev 01)
83:00.0 Fibre Channel: Emulex Corporation LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter (rev 01)
83:00.1 Fibre Channel: Emulex Corporation LPe31000/LPe32000 Series 16Gb/32Gb Fibre Channel Adapter (rev 01)

You can see these entries coming back in the /sys/class/fc_host directory where logical links to the PCI devices are created

lrwxrwxrwx. 1 root root 0 Sep 4 15:08 host11 -> ../../devices/pci0000:80/0000:80:03.0/0000:81:00.0/host11/fc_host/host11
lrwxrwxrwx. 1 root root 0 Sep 4 15:08 host12 -> ../../devices/pci0000:80/0000:80:03.0/0000:81:00.1/host12/fc_host/host12
lrwxrwxrwx. 1 root root 0 Sep 4 15:08 host13 -> ../../devices/pci0000:80/0000:80:03.2/0000:83:00.0/host13/fc_host/host13
lrwxrwxrwx. 1 root root 0 Sep 4 15:08 host14 -> ../../devices/pci0000:80/0000:80:03.2/0000:83:00.1/host14/fc_host/host14

As soon as you know this you can associate the respective WWN of the adapter to the one you see on the switch:

[root@centos8 ~]# cat /sys/class/fc_host/host11/port_name
0x10000090fac7cde8

Sydney_ILAB_X6_43_TEST:FID43:admin> switchshow
switchName: Sydney_ILAB_X6_43_TEST
switchType: 165.0
<snip>

Index Slot Port Address Media Speed State Proto
============================================================
66 4 2 01ef40 id N8 Online FC F-Port 50:06:0e:80:10:13:b5:b8 
<snip>
93 4 29 010000 id 16G Online FC F-Port 10:00:00:90:fa:c7:cd:e8

The above shows you when you see errors happening as part of a SAN attached disk where to look and how to assocaite the Emulex adapters to the respective WWN’s on your SAN.

From there on you can also identify which disks are presented to that adapter. As you’ve seen above the PCI subsystem creates a host interface per FC port. In my case these are host11 to host14.

A simple way to check is to just do an “ls” on /sys/class/scsi_disk/device/block tree.

[root@centos8 scsi_disk]# ls */device/block/
<snip>

'11:0:0:0/device/block/':
sdm

<snip>

'11:0:0:8/device/block/':
sdu

As you can see the 11:xxxxx entries will list the respetive “/dev/sd*” entries that is being used for mounting volumes, MPIO listings etc.

Obviously there are heaps of tools available to ease your troubleshooting efforts. I would advise to install the Emulex OCMananger tools that are provided as a free separate package. It can be installed as an agent and agent-less feature. Other tools like “lsblk”, “blockdev”, sg-tools package and a few more are there to make your life a bit easier so you don’t have to crawl thru the sysfs tree yourself.

Let me know if this was helpful. You feedback is much appreciated.

Regards,

Erwin

EvL Consulting

Enabling Verbose Logging on Linux with Emulex Host Bus Adapters

Where did my disks go?

Update Drivers and Firmware

Logging

hbacmd or sysfs