Lustre MDT mirroring with SRP and ZFS

by Jesse Stroik— last modified Jan 17, 2014 03:47 PM

 

Important Note: The definitive source for Lustre documentation is the Lustre Operations Manual available at https://wiki.hpdd.intel.com/display/PUB/Documentation.

These documents are copied from internal SSEC working documentation that may be useful for some, but be we provide no guarantee of accuraccy, correctness, or safety. Use at your own risk.

 

At SSEC we tested mirrored meta data targets for our Lustre file systems. The idea is to use ZFS to mirror storage targets in 2 different MDS - so the data is always available on both servers without using iscsi or other technologies. The basis of this idea comes from Charles Taylor's LUG 2012 presentation "High Availability Lustre Using SRP-Mirrored LUNs"

Instead of LVM, we will use ZFS to provide the mirror. SCST for infiniband RDMA providing the targets and ZFS mirrors performed well in our testing. We did not have a chance to test more thoroughly for production.

Below are notes from our investigation.

Terminology

 
Target
The device to which data will be written. Usually it controls a group of LUNs (think OSS, not OST or individual disk).
 
Initiator
The system or device attempting to access the target. Client system in our case.
 
 

Protocols

 
SRP - SCSI RDMA Protocol.
Despite its name, this can be implemented w/o RMDA. As we would likely implement, it is a protocol used to communicate with SCSI devices directly over RMDA.
 
iSER - iSCSI Extensions for RDMA
A layer of abstraction on the iSCSI protocol implemented by a "Datamover Architecture" and with RDMA support. The basic idea is simple: RMDA allows devices to reach each other's memory directly. When an initiator beings an unsolicited write, the disk uses the protocol to read the data from the initiator directly while writing to itself. So the target effectively goes and reads the data off of the initiator.


SRP Implementations


LIO - SRP implementation by Datera, a SV startup from 2011.

It appears that Datera got this thing into the Linux kernel, but deployment and usage documentation is nonexistent or very hard to actually find.

TargetCLI is a python CLI management interface for the targets.
 
 
 

SCST - SCSI Target Framework (kernel - not in official tree).
 
This framework includes a few components:
 
  1. Core/Engine software
  2. Target "Drivers" - I put drivers in quotes because this part is implemented as a kernel module and they call it a driver, but it is software that controls the Target (think OSS) and doesn't really provide a hardware driver as far as I understand.
  3. Storage Drivers - This is the part that implements the SCSI commands on the LUN (in our case, attached OSTs).
 
We will likely need to compile/link the target and storage drivers against a kernel version, and install only with that kernel version. We already link kernel versions with Lustre, so this may not be unreasonable.
 


iSER implementations


LIO - See information in SRP implementation. This also implements iSER.

STGT - SCSI Target Framework (userspace)
This doesn't perform as well as SCST according to the research. It's considered obsolete, but I included the definition because it may be  mentioned in a lot of documentation.
 

 

Technologies Available w/ Summary

 

  1. TGTD/ISIR vis scsi-target utils
  2. LIO
  3. SCST 
  4. Snapshots 
 
 

LIO

 
This seems like it has disadvantages. 
 
 
 

SCST

 
We'll research this and attempt to implement. It appears we'll need to grab their source, compile and link against our kernel, and install. This may be a temporary issue if we need to link against a newer kernel than Lustre is currently available against, but Lustre is getting accepted into the kernel as well, so we can plan for this to be our future technology if we cannot use it now.
 
 

Installing the SCST

 

NOTE BEFORE IMPLEMENTING: THIS APPEARS TO USE 128KB INODE SIZES WHEN EXPORTED VIA ZPOOLS

This requires the OFA OFED stack and links against it.

 
Download and install the SCST package and scstadmin: http://sourceforge.net/projects/scst/files/
 
Extract the SCST package, and verify your Makefile lines if necessary:
export KDIR=/usr/src/kernels/2.6.32-431.el6.x86_64/  export KVER=2.6.32-431.el6.x86_64


Build and install SRPT and SCST
 
make scst && make scst_install  make srpt && make srpt_install

Then load the modules into the kernel, and set it up to start on boot.

/usr/lib/lsb/install_initd scst  chkconfig --add scst
modprobe scst  modprobe ib_srpt  modprobe scst_vdisk
 

Setting up the Devices

 
NOTE: we ended up using HW RAID on the bottom, exported that via SRP, and ZFS mirrored at the top level. This very first step can be skipped
 
If you are using ZFS, you need to create a logical volume. For example, let's say we have the pool shps-meta which is comprised of some disks.
 

 

zfs create -V 300G shps-meta/MDT  zfs set canmount=off shps-meta

 

Now you have the device  /dev/zvol/shps-meta/MDT
 
 
 
 
 
On each system, once you have your LUN prepared (with the RAID controller or zpool) it's time to register that device:
 
scstadmin -open_dev MDT1  -handler vdisk_blockio -attributes filename=/dev/zvol/shps-meta/MDT
 
Then list the device and target:
 

 

scstadmin -list_device   scstadmin -list_target  ls -l /sys/kernel/scst_tgt/devices
You should get some info from each: the MDT1 dev you just created, and also ib_srpt_target0. If you don't get that, reload the ib_srpt module. 
 
Define a security group (the hosts that can write):
 
scstadmin --add_group MDS -driver ib_srpt -target ib_srpt_target_0  scstadmin  -list_group
 
Add initiators to the group: 
 
(for testing, leave this open)
 
 
Assign the LUNs to the target 
scstadmin -add_lun 0 -driver ib_srpt -target ib_srpt_target_0 -group MDS -device MDT1
 
 Now enable the target:
 

scstadmin -enable_target ib_srpt_target_0 -driver ib_srpt
 
And enable the driver:
 scstadmin -set_drv_attr ib_srpt -attributes enabled=1
 
 Modprobe modifications to pass the driver. This example is access over one-target-per-HCA-port
 
# cat /etc/modprobe.d/ib_srpt.conf    options ib_srpt one_target_per_port=1
 
Set up permissions for the LUN (necessary)
 
 scstadmin -add_init '*' -driver ib_srpt -target ib_srpt_target_0 -group MDS
 
 
 
 

Initiator setup

 
On the target, ensure that this initiator has permission to access the disk:
 
 
 
First, load the module ib_srp:
 
modprobe ib_srp
 
Note: This module is part of OFED. OFED also includes the ib_srpt (target) module which is used to host the FS.
 
 
 
Now, search for the available targets:
 
 srp_daemon -oacd/dev/infiniband/umad0
Note: There could be multiple /dev/infiniband/umad devices. (umad bro?)
 
 
Add the a
#scstadmin -add_target 
 
 
 
 
find /sys -iname add_target -print  echo "id_ext=0002c90300b77f40,ioc_guid=0002c90300b77f40, dgid=fe800000000000000002c90300b77f41,pkey=ffff, \  service_id=0002c90300b77f40" > /sys/devices/pci0000:00/0000:00:03.0/0000:04:00.0/infiniband_srp/srp-mlx4_0-1/add_target
 
Note: in the above example, the part echoed is the result of the previous srp_daemon command (there may be multiple devices to add this way), and the redirection is into the result of the find command.
 
 
 
Be sure to write the config:
 
 
 scstadmin -write_config /etc/scst.conf
 
And then ensure the startup script is in chkconfig:
 
chkconfig --list scst  ckconfig --list srpd  chkconfig --list rdma
/etc/rdma/rdma.conf must contain the line:
 
SRP_LOAD=yes

Snapshots

 
This is our backup. We can just use ZFS features to keep the metadata reasonably in sync. This won't be perfectly up to date, sadly, but a sync from an hour or two ago means an hour or two of data may have been lost, which is very acceptable in many cases.
 
 

Metadata Backups

 
We always backup the metadata also. This is necessary even if we have a backup MDT.