Mellanox OFED for Linux User Manual Rev 2.2-1.0.1 Last Updated: June 10, 2014 www.mellanox.
Rev 2.2-1.0.
Rev 2.2-1.0.1 Table of Contents Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Chapter 1 Mellanox OFED Overview . . . . . . . . . . . . . . . . .
Rev 2.2-1.0.1 2.8 UEFI Secure Boot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 2.8.1 Enrolling Mellanox's x.509 Public Key On your Systems . . . . . . . . . . . . . . . . . . 50 Chapter 3 Configuration Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.1 Persistent Naming for Network Interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Chapter 4 Driver Features . . . . . . . . .
Rev 2.2-1.0.1 4.9 Atomic Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.9.1 Atomic Operations in mlx5 Driver . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.9.2 Enhanced Atomic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 4.10 Ethernet Tunneling Over IPoIB Driver (eIPoIB) . . . . . . . . . . . . . . . . . . . . . . . 101 4.10.1 4.10.2 4.10.3 4.10.4 4.11 4.12 4.13 4.
Rev 2.2-1.0.1 5.2.1 5.2.2 5.2.3 5.2.4 5.2.5 5.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Prerequisites for Running MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MPI Selector - Which MPI Runs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Compiling MPI Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rev 2.2-1.0.1 8.5.5 8.5.6 8.5.7 8.5.8 8.6 180 182 182 190 Quality of Service Management in OpenSM. . . . . . . . . . . . . . . . . . . . . . . . . . . 198 8.6.1 8.6.2 8.6.3 8.6.4 8.6.5 8.6.6 8.6.7 8.6.8 8.7 LASH Routing Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . DOR Routing Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Torus-2QoS Routing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rev 2.2-1.0.1 9.4.13 9.4.14 9.4.15 9.4.16 9.4.17 9.4.18 9.4.19 9.4.20 9.4.21 9.4.22 9.4.23 9.4.24 9.4.25 9.4.26 9.4.27 9.4.28 9.4.29 9.4.30 9.4.31 9.4.32 9.4.33 9.4.34 9.4.35 9.5 ibstatus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ibportstate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ibroute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rev 2.2-1.0.1 E.1 E.2 E.3 E.4 Configuring the iSCSI Target Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . Configuring the DHCP Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Configuring the PXE Server. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Installing SLES11 SP3 on a Remote Storage over iSCSI . . . . . . . . . . . . . . E.5 E.6 Using PXE Boot Services for Booting the SLES11 SP3 from the iSCSI Target 311 Installing RHEL6.
Rev 2.2-1.0.1 List of Figures Figure 1: Mellanox OFED Stack for ConnectX® Family Adapter Cards . . . . . . . . . . . . . . . . . . . .23 Figure 2: I/O Consolidation Over InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .83 Figure 3: An Example of a Virtual Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .104 Figure 4: FCA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rev 2.2-1.0.1 List of Tables Table 1: Document Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .13 Table 2: Abbreviations and Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .17 Table 3: Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18 Table 4: Reference Documents . . . . . . . . . . . . . . . . . . . . .
Rev 2.2-1.0.1 Table 36: ibtracert Flags and Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .228 Table 37: ibqueryerrors Flags and Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .229 Table 38: iblinkinfo Flags and Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .231 Table 39: saquery Flags and Options . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rev 2.2-1.0.1 Document Revision History Table 1 - Document Revision History Release Date 2.2-1.0.1 June 10, 2014 Description • Added the following sections: • Section 1.5.1, “Configuring DAPL over RoCE”, on page 28 • Section 1.5.2, “A Detailed Example”, on page 29 and its subsections • • Section 2.8, “UEFI Secure Boot”, on page 50 Section 4.3, “Ethernet over IB (EoIB) vNic”, on page 61 and its subsections Section 5.4.6, “Configuring MXM over Different Transports”, on page 151 • Section 5.4.
Rev 2.2-1.0.1 Table 1 - Document Revision History Release Date 2.2-1.0.1 April 30, 2014 Description • Added the following sections: • Section 2.3.7, “openibd Script”, on page 45 • Section 4.8, “Ethernet VXLAN”, on page 99 • Section 4.9.1, “Atomic Operations in mlx5 Driver”, on • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 14 Mellanox Technologies page 100 Section 4.15.7, “Running Network Diagnostic Tools on a Virtual Function”, on page 128 Section 4.15.7.
Rev 2.2-1.0.1 Table 1 - Document Revision History Release Date 2.2-1.0.1 April 30, 2014 Description • Updated the following section: • Section 2.3.1, “Pre-installation Notes”, on page 35 • Section 2.3.2, “Installation Script”, on page 36 • Section , “Options”, on page 36 • Section 2.3.3, “Installation Procedure”, on page 39 • Section 2.3.6, “Installation Logging”, on page 45 • Section 2.5.1, “Setting up MLNX_OFED YUM Repository”, on page 47 Section 4.6.
Rev 2.2-1.0.1 Table 1 - Document Revision History Release 2.1-1.0.0 Date December 2013 Description • Updated the following sections: • Section 1.5, “RDMA over Converged Ethernet (RoCE)”, on • • • • • page 27 Section 2.3.3, “Installation Procedure”, on page 39 Section 4.15.2, “Setting Up SR-IOV”, on page 112 Section 5.4.1, “Compiling Open MPI with MXM”, on page 149 Section 5.4.2, “Enabling MXM in OpenMPI”, on page 150 Section 5.4.4, “Configuring Multi-Rail Support”, on page 150 Section 4.10.
Rev 2.2-1.0.1 About this Manual This preface provides general information concerning the scope and organization of this User’s Manual. Intended Audience This manual is intended for system administrators responsible for the installation, configuration, management and maintenance of the software and hardware of VPI (InfiniBand, Ethernet) adapter cards. It is also intended for application developers.
Rev 2.2-1.0.
Rev 2.2-1.0.1 Table 3 - Glossary (Sheet 2 of 2) Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric. Master Subnet Manager The Subnet Manager that is authoritative, that has the reference configuration information for the subnet. See Subnet Manager. Multicast Forwarding Tables A table that exists in every switch providing the list of ports to forward received multicast packet. The table is organized by MLID.
Rev 2.2-1.0.1 Related Documentation Table 4 - Reference Documents Document Name Description InfiniBand Architecture Specification, Vol. 1, Release 1.2.1 The InfiniBand Architecture Specification that is provided by IBTA IEEE Std 802.3ae™-2002 (Amendment to IEEE Std 802.
Rev 2.2-1.0.1 1 Mellanox OFED Overview 1.1 Introduction to Mellanox OFED Mellanox OFED is a single Virtual Protocol Interconnect (VPI) software stack which operates across all Mellanox network adapter solutions supporting 10, 20, 40 and 56 Gb/s InfiniBand (IB); 10, 40 and 56 Gb/s Ethernet; and 2.5 or 5.0 GT/s PCI Express 2.0 and 8 GT/s PCI Express 3.0 uplinks to servers.
Rev 2.2-1.0.1 Mellanox OFED Overview • • mlx4_en (Ethernet) Mid-layer core • Verbs, MADs, SA, CM, CMA, uVerbs, uMADs • Upper Layer Protocols (ULPs) • IPoIB, RDS*, SRP Initiator and SRP * NOTE: RDS was not tested by Mellanox Technologies.
Rev 2.2-1.0.1 1.3 Architecture Figure 1 shows a diagram of the Mellanox OFED stack, and how upper layer protocols (ULPs) interface with the hardware and with the kernel and user space. The application level also shows the versatility of markets that Mellanox OFED applies to. Figure 1: Mellanox OFED Stack for ConnectX® Family Adapter Cards The following sub-sections briefly describe the various components of the Mellanox OFED stack. 1.3.
Rev 2.2-1.0.1 Mellanox OFED Overview mlx4_en A 10/40GigE driver under drivers/net/ethernet/mellanox/mlx4 that handles Ethernet specific functions and plugs into the netdev mid-layer 1.3.2 mlx5 Driver mlx5 is the low level driver implementation for the Connect-IB™ adapters designed by Mella- nox Technologies. Connect-IB™ operates as an InfiniBand adapter. The mlx5 driver is comprised of the following kernel modules: mlx5_core Acts as a library of common functions (e.g.
Rev 2.2-1.0.1 MLX5_SCATTER_TO_CQE • Small buffers are scattered to the completion queue entry and manipulated by the driver. Valid for RC transport. • Default is 1, otherwise disabled. 1.3.3 Mid-layer Core Core services include: management interface (MAD), connection manager (CM) interface, and Subnet Administrator (SA) interface. The stack includes components for both user-mode and kernel applications. The core services run in the kernel and expose an interface to user-mode for verbs, CM and management.
Rev 2.2-1.0.1 1.3.5 Mellanox OFED Overview MPI Message Passing Interface (MPI) is a library specification that enables the development of parallel software libraries to utilize parallel computers, clusters, and heterogeneous networks.
Rev 2.2-1.0.1 This tool burns a firmware binary image to the EEPROM(s) attached to an InfiniScaleIII® switch device. It includes query functions to the burnt firmware image and to the binary image file. The tool accesses the EEPROM and/or switch device via an I2C-compatible interface or via vendor-specific MADs over the InfiniBand fabric (In-Band tool). • Debug utilities A set of debug utilities (e.g., itrace, mstdump, isw, and i2c) For additional details, please refer to the MFT User’s Manual docs/. 1.
Rev 2.2-1.0.1 1.5.1 Mellanox OFED Overview • GID format can be of 2 types, IPv4 and IPv6. IPv4 GID is a IPv4-mapped IPv6 address1 while IPv6 GID is the IPv6 address itself • VLAN tagged Ethernet frames carry a 3-bit priority field. The value of this field is derived from the IB SL field by taking the 3 least significant bits of the SL field • RoCE traffic is not shown in the associated Ethernet device's counters since it is offloaded by the hardware and does not go through Ethernet network driver.
Rev 2.2-1.0.1 1.5.2 A Detailed Example This section provides information of how to use InfiniBand over Ethernet (RoCE). 1.5.2.1 Installing and Loading the Driver To install and load the driver: Step 1. Install MLNX_OFED (See section Section 2.3, “Installing Mellanox OFED”, on page 35 for details on installation.) RoCE is installed as part of mlx4 and mlx4_en and other modules upon driver’s installation.
Rev 2.2-1.0.
Rev 2.2-1.0.1 1.5.2.3 Configuring an IP Address to the mlx4_en Interface To configure an IP address to the mlx4_en interface: Step 1. Configure an IP address to the mlx4_en interface on both sides of the link. # ifconfig eth2 20.4.3.220 # ifconfig eth2 eth2 Link encap:Ethernet HWaddr 00:02:C9:08:E8:11 inet addr:20.4.3.220 Bcast:20.255.255.255 Mask:255.0.0.
Rev 2.2-1.0.1 Mellanox OFED Overview 1.5.2.6 Adding VLANs To add VLANs: Step 1. Make sure that the 8021.q module is loaded. # modprobe 8021q Step 2. Add VLAN. # vconfig add eth2 7 Added VLAN with VID == 7 to IF -:eth2:# Step 3. Configure an IP address. # ifconfig eth2.7 7.4.3.220 Step 4. Examine the GID table. # cat /sys/class/infiniband/mlx4_0/ports/2/gids/0 fe80:0000:0000:0000:0202:c9ff:fe08:e811 # # cat /sys/class/infiniband/mlx4_0/ports/2/gids/1 fe80:0000:0000:0000:0202:c900:0708:e811 1.5.2.
Rev 2.2-1.0.1 1.5.2.9 Using rdma_cm Tests Step 1. Use rdma_cm test on the server. # ucmatose cmatose: starting server initiating data transfers completing sends receiving data transfers data transfers complete cmatose: disconnecting disconnected test complete return status 0 # Step 2. Use rdma_cm test on the client. # ucmatose -s 20.4.3.
Rev 2.2-1.0.1 2 Installation Installation This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox InfiniBand and/or Ethernet adapter hardware installed. 2.
Rev 2.2-1.0.1 2.3 Installing Mellanox OFED The installation script, mlnxofedinstall, performs the following: 2.3.
Rev 2.2-1.0.1 Installation Example The following command will create a MLNX_OFED_LINUX ISO image for RedHat 6.3 under the /tmp directory. # ./MLNX_OFED_LINUX-2.2-1.0.0-rhel6.3-x86_64/mlnx_add_kernel_support.sh -m /tmp/ MLNX_OFED_LINUX-2.2-1.0.0-rhel6.3-x86_64/ --make-tgz Note: This program will create MLNX_OFED_LINUX TGZ for rhel6.3 under /tmp directory. All Mellanox, OEM, OFED, or Distribution IB packages will be removed. Do you want to continue?[y/N]:y See log file /tmp/mlnx_ofed_iso.21642.
Rev 2.2-1.0.1 --force-fw-update Force firmware update --force Force installation --all|--hpc|--basic|--msm Install all, hpc, basic or Mellanox Subnet manager packages correspondingly --vma|--vma-vpi Install packages required by VMA to support VPI --vma-eth Install packages required by VMA to work over Ethernet --with-vma Set configuration for VMA use (to be used with any installation parameter).
Rev 2.2-1.0.1 Installation 2.3.2.1 mlnxofedinstall Return Codes Table 2 lists the mlnxofedinstall script return codes and their meanings. Table 2 - mlnxofedinstall Return Codes Return Code 38 Meaning 0 The Installation ended successfully 1 The installation failed 2 No firmware was found for the adapter device 22 Invalid parameter 28 Not enough free space 171 Not applicable to this system configuration. This can occur when the required hardware is not present on the system.
Rev 2.2-1.0.1 2.3.3 Installation Procedure Step 1. Login to the installation machine as root. Step 2. Mount the ISO image on your machine host1# mount -o ro,loop MLNX_OFED_LINUX---.iso /mnt Step 3. Run the installation script. Logs dir: /tmp/MLNX_OFED_LINUX-2.2-0.0.9.10694.logs This program will install the MLNX_OFED_LINUX package on your machine. Note that all other Mellanox, OEM, OFED, or Distribution IB packages will be removed.
Rev 2.2-1.0.1 Installation Installing user level RPMs: Preparing... ofed-scripts Preparing... libibverbs Preparing... libibverbs-devel Preparing... libibverbs-devel-static Preparing... libibverbs-utils Preparing... libmlx4 Preparing... libmlx4-devel Preparing... libmlx5 Preparing... libmlx5-devel Preparing... libibcm Preparing... libibcm-devel Preparing... libibumad Preparing... libibumad-devel Preparing... libibumad-static Preparing... libibmad Preparing... libibmad-devel Preparing...
Rev 2.2-1.0.1 Preparing... dapl Preparing... dapl-devel Preparing... dapl-devel-static Preparing... dapl-utils Preparing... perftest Preparing... mstflint Preparing... mft Preparing... srptools Preparing... rds-tools Preparing... rds-devel Preparing... ibutils2 Preparing... ibutils Preparing... cc_mgr Preparing... dump_pr Preparing... ar_mgr Preparing... ibdump Preparing... infiniband-diags Preparing... infiniband-diags-compat Preparing... qperf Preparing... fca Preparing... mxm Preparing...
Rev 2.2-1.0.1 Installation Preparing... ################################################## mvapich2 ################################################## Preparing... ################################################## hcoll ################################################## Preparing... ################################################## libibprof ################################################## Preparing...
Rev 2.2-1.0.1 In case your machine has the latest firmware, no firmware update will occur and the installation script will print at the end of installation a message similar to the following: Device #1: ---------Device Type: ConnectX3Pro Part Number: MCX354A-FCC_Ax Description: ConnectX-3 Pro VPI adapter card; dual-port QSFP; FDR IB (56Gb/s) and 40GigE;PCIe3.0 x8 8GT/s;RoHS R6 PSID: MT_1090111019 PCI Device Name: 0000:05:00.0 Versions: Current Available FW 2.31.5000 2.31.5000 PXE 3.4.0224 3.4.
Rev 2.2-1.0.1 Installation • Driver version • Number of active HCA ports along with their states • Node GUID Note: For more details on hca_self_test.ofed, see the file hca_self_test.readme under docs/. # hca_self_test.ofed ---- Performing Adapter Device Self Test ---Number of CAs Detected ................. 1 PCI Device Check ....................... PASS Kernel Arch ............................ x86_64 Host Driver Version .................... MLNX_OFED_LINUX-2.2-1.0.0 (OFED-2.2-1.0.0): 3.0.76-0.
Rev 2.2-1.0.1 a. You run the installation script in default mode; that is, without the option ‘--without-fw-update’. b. The firmware version of the adapter device is older than the firmware version included with the Mellanox OFED ISO image If an adapter’s Flash was originally programmed with an Expansion ROM image, the automatic firmware update will also burn an Expansion ROM image.
Rev 2.2-1.0.1 2.
Rev 2.2-1.0.1 • Burn a firmware image from a .mlx file using the mlxburn utility (that is already installed on your machine) The following command burns firmware onto the ConnectX device with the device name obtained in the example of Step 2. > flint -d /dev/mst/mt25418_pci_cr0 -i fw-25408-2_1_8000-MCX353A-FCA_A1.bin burn Step 4. Reboot your machine after the firmware burning is completed. 2.5 Installing MLNX_OFED using YUM 2.5.1 Setting up MLNX_OFED YUM Repository Step 1.
Rev 2.2-1.0.1 Installation Step 7. MLNX_OFED YUM repository using the "mlnx_create_yum_repo.sh" script located in the downloaded MLNX_OFED package. # ./mlnx_create_yum_repo.sh --mlnx_ofed /mnt --target /repos Creating MLNX_OFED_LINUX YUM Repository under /repos... See log file /tmp/mlnx_yum.24250.log comps file was not provided, going to build it... Copying RPMS... Building YUM Repository... Creating YUM Repository settings file at: /tmp/mlnx_ofed.repo Done. Copy /tmp/mlnx_ofed.repo to /etc/yum.repos.
Rev 2.2-1.0.1 Step 2. Install the desired group. # yum groupinstall 'MLNX_OFED ALL’ Loaded plugins: product-id, security, subscription-manager This system is not registered to Red Hat Subscription Management. You can use subscription-manager to register. Setting up Group Process Resolving Dependencies --> Running transaction check ---> Package ar_mgr.x86_64 0:1.0-0.11.g22fff4a will be installed ............... ............... rds-devel.x86_64 0:2.0.6mlnx-1 rds-tools.x86_64 0:2.0.6mlnx-1 srptools.
Rev 2.2-1.0.1 2.8 Installation UEFI Secure Boot All kernel modules included in MLNX_OFED for RHEL7 and SLES12 are signed with x.509 key to support loading the modules when Secure Boot is enabled. 2.8.1 Enrolling Mellanox's x.509 Public Key On your Systems In order to support loading MLNX_OFED drivers when an OS supporting Secure Boot boots on a UEFI-based system with Secure Boot enabled, the Mellanox x.
Rev 2.2-1.0.1 3 Configuration Files For the complete list of configuration files, please refer to MLNX_OFED_configuration_files.txt at the following location: docs/readme_and_user_manual/MLNX_OFED_configuration_files.txt 3.1 Persistent Naming for Network Interfaces To avoid network interface renaming after boot or driver restart use the "/etc/udev/rules.d/ 70-persistent-net.rules" file.
Rev 2.2-1.0.1 Driver Features 4 Driver Features 4.1 SCSI RDMA Protocol 4.1.1 Overview As described in Section 1.3.4, the SCSI RDMA Protocol (SRP) is designed to take full advantage of the protocol off-load and RDMA features provided by the InfiniBand architecture. SRP allows a large body of SCSI software to be readily used on InfiniBand architecture. The SRP Initiator controls the connection to an SRP Target in order to provide access to remote storage devices across an InfiniBand fabric.
Rev 2.2-1.0.1 allow_ext_sg Default behavior when there are more than cmd_sg_entries S/G entries after mapping; fails the request when false (default false) topspin_workarounds Enable workarounds for Topspin/Cisco SRP target bugs reconnect_delay Time between successive reconnect attempts. Time between successive reconnect attempts of SRP initiator to a disconnected target until dev_loss_tmo timer expires (if enabled), after that the SCSI target will be removed.
Rev 2.2-1.0.1 Driver Features 4.1.2.2 Manually Establishing an SRP Connection The following steps describe how to manually load an SRP connection between the Initiator and an SRP Target. Section 4.1.2.4 explains how to do this automatically. • Make sure that the ib_srp module is loaded, the SRP Initiator is reachable by the SRP Target, and that an SM is running.
Rev 2.2-1.0.1 ioc_guid A 16-digit hexadecimal number specifying the eight byte I/O controller GUID portion of the 16-byte target port identifier. dgid A 32-digit hexadecimal number specifying the destination GID. pkey A four-digit hexadecimal number specifying the InfiniBand partition key. service_id A 16-digit hexadecimal number specifying the InfiniBand service ID used to establish communication with the SRP target.
Rev 2.2-1.0.1 Driver Features tl_retry_count A number in the range 2..7 specifying the IB RC retry count. 4.1.2.
Rev 2.2-1.0.1 a. To generate output suitable for utilization in the “echo” command of Section 4.1.2.2, add the ‘-c’ option to ibsrpdm: ibsrpdm -c Sample output: id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4, dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b81146a1 b.
Rev 2.2-1.0.1 Driver Features • To discover SRP Targets reachable from the HCA device and the port , (and to generate output suitable for 'echo',) you may execute: host1# srp_daemon -c -a -o -i -p To obtain the list of InfiniBand HCA device names, you can either use the ibstat tool or run ‘ls /sys/class/infiniband’. • To both discover the SRP Targets and establish connections with them, just add the -e option to the above command.
Rev 2.2-1.0.1 4.1.2.5 Multiple Connections from Initiator InfiniBand Port to the Target Some system configurations may need multiple SRP connections from the SRP Initiator to the same SRP Target: to the same Target IB port, or to different IB ports on the same Target HCA. In case of a single Target IB port, i.e., SRP connections use the same path, the configuration is enabled using a different initiator_ext value for each SRP connection.
Rev 2.2-1.0.1 Driver Features Manual Activation of High Availability Initialization: (Execute after each boot of the driver) 1. Execute modprobe dm-multipath 2. Execute modprobe ib-srp 3. Make sure you have created file /etc/udev/rules.d/91-srp.rules as described above. 4. Execute for each port and each HCA: srp_daemon -c -e -R 300 -i -p This step can be performed by executing srp_daemon.sh, which sends its log to /var/log/ srp_daemon.log.
Rev 2.2-1.0.1 If you manually activated SRP High Availability, perform the following steps: a. Unmount all SRP partitions that were mounted. b. Stop service srpd (Kill the SRP daemon instances). c. Make sure there are no multipath instances running. If there are multiple instances, wait for them to end or kill them. d. Run: multipath -F 3. After Automatic Activation of High Availability If SRP High Availability was automatically activated, SRP shutdown must be part of the driver shutdown ("/etc/init.
Rev 2.2-1.0.1 Driver Features service. The InfiniBand UD datagrams encapsulates the entire Ethernet L2 datagram and its payload. To perform this operation the module performs an address translation from Ethernet layer 2 MAC addresses (48 bits long) to InfiniBand layer 2 addresses made of LID/GID and QPN. This translation is totally invisible to the OS and user. Thus, differentiating EoIB from IPoIB which exposes a 20 Bytes HW address to the OS.
Rev 2.2-1.0.1 Each vHub belongs to a specific gateway (BridgeX® + eport), and each gateway has one default vHub, and zero or more VLAN-associated vHubs. A specific gateway can have multiple vHubs distinguishable by their unique VLAN ID. Traffic coming from the Ethernet side on a specific eport will be routed to the relevant vHub group based on its VLAN tag (or to the default vHub for that GW if no vLan ID is present). 4.3.1.
Rev 2.2-1.0.1 Driver Features Table 3 - mlx4_vnic.conf file format Field Description ib_port The device name and port number in the form [device name]:[port number]. The device name can be retrieved by running ibv_devinfo and using the output of hca_id field. The port number can have a value of 1 or 2. vid [Optional field] If VLAN ID exists, the vNic will be assigned the specified VLAN ID. This value must be between 0 and 4095.
Rev 2.2-1.0.1 Table 4 - Red Hat Linux mlx4_vnic.conf file format Field Description BXADDR The BridgeX box system GUID or system name string. BXEPORT The string describing the eport name. VNICVLAN [Optional field] If it exists, the vNic will be assigned the VLAN ID specified. This value must be between 0 and 4095 or 'all' for ALL-VLAN feature. VNICIBPORT The device name and port number in the form [device name]:[port number].
Rev 2.2-1.0.1 Driver Features To disable network administered vNics on the host side load mlx4_vnic module with the net_admin module parameter set to 0. 4.3.2.3 VLAN Configuration A vNic instance is associated with a specific vHub group. This vHub group is connected to a BridgeX external port and has a VLAN tag attribute. When creating/configuring a vNic you define the VLAN tag it will use via the vid or the VNICVLAN fields (if these fields are absent, the vNic will not have a VLAN tag).
Rev 2.2-1.0.1 4.3.2.4 EoIB Multicast Configuration Configuring Multicast for EoIB interfaces is identical to multicast configuration for native Ethernet interfaces. EoIB maps Ethernet multicast addresses to InfiniBand MGIDs (Multicast GID). It ensures that different vHubs use mutually exclusive MGIDs. Thus preventing vNics on different vHubs from communicating with one another. 4.3.2.5 EoIB and Quality of Service EoIB enables the use of InfiniBand service levels.
Rev 2.2-1.0.1 4.3.3 Driver Features Retrieving EoIB Information 4.3.3.1 mlx4_vnic_info To retrieve information regarding EoIB interfaces, use the script mlx4_vnic_info. This script provides detailed information about a specific vNic or all EoIB vNic interfaces, such as: BX info, IOA info, SL, PKEY, Link state and interface features. If network administered vNics are enabled, this script can also be used to discover the available BridgeX® boxes from the host side.
Rev 2.2-1.0.1 Advertised link modes: Not reported Advertised auto-negotiation: No Speed: Unknown! (10000) Duplex: Full Port: Twisted Pair PHYAD: 0 Transceiver: internal Auto-negotiation: off Supports Wake-on: d Wake-on: d Current message level: 0x00000000 (0) Link detected: yes 4.3.3.3 Bonding Driver EoIB uses the standard Linux bonding driver. For more information on the Linux Bonding driver please refer to: /Documentation/networking/bonding.txt.
Rev 2.2-1.0.1 Driver Features For example, to configure a host to discover GWs on three partitions 0xffff,0xfff1 and 0x3 add the following line to modprobe configuration file: options mlx4_vnic discovery_pkeys=0xffff,0xfff1,0x3 When using this feature combined with host administrated vnics, each vnic should also be configured with the partition it should be created on.
Rev 2.2-1.0.1 A gateway that is configured to work in ALL VLAN mode cannot accept login requests from • vNics that do not support this mode • host admin vNics that were not configured to work in ALL VLAN mode, by setting the vlan-id value to a 'all' as as described in Section , “Creating vNICs that Support ALL VLAN Mode”, on page 71. Creating vNICs that Support ALL VLAN Mode VLANs are created on a vNIC that supports ALL VLAN mode using "vconfig".
Rev 2.2-1.0.1 • Driver Features vNic Support To verify the vNIC is configured to All-VLAN mode. Run: mlx4_vnic_info -i Example: # mlx4_vnic_info -i eth204 NETDEV_NAME eth204 NETDEV_LINK up NETDEV_OPEN yes GW_TYPE LEGACY ALL_VLAN yes For further information on mlx4_vnic_info script, please see Section 4.3.3.1, “mlx4_vnic_info”, on page 68. 4.3.4 Advanced EoIB Settings 4.3.4.1 Module Parameters The mlx4_vnic driver supports the following module parameters.
Rev 2.2-1.0.1 4.3.4.2 vNic Interface Naming The mlx4_vnic driver enables the kernel to determine the name of the registered vNic. By default, the Linux kernel assigns each vNic interface the name eth, where is an incremental number that keeps the interface name unique in the system. The vNic interface name may not remain consistent among hosts or BridgeX reboots as the vNic creation can happen in a different order each time.
Rev 2.2-1.0.1 Driver Features For the full list of mlx4_vnic module parameters, run: # modinfo mlx4_vnic Network Configuration PV-EoIB supports both L2 (bridged) and L3 (routed) network models. The 'physical' interfaces that can be enslaved to the Hypervisor virtual bridge are actually EoIB vNics, and they can be created as on an native Linux machine. PV-EoIB driver supports both host-administrated and network-administrated vNics. Please refer to Section 4.3.
Rev 2.2-1.0.1 Virtual Guest Tagging (VGT) is not supported. The model explained above applies to Virtual Switch Tagging (VST) only. Migration Some Hypervisors provide the ability to migrate a virtual machine from one physical server to another, this feature is seamlessly supported by PV-EoIB. Any network connectivity over EoIB will automatically be resumed on the new physical server. The downtime that may occur during this process is minor.
Rev 2.2-1.0.1 Driver Features 4.4 IP over InfiniBand 4.4.1 Introduction The IP over IB (IPoIB) driver is a network interface implementation over InfiniBand. IPoIB encapsulates IP datagrams over an InfiniBand Connected or Datagram transport service.
Rev 2.2-1.0.1 4.4.3 IPoIB Configuration Unless you have run the installation script mlnxofedinstall with the flag ‘-n’, then IPoIB has not been configured by the installation. The configuration of IPoIB requires assigning an IP address and a subnet mask to each HCA port, like any other network adapter card (i.e., you need to prepare a file called ifcfg-ib for each port). The first port on the first HCA in the host is called interface ib0, the second port is called ib1, and so on.
Rev 2.2-1.0.1 Driver Features Example: host1# dhcpd ib0 -d 4.4.3.1.2 DHCP Client (Optional) A DHCP client can be used if you need to prepare a diskless machine with an IB driver. See Step 8 under “Example: Adding an IB Driver to initrd (Linux)”. In order to use a DHCP client identifier, you need to first create a configuration file that defines the DHCP client identifier.
Rev 2.2-1.0.1 4.4.3.2 Static IPoIB Configuration If you wish to use an IPoIB configuration that is not based on DHCP, you need to supply the installation script with a configuration file (using the ‘-n’ option) containing the full IP configuration.
Rev 2.2-1.0.1 Driver Features • The subnet mask that you want to assign to the interface The following example shows how to configure an IB interface: host1$ ifconfig ib0 11.4.3.175 netmask 255.255.0.0 Step 2. (Optional) Verify the configuration by entering the ifconfig command with the appropriate interface identifier ib# argument. The following example shows how to verify the configuration: host1$ ifconfig ib0 b0 Link encap:UNSPEC HWaddr 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:11.4.
Rev 2.2-1.0.1 Using the example of Step 2: host1$ ifconfig ib0.8001 ib0.8001 Link encap:UNSPEC HWaddr 80-00-00-4A-FE-80-00-00-00-00-00-00-00-00-00-00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Step 4. As can be seen, the interface does not have IP or network addresses.
Rev 2.2-1.0.1 4.4.6 Driver Features Bonding IPoIB To create an interface configuration script for the ibX and bondX interfaces, you should use the standard syntax (depending on your OS). Bonding of IPoIB interfaces is accomplished in the same manner as would bonding of Ethernet interfaces: via the Linux Bonding Driver. • Network Script files for IPoIB slaves are named after the IPoIB interfaces (e.
Rev 2.2-1.0.1 4.5 Quality of Service InfiniBand 4.5.1 Quality of Service Overview Quality of Service (QoS) requirements stem from the realization of I/O consolidation over an IB network. As multiple applications and ULPs share the same fabric, a means is needed to control their use of network resources. Figure 2: I/O Consolidation Over InfiniBand QoS over Mellanox OFED for Linux is discussed in Chapter 8, “OpenSM – Subnet Manager”.
Rev 2.2-1.0.1 4.5.2 Driver Features QoS Architecture QoS functionality is split between the SM/SA, CMA and the various ULPs. We take the “chronology approach” to describe how the overall system works. 1. The network manager (human) provides a set of rules (policy) that define how the network is being configured and how its resources are split to different QoS-Levels. The policy also define how to decide which QoS-Level each application or ULP or service use. 2.
Rev 2.2-1.0.1 II. Fabric Setup Defines how the SL2VL and VLArb tables should be setup. In OFED this part of the policy is ignored. SL2VL and VLArb tables should be configured in the OpenSM options file (opensm.opts). III. QoS-Levels Definition This section defines the possible sets of parameters for QoS that a client might be mapped to. Each set holds SL and optionally: Max MTU, Max Rate, Packet Lifetime and Path Bits. Path Bits are not implemented in OFED. IV.
Rev 2.2-1.0.1 4.5.5 Driver Features OpenSM Features The QoS related functionality that is provided by OpenSM—the Subnet Manager described in Chapter 8 can be split into two main parts: I. Fabric Setup During fabric initialization, the Subnet Manager parses the policy and apply its settings to the discovered fabric elements. II. PR/MPR Query Handling OpenSM enforces the provided policy on client request.
Rev 2.2-1.0.1 1. The application sets the ToS of the socket using setsockopt (IP_TOS, value). 2. ToS is translated into the sk_prio using a fixed translation: TOS TOS TOS TOS 0 <=> sk_prio 0 8 <=> sk_prio 2 24 <=> sk_prio 4 16 <=> sk_prio 6 3. The Socket Priority is mapped to the UP: • If the underlying device is a VLAN device, egress_map is used controlled by the vconfig command. This is per VLAN mapping. • If the underlying device is not a VLAN device, the tc command is used.
Rev 2.2-1.0.1 Driver Features 4. The UP is mapped to the TC as configured by the mlnx_qos tool or by the lldpad daemon if DCBX is used. With RoCE, there can only be 4 predefined ToS values for the purpose of QoS mapping. 4.6.5 Raw Ethernet QP Quality of Service Mapping Applications open a Raw Ethernet QP using VERBs directly. The following is the RoCE QoS mapping flow: 1. The application sets the UP of the Raw Ethernet QP during the INIT to RTR state transition of the QP: • Sets qp_attrs.ah_attrs.
Rev 2.2-1.0.1 be used. mlnx_qos gets a list of a mapping between UPs to TCs. For example, mlnx_qos ieth0 -p 0,0,0,0,1,1,1,1 maps UPs 0-3 to TC0, and Ups 4-7 to TC1. 4.6.7 Quality of Service Properties The different QoS properties that can be assigned to a TC are: • Strict Priority (see “Strict Priority”) • Minimal Bandwidth Guarantee (ETS) (see “Minimal Bandwidth Guarantee (ETS)”) • Rate Limit (see “Rate Limit”) 4.6.7.
Rev 2.2-1.0.1 Driver Features • Assign a transmission algorithm to each TC (strict or ETS) • Set minimal BW guarantee to ETS TCs • Set rate limit to TCs For unlimited ratelimit set the ratelimit to 0. Usage: mlnx_qos -i [options] Options: --version show program's version number and exit -h, --help show this help message and exit -p LIST, --prio_tc=LIST maps UPs to TCs. LIST is 8 comma seperated TC numbers.
Rev 2.2-1.0.
Rev 2.2-1.0.1 Driver Features Set ratelimit. 3Gbps for tc0 4Gbps for tc1 and 2Gbps for tc2: tc: 0 ratelimit: 3 Gbps, up: 0 skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: up: 1 up: 2 up: 3 up: 4 up: 5 up: 6 up: 7 tsa: strict 0 1 2 (tos: 8) 3 4 (tos: 24) 5 6 (tos: 16) 7 8 9 10 11 12 13 14 15 Configure QoS. map UP 0,7 to tc0, 1,2,3 to tc1 and 4,5,6 to tc 2. set tc0,tc1 as ets and tc2 as strict.
Rev 2.2-1.0.1 up: 1 up: 2 up: 3 tc: 2 ratelimit: 2 Gbps, tsa: strict up: 4 up: 5 up: 6 4.6.8.2 tc and tc_wrap.py The 'tc' tool is used to setup sk_prio to UP mapping, using the mqprio queue discipline. In kernels that do not support mqprio (such as 2.6.34), an alternate mapping is created in sysfs. The 'tc_wrap.py' tool will use either the sysfs or the 'tc' tool to configure the sk_prio to UP mapping. Usage: tc_wrap.
Rev 2.2-1.0.1 Driver Features UP UP UP UP UP UP 2 3 4 5 6 7 4.6.8.3 Additional Tools tc tool compiled with the sch_mqprio module is required to support kernel v2.6.32 or higher. This is a part of iproute2 package v2.6.32-19 or higher. Otherwise, an alternative custom sysfs interface is available. 4.7 • mlnx_qos tool • tc_wrap.py (package: ofed-scripts) requires python >= 2.5 (package: ofed-scripts) requires python >= 2.5 Ethernet Time-Stamping 4.7.
Rev 2.2-1.0.1 To enable time stamping for a net device: Admin privileged user can enable/disable time stamping through calling ioctl (sock, SIOCSHWTSTAMP, &ifreq) with following values: Send side time sampling: • Enabled by ifreq.hwtstamp_config.tx_type when /* possible values for hwtstamp_config->tx_type */ enum hwtstamp_tx_types { /* * No outgoing packet will need hardware time stamping; * should a packet arrive which asks for it, no hardware * time stamping will be done.
Rev 2.2-1.0.1 Driver Features Receive side time sampling: • Enabled by ifreq.hwtstamp_config.
Rev 2.2-1.0.1 a pending bounced packet is ready for reading as far as select() is concerned. If the outgoing packet has to be fragmented, then only the first fragment is time stamped and returned to the sending socket. When time-stamping is enabled, VLAN stripping is disabled. For more info please refer to Documentation/networking/timestamping.txt in kernel.org 4.7.1.
Rev 2.2-1.0.1 Driver Features For example: struct ibv_exp_device_attr attr; ibv_exp_query_device(context, &attr); if (attr.comp_mask & IBV_EXP_DEVICE_ATTR_WITH_TIMESTAMP_MASK) { if (attr.timestamp_mask) { /* Time stamping is supported with mask attr.timestamp_mask */ } } if (attr.comp_mask & IBV_EXP_DEVICE_ATTR_WITH_HCA_CORE_CLOCK) { if (attr.hca_core_clock) { /* reporting the device's clock is supported. */ /* attr.hca_core_clock is the frequency in MHZ */ } } 4.7.2.
Rev 2.2-1.0.1 4.7.2.4 Querying the Hardware Time Querying the hardware for time is done via the ibv_exp_query_values verb. For example: ret = ibv_exp_query_values(context, IBV_EXP_VALUES_HW_CLOCK, &queried_values); if (!ret && queried_values.comp_mask & IBV_EXP_VALUES_HW_CLOCK) queried_time = queried_values.hwclock; To change the queried time in nanoseconds resolution, use the IBV_EXP_VALUES_HW_CLOCK_NS flag along with the hwclock_ns field.
Rev 2.2-1.0.1 4.8.3 Driver Features Important Notes • VXLAN tunneling adds 50 bytes (14-eth + 20-ip + 8-udp + 8-vxlan) to the VM Ethernet frame. Please verify that either the MTU of the NIC who sends the packets, e.g. the VM virtio-net NIC or the host side veth device or the uplink takes into account the tunneling overhead. Meaning, the MTU of the sending NIC has to be decremented by 50 bytes (e.g 1450 instead of 1500), or the uplink NIC MTU has to be incremented by 50 bytes (e.
Rev 2.2-1.0.1 4.9.2.2 Masked Fetch and Add (MFetchAdd) The MFetchAdd Atomic operation extends the functionality of the standard IB FetchAdd by allowing the user to split the target into multiple fields of selectable length. The atomic add is done independently on each one of this fields. A bit set in the field_boundary parameter specifies the field boundaries. The pseudocode below describes the operation: | | | | | | | | | | | | | | | | | | | | | | | | | | | 4.
Rev 2.2-1.0.1 Driver Features The diagram below describes the topology that was created after these steps: The diagram shows how the traffic from the Virtual Machine goes to the virtual-bridge in the Hypervisor and from the bridge to the eIPoIB interface. eIPoIB interface is the Ethernet interface that enslaves the IPoIB interfaces in order to send/receive packets from the Ethernet interface in the Virtual Machine to the IB fabric beneath. 4.10.
Rev 2.2-1.0.1 For example, on a system with dual port HCA, the following two interfaces might be created; eth4 and eth5. cat /sys/class/net/eth_ipoib_interfaces eth4 over IB port: ib0 eth5 over IB port: ib1 These interfaces can be used to configure the network for the guest.
Rev 2.2-1.0.1 Driver Features Figure 3: An Example of a Virtual Network The example above shows a few IPoIB instances that serve the virtual interfaces at the Virtual Machines. To display the services provided to the Virtual Machine interfaces: # cat /sys/class/net/eth0/eth/vifs Example: # cat /sys/class/net/eth0/eth/vifs SLAVE=ib0.2 MAC=52:54:00:60:55:88 VLAN=N/A In the example above the ib0.2 IPoIB interface serves the MAC 52:54:00:60:55:88 with no VLAN tag for that interface. 4.10.
Rev 2.2-1.0.1 4.10.4 Setting Performance Tuning • Use 4K MTU over OpenSM. For further information, please refer to Section 8.4.1, “File Format”, on page 173 Default=0xffff, ipoib, mtu=5 : ALL=full; • Use MTU for 4K (4092 Bytes): • In UD mode, the maximum MTU value is 4092 Bytes Make sure that all interfaces (including the guest interface and its virtual bridge) have the same MTU value (MTU 4092 Bytes). For further information of MTU settings, please refer to the Hypervisor User Manual.
Rev 2.2-1.0.1 Driver Features address field of the struct ibv_mr will hold the address to the allocated memory block. This block will be freed implicitly when the ibv_dereg_mr() is called. The following are environment variables that can be used to control error cases / contiguity: Table 6 - Parameters Used to Control Error Cases / Contiguity Parameters MLX_MR_ALLOC_TYPE Description Configures the allocator type.
Rev 2.2-1.0.1 mode to that MR. The desired access is validated against its given permissions and upon successful creation, the physical pages of the original MR are shared by the new MR. Once the MR is shared, it can be used even if the original MR was destroyed. The request to share the MR can be repeated multiple times and an arbitrary number of Memory Regions can potentially share the same physical memory locations.
Rev 2.2-1.0.1 Driver Features ity, domains and priorities are used. Flow steering uses a methodology of flow attribute, which is a combination of L2-L4 flow specifications, a destination QP and a priority. Flow steering rules may be inserted either by using ethtool or by using InfiniBand verbs. The verbs abstraction uses a different terminology from the flow attribute (ibv_exp_flow_attr), defined by a combination of specifications (struct ibv_exp_flow_spec_*). 4.14.
Rev 2.2-1.0.1 Input parameters: • struct ibv_qp - the attached QP. • struct ibv_exp_flow_attr - attaches the QP to the flow specified. The flow contains mandatory control parameters and optional L2, L3 and L4 headers.
Rev 2.2-1.0.1 Driver Features All packets that contain the above destination IP address and source port are to be steered into rxring 2. When destination MAC is not given, the user's destination MAC is filled automatically. • ethtool -U eth5 flow-type ether dst 00:11:22:33:44:55 vlan 45 m 0xf000 loc 5 action 2 All packets that contain the above destination MAC address and specific VLAN are steered into ring 2. Please pay attention to the VLAN's mask 0xf000. It is required in order to add such a rule.
Rev 2.2-1.0.1 • The mlx4 ipoib driver when it attaches its QP to his configured GIDS Fragmented UDP traffic cannot be steered. It is treated as 'other' protocol by hardware (from the first packet) and not considered as UDP traffic. We recommend using libibverbs v2.0-3.0.0 and libmlx4 v2.0-3.0.0 and higher as of MLNX_OFED v2.0-3.0.0 due to API changes. 4.
Rev 2.2-1.0.1 Driver Features 4.15.2 Setting Up SR-IOV Depending on your system, perform the steps below to set up your BIOS. The figures used in this section are for illustration purposes only. For further information, please refer to the appropriate BIOS User Manual: 112 Step 1. Enable "SR-IOV" in the system BIOS. Step 2. Enable "Intel Virtualization Technology". Step 3. Install a hypervisor that supports SR-IOV. Step 4. Depending on your system, update the /boot/grub/grub.
Rev 2.2-1.0.1 For example, to Intel systems, add: default=0 timeout=5 splashimage=(hd0,0)/grub/splash.xpm.gz hiddenmenu title Red Hat Enterprise Linux Server (2.6.32-36.x86-645) root (hd0,0) kernel /vmlinuz-2.6.32-36.x86-64 ro root=/dev/VolGroup00/LogVol00 rhgb quiet intel_iommu=on1 initrd /initrd-2.6.32-36.x86-64.img 1. Please make sure the parameter "intel_iommu=on" exists when updating the /boot/grub/grub.conf file, otherwise SR-IOV cannot be loaded. Step 5.
Rev 2.2-1.0.1 Driver Features 2. Add the above fields to the INI if they are missing. 3. Set the total_vfs parameter to the desired number if you need to change the num- ber of total VFs. 4. Reburn the firmware using the mlxburn tool if the fields above were added to the INI, or the total_vfs parameter was modified. If the mlxburn is not installed, please downloaded it from the Mellanox website http://www.mellanox.com > products > Firmware tools mlxburn -fw ./fw-ConnectX3-rel.
Rev 2.2-1.0.1 Parameter num_vfs Recommended Value • • • If absent, or zero: no VFs will be available If its value is a single number in the range of 0-63: The driver will enable the num_vfs VFs on the HCA and this will be applied to all ConnectX® HCAs on the host.
Rev 2.2-1.0.1 Driver Features Parameter Recommended Value port_type_array Specifies the protocol type of the ports. It is either one array of 2 port types 't1,t2' for all devices or list of BDF to port_type_array 'bb:dd.f-t1;t2,...'. (string) Valid port types: 1-ib, 2-eth, 3-auto, 4-N/A If only a single port is available, use the N/A port type for port2 (e.g '1,4').
Rev 2.2-1.0.1 Parameter Recommended Value • probe_vf=1,2,3 - The PF driver will activate 1 VF on physical port 1, 2 VFs on physical port 2 and 3 dual port VFs (applies only to dual port HCA when all ports are Ethernet ports). • This applies to all ConnectX® HCAs in the host. probe_vf=00:04.0-5;6;7,00:07.0-8;9;10 - The PF driver will activate: • HCA positioned in BDF 00:04.0 • 5 single VFs on port 1 • 6 single VFs on port 2 • 7 dual port VFs • HCA positioned in BDF 00:07.
Rev 2.2-1.0.1 Driver Features Step 10. Load the driver and verify the SR-IOV is supported. Run: lspci | grep Mellanox 03:00.0 InfiniBand: Mellanox / 10GigE] (rev b0) 03:00.1 InfiniBand: Mellanox (rev b0) 03:00.2 InfiniBand: Mellanox (rev b0) 03:00.3 InfiniBand: Mellanox (rev b0) 03:00.4 InfiniBand: Mellanox (rev b0) 03:00.5 InfiniBand: Mellanox (rev b0) Technologies MT26428 [ConnectX VPI PCIe 2.
Rev 2.2-1.0.1 Step 4. Attach a virtual NIC to VM. ifconfig -a … eth6 Link encap:Ethernet HWaddr 52:54:00:E7:77:99 inet addr:13.195.15.5 Bcast:13.195.255.255 Mask:255.255.0.0 inet6 addr: fe80::5054:ff:fee7:7799/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:481 errors:0 dropped:0 overruns:0 frame:0 TX packets:450 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:22440 (21.9 KiB) TX bytes:19232 (18.7 KiB) Interrupt:10 Base address:0xa000 … 4.15.
Rev 2.2-1.0.1 Driver Features Step 7. Add the device to the /etc/sysconfig/network-scripts/ifcfg-ethX configuration file. The MAC address for every virtual function is configured randomly, therefore it is not necessary to add it. 4.15.5 Uninstalling SR-IOV Driver To uninstall SR-IOV driver, perform the following: Step 1. For Hypervisors, detach all the Virtual Functions (VF) from all the Virtual Machines (VM) or stop the Virtual Machines that use the Virtual Functions.
Rev 2.2-1.0.1 Only the PFs are set via this mechanism. The VFs inherit their port types from their associated PF.
Rev 2.2-1.0.1 Driver Features 4.15.6.2 Virtual Function InfiniBand Ports Each VF presents itself as an independent vHCA to the host, while a single HCA is observable by the network which is unaware of the vHCAs. No changes are required by the InfiniBand subsystem, ULPs, and applications to support SR-IOV, and vHCAs are interoperable with any existing (non-virtualized) IB deployments.
Rev 2.2-1.0.1 • /port//pkey_idx/, where m = 1..2 and n = 0..126 For instructions on configuring pkey_idx, please see below. 4.15.6.2.2Configuring an Alias GUID (under ports//admin_guids) Step 1. Determine the GUID index of the PCI Virtual Function that you want to pass through to a guest. For example, if you want to pass through PCI function 02:00.3 to a certain guest, you initially need to see which GUID index is used for this function. To do so: cat /sys/class/infiniband/iov/0000:02:00.
Rev 2.2-1.0.1 Driver Features The example below shows the mapping between “entry 0” of to its physical one on port number 1. cat /sys/class/infiniband/mlx4_0/iov//ports/1/gid_idx/0 Initial GUIDs' values depend on the mlx4_ib module parameter 'sm_guid_assign' as follows: Mode Type sm assigned Description Asks SM for values for GUID entry 0 per VF. Other entries will have value 0 in the port GUID table, and ffffffffffffffff under their matching admin_guids entry.
Rev 2.2-1.0.1 This is done by configuring the virtual-to-physical PKey mappings for all the VMs, such that at virtual PKey index 0, both vm-1s will have the same pkey and both vm-2s will have the same PKey (different from the vm-1's), and the Dom0's will have the default pkey (different from the vm's pkeys at index 0). OpenSM must be used to configure the physical Pkey tables on both hosts.
Rev 2.2-1.0.1 Driver Features Step b. Assuming that on Host1, the physical function displayed by lspci is "0000:02:00.0", and that on Host2 it is "0000:03:00.0" On Host1 do the following. cd /sys/class/infiniband/mlx4_0/iov 0000:02:00.0 0000:02:00.1 0000:02:00.2 ...1 1. 0000:02:00.0 contains the virtual-to-physical mapping tables for the physical function. 0000:02:00.X contain the virt-to-phys mapping tables for the virtual functions. Do not touch the Dom0 mapping table (under ::00.0).
Rev 2.2-1.0.1 The feature may be controlled on the Hypervisor from userspace via iprout2 / netlink: ip link set { dev DEVICE | group DEVGROUP } [ { up | down } ] ... [ vf NUM [ mac LLADDR ] [ vlan VLANID [ qos VLAN-QOS ] ] ... [ spoofchk { on | off} ] ] ... use: ip link set dev vf vlan [qos ] • where NUM = 0..max-vf-num • vlan_id = 0..4095 (4095 means "set VGT") • qos = 0..
Rev 2.2-1.0.1 Driver Features When a MAC is ff:ff:ff:ff:ff:ff, the VF is not assigned to the port of the net device it is listed under. In the example above, vf 38 is not assigned to the same port as p1p1, in contrast to vf0. However, even VFs that are not assigned to the net device, could be used to set and change its settings. For example, the following is a valid command to change the spoof check: ip link set dev p1p1 vf 38 spoofchk on This command will affect only the vf 38.
Rev 2.2-1.0.1 4.15.7.2 Granting SMP Capability to a Virtual Function To enable SMP capability for a VF, one must enable the Subnet Management Interface (SMI) for that VF. By default, the SMI interface is disabled for VFs. To enable SMI mads for VFs, there are two new sysfs entries per VF per on the Hypervisor (under /sys/class/infiniband/mlx4_X/ iov//ports/<1 or 2>. These entries are displayed only for VFs (not for the PF), and only for IB ports (not ETH ports).
Rev 2.2-1.0.1 Driver Features 4.16.1 QCN Tool - mlnx_qcn mlnx_qcn is a tool used to configure QCN attributes of the local host. It communicates directly with the driver thus does not require setting up a DCBX daemon on the system.
Rev 2.2-1.0.1 --rpg_min_dec_fac=RPG_MIN_DEC_FAC_LIST --rpg_min_rate=RPG_MIN_RATE_LIST --cndd_state_machine=CNDD_STATE_MACHINE_LIST Set value of rpg_min_dec_fac according to priority, use spaces between values and -1 for unknown values. Set value of rpg_min_rate according to priority, use spaces between values and -1 for unknown values. Set value of cndd_state_machine according to priority, use spaces between values and -1 for unknown values.
Rev 2.2-1.0.1 Driver Features rpg_max_rate: 40000 rpg_ai_rate: 10 rpg_hai_rate: 50 rpg_gd: 8 rpg_min_dec_fac: 2 rpg_min_rate: 10 cndd_state_machine: 0 4.16.2 Setting QCN Configuration Setting the QCN parameters, requires updating its value for each priority. '-1' indicates no change in the current value.
Rev 2.2-1.0.1 The following are the ethtool supported options: Table 8 - ethtool Supported Options Options ethtool -i eth Description Checks driver and device information. For example: #> ethtool -i eth2 driver: mlx4_en (MT_0DD0120009_CX3) version: 2.1.6 (Aug 2013) firmware-version: 2.30.3000 bus-info: 0000:1a:00.0 ethtool -k eth Queries the stateless offload status.
Rev 2.2-1.0.1 Driver Features Table 8 - ethtool Supported Options Options ethtool -C eth [rx-usecs N] [rxframes N] Description Sets the interrupt coalescing settings when the adaptive moderation is disabled. Note: usec settings correspond to the time to wait after the *last* packet is sent/received before triggering an interrupt. ethtool -a eth Queries the pause frame settings. ethtool -A eth [rx on|off] [tx on|off] Sets the pause frame settings.
Rev 2.2-1.0.1 4.20 PeerDirect PeerDirect uses an API between IB CORE and peer memory clients, (e.g. GPU cards) to provide access to an HCA to read/write peer memory for data buffers. As a result, it allows RDMA-based (over InfiniBand/RoCE) application to use peer device computing power, and RDMA interconnect at the same time without copying the data between the P2P devices. For example, PeerDirect is being used for GPUDirect RDMA.
Rev 2.2-1.0.1 4.22 Driver Features Ethernet Performance Counters Counters are used to provide information about how well an operating system, an application, a service, or a driver is performing. The counter data helps determine system bottlenecks and finetune the system and application performance. The operating system, network, and devices provide counter data that an application can consume to provide users with a graphical view of how well the system is performing.
Rev 2.2-1.0.
Rev 2.2-1.0.
Rev 2.2-1.0.
Rev 2.2-1.0.
Rev 2.2-1.0.1 Memory Windows API cannot co-work with peer memory clients (PeerDirect). 4.23.1 Query Capabilities Memory Windows are available if and only the hardware supports it. To verify whether Memory Windows are available, run ibv_exp_query_device. For example: truct ibv_exp_device_attr device_attr = {.comp_mask = IBV_EXP_DEVICE_ATTR_RESERVED 1}; ibv_exp_query_device(context, & device_attr); if (device_attr.exp_device_cap_flags & IBV_EXP_DEVICE_MEM_WINDOW || device_attr.
Rev 2.2-1.0.1 4.24 Driver Features pm_qos usage on ingress Packet Traffic pm_qos API is used by mlx4_en, to enforce minimum DMA latency requirement on the system when ingress traffic is detected. Additionally, it decreases packet loss on systems configured with abundant power state profile when Flow Control is disabled.
Rev 2.2-1.0.1 5 HPC Features 5.1 HPC-X Mellanox HPC-X Scalable Software Toolkit provides various acceleration packages to improve both the performance and scalability of popular MPI and SHMEM/PGAS libraries. These packages, including MXM (Mellanox Messaging) which accelerates the underlying send/receive (or put/get) messages, and FCA (Fabric Collectives Accelerations) that accelerates the underlying collective operations used by the MPI/PGAS languages.
Rev 2.2-1.0.1 • HPC Features operations - data transfer from a different PE, and remote pointers, allowing direct references to data objects owned by another PE “get” Additional supported operations are collective broadcast and reduction, barrier synchronization, and atomic memory operations. An atomic memory operation is an atomic read-and-update operation, such as a fetch-and-increment, on a remote or local data object. SHMEM libraries implement active messaging.
Rev 2.2-1.0.1 To enable FCA in the shmemrun command line, add the following: -mca scoll_fca_enable=1 -mca scoll_fca_enable_np 0 To disable FCA: -mca scoll_fca_enable 0 -mca coll_fca_enable 0 For more details on FCA installation and configuration, please refer to the FCA User Manual found in the Mellanox website. 5.2.
Rev 2.2-1.0.1 5.2.5 HPC Features Running ScalableSHMEM Application The ScalableSHMEM framework contains the shmemrun utility which launches the executable from a service node to compute nodes. This utility accepts the same command line parameters as mpirun from the OpenMPI package. For further information, please refer to OpenMPI MCA parameters documentation at: http://www.open-mpi.org/faq/?category=running. Run "shmemrun --help" to obtain ScalableSHMEM job launcher runtime parameters.
Rev 2.2-1.0.1 5.3.2.1 SSH Configuration The following steps describe how to configure password-less access over SSH: Step 1. Generate an ssh key on the initiator machine (host1). host1$ ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home//.ssh/id_rsa): Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home//.ssh/id_rsa. Your public key has been saved in /home//.
Rev 2.2-1.0.1 5.3.3 HPC Features MPI Selector - Which MPI Runs Mellanox OFED contains a simple mechanism for system administrators and end-users to select which MPI implementation they want to use. The MPI selector functionality is not specific to any MPI implementation; it can be used with any implementation that provides shell startup files that correctly set the environment for that MPI. The Mellanox OFED installer will automatically add MPI selector support for each MPI that it installs.
Rev 2.2-1.0.1 • Connection management • Receive side tag matching • Intra-node shared memory communication These enhancements significantly increase the scalability and performance of message communications in the network, alleviating bottlenecks within the parallel communication libraries. The latest MXM software can be downloaded from the Mellanox website. 5.4.1 Compiling Open MPI with MXM Step 1. Install MXM from: • an RPM % rpm -ihv mxm-x.y.z-1.x86_64.rpm • a tarball % tar jxf mxm-x.y.z.
Rev 2.2-1.0.1 HPC Features When upgrading to MXM v0.52, Open MPI compiled with the previous versions of the MXM should be recompiled with MXM v0.52. 5.4.2 Enabling MXM in OpenMPI As of MXM v2.1, MXM is automatically selected by Open MPI (up to v1.6) when the Number of Processes (NP) is higher or equal to 128. To activate MXM for any NP, run: % mpirun -mca mtl_mxm_np 0 <...other mpirun parameters ...> From Open MPI v1.7.x, MXM is selected when the number of processes is higher or equal to 0. i.e.
Rev 2.2-1.0.1 Possible values for MXM_IB_MAP_MODE are: • first - [Default] Maps the first suitable HCA port to all processes • affinity - Distributes the HCA ports evenly among processes based on CPU affinity • nearest - Tries to find the nearest HCA port based on CPU affinity You may also use an asterisk (*) and a question mark (?) to choose the HCA and the port you would like to use.
Rev 2.2-1.0.1 HPC Features By default the transports (TLS) used are: MXM_TLS=self,shm,ud 5.4.7 Configuring Service Level Support Service Level Support is currently at alpha level. Please be aware that the content below is subject to change. MXM v3.0 added support for Service Level to enable Quality of Service (QoS). If set, every InfiniBand endpoint in MXM will generate a random Service Level (SL) within the given range, and use it for outbound communication.
Rev 2.2-1.0.1 ware multicast. In FCA 3.0 we also expose the performance and scalability of Mellanox's advanced point-to-point library, MXM 2.x, in the form of the "mlnx_p2p" BCOL. This allows users to take full advantage of new features with minimal effort. FCA 3.0 is a standalone library that can be integrated into any MPI or PGAS runtime. Support for FCA 3.0 is currently integrated into Open MPI versions 1.7.4 and higher. The 3.
Rev 2.2-1.0.1 HPC Features Figure 5: FCA Components After MLNX_OFED installation, FCA can be found at /opt/mellanox/fca folder. For further information on configuration instructions, please refer to the FCA User Manual. 5.6 ScalableUPC Unified Parallel C (UPC) is an extension of the C programming language designed for high performance computing on large-scale parallel machines.The language provides a uniform programming model for both shared and distributed memory hardware.
Rev 2.2-1.0.1 • GasNet library contains MXM conduit which offloads from UPC all P2P operations as well as some synchronization routines. For further information on MXM, please refer to the Mellanox website. Mellanox OFED 1.8 includes ScalableUPC 2.1, which is installed under: /opt/mellanox/bupc. If you have installed OFED 1.8, you do not need to download and install ScalableUPC. Mellanox ScalableUPC is distributed as source RPM as well and can be downloaded from the Mellanox website. 5.6.
Rev 2.2-1.0.1 HPC Features Table 18 - Runtime Parameters Parameter -fca_ops <+/->[op_list] Description op_list - comma separated list of collective operations. • -fca_ops <+/->[op_list] - Enables/disables only the • specified operations -fca_ops <+/-> - Enables/disables all operations By default all operations are enabled. Allowed operation names are: barrier (br), bcast (bt), reduce (rc), allgather (ag).
Rev 2.2-1.0.
Rev 2.2-1.0.1 6 Working With VPI Working With VPI VPI allows ConnectX ports to be independently configured as either IB or Eth. 6.1 Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethernet ports. By default both ConnectX ports are initialized as InfiniBand ports. If you wish to change the port type use the connectx_port_config script after the driver is loaded.
Rev 2.2-1.0.
Rev 2.2-1.0.1 7 Performance Performance For further information on Linux performance, please refer to the Performance Tuning Guide for Mellanox Network Adapters.
Rev 2.2-1.0.1 8 OpenSM – Subnet Manager 8.1 Overview OpenSM is an InfiniBand compliant Subnet Manager (SM). It is provided as a fixed flow executable called opensm, accompanied by a testing application called osmtest. OpenSM implements an InfiniBand compliant SM according to the InfiniBand Architecture Specification chapters: Management Model (13), Subnet Management (14), and Subnet Administration (15). 8.
Rev 2.2-1.0.1 OpenSM – Subnet Manager bound to 1 port at a time. If GUID given is 0, OpenSM displays a list of possible port GUIDs and waits for user input. Without -g, OpenSM tries to use the default port. --lmc, -l This option specifies the subnet's LMC value. The number of LIDs assigned to each port is 2^LMC. The LMC value must be in the range 0-7. LMC values > 0 allow multiple paths between ports.
Rev 2.2-1.0.1 --do_mesh_analysis This option enables additional analysis for the lash routing engine to precondition switch port assignments in regular cartesian meshes which may reduce the number of SLs required to give a deadlock free routing --lash_start_vl Sets the starting VL to use for the lash routing algorithm. Defaults to 0. --sm_sl Sets the SL to use to communicate with the SM/SA. Defaults to 0.
Rev 2.2-1.0.
Rev 2.2-1.0.1 --timeout, -t This option specifies the time in milliseconds used for transaction timeouts. Specifying -t 0 disables timeouts. Without -t, OpenSM defaults to a timeout value of 200 milliseconds. --retries This option specifies the number of retries used for transactions. Without --retries, OpenSM defaults to 3 retries for transactions. --maxsmps, -n This option specifies the number of VL15 SMP MADs allowed on the wire at any one time.
Rev 2.2-1.0.1 OpenSM – Subnet Manager --port_search_ordering_file, -O This option provides the means to define a mapping between ports and dimension (Order) for controlling Dimension Order Routing (DOR). Moreover this option provides the means to define non default routing port order. --dimn_ports_file, -O (DEPRECATED) This option provides the means to define a mapping between ports and dimension (Order) for controlling Dimension Order Routing (DOR).
Rev 2.2-1.0.1 --part_enforce, -Z [both, in, out, off] This option indicates the partition enforcement type (for switches) Enforcement type can be outbound only (out), inbound only (in), both or disabled (off). Default is both. --allow_both_pkeys, -W This option indicates whether both full and limited membership on the same partition can be configured in the PKeyTable. Default is not to allow both pkeys. --qos, -Q This option enables QoS setup.
Rev 2.2-1.0.1 OpenSM – Subnet Manager --consolidate_ipv6_snm_req Use shared MLID for IPv6 Solicited Node Multicast groups per MGID scope and P_Key. --consolidate_ipv4_mask Use mask for IPv4 multicast groups multiplexing per MGID scope and P_Key. --pid_file Specifies the file that contains the process ID of the opensm daemon.The default is /var/run/opensm.
Rev 2.2-1.0.1 This option sets the log verbosity level. A flags field must follow the -D option.
Rev 2.2-1.0.1 OpenSM – Subnet Manager opensm stores certain data to the disk such that subsequent runs are consistent. The default directory used is /var/cache/opensm. The following file is included in it: • guid2lid – stores the LID range assigned to each GUID 8.2.3 Signaling When OpenSM receives a HUP signal, it starts a new heavy sweep as if a trap has been received or a topology change has been found. Also, SIGUSR1 can be used to trigger a reopen of /var/log/opensm.log for logrotate purposes. 8.2.
Rev 2.2-1.0.1 8.3.
Rev 2.2-1.0.1 OpenSM – Subnet Manager -s, -M, -t, -l, -v, -V -vf 172 specified, osmtest defaults to the file osmtest.dat.
Rev 2.2-1.0.1 -h, --help 8.3.2 0x10 - FUNCS (function entry/exit, very high volume) 0x20 - FRAMES (dumps all SMP and GMP frames) 0x40 - ROUTING (dump FDB routing information) 0x80 - currently unused. Without -vf, osmtest defaults to ERROR + INFO (0x3) Specifying -vf 0 disables all messages Specifying -vf 0xFF enables all messages (see -V) High verbosity levels may require increasing the transaction timeout with the -t option Display this usage info then exit.
Rev 2.2-1.0.1 OpenSM – Subnet Manager where PartitionName string, will be used with logging. When omitted, an empty string will be used. PKey P_Key value for this partition. Only low 15 bits will be used. When omitted, P_Key will be autogenerated. flag used to indicate IPoIB capability of this partition. defmember=full|limited specifies default membership for port guid list. Default is limited.
Rev 2.2-1.0.
Rev 2.2-1.0.1 8.5 OpenSM – Subnet Manager Routing Algorithms OpenSM offers six routing engines: 1. “Min Hop Algorithm” Based on the minimum hops to each node where the path length is optimized. 2. “UPDN Algorithm” Based on the minimum hops to each node, but it is constrained to ranking rules. This algorithm should be chosen if the subnet is not a pure Fat Tree, and a deadlock may occur due to a loop in the subnet. 3.
Rev 2.2-1.0.1 The BFS tracks link direction (up or down) and avoid steps that will perform up after a down step was used. 2. Once MinHop matrices exist, each switch is visited and for each target LID a decision is made as to what port should be used to get to that LID. This step is common to standard and Up/Down routing. Each port has a counter counting the number of target LIDs going through it.
Rev 2.2-1.0.1 8.5.3 OpenSM – Subnet Manager UPDN Algorithm The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet. A loop-deadlock is a situation in which it is no longer possible to send data between any two hosts connected through the loop. As such, the UPDN routing algorithm should be send if the subnet is not a pure Fat Tree, and one of its loops may experience a deadlock (due, for example, to high pressure).
Rev 2.2-1.0.1 8.5.3.1 UPDN Algorithm Usage Activation through OpenSM • Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm. • Use '-a ' for adding an UPDN guid file that contains the root nodes for ranking. If the `-a' option is not used, OpenSM uses its auto-detect root nodes algorithm. Notes on the guid list file: • A valid guid file specifies one guid in each line. Lines with an invalid format will be discarded • 8.5.
Rev 2.2-1.0.1 OpenSM – Subnet Manager root list is provided, the closer the topology to a pure and symmetrical fat-tree, the more optimal the routing will be. The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump) in the same directory where the OpenSM log resides. This ordering file provides the CN order that may be used to create efficient communication pattern, that will match the routing tables. 8.5.4.
Rev 2.2-1.0.1 When computing the routing function, LASH analyzes the network topology for the shortest-path routes between all pairs of sources / destinations and groups these paths into virtual layers in such a way as to avoid deadlock. LASH analyzes routes and ensures deadlock freedom between switch pairs. The link from HCA between and switch does not need virtual layers as deadlock will not arise between switch and HCA. In more detail, the algorithm works as follows: 1.
Rev 2.2-1.0.1 8.5.6 OpenSM – Subnet Manager DOR Routing Algorithm The Dimension Order Routing algorithm is based on the Min Hop algorithm and so uses shortest paths. Instead of spreading traffic out across different paths with the same shortest distance, it chooses among the available shortest paths based on an ordering of dimensions. Each port must be consistently cabled to represent a hypercube dimension or a mesh dimension.
Rev 2.2-1.0.1 Thus, on a pristine 3D torus, i.e., in the absence of failed fabric switches, torus-2 QoS consumes 8 SL values (SL bits 0-2) and 2 VL values (VL bit 0) per QoS level to provide deadlock-free routing on a 3D torus. Torus-2 QoS routes around link failure by "taking the long way around" any 1D ring interrupted by a link failure. For example, consider the 2D 6x5 torus below, where switches are denoted by [+a-zA-Z]: For a pristine fabric the path from S to D would be S-n-T-r-D.
Rev 2.2-1.0.1 OpenSM – Subnet Manager because they cannot be used to construct a loop encircling T. The hop I-r uses a separate VL, so it cannot contribute to a credit loop encircling T. Extending this argument shows that in addition to being capable of routing around a single switch failure without introducing deadlock, torus2QoS can also route around multiple failed switches on the condition they are adjacent in the last dimension routed by DOR.
Rev 2.2-1.0.1 not arise from a combination of multicast and unicast path segments. It turns out that it is possible to construct spanning trees for multicast routing that have that property. For the 2D 6x5 torus example above, here is the full-fabric spanning tree that torus-2QoS will construct, where "x" is the root switch and each "+" is a non-root switch: For multicast traffic routed from root to tip, every turn in the above spanning tree is a legal DOR turn.
Rev 2.2-1.0.1 OpenSM – Subnet Manager Two things are notable about this master spanning tree. First, assuming the x dateline was between x=5 and x=0, this spanning tree has a branch that crosses the dateline. However, just as for unicast, crossing a dateline on a 1D ring (here, the ring for y=2) that is broken by a failure cannot contribute to a torus credit loop. Second, this spanning tree is no longer optimal even for multicast groups that encompass the entire fabric.
Rev 2.2-1.0.1 occurs if torus-2QoS is misconfigured, i.e., the radix of a torus dimension as configured does not match the radix of that torus dimension as wired, and many switches/links in the fabric will not be placed into the torus. 8.5.7.4 Quality Of Service Configuration OpenSM will not program switchs and channel adapters with SL2VL maps or VL arbitration configuration unless it is invoked with -Q.
Rev 2.2-1.0.1 OpenSM – Subnet Manager 8.5.7.6 Torus-2QoS Configuration File Syntax The file torus-2QoS.conf contains configuration information that is specific to the OpenSM routing engine torus-2QoS. Blank lines and lines where the first non-whitespace character is "#" are ignored. A token is any contiguous group of non-whitespace characters. Any tokens on a line following the recognized configuration tokens described below are ignored.
Rev 2.2-1.0.1 eter for a dateline keyword moves the origin (and hence the dateline) the specified amount relative to the common switch in a torus seed. next_seed If any of the switches used to specify a seed were to fail torus-2QoS would be unable to complete topology discovery successfully. The next_seed keyword specifies that the following link and dateline keywords apply to a new seed specification. For maximum resiliency, no seed specification should share a switch with any other seed specification.
Rev 2.2-1.0.1 8.5.8 OpenSM – Subnet Manager Routing Chains As of today, as it is impossible to configure each part of the fabric to be routed using different routing engines, a fabric can be routed using only one routing engine at a time. The routing chains feature is offering a solution that enables one to configure different parts of the fabric and define a different routing engine to route each of them.
Rev 2.2-1.0.1 Port Group Qualifiers Unlike the port group's beginning and ending which do not require a colon, all qualifiers must end with a colon (':'). Also - a colon is a predefined mark that must not be used inside qualifier values. An inclusion of a colon in the name or the use of a port group, will result in the policy's failure. Table 19 - Port Group Qualifiers Parameter Description Example name Each group must have a name. Without a name qualifier, the policy fails.
Rev 2.2-1.0.1 OpenSM – Subnet Manager Rule Qualifier There are several qualifiers used to describe a rule that determines which ports will be added to the group. Each port group policy must contain exactly one rule qualifier (If no rules exist, no ports can be chosen. More than one rule, on the other hand, will cause a conflict). Table 20 - Port Group Qualifiers Parameter guid list Description Comma separated list of guids to include in the group.
Rev 2.2-1.0.1 Table 20 - Port Group Qualifiers Parameter Description Example port name One can configure a list of hostnames as a rule. Hosts with a node description that is built out of these hostnames will be chosen. Since the node description contains the network card index as well, one might also specify a network card index and a physical port to be chosen. For example, the given configuration will cause only physical port 2 of a host with the node description ‘kuku HCA-1’ to be chosen.
Rev 2.2-1.0.1 OpenSM – Subnet Manager Predefined Port Groups There are 3 predefined, automatically created port groups that are available for use, yet cannot be defined in the policy file (if a group in the policy is configured with the name of one of these predefined groups, the policy fails) • ALL - a group that includes all nodes in the fabric • ALL_SWITCHES - a group that includes all switches in the fabric. • ALL_CAS - a group that includes all HCA's in the fabric.
Rev 2.2-1.0.1 For example: topology …topology qualifiers… end-topology Topology Qualifiers Unlike topology and end-topology which do not require a colon, all qualifiers must end with a colon (':'). Also - a colon is a predefined mark that must not be used inside qualifier values. An inclusion of a column in the qualifier values will result in the policy's failure. All topology qualifiers are mandatory. Absence of any of the below qualifiers will cause the policy parsing to fail.
Rev 2.2-1.0.1 OpenSM – Subnet Manager First Routing Engine in the Chain The first unicast engine in a routing chain must include all switches and HCA's in the fabric (topology id must be 0). The path-bit parameter value is path-bit 0 and it cannot be changed. Configuring a Routing Chains Policy The routing chains policy file details the routing engines (and their fallback engines) used for the fabric's routing.
Rev 2.2-1.0.1 Table 22 - Routing Engine Qualifiers Parameter topology Description Define the topology that this engine uses. • • • fallback-to • • • path-bit topology: 1 Legal value – id of an existing topology that is defined in topologies policy (or zero that represents the entire fabric and not a specific topology). Default value – If unspecified, a routing engine will relate to the entire fabric (as if topology zero was defined).
Rev 2.2-1.0.1 OpenSM – Subnet Manager sl2vl and mcfdbs files are dumped only once for the entire fabric and NOT by every routing engine. • • Each engine concatenates its ID and routing algorithm name in its dump files names, as follows: • opensm-lid-matrix.2.minhop.dump • opensm.fdbs.3.ftree • opensm-subnet.4.updn.lst In case that a fallback routing engine is used, both the routing engine that failed and the fallback engine that replaces it, dump their data.
Rev 2.2-1.0.1 Figure 6: QoS Manager There are two ways to define QoS policy: • Advanced – the advanced policy file syntax provides the administrator various ways to match a PathRecord/MultiPathRecord (PR/MPR) request, and to enforce various QoS constraints on the requested PR/MPR • 8.6.
Rev 2.2-1.0.1 OpenSM – Subnet Manager • Rate limit • PKey • Packet lifetime When path(s) search is performed, it is done with regards to restriction that these QoS Level parameters impose. One QoS level that is mandatory to define is a DEFAULT QoS level. It is applied to a PR/MPR query that does not match any existing match rule. Similar to any other QoS Level, it can also be explicitly referred by any match rule.
Rev 2.2-1.0.1 • Having a QoS Level named "DEFAULT" is a must - it is applied to PR/MPR requests that didn't match any of the matching rules. • Any section/subsection of the policy file is optional. 8.6.5 Examples of Advanced Policy File As mentioned earlier, any section of the policy file is optional, and the only mandatory part of the policy file is a default QoS Level.
Rev 2.2-1.0.1 OpenSM – Subnet Manager end-port-group # using partitions defined in the partition policy port-group name: Partitions partition: Part1 pkey: 0x1234 end-port-group # using node types: CA, ROUTER, SWITCH, SELF (for node that runs SM) # or ALL (for all the nodes in the subnet) port-group name: CAs and SM node-type: CA, SELF end-port-group end-port-groups qos-setup # This section of the policy file describes how to set up SL2VL and VL # Arbitration tables on various nodes in the fabric.
Rev 2.2-1.0.1 end-qos-levels # Match rules are scanned in order of their apperance in the policy file. # First matched rule takes precedence.
Rev 2.2-1.0.
Rev 2.2-1.0.1 end-qos-ulps Similar to the advanced policy definition, matching of PR/MPR queries is done in order of appearance in the QoS policy file such as the first match takes precedence, except for the "default" rule, which is applied only if the query didn't match any other rule. All other sections of the QoS policy file take precedence over the qos-ulps section.
Rev 2.2-1.0.1 OpenSM – Subnet Manager Note that any of the above ULPs might contain target port GUID in the PR query, so in order for these queries not to be recognized by the QoS manager as SRP, the SRP match rule (or any match rule that refers to the target port guid only) should be placed at the end of the qos-ulps match rules. 8.6.6.5 MPI SL for MPI is manually configured by MPI admin.
Rev 2.2-1.0.1 Note, that the same VLs may be listed multiple times in the High or Low priority arbitration tables, and, further, it can be listed in both tables. The limit of high-priority VLArb table (qos__high_limit) indicates the number of high-priority packets that can be transmitted without an opportunity to send a low-priority packet. Specifically, the number of bytes that can be sent is high_limit times 4K bytes. A high_limit value of 255 indicates that the byte limit is unbounded.
Rev 2.2-1.0.1 OpenSM – Subnet Manager Figure 7: Example QoS Deployment on InfiniBand Subnet 8.7 QoS Configuration Examples The following are examples of QoS configuration for different cluster deployments. Each example provides the QoS level assignment and their administration via OpenSM configuration files. 8.7.
Rev 2.2-1.0.1 default :0 # default SL (for MPI) any, target-port-guid OST1,OST2,OST3,OST4:1 # SL for Lustre OST any, target-port-guid MDS1,MDS2 :2 # SL for Lustre MDS end-qos-ulps • OpenSM options file qos_max_vls 8 qos_high_limit 0 qos_vlarb_high 2:1 qos_vlarb_low 0:96,1:224 qos_sl2vl 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15 8.7.
Rev 2.2-1.0.1 OpenSM – Subnet Manager qos_max_vls 8 qos_high_limit 0 qos_vlarb_high 1:32,2:32 qos_vlarb_low 0:1, qos_sl2vl 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15 8.7.3 EDC (3-tier): IPoIB, RDS, SRP The following is an example of QoS configuration for an enterprise data center (EDC), with IPoIB carrying all application traffic, RDS for database traffic, and SRP used for storage.
Rev 2.2-1.0.1 qos_max_vls 8 qos_high_limit 0 qos_vlarb_high 1:32,2:96,3:96,4:96 qos_vlarb_low 0:1 qos_sl2vl 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15 • Partition configuration file Default=0x7fff, ipoib : ALL=full; PartA=0x8001, sl=1, ipoib : ALL=full; 8.8 Adaptive Routing 8.8.1 Overview Adaptive Routing is at beta stage. Adaptive Routing (AR) enables the switch to select the output port based on the port's load. AR supports two routing modes: • Free AR: No constraints on output port selection.
Rev 2.2-1.0.1 8.8.3.1 OpenSM – Subnet Manager Enabling Adaptive Routing To enable Adaptive Routing, perform the following: 1. Create the Subnet Manager options file. Run: opensm -c 2. Add 'armgr' to the 'event_plugin_name' option in the file: # Event plugin name(s) event_plugin_name armgr 3.
Rev 2.2-1.0.1 8.8.5 Adaptive Routing Manager Options File The default location of the AR Manager options file is /etc/opensm/ar_mgr.conf. To set an alternative location, please perform the following: 1. Add 'armgr --conf_file to the event_plugin_option' option in the file # Options string that would be passed to the plugin(s) event_plugin_options armgr --conf_file 2.
Rev 2.2-1.0.1 OpenSM – Subnet Manager 8.8.5.1 General AR Manager Options Table 23 - Adaptive Routing Manager Options File Option File Description Values ENABLE: Enable/disable Adaptive Routing on fabric switches. Note that if a switch was identified by AR Manager as device that does not support AR, AR Manager will not try to enable AR on this switch.
Rev 2.2-1.0.1 SWITCH { ; ; ... } The following are the per-switch options: Table 24 - Adaptive Routing Manager Pre-Switch Options File Option File Description Values ENABLE: Allows you to enable/disable the AR on this switch. If the general ENABLE option value is set to 'false', then this per-switch option is ignored. This option can be changed on the fly. Default: true AGEING_TIME: Applicable to bounded AR mode only.
Rev 2.2-1.0.1 OpenSM – Subnet Manager 8.9 Congestion Control 8.9.1 Congestion Control Overview Congestion Control Manager is a Subnet Manager (SM) plug-in, i.e. it is a shared library (libccmgr.so) that is dynamically loaded by the Subnet Manager. Congestion Control Manager is installed as part of Mellanox OFED installation.
Rev 2.2-1.0.1 To turn CC OFF, set 'enable' to 'FALSE' in the Congestion Control Manager configuration file, and run OpenSM ones with this configuration. For the full list of CC Manager options with all the default values, See “Configuring Congestion Control Manager” on page 216. For further details on the list of CC Manager options, please refer to the IB spec. 8.9.
Rev 2.2-1.0.1 • OpenSM – Subnet Manager When number of errors exceeds 'max_errors' of send/receive errors or timeouts in less than 'error_window' seconds, the CC MGR will abort and will allow OpenSM to proceed. To do so, set the following parameter: max_errors error_window • The values are: max_errors = 0: zero tollerance - abort configuration on first error error_window = 0: mechanism disabled - no error checking.[0-48K] • The default is: 5 8.9.4.
Rev 2.2-1.0.1 Table 27 - Congestion Control Manager CA Options File Option File Desctiption Values ca_control_map An array of sixteen bits, one for each SL. Each bit indicates whether or not the corresponding SL entry is to be modified. Values: 0xffff ccti_increase Sets the CC Table Index (CCTI) increase. Default: 1 trigger_threshold Sets the trigger threshold. Default: 2 ccti_min Sets the CC Table Index (CCTI) minimum. Default: 0 cct Sets all the CC table entries to a specified value .
Rev 2.2-1.0.1 9 InfiniBand Fabric Utilities InfiniBand Fabric Utilities This section first describes common configuration, interface, and addressing for all the tools in the package. Then it provides detailed descriptions of the tools themselves including: operation, synopsis and options descriptions, error codes, and examples. 9.1 Common Configuration, Interface and Addressing Topology File (Optional) An InfiniBand fabric is composed of switches and channel adapter (HCA/TCA) devices.
Rev 2.2-1.0.1 The following addressing modes can be used to define the IB ports: • Using a Directed Route to the destination: (Tool option ‘-d’) This option defines a directed route of output port numbers from the local port to the destination. • Using port LIDs: (Tool option ‘-l’): In this mode, the source and destination ports are defined by means of their LIDs. If the fabric is configured to allow multiple LIDs per port, then using any of them is valid for defining a port.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities -g|--guid --vlr -r|--routing -u|--fat_tree -o|--output_path --skip --skip_plugin --pc -P|--counter <=> --pm_pause_time --ber_test --ber_use_data --ber_thresh --extended_speeds --pm_per_lane --ls <2.
Rev 2.2-1.0.1 -V|--version -h|--help -H|--deep_help Prints the version of the tool. Prints help information (without plugins help if exists). Prints deep help information (including plugins help). Output Files Table 29 lists the ibdiagnet output files that are placed under /var/tmp/ibdiagnet2. Table 29 - ibdiagnet (of ibutils2) Output Files Output File Description ibdiagnet2.lst Fabric links in LST format ibdiagnet2.sm Subnet Manager ibdiagnet2.pm Ports Counters ibdiagnet2.
Rev 2.2-1.0.
Rev 2.2-1.0.1 9.4.2 ibdiagnet (of ibutils) - IB Net Diagnostic Please note that ibdiagnet is an obsolete package. We recommend using ibdiagnet from ibutils2. This version of ibdiagnet is included in the ibutils package, and it is not run by default after installing Mellanox OFED. To use this ibdiagnet version and not that of the ibutils package, you need to specify the full path: /opt/ibutils/bin 9.4.3 ibdiagpath - IB Diagnostic Path ibdiagpath is located at: /opt/ibutisl/bin.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Options -n <[src-name,]dst-name> -l <[src-lid,]dst-lid> -d -c -v -t -s -i -p -o -lw <1x|4x|12x> -ls <2.
Rev 2.2-1.0.1 9.4.4 ibstat ibstat is a binary which displays basic information obtained from the local IB driver. Output includes LID, SMLID, port state, link width active, and port physical state. Synopsis ibstat [-d(ebug)] [-l(ist_of_cas)] [-s(hort)] [-p(ort_list)] [-V(ersion)] [-h] [portnum] Options The table below lists the various flags of the command. Most OpenIB diagnostics take the following common flags.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Synopsys ibtracert [-d(ebug)] [-v(erbose)] [-D(irect)] [-L(id)] [-e(rrors)] [-u(sage)] [-G(uids)] [-f(orce)] [-n(o_info)] [-m mlid] [-s smlid] [-C ca_name][-P ca_port] [-t(imeout) timeout_ms] [-V(ersion)] [--node-name--map ] [-h(elp)] [ [ []] Options The table below lists the various flags of the command. Most OpenIB diagnostics take the following common flags.
Rev 2.2-1.0.1 • Multicast example ibtracert -m 0xc000 4 16 lids 4 and 16 9.4.6 # show multicast path of mlid 0xc000 between ibqueryerrors The default behavior is to report the port error counters which exceed a threshold for each port in the fabric. The default threshold is zero (0). Error fields can also be suppressed entirely.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Table 33 - ibqueryerrors Flags and Options Flags Description --switch Prints data for switches only. --ca Prints data for CA’s only. --router Prints data for routers only --clear-errors-k Clear error counters after read. -k and -K can be used together to clear both errors and counters. --clear-counts -K Clear data counters after read. CAUTION: clearing data counters will occur regardless of if they are printed or not.
Rev 2.2-1.0.1 Exit Status If a failure to scan the fabric occurs return -1. If the scan succeeds without errors beyond thresholds return 0. If errors are found on ports beyond thresholds return 1. Files /etc/infiniband-diags/error_threshold Define threshold values for errors. File format is simple "name=val". Comments begin with ’#’ Example: # Define thresholds for error counters SymbolErrorCounter=10 LinkErrorRecoveryCounter=10 VL15Dropped=100 9.4.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Table 34 - iblinkinfo Flags and Options Flags 9.4.8 Description --load-cache Loads and use the cached ibnetdiscover data stored in the specified filename. May be useful for outputting and learning about other fabrics or a previous state of a fabric. Cannot be used if user specifies a direct route path. See ibnetdiscover for information on caching ibnetdiscover output.
Rev 2.2-1.0.1 Table 35 - saquery Flags and Options Flags Description -N Gets NodeRecord info. --list | -D Gets NodeDescriptions of CAs only. -S Gets ServiceRecord info. -I Gets InformInfoRecord (subscription) info.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Table 35 - saquery Flags and Options Flags --node-name-map Description Specifies a node name map.The node name map file maps GUIDs to more user friendly names. See ibnetdiscover(8) for node name map file format.Only used with the -O and -U options. • 9.4.
Rev 2.2-1.0.1 Table 36 - smpdump Flags and Options Flags Description -d Raises the IB debugging level. Can be used several times (-ddd or -d -d -d). -e Shows send and receive errors (timeouts and others) -h Shows the usage message -v Increases the application verbosity level. Can be used several times (-vv or -v -v -v) -V Shows the version info. Addressing Flags Description -D Uses directed path address arguments. The path is a comma separated list of out ports.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities 9.4.10 ibv_devices Lists InfiniBand devices available for use from userspace, including node GUIDs. Synopsis ibv_devices Examples 1. List the names of all available InfiniBand devices. > ibv_devices device -----mthca0 mlx4_0 node GUID ---------------0002c9000101d150 0000000000073895 9.4.11 ibv_devinfo Queries InfiniBand devices and prints about them information that is available for use from userspace.
Rev 2.2-1.0.1 mthca0 mlx4_0 2. Query the device mlx4_0 and print user-available information for its Port 2. > ibv_devinfo -d mlx4_0 -i 2 hca_id: mlx4_0 fw_ver: node_guid: sys_image_guid: vendor_id: vendor_part_id: hw_ver: board_id: phys_port_cnt: port: 2 state: max_mtu: active_mtu: sm_lid: port_lid: port_lmc: 2.5.944 0000:0000:0007:3895 0000:0000:0007:3898 0x02c9 25418 0xA0 MT_04A0140005 2 PORT_ACTIVE (4) 2048 (4) 2048 (4) 1 1 0x00 9.4.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities mlx4_1 (MT26448 - MT1023X00777) Hawk Dual Port fw 2.7.9400 port 2 (DOWN ) ==> eth3 (Down) sw417:~/BXOFED-1.5.2-20101128-1524 # ibdev2netdev mlx4_0 port 1 ==> eth5 (Down) mlx4_0 port 1 ==> ib0 (Down) mlx4_0 port 2 ==> ib1 (Down) mlx4_1 port 1 ==> eth2 (Down) mlx4_1 port 2 ==> eth3 (Down) 9.4.13 ibstatus Displays basic information obtained from the local InfiniBand driver. Output includes LID, SMLID, port state, port physical state, port width and port rate.
Rev 2.2-1.0.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities 9.4.14 ibportstate Enables querying the logical (link) and physical port states of an InfiniBand port. It also allows adjusting the link speed that is enabled on any InfiniBand port.
Rev 2.2-1.0.1 In case of multiple channel adapters (CAs) or multiple ports without a CA/port being specified, a port is chosen by the utility according to the following criteria: 1. The first ACTIVE port that is found. 2. If not found, the first port that is UP (physical link state is LinkUp). Examples 1. Query the status of Port 1 of CA mlx4_0 (using ibstatus) and use its output (the LID – 3 in this case) to obtain additional link information using ibportstate.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities LinkState:.......................Down PhysLinkState:...................Polling LinkWidthSupported:..............1X or 4X LinkWidthEnabled:................1X or 4X LinkWidthActive:.................4X LinkSpeedSupported:..............2.5 Gbps LinkSpeedEnabled:................2.5 Gbps LinkSpeedActive:.................2.5 Gbps 3. Change the speed of a port.
Rev 2.2-1.0.1 9.4.15 ibroute Uses SMPs to display the forwarding tables—unicast (LinearForwardingTable or LFT) or multicast (MulticastForwardingTable or MFT)—for the specified switch LID and the optional lid (mlid) range. The default range is all valid entries in the range 1 to FDBTop. Synopsis ibroute [-h] [-d] [-v] [-V] [-a] [-n] [-D] [-G] [-M] [-s ] \ [-C ] [-P ] [-t ] \ [ [ []]] Output Files Table 39 lists the various flags
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Table 39 - ibportstate Flags and Options Flag Optional / Mandatory Default (If Not Specified) Description -P Optional Use the specified port -t Optional Override the default timeout for the solicited MADs [msec] Optional Destination’s directed path, LID, or GUID Optional Starting LID in an MLID range Optional Ending LID in an MLID range Examples 1.
Rev 2.2-1.0.1 3. Dump all Lids in the range 3 to 7 with valid out ports of the switch with Lid 2.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities 9.4.16 smpquery Provides a basic subset of standard SMP queries to query Subnet management attributes such as node info, node description, switch info, and port info. Synopsis smpquery [-h] [-d] [-e] [-v] [-D] [-G] [-s ] [-V] [-C ] [-P ] [-t ] [--node-name-map ] [op params] Output Files Table 40 lists the various flags of the command.
Rev 2.2-1.0.1 Table 40 - smpquery Flags and Options Flag Optional / Mandatory Default (If Not Specified) Description Mandatory Supported operations: nodeinfo nodedesc portinfo [] switchinfo pkeys [] sl2vl [] vlarb [] guids mepi [] Optional Destination’s directed path, LID, or GUID Examples 1. Query PortInfo by LID, with port modifier.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities ProtectBits:.....................0 LMC:.............................0 LinkSpeedActive:.................5.0 Gbps LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps NeighborMTU:.....................2048 SMSL:............................0 VLCap:...........................VL0-7 InitType:........................0x00 VLHighLimit:.....................4 VLArbHighCap:....................8 VLArbLowCap:.....................8 InitReply:.......................0x00 MtuCap:..
Rev 2.2-1.0.1 InboundPartEnf:..................1 OutboundPartEnf:.................1 FilterRawInbound:................1 FilterRawOutbound:...............1 EnhancedPort0:...................0 3. Query NodeInfo by direct route. > smpquery -D nodeinfo 0 # Node info: DR path slid 65535; dlid 65535; 0 BaseVers:........................1 ClassVers:.......................1 NodeType:........................Channel Adapter NumPorts:........................2 SystemGuid:......................0x0002c9030000103b Guid:...
Rev 2.2-1.0.
Rev 2.2-1.0.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities RcvSwRelayErrors:................70 XmtDiscards:.....................488 XmtConstraintErrors:.............0 RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................0 XmtData:.........................129840354 RcvData:.........................129529906 XmtPkts:.........................1803332 RcvPkts:.........................1799018 3.
Rev 2.2-1.0.1 Output Files Table 42 lists the various switches of the utility, and Table 43 lists its commands. Table 42 - mstflint Switches (Sheet 1 of 2) Switch Affected/ Relevant Commands Description -h Print the help menu -hh Print an extended help menu -d[evice] All Specify the device to which the Flash is connected. -guid burn, sg GUID base value.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Table 42 - mstflint Switches (Sheet 2 of 2) Switch Affected/ Relevant Commands Description -qq burn, query Run a quick query. When specified, mstflint will not perform full image integrity checks during the query operation. This may shorten execution time when running over slow interfaces (e.g., I2C, MTUSB-1). -nofs burn Burn image in a non-failsafe manner -skip_is burn Allow burning the firmware image without updating the invariant sector.
Rev 2.2-1.0.1 Table 43 - mstflint Commands (Sheet 2 of 2) Command Description sg Set GUIDs ri Read the firmware image on the Flash into the specified file dc Dump Configuration: Print a firmware configuration file for the given image to the specified output file e[rase] Erase sector rw Read one DWORD from Flash ww < data> Write one DWORD to Flash wwne Write one DWORD to Flash without sector erase wbne
Rev 2.2-1.0.
Rev 2.2-1.0.1 9.4.19 ibv_asyncwatch Display asynchronous events forwarded to userspace for an InfiniBand device. Synopsis ibv_asyncwatch Examples 1. Display asynchronous events. > ibv_asyncwatch mlx4_0: async event FD 4 9.4.20 ibdump Dump InfiniBand traffic that flows to and from Mellanox Technologies ConnectX® family adapters InfiniBand ports. The dump file can be loaded by the Wireshark tool for graphical traffic analysis.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities -w, --write= -o, --output= -b, --max-burst= l -s, --silent --mem-mode --decap -h, --help -v, --version dump file name (default "sniffer.pcap") '-' stands for stdout - enables piping to tcpdump or tshark. alias for the '-w' option. Do not use - for backward compatibility og2 of the maximal burst size that can be captured with no packets loss.
Rev 2.2-1.0.1 9.4.22 ibswitches Traces the InfiniBand subnet topology or uses an already saved topology file to extract the InfiniBand switches. Synopsis ibswitches [-h] [] 9.4.23 ibnetsplit Automatically groups hosts and creates scripts that can be run in order to split the network into sub-networks containing one group of hosts. Synopsis • Group ibnetsplit [-v][-h][-g grp-file] -s <.lst|.net|.topo> <-r head-ports|-d max-dist> • Split ibnetsplit [-v][-h][-g grp-file] -s <.lst|.net|.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Both stages require a subnet definition file to be provided by the -s flag. The supported formats for subnet definition are: • *.net - for ibnetdiscover • *.lst - for opensm-subnet.lst or ibiagnet.lst • *.topo - for a topology file HEAD PORTS FILE This file is provided by the user and defines the ports by which grouping of the other host ports is defined. Format: Each line should contain either the name or the GUID of a single port.
Rev 2.2-1.0.1 Operation CongestionControlTable (CT) Parameters Options --cckey, -c --config, -z --Ca, -C --Port, -P --Lid, -L --Guid, -G --timeout, -t --sm_port, -s --m_key, -y --errors, -e --verbose, -v --debug, -d --help, -h --version, -V CC key use config file, default: /etc/infiniband-diags/ibdiag.
Rev 2.2-1.0.
Rev 2.2-1.0.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Additional Options -t|--topology -d|--discovered -e|--edge -s|--start-node -p|--port-num -g|--port-guid Topology file [ibdm.topo]. The format is defined in the IBDM user manual. [subnet.lst] file produced by OpenSM. start processing from the edges using strict match. The name of the start node [H-1/U1]. The number of the start port [1]. The guid of the start port [none]. 9.4.28 ibcongest Provides static congestion analysis.
Rev 2.2-1.0.1 -O order-file -o order-file -m -R rep-hsd-file -Q rep-bw-file -v lvl: verbose mode -h Runa a simulated annealing and write out the result into the given file. A file holding host name in each line, defining their ordering. Description of the order file is in section ORDER_FILE. Default order is by host name sort H- by the N.
Rev 2.2-1.0.1 • InfiniBand Fabric Utilities OpenSM order file style holding LID and host name ^\s*(0x[0-9a-fA-F]+)\s(\s+)\s+HCA-([0-9]+) • Port name only: simply declare the order of host names to be used ^\s*(\s+)\s+P?([0-9]+)\s*$ • Dummy place holder ^DUMMY$ DUMMY stands for an empty rank. Just a placeholder - not a real HCA. Traffic from DUMMY or to DUMMY is not generated even if specified in the schedule. 9.4.
Rev 2.2-1.0.1 Examples: sminfo sminfo 2 # show sminfo of SM listed in local portinfo # query SM on port lid 2 9.4.31 ibnetdiscover Performs InfiniBand subnet discovery and outputs a human readable topology file. GUIDs, node types, and port numbers are displayed as well as port LIDs and node descriptions. All nodes (and links) are displayed (full topology). This utility can also be used to list the current connected nodes. The output is printed to the standard output unless a topology file is specified.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities 9.4.33 ibsysstat Uses vendor mads to validate connectivity between InfiniBand nodes and obtain other information about the InfiniBand node. ibsysstat is run as client/server. The default is to run as client.
Rev 2.2-1.0.1 9.4.35 dump_fts Dumps tables for every switch found in an ibnetdiscover scan of the subnet. The dump file format is compatible with loading into OpenSM using the -R file -U /path/to/dump-file syntax. Syntax dump_fts [options] [ []] Options -a, --all -n, --no_dests -M, --Multicast show all lids in range, even invalid entries do not try to resolve destinations show multicast forwarding tables In this case, the range parameters are specifying the mlid range.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities --node-name-map Specify a node name map. This file maps GUIDs to more user friendly names. See FILES section. Specify alternate config file. Default: /etc/infinibanddiags/ibdiag.conf --config, -z CONFIG FILE /etc/infiniband-diags/ibdiag.conf A global config file is provided to set some of the common options for all tools. See supplied config file for details.
Rev 2.2-1.0.1 9.5 Performance Utilities The performance utilities described in this chapter are intended to be used as a performance micro-benchmark. 9.5.1 ib_read_bw ib_read_bw calculates the BW of RDMA read between a pair of machines. One acts as a server and the other as a client. The client RDMA reads the server memory and calculate the BW by sampling the CPU each time it receive a successful completion.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Table 44 - ib_read_bw Flags and Options Flag Description -N, --no peak-bw Cancel peak-bw calculation (default with peak) -o, --outs= Number of outstanding read/atom(default max of device) -O, --dualport Run test in dual-port mode.
Rev 2.2-1.0.1 Table 45 - Additional ib_read_bw Flags and Options Flag 9.5.2 Description --report-both Report RX & TX results separately on Bidirectinal BW tests --report_gbits Report Max/Average BW of test in Gbit/sec (instead of MB/sec) --run_infinitely Run test forever, print results every seconds ib_read_lat ib_read_lat calculates the latency of RDMA read operation of message_size between a pair of machines. One acts as a server and the other as a client.
Rev 2.2-1.0.
Rev 2.2-1.0.1 Synopsis Server ib_send_bw [options] Client ib_send_bw [options] Options The table below lists the various flags of the command.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Table 48 - ib_send_bw Flags and Options Flag Description -Q, --cq-mod Generate Cqe only after <--cq-mod> completion -r, --rx-depth= Rx queue size (default 512).
Rev 2.2-1.0.1 Rate Limiter The table below lists the Rate Limiter flags of the command. Table 50 - Additional Rate Limiter Flags and Options Flag 9.5.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Table 51 - ib_send_lat Flags and Options Flag Description -g, --mcg= Send messages to multicast group with qps attached to it. -h, --help Show this help screen.
Rev 2.2-1.0.1 Table 52 - Additional ib_send_lat Flags and Options Flag 9.5.5 Description --output= Set verbosity output level: bandwidth , message_rate, latency_typical --pkey_index= PKey index to use for QP ib_write_bw ib_write_bw calculates the BW of RDMA write between a pair of machines. One acts as a server and the other as a client. The client RDMA writes to the server memory and calculate the BW by sampling the CPU each time it receive a successful completion.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Table 53 - ib_write_bw Flags and Options Flag Description -l, --post_list= Post list of WQEs of size (instead of single post) -m, --mtu= MTU size : 256 - 4096 (default port mtu) -n, --iters= Number of exchanges (at least 5, default 5000) -N, --no peak-bw Cancel peak-bw calculation (default with peak) -O, --dualport Run test in dual-port mode.
Rev 2.2-1.0.1 Table 54 - Additional ib_write_bw Flags and Options Flag 9.5.6 Description --pkey_index= PKey index to use for QP --report-both Report RX & TX results separately on Bidirectinal BW tests --report_gbits Report Max/Average BW of test in Gbit/sec (instead of MB/sec) --run_infinitely Run test forever, print results every seconds ib_write_lat ib_write_lat calculates the latency of RDMA write operation of message_size between a pair of machines.
Rev 2.2-1.0.
Rev 2.2-1.0.1 Synopsis Server ib_atomic_bw [options] Client ib_atomic_bw [options] Options The table below lists the various flags of the command.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Table 57 - ib_atomic_bw Flags and Options Flag Description -R, --rdma_cm Connect QPs with rdma_cm and run test on those QPs -S, --sl= SL (default 0) -t, --tx-depth= Size of tx queue (default 128) -T, --tos= Set to RDMA-CM QPs. available only with -R flag.
Rev 2.2-1.0.1 Synopsis Server ib_atomic_lat [options] Client ib_atomic_lat [options] Options The table below lists the various flags of the command.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Table 59 - ib_atomic_lat Flags and Options Flag Description -V, --version Display version number -x, --gid-index= Test uses GID with GID index (Default : IB - no gid . ETH - 0) -z, --com_rdma_cm Communicate with rdma_cm module to exchange data use regular QPs Additional Options The table below lists the additional flags of the command. Table 60 - Additional ib_atomic_lat Flags and Options Flag 9.5.
Rev 2.2-1.0.1 Table 61 - raw_ethernet_bw Flags and Options Flag Description -d, --ib-dev= Use IB device (default first device found) -D, --duration Run test for a customized period of seconds. -e, --events Sleep on CQ events (default poll) -f, --margin Measure results within margins. (default=2sec) -F, --CPU-freq Do not fail even if cpufreq_ondemand module is loaded -g, --mcg= Send messages to multicast group with qps attached to it.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Table 61 - raw_ethernet_bw Flags and Options Flag Description -V, --version Display version number -w, --limit_bw Set verifier limit for bandwidth -x, --gid-index= Test uses GID with GID index (Default : IB - no gid . ETH - 0) -y, --limit_msgrate Set verifier limit for Msg Rate -z, --com_rdma_cm Communicate with rdma_cm module to exchange data use regular QPs Additional Options The table below lists the additional flags of the command.
Rev 2.2-1.0.1 Rate Limiter The table below lists the Rate Limiter flags of the command. Table 63 - raw_ethernet_bw Rate Limiter Flags and Options Flag Description --burst_size= Set the amount of messages to send in a burst when using rate limiter --rate_limit= Set the maximum rate of sent packages --rate_units= [Mgp] Set the units for rate limit to MBps (M), Gbps (g) or pps (p) Raw Ethernet Options The table below lists the Raw Ethernet flags of the command.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities 9.5.10 raw_ethernet_lat raw_ethernet_lat calculates the latency of sending a packet in message_size between a pair of machines. One acts as a server and the other as a client. They perform a ping pong benchmark on which you send packet only if you receive one. Each of the sides samples the CPU each time they receive a packet in order to calculate the latency. Using the "-a" provides results for all message sizes.
Rev 2.2-1.0.1 Table 65 - raw_ethernet_lat Flags and Options Flag Description -r, --rx-depth= Rx queue size (default 512). If using srq, rx-depth controls max-wr size of the srq -R, --rdma_cm Connect QPs with rdma_cm and run test on those QPs -s, --size= Size of message to exchange (default 2) -S, --sl= SL (default 0) -T, --tos= Set to RDMA-CM QPs. availible only with -R flag.
Rev 2.2-1.0.1 InfiniBand Fabric Utilities Table 67 - raw_ethernet_lat Raw Ethernet Flags and Options Flag 292 Description -E, --dest_mac Destination MAC address by this format XX:XX:XX:XX:XX:XX **MUST** be entered -J, --dest_ip Destination ip address by this format X.X.X.X (using to send packets with IP header) -j, --source_ip Source ip address by this format X.X.X.
Rev 2.2-1.0.1 Appendix A: SRP Target Driver The SRP Target driver is designed to work directly on top of OpenFabrics OFED software stacks (http://www.openfabrics.org) or InfiniBand drivers in Linux kernel tree (kernel.org). It also interfaces with Generic SCSI target mid-level driver - SCST (http://scst.sourceforge.net). By interfacing with an SCST driver, it is possible to work with and support a lot of IO modes on real or virtual devices in the back end. 1. scst_vdisk – fileio and blockio modes.
Rev 2.2-1.0.1 Example 1: Working with VDISK BLOCKIO mode (Using the md0 device, sda, and cciss/c1d0) a. modprobe scst b. modprobe scst_vdisk c. echo "open vdisk0 /dev/md0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk d. echo "open vdisk1 /dev/sda BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk e. echo "open vdisk2 /dev/cciss/c1d0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk f. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices g. echo "add vdisk1 1" >/proc/scsi_tgt/groups/Default/devices h.
Rev 2.2-1.0.1 umad1: port 2 of the first HCA umad2: port 1 of the second HCA 3. echo {new target info} > /sys/class/infiniband_srp/srp-mthca0-1/add_target 4. fdisk -l (will show the newly discovered scsi disks) Example: Assume that you use port 1 of first HCA in the system, i.e.
Rev 2.2-1.0.1 A.3 How-to Unload/Shutdown 1. Unload ib_srpt. $ modprobe -r ib_srpt 2. Unload scst and its dev_handlers first. $ modprobe -r scst_vdisk scst 3. Unload MLNX_OFED kernel modules. $ /etc/rc.
Rev 2.2-1.0.1 Appendix B: mlx4 Module Parameters In order to set mlx4 parameters, add the following line(s) to /etc/modprobe.conf: options mlx4_core parameter= and/or options mlx4_ib parameter= and/or options mlx4_en parameter= The following sections list the available mlx4 parameters. B.1 mlx4_ib Parameters sm_guid_assign: dev_assign_str1: Enable SM alias_GUID assignment if sm_guid_assign > 0 (Default: 1) (int) Map device function numbers to IB device numbers (e.g.'0000:04:00.
Rev 2.2-1.0.1 log_num_mgm_entry_size: high_rate_steer: fast_drop: enable_64b_cqe_eqe: log_num_mac: log_num_vlan: log_mtts_per_seg: port_type_array: log_num_qp: log_num_srq: log_rdmarc_per_qp: log_num_cq: log_num_mcg: log_num_mpt: log_num_mtt: enable_qos: internal_err_reset: B.3 mlx4_en Parameters inline_thold: udp_rss: pfctx: pfcrx: 298 log mgm size, that defines the num of qp per mcg, for example: 10 gives 248.range: 7 <= log_num_mgm_entry_size <= 12.
Rev 2.2-1.0.1 Appendix C: mlx5 Module Parameters The mlx5_ib module supports a single parameter used to select the profile which defines the number of resources supported. The parameter name for selecting the profile is prof_sel.
Rev 2.2-1.0.1 Appendix D: Lustre Compilation over MLNX_OFED This procedure applies to RHEL/SLES OSs supported by Lustre. For further information, please refer to Lustre Release Notes. To compile Lustre version 2.4.0 and higher: $ ./configure --with-o2ib=/usr/src/ofa_kernel/default/ $ make rpms To compile older Lustre versions: $ EXTRA_LNET_INCLUDE="-I/usr/src/ofa_kernel/default/include/ -include /usr/src/ ofa_kernel/default/include/linux/compat-2.6.h" .
Rev 2.2-1.0.1 Appendix E: Using FlexBoot for Booting Various OSs from an iSCSI Target Below are instructions on how to provision a diskless system (the client) with a fresh SLES11 SP3 and RHEL6.4 installation to a remote storage (IE: a LUN partition on an iSCSI target) and then SAN-Booting (iSCSI boot) the client using Mellanox PXE boot agent (FlexBoot). The iSCSI configuration in this document is very basic (no CHAP authentication, no multipath I/O) and demonstrates basic PXE SAN Boot capability.
Rev 2.2-1.0.1 Step 3. Create the IQN in the ietd configuration file. Target iqn.2013-10.qalab.com:sqa030.prt9 Lun 0 Path=/dev/cciss/c0d0p9,Type=fileio,IOMode=wb MaxConnections 1 # Number of connections/session.
Rev 2.2-1.0.1 E.2 Configuring the DHCP Server Edit a host-declaration for your PXE client in the DHCP configuration file, serving it with pxelinux.0, and restart your DHCP. "pxelinux.0" is a part of the "syslinux" RPM (/usr/share/syslinux/pxelinux.0). To install the syslinux RPM on: • RedHat, run "yum install syslinux" • SUSE - "zypper install syslinux" To get the latest version of syslinux please refer to http://www.syslinux.
Rev 2.2-1.0.1 If the iso method above is not used, two different PXE server configurations are required (PXELINUX booting labels) for each phase discussed herein (booting the installer and post-installation boot) • For booting the installer program off the TFTP server, please provide the client a path to the initrd and linux kernel as provided inside SLES11SP3-kISO-VPI/pxeboot-install/ in the tgz above. The below is an example of such label. LABEL SLES11.
Rev 2.2-1.0.1 Step 2. Reboot the client and invoke PXE boot with the Mellanox boot agent. Step 3. Select the "Install SLES11.3" boot option from the menu (see pxelinux.cfg example above). After about 30 seconds, the SLES installer will issue the notification below due to the PXELINUX boot label we used above. Step 4. Click OK.
Rev 2.2-1.0.1 306 Step 5. Click on the Configure iSCSI Disks button. Step 6. Choose Connected Targets tab.
Rev 2.2-1.0.1 Step 7. Click Add. Step 8. Enter the IP address of the iSCSI storage target. Step 9. Click Next. Step 10. Select the relevant target from the table (In our example, only one target exist so only one was discovered).
Rev 2.2-1.0.1 Step 11. Click Connect. Step 12. Select onboot from drop-list. Step 13. Click Next to exit the discovery screen.
Rev 2.2-1.0.1 Step 14. Go to the Connected Targets tab again to confirm iSCSI connection with target. Step 15. Click OK. Step 16. Click Next back at the Disk Activation screen. Step 17. Select New Installation. Step 18. Click Next. Step 19. Complete Clock and Time Zone configuration.
Rev 2.2-1.0.1 Step 20. Select Physical Machine. Step 21. Click Next. Step 22. Click Install. Make sure "open-iscsi" RPM is selected for the installation under "Software". After the installation is completed, the system will reboot. Make sure you choose "SLES11.3x64_iscsi_boot" label from the boot menu (See Section E.3, “Configuring the PXE Server”, on page 303).
Rev 2.2-1.0.1 Step 23. Complete post installation configuration steps. It is recommended to download and install the latest version of MLNX_OFED_LINUX available from http://www.mellanox.com/page/ products_dyn?product_family=26&mtag=linux_sw_drivers E.4.1 Using PXE Boot Services for Booting the SLES11 SP3 from the iSCSI Target Once the installation is completed, the system will reboot. At this stage, it is expected from the client to perform another PXE network boot with FlexBoot®.
Rev 2.2-1.0.1 Step 6. Step 7. In the Advanced Storage Options window perform the following: Step a. Select the Add Iscsi Target option. Step b. Check the Bind targets to network interfaces checkbox. Step c. Click +Add drive button. Enter the IP address of iSCSI target. Optionally, you may choose to enter a customized Initiator Name and select the necessary CHAP authentication of choice. Please refer to Section E.5.
Rev 2.2-1.0.1 Step 9. Click Login. A successful login is mandatory to proceed. A failure at this stage is probably a result of a target or network configuration error and recovery/troubleshooting that is out of the scope of this document. Step 10. Make sure a new storage LUN appears in the Other AN Devices tab. A successful LUN discovery is mandatory to proceed.
Rev 2.2-1.0.1 Step 13. Select the Use All Space option. Step 14. Click Next and proceed with the Installation. Step 15. Select the Basic Server option. This is only one of the options that can be chosen, not the mandatory one. Step 16. Check the Customize Now checkbox. Step 17. Click Next. Step 18. Select Infiniband Support and iSCSI Storage Client.
Rev 2.2-1.0.1 Step 19. Click Next. Allow the installation to reach completion. E.5.1 SAN-Booting the Diskless Client with FlexBoot When the installation process is completed, the client will ask to reboot. At that point, the DHCP server configuration for that client needs to be changed so that when it PXE boots again, it will get the root-path IQN and LUN information from the DHCP server. For further information, please refer to section DHCP Configuration for iSCSI Boot with FlexBoot (PXE SAN Boot).
Rev 2.2-1.0.1 Step 2. Configure the Initiator. [[root@sqa070 ~ ]# vim /etc/iscsi/iscsid.conf node.startup = automatic ## Optional: for CHAP authentication, uncomment the following #lines #discovery.sendtargets.auth.authmethod = CHAP #discovery.sendtargets.auth.username = joe #discovery.sendtargets.auth.password = secret #node.session.auth.authmethod = CHAP #node.session.auth.username = jack #node.session.auth.password = 12charsecret node.session.timeo.replacement_timeout = 120 node.conn[0].timeo.
Rev 2.2-1.0.1 Step 6. Verify the remote partition appears to the initiator as a local HDD. [root@sqa070 ~ ]# fdisk -l Disk /dev/sda: 500.1 GB, 500107862016 bytes 255 heads, 63 sectors/track, 60801 cylinders Units = cylinders of 16065 * 512 = 8225280 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disk identifier: 0x000518f2 Device Boot Start End Blocks /dev/sda1 * 1 131 1048576 Partition 1 does not end on cylinder boundary.
Rev 2.2-1.0.1 subnet 12.7.0.0 netmask 255.255.0.0 { option dhcp-server-identifier 12.7.6.30 ; option domain-name "pxe030.mtl.com" ; option domain-name-servers 12.7.6.30 ; default-lease-time 86400 ; # 1 day max-lease-time 86400 ; option ntp-servers 12.7.6.30; } host sqa070 { fixed-address 12.7.6.70 ; hardware ethernet 00:02:c9:32:e8:80 ; next-server 12.7.6.30; if option client-system-architecture = 00:00 { filename "pxelinux.0" ; } } E.6.3 pxelinux.
Rev 2.2-1.0.1 For more information, visit http://www.ipxe.org/download [root@sqa030 ~]# git clone git://git.ipxe.org/ipxe.git Step 3. Edit a command file named sanbootnchap.ipxe (the name is given as an example whereas the .ipxe file extension is mandatory) with the following lines. Make sure to enter your own values for username and password per your CHAP configuration. For reasons of simplicity, and coherence with this document examples, we gave our CHAP the username joe, and the password secret.
Rev 2.2-1.0.1 For CHAP users: All the CHAP authentication lines mentioned as comments in the iSCSI target and initiator configuration examples in sections Section E.1, “Configuring the iSCSI Target Machine”, on page 301 and Section E.6.