Mellanox OFED for Linux User Manual Rev 2.1-1.0.6 Last Updated: 18 March, 2014 www.mellanox.
Rev 2.1-1.0.
Rev 2.1-1.0.6 Table of Contents Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 Document Revision History . . . . . . . . . . . . . . . . . . . .
Rev 2.1-1.0.6 3.1 Persistent Naming for Network Interfaces. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 Chapter 4 Driver Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1 SCSI RDMA Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.1.
Rev 2.1-1.0.6 4.12.2 Flow Domains and Priorities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88 4.13 Single Root IO Virtualization (SR-IOV) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.13.1 4.13.2 4.13.3 4.13.4 4.13.5 4.13.6 4.13.7 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Setting Up SR-IOV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rev 2.1-1.0.6 5.5.2 FCA Runtime Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.5.3 Various Executable Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 Chapter 6 Working With VPI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 6.1 6.2 Port Type Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Auto Sensing . . . . . .
Rev 2.1-1.0.6 8.6.5 8.6.6 8.6.7 8.6.8 8.7 Examples of Advanced Policy File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simple QoS Policy - Details and Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . SL2VL Mapping and VL Arbitration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deployment Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 173 175 176 QoS Configuration Examples . . . . . . .
Rev 2.1-1.0.6 A.4 A.5 A.6 A.7 A.8 A.9 Preparing the DHCP Server in Linux Environment . . . . . . . . . . . . . . . . . . . Subnet Manager – OpenSM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . BIOS Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Diskless Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rev 2.1-1.0.6 List of Figures Figure 1: Mellanox OFED Stack for ConnectX® Family Adapter Cards . . . . . . . . . . . . . . . . . . . .20 Figure 2: I/O Consolidation Over InfiniBand . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .63 Figure 3: An Example of a Virtual Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .84 Figure 4: QoS Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rev 2.1-1.0.6 List of Tables Table 1: Document Revision History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .12 Table 2: Abbreviations and Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .14 Table 3: Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .15 Table 4: Reference Documents . . . . . . . . . . . . . . . . . . . . .
Rev 2.1-1.0.6 Table 36: ibportstate Flags and Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .206 Table 37: smpquery Flags and Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .209 Table 38: perfquery Flags and Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .212 Table 39: ibcheckerrs Flags and Options . . . . . . . . . . . . . . . . . . . . . . . . . . .
Rev 2.1-1.0.6 Document Revision History Table 1 - Document Revision History Release 2.1-1.0.6 Date March 18, 2014 Description • Updated the following section: • • Section 4.2, “iSCSI Extensions for RDMA (iSER)”, on page 55 - Removed the note regarding iSER being at beta level Section 9.17, “ibdump”, on page 221 - Added clarification on how to use ibdump with RoCE 2.1-1.0.0 February 18, 2014 • Updated the following section: • Section 2.3.3, “Installation Procedure”, on page 31 • Section 4.13.
Rev 2.1-1.0.6 Table 1 - Document Revision History Release Date Description • Appendix C.2, “mlx4_core Parameters” page 239 • Section 4.1.2.2, “Manually Establishing an SRP Connection”, on page 43 Section 4.1.2.3, “SRP Tools - ibsrpdm, srp_daemon and srpd Service Script”, on page 45 Section 4.1.2.4, “Automatic Discovery and Connection to Targets”, on page 47 Section 4.1.2.5, “Multiple Connections from Initiator InfiniBand Port to the Target”, on page 48 Section 4.1.2.
Rev 2.1-1.0.6 About this Manual This Preface provides general information concerning the scope and organization of this User’s Manual. Intended Audience This manual is intended for system administrators responsible for the installation, configuration, management and maintenance of the software and hardware of VPI (InfiniBand, Ethernet) adapter cards. It is also intended for application developers.
Rev 2.1-1.0.
Rev 2.1-1.0.6 Table 3 - Glossary (Sheet 2 of 2) 16 Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric. Master Subnet Manager The Subnet Manager that is authoritative, that has the reference configuration information for the subnet. See Subnet Manager. Multicast Forwarding Tables A table that exists in every switch providing the list of ports to forward received multicast packet. The table is organized by MLID.
Rev 2.1-1.0.6 Related Documentation Table 4 - Reference Documents Document Name Description InfiniBand Architecture Specification, Vol. 1, Release 1.2.1 The InfiniBand Architecture Specification that is provided by IBTA IEEE Std 802.3ae™-2002 (Amendment to IEEE Std 802.
Rev 2.1-1.0.6 Mellanox OFED Overview 1 Mellanox OFED Overview 1.1 Introduction to Mellanox OFED Mellanox OFED is a single Virtual Protocol Internconnect (VPI) software stack which operates across all Mellanox network adapter solutions supporting 10, 20, 40 and 56 Gb/s InfiniBand (IB); 10, 40 and 56 Gb/s Ethernet; and 2.5 or 5.0 GT/s PCI Express 2.0 and 8 GT/s PCI Express 3.0 uplinks to servers.
Rev 2.1-1.0.6 • • mlx4_en (Ethernet) Mid-layer core • Verbs, MADs, SA, CM, CMA, uVerbs, uMADs • Upper Layer Protocols (ULPs) • IPoIB, RDS*, SRP Initiator and SRP * NOTE: RDS was not tested by Mellanox Technologies.
Rev 2.1-1.0.6 1.3 Mellanox OFED Overview Architecture Figure 1 shows a diagram of the Mellanox OFED stack, and how upper layer protocols (ULPs) interface with the hardware and with the kernel and user space. The application level also shows the versatility of markets that Mellanox OFED applies to. Figure 1: Mellanox OFED Stack for ConnectX® Family Adapter Cards The following sub-sections briefly describe the various components of the Mellanox OFED stack. 1.3.
Rev 2.1-1.0.6 mlx4_en A 10/40GigE driver under drivers/net/ethernet/mellanox/mlx4 that handles Ethernet specific functions and plugs into the netdev mid-layer 1.3.2 mlx5 Driver mlx5 is the low level driver implementation for the Connect-IB™ adapters designed by Mella- nox Technologies. Connect-IB™ operates as an InfiniBand adapter. The mlx5 driver is comprised of the following kernel modules: mlx5_core Acts as a library of common functions (e.g.
Rev 2.1-1.0.6 Mellanox OFED Overview MLX5_SCATTER_TO_CQE • Small buffers are scattered to the completion queue entry and manipulated by the driver. Valid for RC transport. • Default is 1, otherwise disabled. 1.3.3 Mid-layer Core Core services include: management interface (MAD), connection manager (CM) interface, and Subnet Administrator (SA) interface. The stack includes components for both user-mode and kernel applications.
Rev 2.1-1.0.6 1.3.5 MPI Message Passing Interface (MPI) is a library specification that enables the development of parallel software libraries to utilize parallel computers, clusters, and heterogeneous networks.
Rev 2.1-1.0.6 Mellanox OFED Overview This tool burns a firmware binary image to the EEPROM(s) attached to an InfiniScaleIII® switch device. It includes query functions to the burnt firmware image and to the binary image file. The tool accesses the EEPROM and/or switch device via an I2C-compatible interface or via vendor-specific MADs over the InfiniBand fabric (In-Band tool). • Debug utilities A set of debug utilities (e.g.
Rev 2.1-1.0.6 • GID format can be of 2 types, IPv4 and IPv6. IPv4 GID is a IPv4-mapped IPv6 address1 while IPv6 GID is the IPv6 address itself 1. For the IPv4 address A.B.C.D the corresponding IPv4-mapped IPv6 address is ::ffff.A.B.C.
Rev 2.1-1.0.6 2 Installation Installation This chapter describes how to install and test the Mellanox OFED for Linux package on a single host machine with Mellanox InfiniBand and/or Ethernet adapter hardware installed. 2.
Rev 2.1-1.0.6 2.3 Installing Mellanox OFED The installation script, mlnxofedinstall, performs the following: 2.3.
Rev 2.1-1.0.6 Installation Example The following command will create a MLNX_OFED_LINUX ISO image for RedHat 6.3 under the /tmp directory. # ./MLNX_OFED_LINUX-2.1-1.0.0-rhel6.3-x86_64/mlnx_add_kernel_support.sh -m /tmp/ MLNX_OFED_LINUX-2.1-1.0.0-rhel6.3-x86_64/ --make-tgz Note: This program will create MLNX_OFED_LINUX TGZ for rhel6.2 under /tmp directory. All Mellanox, OEM, OFED, or Distribution IB packages will be removed. Do you want to continue?[y/N]:y See log file /tmp/mlnx_ofed_iso.21642.
Rev 2.1-1.0.6 --force-fw-update Force firmware update --force Force installation --all|--hpc|--basic|--msm Install all, hpc, basic or Mellanox Subnet manager packages correspondingly --vma|--vma-vpi Install packages required by VMA to support VPI --vma-eth Install packages required by VMA to work over Ethernet --with-vma Set configuration for VMA use (to be used with any installation parameter).
Rev 2.1-1.0.6 Installation 2.3.2.1 mlnxofedinstall Return Codes Table 2 lists the mlnxofedinstall script return codes and their meanings. Table 2 - mlnxofedinstall Return Codes Return Code 30 Meaning 0 The Installation ended successfully 1 The installation failed 2 No firmware was found for the adapter device 22 Invalid parameter 28 Not enough free space 171 Not applicable to this system configuration. This can occur when the required hardware is not present on the system.
Rev 2.1-1.0.6 2.3.3 Installation Procedure Step 1. Login to the installation machine as root. Step 2. Mount the ISO image on your machine host1# mount -o ro,loop MLNX_OFED_LINUX---.iso /mnt Step 3. Run the installation script. ./mlnxofedinstall Logs dir: /tmp/MLNX_OFED_LINUX-2.1-0.0.9.10740.logs This program will install the MLNX_OFED_LINUX package on your machine. Note that all other Mellanox, OEM, OFED, or Distribution IB packages will be removed.
Rev 2.1-1.0.6 Installation Installing user level RPMs: Preparing... ofed-scripts Preparing... libibverbs Preparing... libibverbs Preparing... libibverbs-devel Preparing... libibverbs-devel Preparing... libibverbs-devel-static Preparing... libibverbs-devel-static Preparing... libibverbs-utils Preparing... libmlx4 Preparing... libmlx4 Preparing... libmlx4-devel Preparing... libmlx4-devel Preparing... libmlx5 Preparing... libmlx5 Preparing... libmlx5-devel Preparing... libmlx5-devel Preparing...
Rev 2.1-1.0.6 Preparing... libcxgb4-devel Preparing... libcxgb4-devel Preparing... libnes Preparing... libnes Preparing... libnes-devel-static Preparing... libnes-devel-static Preparing... libipathverbs Preparing... libipathverbs Preparing... libipathverbs-devel Preparing... libipathverbs-devel Preparing... libibcm Preparing... libibcm Preparing... libibcm-devel Preparing... libibcm-devel Preparing... libibumad Preparing... libibumad Preparing... libibumad-devel Preparing... libibumad-devel Preparing...
Rev 2.1-1.0.6 Installation Preparing... libibmad-devel Preparing... libibmad-devel Preparing... libibmad-static Preparing... libibmad-static Preparing... ibsim Preparing... ibacm Preparing... librdmacm Preparing... librdmacm Preparing... librdmacm-utils Preparing... librdmacm-devel Preparing... librdmacm-devel Preparing... opensm-libs Preparing... opensm-libs Preparing... opensm Preparing... opensm-devel Preparing... opensm-devel Preparing... opensm-static Preparing... opensm-static Preparing...
Rev 2.1-1.0.6 IMPORTANT NOTE: =============== - The FCA Manager and FCA MPI Runtime library are installed in /opt/mellanox/fca directory. - The FCA Manager will not be started automatically. - To start FCA Manager now, type: /etc/init.d/fca_managerd start - There should be single process of FCA Manager running per fabric. - To start FCA Manager automatically after boot, type: /etc/init.d/fca_managerd install_service - Check /opt/mellanox/fca/share/doc/fca/README.txt for quick start instructions. Preparing.
Rev 2.1-1.0.6 Installation Preparing... ################################################## ar_mgr ################################################## Preparing... ################################################## ibdump ################################################## Preparing... ################################################## infiniband-diags-compat ################################################## Preparing...
Rev 2.1-1.0.6 Device (06:00.0): 06:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3] Link Width: 8x PCI Link Speed: 5Gb/s Installation finished successfully. Attempting to perform Firmware update... Querying Mellanox devices firmware ... Device #1: ---------Device: Part Number: Description: 40GigE; PCIe3.0 PSID: Versions: FW PXE Status: 0000:06:00.0 MCX354A-FCB_A1 ConnectX-3 VPI adapter card; dual-port QSFP; FDR IB (56Gb/s) and x8 8GT/s; RoHS R6 MT_1090110019 Current Available 2.
Rev 2.1-1.0.6 Installation In case your machine has an unsupported network adapter device, no firmware update will occur and the error message below will be printed. Please contact your hardware vendor for help on firmware updates. Error message: Device #1: ---------Device: Part Number: Description: PSID: Versions: FW Status: Step 4. 0000:05:00.0 MT_0DB0110010 Current Available 2.9.
Rev 2.1-1.0.6 Note: For more details on hca_self_test.ofed, see the file hca_self_test.readme under docs/. # hca_self_test.ofed ---- Performing Adapter Device Self Test ---Number of CAs Detected ................. 1 PCI Device Check ....................... PASS Kernel Arch ............................ x86_64 Host Driver Version .................... MLNX_OFED_LINUX-2.1-1.0.0 (OFED-2.1-1.0.0): 3.0.76-0.11-default Host Driver RPM Check .................. PASS Firmware on CA #0 VPI .................. v2.30.
Rev 2.1-1.0.6 Installation b. The firmware version of the adapter device is older than the firmware version included with the Mellanox OFED ISO image If an adapter’s Flash was originally programmed with an Expansion ROM image, the automatic firmware update will also burn an Expansion ROM image. • In case your machine has an unsupported network adapter device, no firmware update will occur and the error message below will be printed. Please contact your hardware vendor for help on firmware updates.
Rev 2.1-1.0.6 Step 1. Start mst. host1# mst start Step 2. Identify your target InfiniBand device for firmware update. 1. Get the list of InfiniBand device names on your machine. host1# mst status MST modules: -----------MST PCI module loaded MST PCI configuration module loaded MST Calibre (I2C) module is not loaded MST devices: -----------/dev/mst/mt25418_pciconf0 /dev/mst/mt25418_pci_cr0 /dev/mst/mt25418_pci_msix0 /dev/mst/mt25418_pci_uar0 - PCI configuration cycles access. bus:dev.fn=02:00.0 addr.
Rev 2.1-1.0.6 Installation 2.5 Installing MLNX_OFED using YUM 2.5.1 Setting up MLNX_OFED YUM Repository Step 1. Download the tarball to your host. The image’s name has the format MLNX_OFED_LINUX--.tgz. You can download it from http://www.mellanox.com > Products > Software> InfiniBand Drivers. Step 2. Extract the MLNX_OFED tarball package to a shared location in your network. # tar xzf MLNX_OFED_LINUX--rhel6.4-x86_64.tgz Step 3.
Rev 2.1-1.0.6 Step 7. Check that the repository was successfully added. # yum repolist Loaded plugins: product-id, security, subscription-manager This system is not registered to Red Hat Subscription Management. tion-manager to register. repo id repo name mlnx_ofed MLNX_OFED Repository rpmforge RHEL 6Server - RPMforge.net - dag You can use subscripstatus 108 4,597 repolist: 8,351 2.5.
Rev 2.1-1.0.6 2.6 Installation Uninstalling Mellanox OFED Use the script /usr/sbin/ofed_uninstall.sh to uninstall the Mellanox OFED package. The script is part of the ofed-scripts RPM. 2.7 Uninstalling Mellanox OFED using the YUM Tool If MLNX_OFED was installed using the yum tool, then it can be uninstalled as follow: yum groupremove ''1 1. The “” must be the same group name that was previously used to install MLNX_OFED.
Rev 2.1-1.0.6 3 Configuration Files For the complete list of configuration files, please refer to MLNX_OFED_configuration_files.txt at the following location: docs/readme_and_user_manual/MLNX_OFED_configuration_files.txt 3.1 Persistent Naming for Network Interfaces To avoid network interface renaming after boot or driver restart use the "/etc/udev/rules.d/ 70-persistent-net.rules" file.
Rev 2.1-1.0.6 Driver Features 4 Driver Features 4.1 SCSI RDMA Protocol 4.1.1 Overview As described in Section 1.3.4, the SCSI RDMA Protocol (SRP) is designed to take full advantage of the protocol off-load and RDMA features provided by the InfiniBand architecture. SRP allows a large body of SCSI software to be readily used on InfiniBand architecture. The SRP Initiator controls the connection to an SRP Target in order to provide access to remote storage devices across an InfiniBand fabric.
Rev 2.1-1.0.6 4.1.2.1.
Rev 2.1-1.0.6 Driver Features 4.1.2.2 Manually Establishing an SRP Connection The following steps describe how to manually load an SRP connection between the Initiator and an SRP Target. Section 4.1.2.4 explains how to do this automatically. • Make sure that the ib_srp module is loaded, the SRP Initiator is reachable by the SRP Target, and that an SM is running.
Rev 2.1-1.0.6 ioc_guid A 16-digit hexadecimal number specifying the eight byte I/O controller GUID portion of the 16-byte target port identifier. dgid A 32-digit hexadecimal number specifying the destination GID. pkey A four-digit hexadecimal number specifying the InfiniBand partition key. service_id A 16-digit hexadecimal number specifying the InfiniBand service ID used to establish communication with the SRP target.
Rev 2.1-1.0.6 Driver Features tl_retry_count A number in the range 2..7 specifying the IB RC retry count. 4.1.2.
Rev 2.1-1.0.6 a. To generate output suitable for utilization in the “echo” command of Section 4.1.2.2, add the ‘-c’ option to ibsrpdm: ibsrpdm -c Sample output: id_ext=200400A0B81146A1,ioc_guid=0002c90200402bd4, dgid=fe800000000000000002c90200402bd5,pkey=ffff,service_id=200400a0b81146a1 b.
Rev 2.1-1.0.6 Driver Features • To discover SRP Targets reachable from the HCA device and the port , (and to generate output suitable for 'echo',) you may execute: host1# srp_daemon -c -a -o -i -p To obtain the list of InfiniBand HCA device names, you can either use the ibstat tool or run ‘ls /sys/class/infiniband’. • To both discover the SRP Targets and establish connections with them, just add the -e option to the above command.
Rev 2.1-1.0.6 For the changes in openib.conf to take effect, run: /etc/init.d/openibd restart 4.1.2.5 Multiple Connections from Initiator InfiniBand Port to the Target Some system configurations may need multiple SRP connections from the SRP Initiator to the same SRP Target: to the same Target IB port, or to different IB ports on the same Target HCA. In case of a single Target IB port, i.e.
Rev 2.1-1.0.6 Driver Features Manual Activation of High Availability Initialization: (Execute after each boot of the driver) 1. Execute modprobe dm-multipath 2. Execute modprobe ib-srp 3. Make sure you have created file /etc/udev/rules.d/91-srp.rules as described above. 4. Execute for each port and each HCA: srp_daemon -c -e -R 300 -i -p This step can be performed by executing srp_daemon.sh, which sends its log to /var/log/ srp_daemon.log.
Rev 2.1-1.0.6 When working without High Availability, you should unmount the SRP partitions that were mounted prior to shutting down SRP. 2. After Manual Activation of High Availability If you manually activated SRP High Availability, perform the following steps: a. Unmount all SRP partitions that were mounted. b. Stop service srpd (Kill the SRP daemon instances). c. Make sure there are no multipath instances running. If there are multiple instances, wait for them to end or kill them. d.
Rev 2.1-1.0.6 Driver Features • VLAN simulation over an InfiniBand network via child interfaces • High Availability via Bonding • Varies MTU values: • up to 4k in Datagram mode • up to 64k in Connected mode • Uses any ConnectX® IB ports (one or two) • Inserts IP/UDP/TCP checksum on outgoing packets • Calculates checksum on received packets • Support net device TSO through ConnectX® LSO capability to defragment large datagrams to MTU quantas.
Rev 2.1-1.0.6 4.3.3 IPoIB Configuration Unless you have run the installation script mlnxofedinstall with the flag ‘-n’, then IPoIB has not been configured by the installation. The configuration of IPoIB requires assigning an IP address and a subnet mask to each HCA port, like any other network adapter card (i.e., you need to prepare a file called ifcfg-ib for each port). The first port on the first HCA in the host is called interface ib0, the second port is called ib1, and so on.
Rev 2.1-1.0.6 Driver Features To run the DHCP server from the command line, enter: dhcpd -d Example: host1# dhcpd ib0 -d 4.3.3.1.2 DHCP Client (Optional) A DHCP client can be used if you need to prepare a diskless machine with an IB driver. See Step 8 under “Example: Adding an IB Driver to initrd (Linux)”. In order to use a DHCP client identifier, you need to first create a configuration file that defines the DHCP client identifier.
Rev 2.1-1.0.6 4.3.3.2 Static IPoIB Configuration If you wish to use an IPoIB configuration that is not based on DHCP, you need to supply the installation script with a configuration file (using the ‘-n’ option) containing the full IP configuration.
Rev 2.1-1.0.6 Driver Features • The subnet mask that you want to assign to the interface The following example shows how to configure an IB interface: host1$ ifconfig ib0 11.4.3.175 netmask 255.255.0.0 Step 2. (Optional) Verify the configuration by entering the ifconfig command with the appropriate interface identifier ib# argument. The following example shows how to verify the configuration: host1$ ifconfig ib0 b0 Link encap:UNSPEC HWaddr 80-00-04-04-FE-80-00-00-00-00-00-00-00-00-00-00 inet addr:11.4.
Rev 2.1-1.0.6 Step 3. Verify the configuration of this interface by running: host1$ ifconfig . Using the example of Step 2: host1$ ifconfig ib0.8001 ib0.8001 Link encap:UNSPEC HWaddr 80-00-00-4A-FE-80-00-00-00-00-00-00-00-00-00-00 BROADCAST MULTICAST MTU:2044 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:128 RX bytes:0 (0.0 b) TX bytes:0 (0.0 b) Step 4.
Rev 2.1-1.0.6 Driver Features 5 packets transmitted, 5 received, 0% packet loss, time 3999ms rtt min/avg/max/mdev = 0.044/0.058/0.079/0.014 ms, pipe 2 4.3.6 Bonding IPoIB To create an interface configuration script for the ibX and bondX interfaces, you should use the standard syntax (depending on your OS). Bonding of IPoIB interfaces is accomplished in the same manner as would bonding of Ethernet interfaces: via the Linux Bonding Driver.
Rev 2.1-1.0.6 4.4 Quality of Service InfiniBand 4.4.1 Quality of Service Overview Quality of Service (QoS) requirements stem from the realization of I/O consolidation over an IB network. As multiple applications and ULPs share the same fabric, a means is needed to control their use of network resources. Figure 2: I/O Consolidation Over InfiniBand QoS over Mellanox OFED for Linux is discussed in Chapter 8, “OpenSM – Subnet Manager”.
Rev 2.1-1.0.6 4.4.2 Driver Features QoS Architecture QoS functionality is split between the SM/SA, CMA and the various ULPs. We take the “chronology approach” to describe how the overall system works. 1. The network manager (human) provides a set of rules (policy) that define how the network is being configured and how its resources are split to different QoS-Levels. The policy also define how to decide which QoS-Level each application or ULP or service use. 2.
Rev 2.1-1.0.6 II. Fabric Setup Defines how the SL2VL and VLArb tables should be setup. In OFED this part of the policy is ignored. SL2VL and VLArb tables should be configured in the OpenSM options file (opensm.opts). III. QoS-Levels Definition This section defines the possible sets of parameters for QoS that a client might be mapped to. Each set holds SL and optionally: Max MTU, Max Rate, Packet Lifetime and Path Bits. Path Bits are not implemented in OFED. IV.
Rev 2.1-1.0.6 4.4.5 Driver Features OpenSM Features The QoS related functionality that is provided by OpenSM—the Subnet Manager described in Chapter 8 can be split into two main parts: I. Fabric Setup During fabric initialization, the Subnet Manager parses the policy and apply its settings to the discovered fabric elements. II. PR/MPR Query Handling OpenSM enforces the provided policy on client request.
Rev 2.1-1.0.6 1. The application sets the ToS of the socket using setsockopt (IP_TOS, value). 2. ToS is translated into the sk_prio using a fixed translation: TOS TOS TOS TOS 0 <=> sk_prio 0 8 <=> sk_prio 2 24 <=> sk_prio 4 16 <=> sk_prio 6 3. The Socket Priority is mapped to the UP: • If the underlying device is a VLAN device, egress_map is used controlled by the vconfig command. This is per VLAN mapping. • If the underlying device is not a VLAN device, the tc command is used.
Rev 2.1-1.0.6 Driver Features 4. The the UP is mapped to the TC as configured by the mlnx_qos tool or by the lldpad daemon if DCBX is used. With RoCE, there can only be 4 predefined ToS values for the purpose of QoS mapping. 4.5.5 Raw Ethernet QP Quality of Service Mapping Applications open a Raw Ethernet QP using VERBs directly. The following is the RoCE QoS mapping flow: 1. The application sets the UP of the Raw Ethernet QP during the INIT to RTR state transition of the QP: • Sets qp_attrs.ah_attrs.
Rev 2.1-1.0.6 • After mapping the skb_priority to UP, one should map the UP into a TC. This assigns the user priority to a specific hardware traffic class. In order to do that, mlnx_qos should be used. mlnx_qos gets a list of a mapping between UPs to TCs. For example, mlnx_qos ieth0 -p 0,0,0,0,1,1,1,1 maps UPs 0-3 to TC0, and Ups 4-7 to TC1. 4.5.
Rev 2.1-1.0.6 Driver Features The tool will also display maps configured by TC and vconfig set_egress_map tools, in order to give a centralized view of all QoS mappings. • Set UP to TC mapping • Assign a transmission algorithm to each TC (strict or ETS) • Set minimal BW guarantee to ETS TCs • Set rate limit to TCs For unlimited ratelimit set the ratelimit to 0.
Rev 2.1-1.0.
Rev 2.1-1.0.6 Driver Features Set ratelimit. 3Gbps for tc0 4Gbps for tc1 and 2Gbps for tc2: tc: 0 ratelimit: 3 Gbps, up: 0 skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: skprio: up: 1 up: 2 up: 3 up: 4 up: 5 up: 6 up: 7 tsa: strict 0 1 2 (tos: 8) 3 4 (tos: 24) 5 6 (tos: 16) 7 8 9 10 11 12 13 14 15 Configure QoS. map UP 0,7 to tc0, 1,2,3 to tc1 and 4,5,6 to tc 2. set tc0,tc1 as ets and tc2 as strict.
Rev 2.1-1.0.6 up: 1 up: 2 up: 3 tc: 2 ratelimit: 2 Gbps, tsa: strict up: 4 up: 5 up: 6 4.5.8.2 tc and tc_wrap.py The 'tc' tool is used to setup sk_prio to UP mapping, using the mqprio queue discipline. In kernels that do not support mqprio (such as 2.6.34), an alternate mapping is created in sysfs. The 'tc_wrap.py' tool will use either the sysfs or the 'tc' tool to configure the sk_prio to UP mapping. Usage: tc_wrap.
Rev 2.1-1.0.6 Driver Features UP UP UP UP UP UP 2 3 4 5 6 7 4.5.8.3 Additional Tools tc tool compiled with the sch_mqprio module is required to support kernel v2.6.32 or higher. This is a part of iproute2 package v2.6.32-19 or higher. Otherwise, an alternative custom sysfs interface is available. 4.6 • mlnx_qos tool • tc_wrap.py (package: ofed-scripts) requires python >= 2.5 (package: ofed-scripts) requires python >= 2.5 Ethernet Time-Stamping 4.6.
Rev 2.1-1.0.6 SOF_TIMESTAMPING_RAW_HARDWARE: return original raw hardware time stamp SOF_TIMESTAMPING_SYS_HARDWARE: return hardware time stamp transformed to the system time base SOF_TIMESTAMPING_SOFTWARE: return system time stamp generated in software SOF_TIMESTAMPING_TX/RX determine how time stamps are generated.
Rev 2.1-1.0.6 Driver Features To enable time stamping for a net device: Admin privileged user can enable/disable time stamping through calling ioctl(sock, SIOCSHWTSTAMP, &ifreq) with following values: Send side time sampling: • Enabled by ifreq.hwtstamp_config.tx_type when /* possible values for hwtstamp_config->tx_type */ enum hwtstamp_tx_types { /* * No outgoing packet will need hardware time stamping; * should a packet arrive which asks for it, no hardware * time stamping will be done.
Rev 2.1-1.0.6 Receive side time sampling: • Enabled by ifreq.hwtstamp_config.
Rev 2.1-1.0.6 Driver Features a pending bounced packet is ready for reading as far as select() is concerned. If the outgoing packet has to be fragmented, then only the first fragment is time stamped and returned to the sending socket. When time-stamping is enabled, VLAN stripping is disabled. For more info please refer to Documentation/networking/timestamping.txt in kernel.org 4.6.1.
Rev 2.1-1.0.6 For example: struct ibv_exp_device_attr attr; ibv_exp_query_device(context, &attr); if (attr.comp_mask & IBV_EXP_DEVICE_ATTR_WITH_TIMESTAMP_MASK) { if (attr.timestamp_mask) { /* Time stamping is supported with mask attr.timestamp_mask */ } } if (attr.comp_mask & IBV_EXP_DEVICE_ATTR_WITH_HCA_CORE_CLOCK) { if (attr.hca_core_clock) { /* reporting the device's clock is supported. */ /* attr.hca_core_clock is the frequency in MHZ */ } } 4.6.2.
Rev 2.1-1.0.6 Driver Features CQs that are opened with the ibv_create_cq_ex versb should be always be polled with the ibv_poll_cq_ex verb. 4.6.2.4 Querying the Hardware Time Querying the hardware for time is done via the ibv_query_values_ex verb. For example: ret = ibv_query_values_ex(context, IBV_VALUES_HW_CLOCK, &queried_values); if (!ret && queried_values.comp_mask & IBV_VALUES_HW_CLOCK) queried_time = queried_values.
Rev 2.1-1.0.6 | *va = (*va & ~(swap_mask)) | (swap & swap_mask) | | return atomic_response The additional operands are carried in the Extended Transport Header. Atomic response generation and packet format for MskCmpSwap is as for standard IB Atomic operations. 4.7.1.2 Masked Fetch and Add (MFetchAdd) The MFetchAdd Atomic operation extends the functionality of the standard IB FetchAdd by allowing the user to split the target into multiple fields of selectable length.
Rev 2.1-1.0.6 Driver Features In virtualization environment, a virtual machine can be expose to the physical network by performing the next setting: Step 1. Create a virtual bridge Step 2. Attach the para-virtualized interface created by the eth_ipoib driver to the bridge Step 3.
Rev 2.1-1.0.6 4.8.2 Configuring the Ethernet Tunneling Over IPoIB Driver When eth_ipoib is loaded, number of eIPoIB interfaces are created, with the following default naming scheme: ethX, where X represents the ETH port available on the system. To check which eIPoIB interfaces were created: cat /sys/class/net/eth_ipoib_interfaces For example, on a system with dual port HCA, the following two interfaces might be created; eth4 and eth5.
Rev 2.1-1.0.6 Driver Features The example above shows, two eIPoIB interfaces, where eth4 runs traffic over ib0, and eth5 runs traffic over ib1. Figure 3: An Example of a Virtual Network The example above shows a few IPoIB instances that server the virtual interfaces at the Virtual Machines. To display the services provided to the Virtual Machine interfaces: # cat /sys/class/net/eth0/eth/vifs Example: # cat /sys/class/net/eth0/eth/vifs SLAVE=ib0.
Rev 2.1-1.0.6 For example, to create the VLAN tag 3 with pkey 0x8003 over that port in the eIPoIB interface eth4, run: #vconfig add eth4 3 #brctl addif br2 eth4.3 4.8.4 Setting Performance Tuning • Use 4K MTU over OpenSM. For further information, please refer to Section 8.4.
Rev 2.1-1.0.6 Driver Features Table 3 - Buffer Values Possible Value1 Description Try huge fallback to contiguous if failed fallback to ANON small pages. ALL 1. Values are NOT case sensitive. Usage: The application calls the ibv_reg_mr API which turns on the IBV_ACCESS_ALLOCATE_MR bit and sets the input address to NULL. Upon success, the address field of the struct ibv_mr will hold the address to the allocated memory block. This block will be freed implicitly when the ibv_dereg_mr() is called.
Rev 2.1-1.0.6 The underlying physical pages must not be Least Recently Used (LRU) or Anonymous. To disable that, you need to turn on the IBV_ACCESS_ALLOCATE_MR bit as part of the sharing bits. Usage: • Turns on via the ibv_reg_mr one or more of the sharing access bits. The sharing bits are part of the ibv_reg_mr man page. • Turns on the IBV_ACCESS_ALLOCATE_MR bit Step 2. Request to register to a shared MR A new verb called ibv_reg_shared_mr is added to enable sharing an MR.
Rev 2.1-1.0.6 • Driver Features ibv_open_qp Please use ibv_xsrq_pingpong for basic tests and code reference. For detailed information regarding the various options for these verbs, please refer to their appropriate man pages. 4.12 Flow Steering Flow Steering is applicable to the mlx4 driver only. Flow steering is a new model which steers network flows based on flow specifications to specific QPs. Those flows can be either unicast or multicast network flows.
Rev 2.1-1.0.6 • ibv_create_flow struct ibv_flow *ibv_create_flow(struct ibv_qp *qp, struct ibv_flow_attr *flow) Input parameters: • struct ibv_qp - the attached QP. • struct ibv_flow_attr - attaches the QP to the flow specified. The flow contains mandatory control parameters and optional L2, L3 and L4 headers.
Rev 2.1-1.0.6 Driver Features All packets that contain the above destination MAC address are to be steered into rx-ring 2 (its underlying QP), with priority 5 (within the ethtool domain) • ethtool –U eth5 flow-type tcp4 src-ip 1.2.3.4 dst-port 8888 loc 5 action 2 All packets that contain the above destination IP address and source port are to be steered into rxring 2. When destination MAC is not given, the user's destination MAC is filled automatically.
Rev 2.1-1.0.6 • The mlx4 ipoib driver when it attaches its QP to his configured GIDS Fragmented UDP traffic cannot be steered. It is treated as 'other' protocol by hardware (from the first packet) and not considered as UDP traffic. We recommend using libibverbs v2.0-3.0.0 and libmlx4 v2.0-3.0.0 and higher as of MLNX_OFED v2.0-3.0.0 due to API changes. 4.
Rev 2.1-1.0.6 Driver Features Step 1. Enable "SR-IOV" in the system BIOS. Step 2. Enable "Intel Virtualization Technology". Step 3. Install a hypervisor that supports SR-IOV. Step 4. Depending on your system, update the /boot/grub/grub.conf file to include a similar command line load parameter for the Linux kernel. For example, to Intel systems, add: default=0 timeout=5 splashimage=(hd0,0)/grub/splash.xpm.gz hiddenmenu title Red Hat Enterprise Linux Server (2.6.32-36.
Rev 2.1-1.0.6 Step 5. Install the MLNX_OFED driver for Linux that supports SR-IOV. Use '--enable-sriov' installation parameter to burn firmware with SR-IOV support. The number of virtual functions (VFs) will be set to 16. Step 6. Verify the HCA is configured to support SR-IOV. [root@selene ~]# mstflint -dev dc 1.
Rev 2.1-1.0.6 Driver Features Parameter num_vfs Recommended Value • • • • If absent, or zero: no VFs will be available If its value is a single number in the range of 0-63: The driver will enable the num_vfs VFs on the HCA and this will be applied to all ConnectX® HCAs on the host. If its format is a string: The string specifies the num_vfs parameter separately per installed HCA. The string format is: "bb:dd.f-v,bb:dd.f-v,…" • • bb:dd.f = bus:device.
Rev 2.1-1.0.6 • port_type_array=2,2 (Ethernet, Ethernet) • port_type_array=1,1 (IB, IB) • port_type_array=1,2 (VPI: IB, Ethernet) • NO port_type_array module parameter: ports are IB Step 9. Reboot the server. If the SR-IOV is not supported by the server, the machine might not come out of boot/ load. Step 10. Load the driver and verify the SR-IOV is supported. Run: lspci | grep Mellanox 03:00.0 InfiniBand: Mellanox / 10GigE] (rev b0) 03:00.1 InfiniBand: Mellanox (rev b0) 03:00.
Rev 2.1-1.0.6 Driver Features Step 2. Change the related interface (in the example below bridge0 is created over eth5). DEVICE=eth5 BOOTPROTO=none STARTMODE=on HWADDR=00:02:c9:2e:66:52 TYPE=Ethernet NM_CONTROLLED=no ONBOOT=yes BRIDGE=bridge0 Step 3. Restart the service network. Step 4. Attach a virtual NIC to VM. ifconfig -a … eth6 Link encap:Ethernet HWaddr 52:54:00:E7:77:99 inet addr:13.195.15.5 Bcast:13.195.255.255 Mask:255.255.0.
Rev 2.1-1.0.6 Step 4. Choose a Mellanox virtual function according to its PCI device (e.g., 00:03.1) Step 5. If the Virtual Machine is up reboot it, otherwise start it. Step 6. Log into the virtual machine and verify that it recognizes the Mellanox card. Run: lspci | grep Mellanox 00:03.0 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function] (rev b0) Step 7. Add the device to the /etc/sysconfig/network-scripts/ifcfg-ethX configuration file.
Rev 2.1-1.0.6 Driver Features If such ini file cannot be found in the firmware directory, you may want to dump the configuration file using mstflint. Run: # mstflint -dev dc > Step 4. Edit the ini file that you found in the previous step, and add the following lines to the [HCA] section in order to support 63 VFs. ;; SRIOV enable total_vfs = 631 num_pfs = 1 sriov_en = true 1. Some servers might have issues accepting 63 Virtual Functions or more.
Rev 2.1-1.0.6 Only the PFs are set via this mechanism. The VFs inherit their port types from their associated PF. 4.13.7.2 Virtual Function InfiniBand Ports Each VF presents itself as an independent vHCA to the host, while a single HCA is observable by the network which is unaware of the vHCAs. No changes are required by the InfiniBand subsystem, ULPs, and applications to support SR-IOV, and vHCAs are interoperable with any existing (non-virtualized) IB deployments.
Rev 2.1-1.0.6 Driver Features • ports//admin_guids/ where 0 <= n <= 127 (allows examining or changing the administrative state of a given GUID> • ports//pkeys/ where 0 <= n <= 126 (displays the contents of the physical pkey table) • directories - one for Dom0 and one per guest. Here, you may see the map- ping between virtual and physical pkey indices, and the virtual to physical gid 0. Currently, the GID mapping cannot be modified, but the pkey virtual to physical mapping can .
Rev 2.1-1.0.6 If the value under admin_guids/ is different that the value under gids/, the request is still in progress. 4.13.7.2.3Partitioning IPoIB Communication using PKeys PKeys are used to partition IPoIB communication between the Virtual Machines and the Dom0 by mapping a non-default full-membership PKey to virtual index 0, and mapping the default PKey to a virtual pkey index other than zero. The below describes how to set up two hosts, each with 2 Virtual Machines.
Rev 2.1-1.0.6 Driver Features (the most significant bit indicates if a PKey is a full PKey). The ",ipoib" causes OpenSM to pre-create IPoIB the broadcast group for the indicated PKeys. Step 2. Configure (on Dom0) the virtual-to-physical PKey mappings for the VMs. Step a. Check the PCI ID for the Physical Function and the Virtual Functions. lspci | grep Mel Step b. Assuming that on Host1, the physical function displayed by lspci is "0000:02:00.0", and that on Host2 it is "0000:03:00.
Rev 2.1-1.0.6 4.13.7.3 Ethernet Virtual Function Configuration when Running SR-IOV 4.13.7.3.1VLAN Guest Tagging (VGT) and VLAN Switch Tagging (VST) When running ETH ports on VGT, the ports may be configured to simply pass through packets as is from VFs (Vlan Guest Tagging), or the administrator may configure the Hypervisor to silently force packets to be associated with a VLan/Qos (Vlan Switch Tagging).
Rev 2.1-1.0.6 Driver Features number of VFs is larger than 56 entries, some of them will have GID table with only a single entry which is inadequate if VF's Ethernet device is assigned with an IP address. When setting num_vfs in mlx4_core module parameter it is important to check that the number of the assigned IP addresses per VF does not exceed the limit for GID table size. 4.14 CORE-Direct 4.14.
Rev 2.1-1.0.6 The following are the ethtool supported options: Table 6 - ethtool Supported Options Options ethtool -i eth Description Checks driver and device information. For example: #> ethtool -i eth2 driver: mlx4_en (MT_0DD0120009_CX3) version: 2.1.6 (Aug 2013) firmware-version: 2.30.3000 bus-info: 0000:1a:00.0 ethtool -k eth Queries the stateless offload status.
Rev 2.1-1.0.6 Driver Features Table 6 - ethtool Supported Options Options ethtool -C eth [rx-usecs N] [rxframes N] Description Sets the interrupt coalescing settings when the adaptive moderation is disabled. Note: usec settings correspond to the time to wait after the *last* packet is sent/received before triggering an interrupt. 4.16 ethtool -a eth Queries the pause frame settings. ethtool -A eth [rx on|off] [tx on|off] Sets the pause frame settings.
Rev 2.1-1.0.6 (over InfiniBand/RoCE) application to use peer device computing power, and RDMA interconnect at the same time without copying the data between the P2P devices. For example, PeerDirect is being used for GPUDirect RDMA. Detailed description for that API exists under MLNX OFED installation, please see docs/readme_and_user_manual/PEER_MEMORY_API.txt 4.18 Inline-Receive When Inline-Receive is active, the HCA may write received data in to the receive WQE or CQE.
Rev 2.1-1.0.6 Driver Features The counter index is a QP attribute given in the QP context. Multiple QPs may be associated with the same counter set, If multiple QPs share the same counter its value represents the cumulative total.
Rev 2.1-1.0.
Rev 2.1-1.0.6 Driver Features Table 8 - Port OUT Counters Counter tx_gt_1548_bytes_packets Description Number of transmitted 1549-or-greater-octet frames Table 9 - Port VLAN Priority Tagging (where is in the range 0…7) Counter Description rx_prio__packets Total packets successfully received with priority i. rx_prio__bytes Total bytes in successfully received packets with priority i. rx_novlan_packets Total packets successfully received with no VLAN priority.
Rev 2.1-1.0.
Rev 2.1-1.0.
Rev 2.1-1.0.6 4.20.1 Query Capabilities Memory Windows are available if and only the hardware supports it. To verify whether Memory Windows are available, run ibv_query_device. For example: truct ibv_device_attr device_attr = {}; ibv_query_device (context, & device_attr); if (device_attr.device_cap_flags & IBV_DEVICE_MEM_WINDOW || device_attr.device_cap_flags & IBV_DEVICE_MW_TYPE_2B) { /* Memory window is supported */ 4.20.
Rev 2.1-1.0.6 HPC Features 5 HPC Features 5.1 Shared Memory Access The Shared Memory Access (SHMEM) routines provide low-latency, high-bandwidth communication for use in highly parallel scalable programs. The routines in the SHMEM Application Programming Interface (API) provide a programming model for exchanging data between cooperating parallel processes. The SHMEM API can be used either alone or in combination with MPI routines in the same parallel program.
Rev 2.1-1.0.6 5.1.2 Running SHMEM with FCA The Mellanox Fabric Collective Accelerator (FCA) is a unique solution for offloading collective operations from the Message Passing Interface (MPI) or ScalableSHMEM process onto Mellanox InfiniBand managed switch CPUs. As a system-wide solution, FCA utilizes intelligence on Mellanox InfiniBand switches, Unified Fabric Manager and MPI nodes without requiring additional hardware.
Rev 2.1-1.0.6 HPC Features These enhancements significantly increase the scalability and performance of message communi-cations in the network, alleviating bottlenecks within the parallel communication libraries 5.1.4 Running SHMEM with Contiguous Pages Contiguous Pages improves performance by allocating user memory regions over contiguous pages. It enables a user application to ask low level drivers to allocate contiguous memory for it as part of ibv_reg_mr. To activate MLNX_OFED 2.
Rev 2.1-1.0.6 These MPI implementations, along with MPI benchmark tests such as OSU BW/LAT, Intel MPI Benchmark, and Presta, are installed on your machine as part of the Mellanox OFED for Linux installation. Table 14 lists some useful MPI links. Table 14 - Useful MPI Links MPI Standard http://www-unix.mcs.anl.gov/mpi Open MPI http://www.open-mpi.org MVAPICH 2 MPI http://mvapich.cse.ohio-state.edu/ MPI Forum http://www.mpi-forum.org This chapter includes the following sections: 5.2.
Rev 2.1-1.0.6 HPC Features -rw-r--r-- 1 root root 404 Mar 5 04:57 id_rsa.pub Step 3. Check the public key. host1$ cat id_rsa.
Rev 2.1-1.0.6 5.2.4 Compiling MPI Applications Compiling MVAPICH Applications Please refer to http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html. To review the default configuration of the installation, check the default configuration file: /usr/mpi//mvapich-/etc/mvapich.conf Compiling Open MPI Applications Please refer to http://www.open-mpi.org/faq/?category=mpi-apps. 5.
Rev 2.1-1.0.6 HPC Features To upgrade MLNX_OFED v2.0 or later with a newer MXM: Step 1. Remove MXM v1.1. rpm -e mxm Step 2. Remove the pre-compiled OpenMPI. rpm -e mlnx-openmpi_gcc Step 3. Install the new MXM and compile the OpenMPI with it. To run OpenMPI without MXM, run: % mpirun -mca mtl ^mxm <...> When upgrading to MXM v0.52, OpenMPI compiled with the previous versions of the MXM should be recompiled with MXM v0.52. 5.3.2 Enabling MXM in OpenMPI MXM v0.
Rev 2.1-1.0.6 5.3.4 Configuring Multi-Rail Support Multi-Rail support enables the user to use more than one of the active ports on the card, by making a better use of the resources. It provides a combined throughput among the used ports. To configure dual rail support: • Specify the list of ports you would like to use to enable multi rail support. -x MXM_RDMA_PORTS=cardName:portNum or -x MXM_IB_PORTS=cardName:portNum 5.3.
Rev 2.1-1.0.6 HPC Features Collective communications are isolated from the rest of the traffic in the fabric using a private virtual network (VLane) eliminating contention with other types of traffic. After MLNX_OFED installation, FCA can be found at /opt/mellanox/fca folder. For further information on configuration instructions, please refer to the FCA User Manual. 5.
Rev 2.1-1.0.6 5.5.1 Installing ScalableUPC Mellanox ScalableUPC is installed as part of MLNX_OFED package. Mellanox OFED 1.8.5 includes ScalableUPC Rev 2.2, which is installed under: /opt/mellanox/bupc. If you have installed OFED 1.8.5, you do not need to download and install ScalableUPC. Mellanox ScalableUPC is distributed as source RPM as well and can be downloaded from the Mellanox website. Please note, the binary distribution of ScalableUPC is compiled with the following defaults: 5.5.
Rev 2.1-1.0.6 HPC Features 5.5.2.2 Controlling FCA Offload in ScalableUPC using Environment Variables To enable FCA module under ScalableUPC: % export GASNET_FCA_ENABLE_CMD_LINE=1 To set FCA verbose level: % export GASNET_FCA_VERBOSE_CMD_LINE=10 To set the minimal number of processes threshold to activate FCA: % export GASNET_FCA_NP_CMD_LINE=1 ScalableUPC contains modules configuration file (http://modules.sf.net) which can be found at /opt/mellanox/bupc/2.2/etc/bupc_modulefile. 5.5.
Mellanox OFED for Linux User’s Manual 6 Rev 2.1-1.0.6 Working With VPI VPI allows ConnectX ports to be independently configured as either IB or Eth. 6.1 Port Type Management ConnectX ports can be individually configured to work as InfiniBand or Ethernet ports. By default both ConnectX ports are initialized as InfiniBand ports. If you wish to change the port type use the connectx_port_config script after the driver is loaded.
Rev 2.1-1.0.6 6.2 Working With VPI Auto Sensing Auto Sensing enables the NIC to automatically sense the link type (InfiniBand or Ethernet) based on the link partner and load the appropriate driver stack (InfiniBand or Ethernet). For example, if the first port is connected to an InfiniBand switch and the second to Ethernet switch, the NIC will automatically load the first switch as InfiniBand and the second as Ethernet. 6.2.1 Enabling Auto Sensing Upon driver start up: 1.
Rev 2.1-1.0.6 7 Performance 7.1 General System Configurations The following sections describe recommended configurations for system components and/or interfaces. Different systems may have different features, thus some recommendations below may not be applicable. 7.1.1 PCI Express (PCIe) Capabilities Table 16 - Recommended PCIe Configuration PCIe Generation 3.
Rev 2.1-1.0.6 Performance 7.1.3.2 Intel® Sandy Bridge Processors The following table displays the recommended BIOS settings in machines with Intel code name Sandy Bridge based processors.
Rev 2.1-1.0.6 7.1.3.3 Intel® Nehalem/Westmere Processors The following table displays the recommended BIOS settings in machines with Intel Nehalembased processors.Configuring the Completion Queue Stall Delay. Table 18 - Recommended BIOS Settings for Intel® Nehalem/Westmere Processors BIOS Option Values General Operating Mode /Power profile Maximum Performance Processor C-States Disabled Turbo mode Disabled Hyper-Threading1 Disabled Recommended for latency and message rate sensitive applications.
Rev 2.1-1.0.6 Performance Table 19 - Recommended BIOS Settings for AMD Processors BIOS Option Memory 7.2 Values Memory speed Max performance Memory channel mode Independent Node Interleaving Disabled / NUMA Channel Interleaving Enabled Thermal Mode Performance Performance Tuning for Linux You can use the Linux sysctl command to modify default system network parameters that are set by the operating system in order to improve IPv4 and IPv6 traffic performance.
Rev 2.1-1.0.6 • Enable the TCP selective acks option for better CPU utilization: sysctl -w net.ipv4.tcp_sack=1 7.2.3 Preserving Your Performance Settings after a Reboot To preserve your performance settings after a reboot, you need to add them to the file /etc/ sysctl.
Rev 2.1-1.0.6 Performance 7.2.4.1 Setting the Scaling Governor If the following modules are loaded, CPU scaling is supported, and you can improve performance by setting the scaling mode to performance: • freq_table • acpi_cpufreq: this module is architecture dependent. It is also recommended to disable the module cpuspeed; this module is also architecture dependent.
Rev 2.1-1.0.6 7.2.5 Interrupt Moderation Interrupt moderation is used to decrease the frequency of network adapter interrupts to the CPU. Mellanox network adapters use an adaptive interrupt moderation algorithm by default. The algorithm checks the transmission (Tx) and receive (Rx) packet rates and modifies the Rx interrupt moderation settings accordingly. To manually set Tx and/or Rx interrupt moderation, use the ethtool utility.
Rev 2.1-1.0.6 Performance Example for supported system: # cat /sys/class/net/eth3/device//numa_node 0 Example for unsupported system: # cat /sys/class/net/ib0/device/numa_node -1 7.2.6.1.1 Improving Application Performance on Remote NUMA Node Verbs API applications that mostly use polling, will have an impact when using the remote NUMA node.
Rev 2.1-1.0.6 7.2.6.3.1 Running an Application on a Certain NUMA Node In order to run an application on a certain NUMA node, the process affinity should be set in either in the command line or an external tool. For example, if the adapter's NUMA node is 1 and NUMA 1 cores are 8-15 then an application should run with process affinity that uses 8-15 cores only. To run an application, run the following commands: taskset -c 8-15 ib_write_bw -a or: taskset 0xff00 ib_write_bw -a 7.2.
Rev 2.1-1.0.6 • Performance Stop # mlnx_affinity stop • Restart # mlnx_affinity restart mlnx_affinity can also be started by driver load/unload To enable mlnx_affinity by default: • Add the line below to the /etc/infiniband/openib.conf file. RUN_AFFINITY_TUNER=yes 7.2.7.3 Tuning for Multiple Adapters When optimizing the system performance for using more than one adapter. It is recommended to separate the adapter's core utilization so there will be no interleaving between interfaces.
Rev 2.1-1.0.6 7.2.8 Tuning Multi-Threaded IP Forwarding To optimize NIC usage as IP forwarding: 1. Set the following options in /etc/modprobe.d/mlx4.conf: • For MLNX_OFED-2.0.x: options mlx4_en inline_thold=0 options mlx4_core high_rate_steer=1 • For MLNX_EN-1.5.10: options mlx4_en num_lro=0 inline_thold=0 options mlx4_core high_rate_steer=1 2. Apply interrupt affinity tuning. 3. Forwarding on the same interface: # set_irq_affinity_bynode.sh 4.
Rev 2.1-1.0.6 OpenSM – Subnet Manager 8 OpenSM – Subnet Manager 8.1 Overview OpenSM is an InfiniBand compliant Subnet Manager (SM). It is provided as a fixed flow executable called opensm, accompanied by a testing application called osmtest. OpenSM implements an InfiniBand compliant SM according to the InfiniBand Architecture Specification chapters: Management Model (13), Subnet Management (14), and Subnet Administration (15). 8.
Rev 2.1-1.0.6 bound to 1 port at a time. If GUID given is 0, OpenSM displays a list of possible port GUIDs and waits for user input. Without -g, OpenSM tries to use the default port. --lmc, -l This option specifies the subnet's LMC value. The number of LIDs assigned to each port is 2^LMC. The LMC value must be in the range 0-7. LMC values > 0 allow multiple paths between ports. LMC values > 0 should only be used if the subnet topology actually provides multiple paths between ports, i.e.
Rev 2.1-1.0.6 OpenSM – Subnet Manager --do_mesh_analysis This option enables additional analysis for the lash routing engine to precondition switch port assignments in regular cartesian meshes which may reduce the number of SLs required to give a deadlock free routing --lash_start_vl Sets the starting VL to use for the lash routing algorithm. Defaults to 0. --sm_sl Sets the SL to use to communicate with the SM/SA. Defaults to 0.
Rev 2.1-1.0.
Rev 2.1-1.0.6 OpenSM – Subnet Manager --timeout, -t This option specifies the time in milliseconds used for transaction timeouts. Specifying -t 0 disables timeouts. Without -t, OpenSM defaults to a timeout value of 200 milliseconds. --retries This option specifies the number of retries used for transactions. Without --retries, OpenSM defaults to 3 retries for transactions. --maxsmps, -n This option specifies the number of VL15 SMP MADs allowed on the wire at any one time.
Rev 2.1-1.0.6 --port_search_ordering_file, -O This option provides the means to define a mapping between ports and dimension (Order) for controlling Dimension Order Routing (DOR). Moreover this option provides the means to define non default routing port order. --dimn_ports_file, -O (DEPRECATED) This option provides the means to define a mapping between ports and dimension (Order) for controlling Dimension Order Routing (DOR).
Rev 2.1-1.0.6 OpenSM – Subnet Manager --part_enforce, -Z [both, in, out, off] This option indicates the partition enforcement type (for switches) Enforcement type can be outbound only (out), inbound only (in), both or disabled (off). Default is both. --allow_both_pkeys, -W This option indicates whether both full and limited membership on the same partition can be configured in the PKeyTable. Default is not to allow both pkeys. --qos, -Q This option enables QoS setup.
Rev 2.1-1.0.6 --consolidate_ipv6_snm_req Use shared MLID for IPv6 Solicited Node Multicast groups per MGID scope and P_Key. --consolidate_ipv4_mask Use mask for IPv4 multicast groups multiplexing per MGID scope and P_Key. --pid_file Specifies the file that contains the process ID of the opensm daemon.The default is /var/run/opensm.
Rev 2.1-1.0.6 OpenSM – Subnet Manager This option sets the log verbosity level. A flags field must follow the -D option.
Rev 2.1-1.0.6 opensm stores certain data to the disk such that subsequent runs are consistent. The default directory used is /var/cache/opensm. The following file is included in it: • guid2lid – stores the LID range assigned to each GUID 8.2.3 Signaling When OpenSM receives a HUP signal, it starts a new heavy sweep as if a trap has been received or a topology change has been found. Also, SIGUSR1 can be used to trigger a reopen of /var/log/opensm.log for logrotate purposes. 8.2.
Rev 2.1-1.0.6 • 8.3.
Rev 2.1-1.0.6 -s, -M, -t, -l, -v, -V -vf received from the SA during testing. If -i is not specified, osmtest defaults to the file osmtest.dat.
Rev 2.1-1.0.6 OpenSM – Subnet Manager -h, --help 8.3.2 0x08 - DEBUG (diagnostic, high volume) 0x10 - FUNCS (function entry/exit, very high volume) 0x20 - FRAMES (dumps all SMP and GMP frames) 0x40 - ROUTING (dump FDB routing information) 0x80 - currently unused.
Rev 2.1-1.0.6 where PartitionName string, will be used with logging. When omitted, an empty string will be used. PKey P_Key value for this partition. Only low 15 bits will be used. When omitted, P_Key will be autogenerated. flag used to indicate IPoIB capability of this partition. defmember=full|limited specifies default membership for port guid list. Default is limited.
Rev 2.1-1.0.
Rev 2.1-1.0.6 8.5 Routing Algorithms OpenSM offers six routing engines: 1. “Min Hop Algorithm” Based on the minimum hops to each node where the path length is optimized. 2. “UPDN Algorithm” Based on the minimum hops to each node, but it is constrained to ranking rules. This algorithm should be chosen if the subnet is not a pure Fat Tree, and a deadlock may occur due to a loop in the subnet. 3.
Rev 2.1-1.0.6 OpenSM – Subnet Manager Up/Down routing. Each port has a counter counting the number of target LIDs going through it. When there are multiple alternative ports with same MinHop to a LID, the one with less previously assigned ports is selected. If LMC > 0, more checks are added. Within each group of LIDs assigned to same target port: a. Use only ports which have same MinHop b. First prefer the ones that go to different systemImageGuid (then the previous LID of the same LMC group) c.
Rev 2.1-1.0.6 8.5.3 UPDN Algorithm The UPDN algorithm is designed to prevent deadlocks from occurring in loops of the subnet. A loop-deadlock is a situation in which it is no longer possible to send data between any two hosts connected through the loop. As such, the UPDN routing algorithm should be used if the subnet is not a pure Fat Tree, and one of its loops may experience a deadlock (due, for example, to high pressure). The UPDN algorithm is based on the following main stages: 1.
Rev 2.1-1.0.6 OpenSM – Subnet Manager 1. A valid guid file specifies one guid in each line. Lines with an invalid format will be discarded. 2. The user should specify the root switch guids. However, it is also possible to specify CA guids; OpenSM will use the guid of the switch (if it exists) that connects the CA to the subnet as a root node. 8.5.4 Fat-tree Routing Algorithm The fat-tree algorithm optimizes routing for "shift" communication pattern.
Rev 2.1-1.0.6 8.5.4.1 Routing between non-CN Nodes The use of the cn_guid_file option allows non-CN nodes to be located on different levels in the fat tree. In such case, it is not guaranteed that the Fat Tree algorithm will route between two nonCN nodes. In the scheme below, N1, N2 and N3 are non-CN nodes. Although all the CN have routes to and from them, there will not necessarily be a route between N1,N2 and N3. Such routes would require to use at least one of the switches the wrong way around.
Rev 2.1-1.0.6 OpenSM – Subnet Manager LASH analyzes routes and ensures deadlock freedom between switch pairs. The link from HCA between and switch does not need virtual layers as deadlock will not arise between switch and HCA. In more detail, the algorithm works as follows: 1. LASH determines the shortest-path between all pairs of source / destination switches.
Rev 2.1-1.0.6 8.5.6 DOR Routing Algorithm The Dimension Order Routing algorithm is based on the Min Hop algorithm and so uses shortest paths. Instead of spreading traffic out across different paths with the same shortest distance, it chooses among the available shortest paths based on an ordering of dimensions. Each port must be consistently cabled to represent a hypercube dimension or a mesh dimension.
Rev 2.1-1.0.6 OpenSM – Subnet Manager Thus, on a pristine 3D torus, i.e., in the absence of failed fabric switches, torus-2QoS consumes 8 SL values (SL bits 0-2) and 2 VL values (VL bit 0) per QoS level to provide deadlock-free routing on a 3D torus. Torus-2QoS routes around link failure by "taking the long way around" any 1D ring interrupted by a link failure.
Rev 2.1-1.0.6 because they cannot be used to construct a loop encircling T. The hop I-r uses a separate VL, so it cannot contribute to a credit loop encircling T. Extending this argument shows that in addition to being capable of routing around a single switch failure without introducing deadlock, torus2QoS can also route around multiple failed switches on the condition they are adjacent in the last dimension routed by DOR.
Rev 2.1-1.0.6 OpenSM – Subnet Manager not arise from a combination of multicast and unicast path segments. It turns out that it is possible to construct spanning trees for multicast routing that have that property. For the 2D 6x5 torus example above, here is the full-fabric spanning tree that torus-2QoS will construct, where "x" is the root switch and each "+" is a non-root switch: For multicast traffic routed from root to tip, every turn in the above spanning tree is a legal DOR turn.
Rev 2.1-1.0.6 Two things are notable about this master spanning tree. First, assuming the x dateline was between x=5 and x=0, this spanning tree has a branch that crosses the dateline. However, just as for unicast, crossing a dateline on a 1D ring (here, the ring for y=2) that is broken by a failure cannot contribute to a torus credit loop. Second, this spanning tree is no longer optimal even for multicast groups that encompass the entire fabric.
Rev 2.1-1.0.6 OpenSM – Subnet Manager occurs if torus-2QoS is misconfigured, i.e., the radix of a torus dimension as configured does not match the radix of that torus dimension as wired, and many switches/links in the fabric will not be placed into the torus. 8.5.7.4 Quality Of Service Configuration OpenSM will not program switchs and channel adapters with SL2VL maps or VL arbitration configuration unless it is invoked with -Q.
Rev 2.1-1.0.6 8.5.7.6 Torus-2QoS Configuration File Syntax The file torus-2QoS.conf contains configuration information that is specific to the OpenSM routing engine torus-2QoS. Blank lines and lines where the first non-whitespace character is "#" are ignored. A token is any contiguous group of non-whitespace characters. Any tokens on a line following the recognized configuration tokens described below are ignored.
Rev 2.1-1.0.6 OpenSM – Subnet Manager eter for a dateline keyword moves the origin (and hence the dateline) the specified amount relative to the common switch in a torus seed. next_seed If any of the switches used to specify a seed were to fail torus-2QoS would be unable to complete topology discovery successfully. The next_seed keyword specifies that the following link and dateline keywords apply to a new seed specification.
Rev 2.1-1.0.6 8.6 Quality of Service Management in OpenSM 8.6.1 Overview When Quality of Service (QoS) in OpenSM is enabled (using the ‘-Q’ or ‘--qos’ flags), OpenSM looks for a QoS Policy file. During fabric initialization and at every heavy sweep, OpenSM parses the QoS policy file, applies its settings to the discovered fabric elements, and enforces the provided policy on client requests.
Rev 2.1-1.0.6 168 OpenSM – Subnet Manager • Partition name, which means that all the ports in the subnet that belong to partition with a given name belong to this port group • Node type, where possible node types are: CA, SWITCH, ROUTER, ALL, and SELF (SM's port).
Rev 2.1-1.0.6 II) QoS Setup (denoted by qos-setup) This section describes how to set up SL2VL and VL Arbitration tables on various nodes in the fabric. However, this is not supported in OFED. SL2VL and VLArb tables should be configured in the OpenSM options file (default location - /var/cache/opensm/opensm.opts).
Rev 2.1-1.0.6 8.6.4 OpenSM – Subnet Manager Policy File Syntax Guidelines • Leading and trailing blanks, as well as empty lines, are ignored, so the indentation in the example is just for better readability. • Comments are started with the pound sign (#) and terminated by EOL. • Any keyword should be the first non-blank in the line, unless it's a comment. • Keywords that denote section/subsection start have matching closing keywords.
Rev 2.1-1.0.6 port-group name: Virtual Servers # The syntax of the port name is as follows: # "node_description/Pnum". # node_description is compared to the NodeDescription of the node, # and "Pnum" is a port number on that node.
Rev 2.1-1.0.6 OpenSM – Subnet Manager sl: 1 mtu-limit: 4 rate-limit: 5 pkey: 0x1234 packet-life: 8 end-qos-level end-qos-levels # Match rules are scanned in order of their apperance in the policy file. # First matched rule takes precedence.
Rev 2.1-1.0.6 8.6.6 Simple QoS Policy - Details and Examples Simple QoS policy match rules are tailored for matching ULPs (or some application on top of a ULP) PR/MPR requests. This section has a list of per-ULP (or per-application) match rules and the SL that should be enforced on the matched PR/MPR query.
Rev 2.1-1.0.
Rev 2.1-1.0.6 8.6.6.4 SRP Service ID for SRP varies from storage vendor to vendor, thus SRP query is matched by the target IB port GUID.
Rev 2.1-1.0.6 OpenSM – Subnet Manager qos_swe_sl2vl 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,7 VL arbitration tables (both high and low) are lists of VL/Weight pairs. Each list entry contains a VL number (values from 0-14), and a weighting value (values 0-255), indicating the number of 64 byte units (credits) which may be transmitted from that VL when its turn in the arbitration occurs. A weight of 0 indicates that this entry should be skipped.
Rev 2.1-1.0.6 Figure 5: Example QoS Deployment on InfiniBand Subnet 8.7 QoS Configuration Examples The following are examples of QoS configuration for different cluster deployments. Each example provides the QoS level assignment and their administration via OpenSM configuration files. 8.7.
Rev 2.1-1.0.6 OpenSM – Subnet Manager qos-ulps default :0 # default SL (for MPI) any, target-port-guid OST1,OST2,OST3,OST4:1 # SL for Lustre OST any, target-port-guid MDS1,MDS2 :2 # SL for Lustre MDS end-qos-ulps • OpenSM options file qos_max_vls 8 qos_high_limit 0 qos_vlarb_high 2:1 qos_vlarb_low 0:96,1:224 qos_sl2vl 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15 8.7.
Rev 2.1-1.0.6 end-qos-ulps • OpenSM options file qos_max_vls 8 qos_high_limit 0 qos_vlarb_high 1:32,2:32 qos_vlarb_low 0:1, qos_sl2vl 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15 8.7.3 EDC (3-tier): IPoIB, RDS, SRP The following is an example of QoS configuration for an enterprise data center (EDC), with IPoIB carrying all application traffic, RDS for database traffic, and SRP used for storage.
Rev 2.1-1.0.6 OpenSM – Subnet Manager end-qos-ulps • OpenSM options file qos_max_vls 8 qos_high_limit 0 qos_vlarb_high 1:32,2:96,3:96,4:96 qos_vlarb_low 0:1 qos_sl2vl 0,1,2,3,4,5,6,7,15,15,15,15,15,15,15,15 • Partition configuration file Default=0x7fff, ipoib : ALL=full; PartA=0x8001, sl=1, ipoib : ALL=full; 8.8 Adaptive Routing 8.8.1 Overview Adaptive Routing is at beta stage. Adaptive Routing (AR) enables the switch to select the output port based on the port's load.
Rev 2.1-1.0.6 8.8.2 Installing the Adaptive Routing Adaptive Routing Manager is a Subnet Manager plug-in, i.e. it is a shared library (libarmgr.so) that is dynamically loaded by the Subnet Manager. Adaptive Routing Manager is installed as a part of Mellanox OFED installation. 8.8.3 Running Subnet Manager with Adaptive Routing Manager Adaptive Routing (AR) Manager can be enabled/disabled through SM options file. 8.8.3.1 Enabling Adaptive Routing To enable Adaptive Routing, perform the following: 1.
Rev 2.1-1.0.6 OpenSM – Subnet Manager Adaptive Routing mechanism is automatically disabled once the switch receives setting of the usual linear routing table (LFT). Therefore, no action is required to clear Adaptive Routing configuration on the switches if you do not wish to use Adaptive Routing. 8.8.4 Querying Adaptive Routing Tables When Adaptive Routing is active, the content of the usual Linear Forwarding Routing Table on the switch is invalid, thus the standard tools that query LFT (e.g.
Rev 2.1-1.0.6 8.8.5.1 General AR Manager Options Table 20 - Adaptive Routing Manager Options File Option File Description Values ENABLE: Enable/disable Adaptive Routing on fabric switches. Note that if a switch was identified by AR Manager as device that does not support AR, AR Manager will not try to enable AR on this switch.
Rev 2.1-1.0.6 OpenSM – Subnet Manager SWITCH { ; ; ... } The following are the per-switch options: Table 21 - Adaptive Routing Manager Pre-Switch Options File Option File Description ENABLE: Allows you to enable/disable the AR on this switch. If the general ENABLE option value is set to 'false', then this per-switch option is ignored. This option can be changed on the fly. Default: true AGEING_TIME: Applicable to bounded AR mode only.
Rev 2.1-1.0.6 8.9 Congestion Control 8.9.1 Congestion Control Overview Congestion Control Manager is a Subnet Manager (SM) plug-in, i.e. it is a shared library (libccmgr.so) that is dynamically loaded by the Subnet Manager. Congestion Control Manager is installed as part of Mellanox OFED installation. The Congestion Control mechanism controls traffic entry into a network and attempts to avoid oversubscription of any of the processing or link capabilities of the intermediate nodes and networks.
Rev 2.1-1.0.6 OpenSM – Subnet Manager To turn CC OFF, set 'enable' to 'FALSE' in the Congestion Control Manager configuration file, and run OpenSM ones with this configuration. For the full list of CC Manager options with all the default values, See “Configuring Congestion Control Manager” on page 185. For further details on the list of CC Manager options, please refer to the IB spec. 8.9.
Rev 2.1-1.0.6 • When number of errors exceeds 'max_errors' of send/receive errors or timeouts in less than 'error_window' seconds, the CC MGR will abort and will allow OpenSM to proceed. To do so, set the following parameter: max_errors error_window • The values are: max_errors = 0: zero tollerance - abort configuration on first error error_window = 0: mechanism disabled - no error checking.[0-48K] • The default is: 5 8.9.4.
Rev 2.1-1.0.6 OpenSM – Subnet Manager Table 24 - Congestion Control Manager CA Options File Option File Desctiption Values ca_control_map An array of sixteen bits, one for each SL. Each bit indicates whether or not the corresponding SL entry is to be modified. Values: 0xffff ccti_increase Sets the CC Table Index (CCTI) increase. Default: 1 trigger_threshold Sets the trigger threshold. Default: 2 ccti_min Sets the CC Table Index (CCTI) minimum.
Rev 2.1-1.0.6 9 InfiniBand Fabric Diagnostic Utilities 9.1 Overview The diagnostic utilities described in this chapter provide means for debugging the connectivity and status of InfiniBand (IB) devices in a fabric. 9.2 Utilities Usage This section first describes common configuration, interface, and addressing for all the tools in the package. Then it provides detailed descriptions of the tools themselves including: operation, synopsis and options descriptions, error codes, and examples. 9.2.
Rev 2.1-1.0.6 9.2.3 InfiniBand Fabric Diagnostic Utilities Addressing This section applies to the ibdiagpath tool only. A tool command may require defining the destination device or port to which it applies. The following addressing modes can be used to define the IB ports: • Using a Directed Route to the destination: (Tool option ‘-d’) This option defines a directed route of output port numbers from the local port to the destination.
Rev 2.1-1.0.6 Options -i|--device : Specifies the name of the device of the port used to connect to the IB fabric (in case of multiple devices on he local system). -p|--port : Specifies the local device's port number used to connect to the IB fabric. -g|--guid : Specifies the local port GUID value of the port used to connect to the IB fabric. If GUID given is 0 than ibdiagnet displays a list of possible port GUIDs and waits for user input.
Rev 2.1-1.0.6 InfiniBand Fabric Diagnostic Utilities --ber_test : Provides a BER test for each port. Calculate BER for each port and check no BER value has exceeds the BER threshold. (default threshold="10^-12"). --ber_use_data : Indicates that BER test will use the received data for calculation. --ber_thresh : Specifies the threshold value for the BER test. The reciprocal number of the BER should be provided. Example: for 10^-12 than value need to be 1000000000000 or 0xe8d4a51000 (10^12).
Rev 2.1-1.0.6 Table 26 - ibdiagnet (of ibutils2) Output Files Output File ibdiagnet2.
Rev 2.1-1.0.6 InfiniBand Fabric Diagnostic Utilities Options -c Min number of packets to be sent across each link (default = 10) -v Enable verbose mode -r Provides a report of the fabric qualities -t Specifies the topology file name -s Specifies the local system name.
Rev 2.1-1.0.6 Table 27 - ibdiagnet (of ibutils) Output Files Output File Description ibdiagnet.fdbs A dump of the unicast forwarding tables of the fabric switches ibdiagnet.mcfdbs A dump of the multicast forwarding tables of the fabric switches ibdiagnet.masks In case of duplicate port/node Guids, these file include the map between masked Guid and real Guids ibdiagnet.sm List of all the SM (state and priority) in the fabric ibdiagnet.
Rev 2.1-1.0.6 InfiniBand Fabric Diagnostic Utilities Error Codes 1 2 3 4 5 6 9.5 - Failed Failed Failed Failed Failed Failed to to to to to to fully discover the fabric parse command line options intract with IB fabric use local device or local port use Topology File load requierd Package ibdiagpath - IB diagnostic path ibdiagpath traces a path between two end-points and provides information regarding the nodes and ports traversed along the path.
Rev 2.1-1.0.6 Options -n <[src-name,]dst-name> Names of the source and destination ports (as defined in the topology file; source may be omit ted -> local port is assumed to be the source) -l <[src-lid,]dst-lid> -d -c -v -t -s -i -p -o -lw <1x|4x|12x> -ls <2.
Rev 2.1-1.0.6 InfiniBand Fabric Diagnostic Utilities Error Codes 1 - The path traced is un-healthy 2 - Failed to parse command line options 3 - More then 64 hops are required for traversing the local port to the "Source" port and then to the "Destination" port 4 - Unable to traverse the LFT data from source to destination 5 - Failed to use Topology File 6 - Failed to load required Package 9.6 ibv_devices Lists InfiniBand devices available for use from userspace, including node GUIDs.
Rev 2.1-1.0.6 Table 29 - ibv_devinfo Flags and Options Flag Default (If Not Specified) Optional / Mandatory Description -l --list Optional Inactive Only list the names of InfiniBand devices -v --verbose Optional Inactive Print all available information about the InfiniBand device(s) Examples 1. List the names of all available InfiniBand devices. > ibv_devinfo -l 2 HCAs found: mthca0 mlx4_0 2. Query the device mlx4_0 and print user-available information for its Port 2.
Rev 2.1-1.0.6 InfiniBand Fabric Diagnostic Utilities Options -v Enable verbose mode. Adds additional information such as: Device ID, Part Number, Card Name, Firmware version, IB port state. -h Print help messages. Example: sw417:~/BXOFED-1.5.2-20101128-1524 # ibdev2netdev -v mlx4_0 (MT26428 - MT1006X00034) FALCON QDR fw 2.7.9288 (Down) mlx4_0 (MT26428 - MT1006X00034) FALCON QDR fw 2.7.9288 (Down) mlx4_0 (MT26428 - MT1006X00034) FALCON QDR fw 2.7.
Rev 2.1-1.0.6 Table 30 - ibstatus Flags and Options Flag Optional / Mandatory Optional, but requires specifying a device name Default (If Not Specified) All ports of the specified device Description Print information for the specified port only (of the specified device) Examples 1. List the status of all available InfiniBand devices and their ports.
Rev 2.1-1.0.6 InfiniBand Fabric Diagnostic Utilities 2. List the status of specific ports of specific devices. > ibstatus mthca0:1 mlx4_0:2 Infiniband device 'mthca0' port 1 status: default gid: fe80:0000:0000:0000:0002:c900:0101:d151 base lid: 0x0 sm lid: 0x0 state: 2: INIT phys state: 5: LinkUp rate: 10 Gb/sec (4X) Infiniband device 'mlx4_0' port 2 status: default gid: fe80:0000:0000:0000:0000:0000:0007:3897 base lid: 0x1 sm lid: 0x1 state: 4: ACTIVE phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) 9.
Rev 2.1-1.0.6 Table 31 - ibportstate Flags and Options (Continued) Flag Default (If Not Specified) Optional / Mandatory Description -v(erbose) Optional Increase verbosity level. May be used several times for additional verbosity (-vvv or -v -v -v) -V(ersion) Optional Show version info -D(irect) Optional Use directed path address arguments. The path is a comma separated list of out ports. Examples: ‘0’ – self port ‘0,1,2,1,4’ – out via port 1, then 2, ...
Rev 2.1-1.0.6 InfiniBand Fabric Diagnostic Utilities 1. Query the status of Port 1 of CA mlx4_0 (using ibstatus) and use its output (the LID – 3 in this case) to obtain additional link information using ibportstate. > ibstatus mlx4_0:1 Infiniband device 'mlx4_0' port 1 status: default gid: fe80:0000:0000:0000:0000:0000:9289:3895 base lid: 0x3 sm lid: 0x3 state: 2: INIT phys state: 5: LinkUp rate: 20 Gb/sec (4X DDR) > ibportstate -C mlx4_0 3 1 query PortInfo: # Port info: Lid 3 port 1 LinkState:...........
Rev 2.1-1.0.6 LinkSpeedActive:.................2.5 Gbps 3. Change the speed of a port. # First query for current configuration > ibportstate -C mlx4_0 -D 0 1 PortInfo: # Port info: DR path slid 65535; dlid 65535; 0 port 1 LinkState:.......................Initialize PhysLinkState:...................LinkUp LinkWidthSupported:..............1X or 4X LinkWidthEnabled:................1X or 4X LinkWidthActive:.................4X LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps LinkSpeedEnabled:.............
Rev 2.1-1.0.6 InfiniBand Fabric Diagnostic Utilities Synopsis ibroute [-h] [-d] [-v] [-V] [-a] [-n] [-D] [-G] [-M] [-s ] \[-C ] [-P ] [ -t ] \ [ [ []]] Output Files Table 32 lists the various flags of the command. Table 32 - ibportstate Flags and Options Flag 206 Optional / Mandatory Default (If Not Specified) Description -h(help) Optional Print the help menu -d(ebug) Optional Raise the IB debug level.
Rev 2.1-1.0.6 Table 32 - ibportstate Flags and Options Flag Optional / Mandatory Default (If Not Specified) Description -t Optional Override the default timeout for the solicited MADs [msec] Optional Destination’s directed path, LID, or GUID Optional Starting LID in an MLID range Optional Ending LID in an MLID range Examples 1. Dump all Lids with valid out ports of the switch with Lid 2.
Rev 2.1-1.0.6 InfiniBand Fabric Diagnostic Utilities Unicast lids [0x3-0x7] of switch Lid 2 guid 0x0002c902fffff00a (MT47396 Infiniscale-III Mellanox Technologies): Lid Out Destination Port Info 0x0003 021 : (Switch portguid 0x000b8cffff004016: 'MT47396 Infiniscale-III Mellanox Technologies') 0x0006 007 : (Channel Adapter portguid 0x0002c90300001039: 'sw137 HCA-1') 0x0007 021 : (Channel Adapter portguid 0x0002c9020025874a: 'sw157 HCA-1') 3 valid lids dumped 4.
Rev 2.1-1.0.6 9.12 smpquery Provides a basic subset of standard SMP queries to query Subnet management attributes such as node info, node description, switch info, and port info. Synopsis smpquery [-h] [-d] [-e] [-v] [-D] [-G] [-s ] [-V] [-C ] [-P ] [-t ] [--node-name-map ] [op params] Output Files Table 33 lists the various flags of the command.
Rev 2.1-1.0.
Rev 2.1-1.0.6 LinkSpeedSupported:..............2.5 Gbps or 5.0 Gbps LinkState:.......................Active PhysLinkState:...................LinkUp LinkDownDefState:................Polling ProtectBits:.....................0 LMC:.............................0 LinkSpeedActive:.................5.0 Gbps LinkSpeedEnabled:................2.5 Gbps or 5.0 Gbps NeighborMTU:.....................2048 SMSL:............................0 VLCap:...........................VL0-7 InitType:........................
Rev 2.1-1.0.6 InfiniBand Fabric Diagnostic Utilities LifeTime:........................18 StateChange:.....................0 LidsPerPort:.....................0 PartEnforceCap:..................32 InboundPartEnf:..................1 OutboundPartEnf:.................1 FilterRawInbound:................1 FilterRawOutbound:...............1 EnhancedPort0:...................0 3. Query NodeInfo by direct route. > smpquery -D nodeinfo 0 # Node info: DR path slid 65535; dlid 65535; 0 BaseVers:.......................
Rev 2.1-1.0.6 Table 34 - perfquery Flags and Options Optional / Mandatory Flag Default (If Not Specified) Description -G(uid) Optional Use GUID address argument. In most cases, it is the Port GUID.
Rev 2.1-1.0.6 InfiniBand Fabric Diagnostic Utilities RcvSwRelayErrors:................0 XmtDiscards:.....................0 XmtConstraintErrors:.............0 RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................0 XmtData:.........................55178210 RcvData:.........................55174680 XmtPkts:.........................766366 RcvPkts:.........................766315 2.
Rev 2.1-1.0.6 RcvConstraintErrors:.............0 LinkIntegrityErrors:.............0 ExcBufOverrunErrors:.............0 VL15Dropped:.....................0 XmtData:.........................0 RcvData:.........................0 XmtPkts:.........................0 RcvPkts:.........................0 9.14 ibcheckerrs Validates an IB port (or node) and reports errors in counters above threshold. Check specified port (or node) and report errors that surpassed their predefined threshold.
Rev 2.1-1.0.
Rev 2.1-1.0.6 > ibcheckerrs -v -T thresh1 2 1 Error check on lid 2 (MT47396 Infiniscale-III Mellanox Technologies) port 1: OK 9.15 mstflint Queries and burns a binary firmware-image file on non-volatile (Flash) memories of Mellanox InfiniBand and Ethernet network adapters. The tool requires root privileges for Flash access. If you purchased a standard Mellanox Technologies network adapter card, please download the firmware image from www.mellanox.com > Downloads > Firmware.
Rev 2.1-1.0.6 InfiniBand Fabric Diagnostic Utilities Table 36 - mstflint Switches (Sheet 2 of 3) Switch 218 Affected/ Relevant Commands Description -mac burn, sg MAC address base value. Two MACs are automatically assigned to the following values: mac -> port1 mac+1 -> port2 Note: This switch is applicable only for Mellanox Technologies Ethernet products. -macs burn, sg Two MACs must be specified here. The specified MACs are assigned to port1 and port2, repectively.
Rev 2.1-1.0.6 Table 36 - mstflint Switches (Sheet 3 of 3) Affected/ Relevant Commands Switch Description -vsd burn Write this string of up to 208 characters to VSD upon a burn command. use_image_p s burn Burn vsd as it appears in the given image - do not keep existing VSD on Flash. -dual_image burn Make the burn process burn two images on Flash. The current default failsafe burn process burns a single image (in alternating locations).
Rev 2.1-1.0.6 InfiniBand Fabric Diagnostic Utilities Possible command return values are: 0 - successful completion 1 - error has occurred 7 - the burn command was aborted because firmware is current Examples 1. Find Mellanox Technologies’s ConnectX® VPI cards with PCI Express running at 2.5GT/s and InfiniBand ports at DDR / or Ethernet ports at 10GigE. > /sbin/lspci -d 15b3:634a 04:00.0 InfiniBand: Mellanox Technologies MT25418 [ConnectX IB DDR, PCIe 2.0 2.5GT/s] (rev a0).
Rev 2.1-1.0.6 9.16 ibv_asyncwatch Display asynchronous events forwarded to userspace for an InfiniBand device. Synopsis ibv_asyncwatch Examples 1. Display asynchronous events. > ibv_asyncwatch mlx4_0: async event FD 4 9.17 ibdump Dump InfiniBand traffic that flows to and from Mellanox Technologies ConnectX® family adapters InfiniBand ports. The dump file can be loaded by the Wireshark tool for graphical traffic analysis.
Rev 2.1-1.0.6 InfiniBand Fabric Diagnostic Utilities Output Files -d, --ib-dev= -i, --ib-port= -w, --write= -o, --output= -b, --max-burst= l -s, --silent --mem-mode --decap -h, --help -v, --version use RDMA device (default first device found) The relevant devices can be listed by running the 'ibv_devinfo' command. use port of IB device (default 1) dump file name (default "sniffer.pcap") '-' stands for stdout - enables piping to tcpdump or tshark.
Rev 2.1-1.0.6 Appendix A: Mellanox FlexBoot A.1 Overview Mellanox FlexBoot is a multiprotocol remote boot technology. FlexBoot supports remote Boot over InfiniBand (BoIB) and over Ethernet. Using Mellanox Virtual Protocol Interconnect (VPI) technologies available in ConnectX® adapters, FlexBoot gives IT Managers’ the choice to boot from a remote storage target (iSCSI target) or a LAN target (Ethernet Remote Boot Server) using a single ROM image on Mellanox ConnectX products.
Rev 2.1-1.0.6 You need to install the Mellanox Firmware Tools (MFT) package (version 3.0.0 or later) in order to burn the PXE ROM image. To download MFT, see Firmware Tools under www.mellanox.com > Products > InfiniBand/VPI Drivers > Firmware Tools. Image Burning Procedure To burn the composite image, perform the following steps: 1. Obtain the MST device name. Run: # mst start # mst status The device name will be of the form: mt_pci{_cr0|conf0}.1 2. Create and burn the composite image.
Rev 2.1-1.0.6 The value of the client identifier is composed of a prefix — ff:00:00:00:00:00:02:00:00:02:c9:00 — and an 8-byte port GUID (all separated by colons and represented in hexadecimal digits). Extracting the Port GUID – Method I To obtain the port GUID: Step 1. Start mst. host1# mst start host1# mst status The following MFT commands assume that the Mellanox Firmware Tools (MFT) package has been installed on the client machine. Step 2. Obtain the Port GUID using the device name.
Rev 2.1-1.0.6 option dhcp-client-identifier = ff:00:00:00:00:00:02:00:00:02:c9:00:00:02:c9:03:00:00:10:39; } A.5 Subnet Manager – OpenSM This section applies to ports configured as InfiniBand only. FlexBoot requires a Subnet Manager to be running on one of the machines in the IB network. OpenSM is part of the Mellanox OFED for Linux software package and can be used to accomplish this. Note that OpenSM may be run on the same host running the DHCP server but it is not mandatory.
Rev 2.1-1.0.6 A.7.2 Starting Boot Boot the client machine and enter BIOS setup to configure “MLNX FlexBoot” to be the first on the boot device priority list – see Section A.6. On dual-port network adapters, the client first attempts to boot from Port 1. If this fails, it switches to boot from Port 2. Note also that the driver waits up to 90 seconds for each port to come up. If MLNX FlexBoot/iPXE was selected through BIOS setup, the client will boot from FlexBoot.
Rev 2.1-1.0.6 A.8 Diskless Machines Mellanox FlexBoot supports booting diskless machines. To enable using an IB/ETH driver, the initrd image must include a device driver module and be configured to load that driver. This can be achieved by adding the device driver module into the initrd image and loading it. The ‘initrd’ image of some Linux distributions such as SuSE Linux Enterprise Server and Red Hat Enterprise Linux, cannot be edited prior or during the installation process.
Rev 2.1-1.0.6 4. To add an IB driver into initrd, you need to copy the IB modules to the diskless image. Your machine needs to be pre-installed with a Mellanox OFED for Linux ISO image that is appropriate for the kernel version the diskless image will run. Adding the IB Driver to the initrd File The following procedure modifies critical files used in the boot procedure. It must be executed by users with expertise in the boot process.
Rev 2.1-1.0.6 Step 7. If you plan to give your IB device a static IP address, then copy ifconfig. Otherwise, skip this step. host1$ cp /sbin/ifconfig /tmp/initrd_ib/sbin Step 8. If you plan to obtain an IP address for the IB device through DHCP, then you need to copy the DHCP client which was compiled specifically to support IB; Otherwise, skip this step. To continue with this step, DHCP client v3.1.3 needs to be already installed on the machine you are working with. Copy the DHCP client v3.1.
Rev 2.1-1.0.6 /sbin/insmod /sbin/insmod /sbin/insmod /sbin/insmod /sbin/insmod /lib/modules/ib/rdma_cm.ko /lib/modules/ib/rdma_ucm.ko /lib/modules/ib/mlx4_core.ko /lib/modules/ib/mlx4_ib.ko /lib/modules/ib/ib_mthca.ko The following command (loading ipoib_helper.ko) is not required for all OS kernels. Please check the release notes. /sbin/insmod /lib/modules/ib/ipoib_helper.ko /sbin/insmod /lib/modules/ib/ib_ipoib.ko Step 11.
Rev 2.1-1.0.6 4. To add an Ethernet driver into initrd, you need to copy the Ethernet modules to the diskless image. Your machine needs to be pre-installed with a MLNX_EN Linux Driver that is appropriate for the kernel version the diskless image will run. Adding the Ethernet Driver to the initrd File The following procedure modifies critical files used in the boot procedure. It must be executed by users with expertise in the boot process.
Rev 2.1-1.0.6 Step 10. Close initrd. host1$ cd /tmp/initrd_en host1$ find ./ | cpio -H newc -o > /tmp/new_initrd_en.img host1$ gzip /tmp/new_init_en.img At this stage, the modified initrd (including the Ethernet driver) is ready and located at /tmp/new_init_ib.img.gz. Copy it to the original initrd location and rename it properly. A.9 iSCSI Boot Mellanox FlexBoot enables an iSCSI-boot of an OS located on a remote iSCSI Target.
Rev 2.1-1.0.
Rev 2.1-1.0.6 Appendix B: SRP Target Driver The SRP Target driver is designed to work directly on top of OpenFabrics OFED software stacks (http://www.openfabrics.org) or InfiniBand drivers in Linux kernel tree (kernel.org). It also interfaces with Generic SCSI target mid-level driver - SCST (http://scst.sourceforge.net). By interfacing with an SCST driver, it is possible to work with and support a lot of IO modes on real or virtual devices in the back end. 1. scst_vdisk – fileio and blockio modes.
Rev 2.1-1.0.6 The scst_disk module (pass-thru mode) of SCST is not supported by Mellanox OFED. Example 1: Working with VDISK BLOCKIO mode (Using the md0 device, sda, and cciss/c1d0) a. modprobe scst b. modprobe scst_vdisk c. echo "open vdisk0 /dev/md0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk d. echo "open vdisk1 /dev/sda BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk e. echo "open vdisk2 /dev/cciss/c1d0 BLOCKIO" > /proc/scsi_tgt/vdisk/vdisk f. echo "add vdisk0 0" >/proc/scsi_tgt/groups/Default/devices g.
Rev 2.1-1.0.6 1. Run: modprobe ib_srp 2. Run: ibsrpdm -c -d /dev/infiniband/umadX (to discover a new SRP target) umad0: port 1 of the first HCA umad1: port 2 of the first HCA umad2: port 1 of the second HCA 3. echo {new target info} > /sys/class/infiniband_srp/srp-mthca0-1/add_target 4. fdisk -l (will show the newly discovered scsi disks) Example: Assume that you use port 1 of first HCA in the system, i.e.
Rev 2.1-1.0.6 echo "add "mgmt"" > /proc/scsi_tgt/trace_level echo "add "mgmt_dbg"" > /proc/scsi_tgt/trace_level echo "add "out_of_mem"" > /proc/scsi_tgt/trace_level *********************** End srpt.sh **************************** B.3 How-to Unload/Shutdown 1. Unload ib_srpt $ modprobe -r ib_srpt 2. Unload scst and its dev_handlers first $ modprobe -r scst_vdisk scst 3. Unload ofed $ /etc/rc.
Rev 2.1-1.0.6 Appendix C: mlx4 Module Parameters In order to set mlx4 parameters, add the following line(s) to /etc/modprobe.conf: options mlx4_core parameter= and/or options mlx4_ib parameter= and/or options mlx4_en parameter= The following sections list the available mlx4 parameters. C.1 mlx4_ib Parameters sm_guid_assign: dev_assign_str1: Enable SM alias_GUID assignment if sm_guid_assign > 0 (Default: 1) (int) Map device function numbers to IB device numbers (e.g.'0000:04:00.
Rev 2.1-1.0.6 log_num_mgm_entry_size: high_rate_steer: fast_drop: enable_64b_cqe_eqe: log_num_mac: log_num_vlan: log_mtts_per_seg: port_type_array: log_num_qp: log_num_srq: log_rdmarc_per_qp: log_num_cq: log_num_mcg: log_num_mpt: log_num_mtt: enable_qos: internal_err_reset: C.3 mlx4_en Parameters inline_thold: udp_rss: pfctx: pfcrx: 240 log mgm size, that defines the num of qp per mcg, for example: 10 gives 248.range: 7 <= log_num_mgm_entry_size <= 12.
Rev 2.1-1.0.6 Appendix D: mlx5 Module Parameters The mlx5_ib module supports a single parameter used to select the profile which defines the number of resources supported. The parameter name for selecting the profile is prof_sel.
Rev 2.1-1.0.6 Appendix E: Lustre Compilation over MLNX_OFED This procedure applies to RHEL/SLES OSs only. To compile Lustre version 2.3.65 and higher: $ ./configure --with-o2ib=/usr/src/ofa_kernel/default/ $ make rpms To compile older Lustre versions: $ EXTRA_LNET_INCLUDE="-I/usr/src/ofa_kernel/default/include/ -include /usr/src/ ofa_kernel/default/include/linux/compat-2.6.h" .