Rdma Aware Networks Programming User Manual
Monitoring for RDMA-Enabled Network
Introduction
Basic knowledge on RDMA programming. (Available RDMA communication operations, RDMA transport modes, VPI Verbs API, etc.) Developing experience on RDMA-driven system. (optional) Basic knowledge on Sketch data structure. (optional) Useful References 1 RDMA Aware Networks Programming User Manual. RDMA Aware Networks Programming User Manual Rev 1.7. A Tutorial of the RDMA Model. Gold Sponsors. Silver Sponsors. Supporting Organizations ©2019 HPC Advisory.
As an emerging high-speed network technology, RoCE network has been widely deployed in data centers to dramatically provide high throughput, ultra-low latency and lower CPU overhead. Many research works focus on RoCE-driven distributed system, transactions mechanisms and network protocol optimization in industry and academia. However, most RDMA applications can only guarantee high performance in resource-isolated environments. In a data center or cluster where multiple RDMA applications coexist, efficient and real-time network load measurement, monitoring and balancing from a global perspective for RDMA-enabled network is a big challenge to help to boost RoCE networking performance, efficiency, scalability, reliability, and so on. There is no doubt that the challenge mentioned above is an important topic in 'Measuring and debugging real network systems'.
Problems
Problem-1(Easy): Collect L1/L2 RoCE network information.
Build a framework for collecting various RoCEv1/v2 network traces and replaying the RoCE traces data without real jobs. This will help researchers reproduce the network traffics in a datacenter. Then the network workloads can be located and analyzed for futher load balancing.
Problem-2(Advanced): RDMA-Enabled Traffic Generator for RDMA network experiments.
Build a RoCE-based traffic generator which can generate dynamic, diverse and representative RoCEv1/v2 traffics for RDMA network experiments. This traffic generator contains two critical components: server and client.
The server listens for incoming requests, and replies with a flow with the requested size for each request.
According to a user-defined configuration file, the client establishes multi-types (including RC/UC/UD) RDMA connections and randomly generates two-sided (RDMA SEND/RECV) or one-sided (RDMA WRITE/READ) requests with different sizes at the specified requested rate over RDMA connections.
In the configuration file, the user can specify the list of destination servers with RNICs, the request size distribution, the sending rate distribution with different RDMA verbs, the multi-type (RC/UC/UD) RDMA connections distribution, the different RDMA verbs invoking distribution and so on.
Problem-3(Hard): RDMA Packet based Sketch algorithm for Measurement.
Build a Sketch-based measurement module or framework for RDMA-enabled network. As the mainstream of network measurement research, high-speed traffic network measurement based on sketch is a research hotspot in recent years. The **sketch-based ** high-speed traffic network measurement can detect large stream and abnormal stream without occupying too much computing and space (cache/memory) resources. Sketch is a hash-based data structure which stores traffic characteristics information in a real-time manner for a high-speed network environment. It occupies only small space resources, and has theoretically provable balance of estimation accuracy and memory. Therefore, can we apply the Sketch-based measurement to an RDMA-enabled network for workload monitoring? This is an interesting challenge.
Technical Skills Required
Participants are required to have:
- Basic RDMA (Remote Direct Memory Access) background. (Kernel-bypassing networking, RoCEv1/v2, RDMA verbs, RDMA key components, HCA, L1/L2, control-/data-plane, throughput, RTT)
- Solid C/C++ programming. (memory allocator, C++ template, multiple threads, network, lock, etc.)
- Linux and shell commands.
Participants are encouraged to have the following skills or experience:
- Basic knowledge on RDMA programming. (Available RDMA communication operations, RDMA transport modes, VPI Verbs API, etc.)
- Developing experience on RDMA-driven system. (optional)
- Basic knowledge on Sketch data structure. (optional)
Useful References
[1] RDMA Aware Networks Programming User Manual Rev 1.7. http://www.mellanox.com/related-docs/prod_software/RDMA_Aware_Programming_user_manual.pdf
[2] A Tutorial of the RDMA Model. http://www.hpcwire.com/hpcwire/2006-09-15/a_tutorial_of_the_rdma_model-1.html
[3] RDMA mojo : blog on RDMA technology and programming. http://www.rdmamojo.com/
[4] Cormode G, Muthukrishnan S. An improved data stream summary: the count-min sketch and its applications. [J] Springer Journal of Algorithm, 2004, LNCS 2976,pp. :29-38.
[5] Huang Q, Jin X, P. C. Lee P, et al. SketchVisor: Robust Network Measurement for Software Packet Processing[C]//Proceedings of the 2017 ACM SIGCOMM Conference, 2017: 113-126.
[6] Huang Q , P. C. Lee P, Bao Y. SketchLearn: Relieving User Burdens in Approximate Measurement with Automated Statistical Inference[C] //Proceedins of the 2018 ACM SIGCOMM Conference, 2018: 576-590.
[7] Yang T, Jiang J, Liu P, et al. Elastic Sketch: Adaptive and Fast Network-wide Measurements[C] //Proceeding of the 2018 ACM SIGCOMM Conference, 2018:561-575
[8] YU M, JOSE L, MIAO R. Software defined traffic measurement with OpenSketch[C]//Presented as Part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). 2013: 29-42.
[9] Huang Q, P. C. Lee P. A hybrid local and distributed sketching design for accurate and scalable heavy key detection in network data. streams.[C]//Proceeding of the 2014 IEEE INFOCOMM conference. 2014: 298-315.
Free philips television manual. To view the updated on-screen user manual, download and install the latest User Manual Upgrade software from the ‘User manuals’ section of the Philips support page. Follow the instructions in the Release Notes of the User Manual Up grade on how to Up date the TV on-screen user manual. Philips Customer Care is your resource for product assistance including manuals, FAQ's and software updates.
[10] Dai M, Cheng G. Sketch-based data plane hardware model for software-defined measurement [J] Journal on Communications, 2017, 38 (10): 20172031-20172039.
[11] RDMA Samples. https://github.com/tarickb/the-geek-in-the-corner
Contact us
RDMA (Remote Direct Memory Access) is an emerging networking technology promising higher throughput and lowerlatency compared to traditional TCP/IP based networks. RDMA has been widely used in high performance computing (HPC), and now becomes more popular in today's datacenter environment. This tutorial explains some of the basic concepts of RDMA based programming, and it also covers some commonly applied optimization techniques.
Useful links
This tutorial servers as a brief summary of the contents from the following links, which have more details and code examples:
- Gives great explanations of RDMA interfaces and code snippets on how to use them.
- Example code demonstrating the latest software/hardware features can be found within libibverbs*/examplesand perftest*
Mellanox OFED manual
- Google search for the latest version
RDMA basic concepts
Queue Pair
To draw an analogy from everyday mail service, queue pair (QP) defines the address of the communication endpoints, or equivalently, sockets in traditional socket based programming. Each communication endpoint needs to create aQP in order to talk to each other.
There are three different types of QP: RC (Reliable Connection), UC (Unreliable Connection) and UD (Unreliable Datagram). A more recent optimization from Mellanox introduces DCT (Dynamically Connected Transport) to solve the QP scalability problem.
If QP is connected (RC or UC), each QP can only talk to ONE other QP. Otherwise, if QP is created as UD or DCT, the QP is able to talk to ANY other QPs. A more detailed discussion on how to choose the type of QP can befound later in Choice of QP types.
Verbs
In RDMA based programming, verb is a term that defines the types of communication operations.There are two different communication primitives: channel semantics (send/receive) and memory semantics(read/write). If we only consider how data is delivered to the other end, channel semantics involves bothcommunication endpoints: the receiver needs to pre-post receives and the sender posts sends; while memorysemantics only involves one side of the communication endpoint: the sender can write the data directlyto the receiver's memory region, or the receiver can read from the target's memory region without notifyingthe target.
Generally speaking, memory semantics has less overhead compared to channel semantics and thushas higher raw performance; On the other hand, channel semantics involves less programming effort.
Example 1: use RDMA send/recv
git checkout 65893ec
We start off with a simple example only involves two nodes: a server and a client. Messages are ping-ponged between the server and client, while the client initiates the sending of the messages. This simple example is used to demonstrate the common process of setting up connection using IB. (In this tutorial, we use RDMA and IB interchangeably since IB, short for Infiniband, is one of the most popular hardware implementation of RDMA technology). In this first example, we use RC QP and Send/Receive verbs.
Common steps to setup IB connection
Check the implementation of function setup_ib
in 'setup_ib.c' for more details.
Step 1: Get IB context
struct ibv_context
First, we need to get IB device list by calling
ibv_get_device_list
; Then we can get IB context by calling:ibv_open_device
Step 2: Allocate IB protection domain
struct ibv_pd
Protection domain will later be used to register IB memory region and create Queue Pairs. Protection domain is allocated by calling:
ibv_alloc_pd
Step 3: Register IB memory region
struct ibv_mr
IB memory region is used to store messages exchanged among nodes. Older version of hardware requires these memory regions to be pinned (thus, cannot be swapped out by OS); however, newer hardware that supports on demand paging relaxes such restriction. Memory region is managed by the programmer. The programmer needs to make sure the old message that has not been processed, does not get overwritten by the newly incoming message. Memory region is registered by calling
ibv_reg_mr
Step 4: Create Completion Queue
struct ibv_cq
Completion Queue (CQ) is commonly used along with channel semantics (send/receive). It provides away to notify the programmer that the operation is complete. Completion queue can be created using
ibv_create_cq
Step 5: Create Queue Pair
struct ibv_qp
Queue Pair (QP) can be created by calling
ibv_create_qp
Step 6: Query IB port attribute
struct ibv_port_attr
IB port attribute can be obtained by calling
ibv_query_port
. The fieldlid
from port attribute will later be used to connect QPsStep 7: Connect QP
Now QPs are created, they need to know whom they are talking to. IB needs to use out-of-band communication to connect QP. Commonly, Sockets are used to exchange QP related information. In the case of channel semantics (send/recv), only two pieces of information need to be exchanged in order to connect QP:
lid
fromstruct ibv_port_attr
andqp_num
fromstruct ibv_qp
. Afterlid
andqp_num
has been received, QP state can be changed from INIT to RTR (ready to receive) and then to RTS (ready to send) by callingibv_modify_qp
minor tweaks
git checkout e1570c6
Default setup for Example 1 is to use 64B message and ping-pong one message a time. Now,we modify the command line arguments to introduce two more input parameters: msg_size and num_concurr_msgs. The meaning of msg_size is straightforward. The other parameter, num_concurr_msgs, defines how many messages can be sent at the same time.
As shown in figure above, generally speaking, by using small message size and large number of concurrent messages, it is more likely to saturate the NIC hardware (in terms of operations per second, NOT bytes per second. To have a higher bandwidth number (B/s), large message size should be used), and hence, has a higher throughput number. The downside of having a large number of concurrent messages is that it would potentially consumes more memory.
Example 2: use RDMA write
git checkout f73736e
In this example, we demonstrate how to implement the same echo benchmark as example 1 but using RDMA write. Now the sender needs to know where to write the message on the receiver's end. As a result, two more pieces of information need to be exchanged during QP connection: rkey and raddr. rkey is the key of the receiver's memory region, and raddr is the receiver's memory address the sender will write the message to.
In the previous example, the incoming messages are detected by polling the completion queue: a completed event for a posted receive indicates a newly incoming message. Since RDMA write is one-sided communication primitive, such that the receiver is not notified for the incoming messages. In this example, we demonstrate a different way of detecting incoming messages. We assume all messages are of the same content: 'AA.A', whose length is determined by msg_size. For instance, to detect the first incoming message, one needs to poll two memory locations: raddr and raddr + msg_size - 1. As soon as the content of these two memory locations turn into 'A', the first incoming message is received. You will find the following code snippet in both 'client.c' and 'server.c', which is used to detect incoming messages.
more details
selective signaling
One may notice that, within the post_send
function implementation in 'ib.c', we set the send_flags
to be IBV_SEND_SIGNALED
. As a result, each send operation also generates a completion event when it is finished (when the message has been delivered to the receiver and an acknowledgement is sent back to the sender). Although in example 1, we generally ignore such events, in both 'client.c' and 'server.c', only completion events of posted receives are polled to detect incoming message:
Since the completion events of the sends are not relevant, we can get ride of the notification of such events by not setting send_flags
(the default behavior is not signaled), as shown in post_write_unsignaled
in 'ib.c'. However, we can not use unsignaled sends all the time. Under the current libverbs implementation, a send operation is only considered to be completed when a completion event of this operation is generated and polled from the CQ. If one kept posting unsignaled sends, the send queue will eventually be full because none of these operations would be thought as completed and hence they remain in the send queue; and as a result, no more sends can be posted to the send queue. It also happens when one uses signaled sends but not polling any completion events from the CQ. As a result, a common practice is to use selective signaling, where a signaled send is posted once in a while, and then poll its completion event from the CQ. After the completion event of the signaled send is polled from CQ, all previous sends are considered to be completed and thus are removed from the send queue. Unsignaled sends have less overhead compared to signaled sends, therefore, selective signaling tends to have better performance. A more detailed discussion can be found here
reset receiving memory regions
As we have discussed earlier, the detection of a newly incoming message is by polling memory locations till they turn into the specific content ('A' in our example). In practice, the memory region will be reused once the message has been processed. As a result, as soon as the message has been processed, we need to reset the memory region such that it does not contain the old message.
Example 3: common optimization techniques
In the previous example, we demonstrate one optimization technique: selective signaling. In the following, we will discuss two more: message inlining and batching.
message inlining
git checkout 1b460c2
To post a send (or write), a work request (WR), ibv_send_wr
, needs to be prepared before it is passed on to ibv_post_send
. WR stores a pointer pointing to the memory region containing the message would be sent to the receiver. The successful return of a ibv_post_send
function call only indicates that such work request has been passed down to the NIC and stored within the send queue. When the work request is later being pulled out from the send queue and processed, the NIC follows the pointer to find the memory region and copy out the message to be sent. Therefore, it involves two round trips between the NIC and the host memory: first WR to NIC, then message to NIC.
Home depot air compressor. Most RDMA compatible NICs allow some small messages to be inlined with the WR, such that the message body is copied together with the WR to the NIC; and as a result, saves one round trip of following the pointer to copy out the actual message. Unfortunately, there is no way to determine the max inline message size supported by the hardware. It is normally done by using trial and error.
To enable inlined message. Two places need to be modified: first, add max_inline_data
to ibv_qp_init_attr
when create QP; second, set send_flags
to IBV_SEND_INLINE
.
message batching
git checkout d7dd63e
There are two different ways that CPU communicates with the NIC: MMIO and DMA. When calling ibv_post_send
, the CPU uses MMIO to write WR to the NIC memory; and when the NIC starts to process the WR and needs to copy the message body by following the pointer, it uses DMA. Generally speaking, DMA is more efficient compared to MMIO.
If one needs to send out N messages, a common practice is call ibv_post_send
N times, and each time CPU uses MMIO to write WR to NIC memory. A different approach is to use a link list of WRs where ibv_post_send
is called only once to write the head of the WR list to the NIC using MMIO, and the NIC reads the rest of the WRs using DMA. This second approach is called batching which is more efficient than the first approach. In order to use message batching, a list of ibv_send_wr
needs to be created manually by setting ibv_send_wr.next
pointing to the next WR. Check the implementation of function setup_ib
in 'setup_ib.c' for more details.
However, if one needs to send out multiple messages to different destinations, thus involves more than one QPs, these messages cannot be batched. Under this constraint, messages can be batched must use the same QP. Therefore, there exists an alternative approach: if those outgoing messages are sent to the same QP, packing these messages into one big message at the sender and unpack them at the receiver. Sending one big message is even more efficient than batching. However, it involves packing and unpacking at both ends. We will not discuss this approach further in this tutorial.
A comparison of the throughput performance for different optimization techniques is shown in the following figure. Data are collected using 8B messages (for write batch, batch size is chosen to be 4). Write performs better than send/recv. Write inline and write are similar, while batching has better performance when the number of concurrent messages is small.
In the following figure, we show how the batch size affects throughput performance. The data are collected using 8B messages and maximum of 64 concurrent messages.
Example 4: use SRQ
Rdma Aware Networks Programming User Manual 2016
git checkout ec2a4db
Shared receive queue (SRQ) is introduced to improve memory efficiency by allowing one receive queue being shared among different QPs. Before SRQ is introduced, each QP needs to create its own receive queue. Imaging a case where 100s or even 1000s of QPs need to be created, which is not uncommon for real world applications, the same number of receive queues need to allocated that leads to potential huge memory wastage. The reason is that although one node may need to talk to 100s or 1000s other nodes, it is uncommon for it to talk to all the nodes at the same time, which means among those 100s or 1000s receive queues, only a few being used at a particular time. With the introduction of the SRQ, the number of receive queues no longer grows linearly with the number of QPs.
SRQ is created using ibv_create_srq
. When QP is created, specify the SRQ that QP is intended to use under ibv_qp_init_attr.srq
. Receives are posted by calling ibv_post_srq_recv
When SRQ is used, the source of the incoming message is implicit. There are multiple ways to determine the source of the incoming message, the solution adopted by this tutorial is by embedding the rank of the QP inside the imm_data
, which is a 4B data sent along with the actual message. In order to send message along with a imm_data
field, the ibv_send_wr.op_code
needs to be set to IBV_WR_SEND_WITH_IMM
.
One big limitation of SRQ is that a single SRQ only supports one message size, which is fundamental, because receives are served in FIFO order. To deal messages of variable sizes, either use one SRQ which accommodates the maximum message size or multiple SRQs each accommodates a different message size.
Example 5: use DCT
Due to lack of state of art IB hardware, we hope to include such example using DCT in the future. Meanwhile, reader of this tutorial can find code example for how to use DCT under libibverbs*/examples (with the latest libverbs distribution).
Practical considerations
Choice of QP types
With no experience with DCT, we focus our discussion on choice between RC (Reliable Connection) and UD (Unreliable Datagram). Compared to RC, UD has less overhead hence has better raw performance. However, the performance improvement provided by using UD instead of RC is minimal compared to choose UDP over TCP. It is because the RDMA protocol processing is mostly in hardware, while TCP/UDP uses kernel software. The main advantage of using UD is that a single QP can used to talk to any other QPs, while using RC, one needs to create as many QPs as the number of communication peers. However, with the introduction of DCT, QP scalability problem can be well addressed.
There are a few constrains when using UD:
Maximum message size is limited by MTU. Modern RDMA compatible NICs supports MTU of size 2KB to 4KB, which means message of size larger than 4KB needs to be broken down into smaller pieces at the sender and re-assembled at the receiver. It simply requires more programming effort.
Each message sent using UD has a Global Routing Header (GRH) of 40B long. Even if your message is of size 8B, the total amount of data needs to be sent is at least 48B.
Out-of-order message delivery. Different from RC, the message order is not guaranteed. In case where message order is important, programmers need to keep track of the message order by themselves.
Unreliable message delivery. Although in today's datacenter environment, package drop is rare, unless applications allow message drops, tracking message delivery and re-transmission scheme all need to be implemented using software.
In my opinion, unless the application allows message drops, does not care about the order of message delivery and always has message size less than the MTU, using UD requires re-implement existing hardware logic (used for RC QP) in software, which leads to more programming effort and potentially worse performance.
Polling incoming messages when using RDMA write
Example 2 shows how to poll incoming messages when using RDMA write. However, example 2 only shows the case where only 1 communication peer is involved. If there are 100s of communication peers, which means we need to poll 100s of different memory locations for incoming messages. Straightforward solutions includes polling 100s of different memory locations in a round robin fashion, or spawning 100s of threads, each dedicates to one memory region. Neither seems to be efficient or scalable. Plus for these types of polling method, the application needs to reserve special characters for message termination and reset memory region manually after processing each message.
Introduction To Rdma Programming
An alternative approach involves usage of CQ. Since RDMA write does not use pre-post receives, we cannot poll CQ for receive completion events to detect incoming messages. In order to generate a completion event for incoming RDMA write, we need to set use ibv_send_wr.op_code
to IBV_WR_RDMA_WRITE_WITH_IMM
. The extra immediate data field associated with the write operation generates a completion event which can be polled by the CQ. Plus, one can embed the message length information in imm_data
so that application does not need to reserve special character for message termination or reset memory region after the message being processed.
Rdma Pdf
For a small number of QPs, polling at memory locations directly could outperform using CQ. With a large number of QPs, in my opinion, the alternative approach using RDMA write with imm_data provides a single point polling (polling at CQ) which is more scalable.