P
US12335142B2ActiveUtilityPatentIndex 62

Network interface for data transport in heterogeneous computing environments

Assignee: INTEL CORPPriority: Jun 7, 2019Filed: Jan 19, 2024Granted: Jun 17, 2025
Est. expiryJun 7, 2039(~12.9 yrs left)· nominal 20-yr term from priority
Inventors:MAROLIA PRATIK MSANKARAN RAJESH MRAJ ASHOKJANI NRUPALSARANGAM PARTHASARATHYSHARP ROBERT O
H04L 45/60G06F 12/1081H04L 49/9068G06F 13/28H04L 69/321G06F 2212/1024G06F 13/385H04L 45/742
62
PatentIndex Score
0
Cited by
84
References
33
Claims

Abstract

A network interface controller can be programmed to direct write received data to a memory buffer via either a host-to-device fabric or an accelerator fabric. For packets received that are to be written to a memory buffer associated with an accelerator device, the network interface controller can determine an address translation of a destination memory address of the received packet and determine whether to use a secondary head. If a translated address is available and a secondary head is to be used, a direct memory access (DMA) engine is used to copy a portion of the received packet via the accelerator fabric to a destination memory buffer associated with the address translation. Accordingly, copying a portion of the received packet through the host-to-device fabric and to a destination memory can be avoided and utilization of the host-to-device fabric can be reduced for accelerator bound traffic.

Claims

exact text as granted — not AI-modified
What is claimed is: 
     
       1. Network interface controller circuitry configurable for use in a host node and in association with a device driver, the host node comprising at least one graphics processing unit (GPU)-accessible memory, at least one host memory, and at least one host fabric, the host node to be communicatively coupled via at least one multi-switch fabric to a remote system, the remote system comprising at least one other GPU-accessible memory, at least one other host memory, and at least one other host fabric, the network interface controller circuitry comprising:
 network interface circuitry for use in Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) packet data communication with the remote system via the at least one multi-switch fabric, the ROCE packet data communication to indicate at least one RDMA write to the host node from the remote system and/or at least one RDMA read from the host node to the remote system, the ROCE packet data communication to be initiated in response, at least in part, to at least one host application request; and 
 programmable circuitry to perform operations comprising:
 in event that the ROCE packet data communication indicates the at least one RDMA write, directly writing, via the at least one host fabric, received packet data to the at least one GPU-accessible memory; 
 in event that the ROCE packet data communication indicates the at least one RDMA read, directly reading, via the at least one host fabric, other data from the at least one GPU-accessible memory that is to be provided to the remote system via the ROCE packet data communication; and 
 encryption, decryption, and compression-related host central processing unit (CPU) offload operations; 
 
 wherein:
 the writing and the reading are to be performed in a manner that bypasses both (1) host CPU and/or host operating system (OS) in the writing and the reading, and (2) copying of the received packet data and the other data to the at least one host memory of the host node; 
 the writing and/or the reading are configurable to comprise use of direct data placement (DDP); 
 the writing and/or the reading are configurable to comprise use of address translation; 
 the address translation is to be implemented, at least in part, using the device driver; 
 portions of the received packet data and/or the other data are to be routed to their destinations via respective fabric-associated routings; 
 the respective fabric-associated routings are configurable to be mutually different from each other, at least in part; and 
 the at least one multi-switch fabric is to communicatively couple multiple switches associated with the host node and the remote system. 
 
 
     
     
       2. The network interface controller circuitry of  claim 1 , wherein:
 prior to being received by the programmable circuitry, the received packet data is to be directly read from the at least one other GPU-accessible memory via the at least one other host fabric in a manner that bypasses both (1) remote system CPU and/or remote system OS in the remote system, and (2) copying of the received packet data to the at least one other host memory. 
 
     
     
       3. The network interface controller circuitry of  claim 2 , wherein:
 the at least one host fabric comprises Peripheral Component Interconnect Express (PCIe) interconnect; and 
 the network interface controller circuitry is to be comprised in a circuit board that is to be communicatively coupled to the PCIe interconnect. 
 
     
     
       4. The network interface controller circuitry of  claim 3 , wherein:
 the received packet data and/or the other data are for use in association with artificial intelligence and/or machine learning. 
 
     
     
       5. The network interface controller circuitry of  claim 4 , wherein:
 the host node and the remote system each comprise multiple respective graphics processing units; 
 the at least one GPU-accessible memory is accessible by the multiple respective graphics processing units of the host node; and 
 the at least one other GPU-accessible memory is accessible by the multiple respective graphics processing units of the remote system. 
 
     
     
       6. The network interface controller circuitry of  claim 1 , wherein:
 the programmable circuitry is also for use in association with memory isolation. 
 
     
     
       7. The network interface controller circuitry of  claim 1 , wherein:
 at least one application specific integrated circuit (ASIC) comprises the programmable circuitry; and 
 the at least one host fabric comprises at least one accelerator fabric. 
 
     
     
       8. A method to be implemented using network interface controller circuitry, the network interface controller circuitry to be configured for use in a host node and in association with a device driver, the host node comprising at least one graphics processing unit (GPU)-accessible memory, at least one host memory, and at least one host fabric, the host node to be communicatively coupled via at least one multi-switch fabric to a remote system, the remote system comprising at least one other GPU-accessible memory, at least one other host memory, and at least one other host fabric, the network interface controller circuitry comprising network interface circuitry and programmable circuitry, the method comprising:
 using the network interface circuitry in Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) packet data communication with the remote system via the at least one multi-switch fabric, the ROCE packet data communication to indicate at least one RDMA write to the host node from the remote system and/or at least one RDMA read from the host node to the remote system, the ROCE packet data communication to be initiated in response, at least in part, to at least one host application request; and 
 using the programmable circuitry to perform operations comprising:
 in event that the ROCE packet data communication indicates the at least one RDMA write, directly writing, via the at least one host fabric, received packet data to the at least one GPU-accessible memory; 
 in event that the ROCE packet data communication indicates the at least one RDMA read, directly reading, via the at least one host fabric, other data from the at least one GPU-accessible memory that is to be provided to the remote system via the ROCE packet data communication; and 
 encryption, decryption, and compression-related host central processing unit (CPU) offload operations; 
 
 wherein:
 the writing and the reading are to be performed in a manner that bypasses both (1) host CPU and/or host operating system (OS) in the writing and the reading, and (2) copying of the received packet data and the other data to the at least one host memory of the host node; 
 the writing and/or the reading are configurable to comprise use of direct data placement (DDP); 
 the writing and/or the reading are configurable to comprise use of address translation; 
 the address translation is to be implemented, at least in part, using the device driver; 
 portions of the received packet data and/or the other data are to be routed to their destinations via respective fabric-associated routings; 
 the respective fabric-associated routings are configurable to be mutually different from each other, at least in part; and 
 the at least one multi-switch fabric is to communicatively couple multiple switches associated with the host node and the remote system. 
 
 
     
     
       9. The method of  claim 8 , wherein:
 prior to being received by the programmable circuitry, the received packet data is to be directly read from the at least one other GPU-accessible memory via the at least one other host fabric in a manner that bypasses both (1) remote system CPU and/or remote system OS in the remote system, and (2) copying of the received packet data to the at least one other host memory. 
 
     
     
       10. The method of  claim 9 , wherein:
 the at least one host fabric comprises Peripheral Component Interconnect Express (PCIe) interconnect; and 
 the network interface controller circuitry is to be comprised in a circuit board that is to be communicatively coupled to the PCIe interconnect. 
 
     
     
       11. The method of  claim 10 , wherein:
 the received packet data and/or the other data are for use in association with artificial intelligence and/or machine learning. 
 
     
     
       12. The method of  claim 11 , wherein:
 the host node and the remote system each comprise multiple respective graphics processing units; 
 the at least one GPU-accessible memory is accessible by the multiple respective graphics processing units of the host node; and 
 the at least one other GPU-accessible memory is accessible by the multiple respective graphics processing units of the remote system. 
 
     
     
       13. The method of  claim 8 , wherein:
 the programmable circuitry is also for use in association with memory isolation. 
 
     
     
       14. The method of  claim 8 , wherein:
 at least one application specific integrated circuit (ASIC) comprises the programmable circuitry; and 
 the at least one host fabric comprises at least one accelerator fabric. 
 
     
     
       15. At least one non-transitory machine-readable storage medium storing instructions to be executed by at least one machine associated with network interface controller circuitry, the network interface controller circuitry to be configured for use in a host node and in association with a device driver, the host node comprising at least one graphics processing unit (GPU)-accessible memory, at least one host memory, and at least one host fabric, the host node to be communicatively coupled via at least one multi-switch fabric to a remote system, the remote system comprising at least one other GPU-accessible memory, at least one other host memory, and at least one other host fabric, the network interface controller circuitry comprising network interface circuitry and programmable circuitry, the instructions, when executed by the at least one machine, resulting in performance of operations comprising:
 using the network interface circuitry in Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) packet data communication with the remote system via the at least one multi-switch fabric, the ROCE packet data communication to indicate at least one RDMA write to the host node from the remote system and/or at least one RDMA read from the host node to the remote system, the ROCE packet data communication to be initiated in response, at least in part, to at least one host application request; and 
 using the programmable circuitry to perform a set of operations comprising:
 in event that the ROCE packet data communication indicates the at least one RDMA write, directly writing, via the at least one host fabric, received packet data to the at least one GPU-accessible memory; 
 in event that the ROCE packet data communication indicates the at least one RDMA read, directly reading, via the at least one host fabric, other data from the at least one GPU-accessible memory that is to be provided to the remote system via the ROCE packet data communication; and 
 encryption, decryption, and compression-related host central processing unit (CPU) offload operations; 
 
 wherein:
 the writing and the reading are to be performed in a manner that bypasses both (1) host CPU and/or host operating system (OS) in the writing and the reading, and (2) copying of the received packet data and the other data to the at least one host memory of the host node; 
 the writing and/or the reading are configurable to comprise use of direct data placement (DDP); 
 the writing and/or the reading are configurable to comprise use of address translation; 
 the address translation is to be implemented, at least in part, using the device driver; 
 portions of the received packet data and/or the other data are to be routed to their destinations via respective fabric-associated routings; 
 the respective fabric-associated routings are configurable to be mutually different from each other, at least in part; and 
 the at least one multi-switch fabric is to communicatively couple multiple switches associated with the host node and the remote system. 
 
 
     
     
       16. The at least one non-transitory machine-readable storage medium of  claim 15 , wherein:
 prior to being received by the programmable circuitry, the received packet data is to be directly read from the at least one other GPU-accessible memory via the at least one other host fabric in a manner that bypasses both (1) remote system CPU and/or remote system OS in the remote system, and (2) copying of the received packet data to the at least one other host memory. 
 
     
     
       17. The at least one non-transitory machine-readable storage medium of  claim 16 , wherein:
 the at least one host fabric comprises Peripheral Component Interconnect Express (PCIe) interconnect; and 
 the network interface controller circuitry is to be comprised in a circuit board that is to be communicatively coupled to the PCIe interconnect. 
 
     
     
       18. The at least one non-transitory machine-readable storage medium of  claim 17 , wherein:
 the received packet data and/or the other data are for use in association with artificial intelligence and/or machine learning. 
 
     
     
       19. The at least one non-transitory machine-readable storage medium of  claim 18 , wherein:
 the host node and the remote system each comprise multiple respective graphics processing units; 
 the at least one GPU-accessible memory is accessible by the multiple respective graphics processing units of the host node; and 
 the at least one other GPU-accessible memory is accessible by the multiple respective graphics processing units of the remote system. 
 
     
     
       20. The at least one non-transitory machine-readable storage medium of  claim 15 , wherein:
 the programmable circuitry is also for use in association with memory isolation. 
 
     
     
       21. The at least one non-transitory machine-readable storage medium of  claim 15 , wherein:
 at least one application specific integrated circuit (ASIC) comprises the programmable circuitry; and 
 the at least one host fabric comprises at least one accelerator fabric. 
 
     
     
       22. A host system to be communicatively coupled via at least one multi-switch fabric to a remote system, the host system comprising:
 at least one graphics processing unit (GPU)-accessible memory; 
 at least one host memory; 
 at least one host fabric; and 
 network interface controller circuitry comprising:
 network interface circuitry for use in Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) packet data communication with the remote system via the at least one multi-switch fabric, the ROCE packet data communication to indicate at least one RDMA write to the host system from the remote system and/or at least one RDMA read from the host system to the remote system, the RoCE packet data communication to be initiated in response, at least in part, to at least one host application request; and 
 programmable circuitry to perform operations comprising:
 in event that the ROCE packet data communication indicates the at least one RDMA write, directly writing, via the at least one host fabric, received packet data to the at least one GPU-accessible memory; 
 in event that the ROCE packet data communication indicates the at least one RDMA read, directly reading, via the at least one host fabric, other data from the at least one GPU-accessible memory that is to be provided to the remote system via the ROCE packet data communication; and 
 encryption, decryption, and compression-related host central processing unit (CPU) offload operations; 
 
 
 wherein:
 the writing and the reading are to be performed in a manner that bypasses both (1) host CPU and/or host operating system (OS) in the writing and the reading, and (2) copying of the received packet data and the other data to the at least one host memory of the host system; 
 the writing and/or the reading are configurable to comprise use of direct data placement (DDP); 
 the writing and/or the reading are configurable to comprise use of address translation; 
 the address translation is to be implemented, at least in part, using the device driver; 
 portions of the received packet data and/or the other data are to be routed to their destinations via respective fabric-associated routings; 
 the respective fabric-associated routings are configurable to be mutually different from each other, at least in part; and 
 the at least one multi-switch fabric is to communicatively couple multiple switches associated with the host system and the remote system. 
 
 
     
     
       23. The host system of  claim 22 , wherein:
 the at least one host fabric comprises Peripheral Component Interconnect Express (PCIe) interconnect; and 
 the network interface controller circuitry is to be comprised in a circuit board that is to be communicatively coupled to the PCIe interconnect. 
 
     
     
       24. The host system of  claim 23 , wherein:
 the received packet data and/or the other data are for use in association with artificial intelligence and/or machine learning. 
 
     
     
       25. The host system of  claim 24 , wherein:
 the host system and the remote system each comprise multiple respective graphics processing units; 
 the at least one GPU-accessible memory is accessible by the multiple respective graphics processing units of the host system; and 
 at least one other GPU-accessible memory of the remote system is accessible by the multiple respective graphics processing units of the remote system. 
 
     
     
       26. The host system of  claim 22 , wherein:
 the programmable circuitry is also for use in association with memory isolation. 
 
     
     
       27. The host system of  claim 22 , wherein:
 at least one application specific integrated circuit (ASIC) comprises the programmable circuitry; and 
 the at least one host fabric comprises at least one accelerator fabric. 
 
     
     
       28. A data center system comprising:
 at least one multi-switch fabric; 
 a remote system; and 
 a host system to be communicatively coupled via the at least one multi-switch fabric to the remote system, the host system comprising:
 at least one graphics processing unit (GPU)-accessible memory; 
 at least one host memory; 
 at lea one host fabric; and 
 network interface controller circuitry comprising:
 network interface circuitry for use in Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) packet data communication with the remote system via the at least one multi-switch fabric, the ROCE packet data communication to indicate at least one RDMA write to the host system from the remote system and/or at least one RDMA read from the host system to the remote system, the ROCE packet data communication to be initiated in response, at least in part, to at least one host application request; and 
 programmable circuitry to perform operations comprising:
 in event that the ROCE packet data communication indicates the at least one RDMA write, directly writing, via the at least one host fabric, received packet data to the at least one GPU-accessible memory; 
 in event that the ROCE packet data communication indicates the at least one RDMA read, directly reading, via the at least one host fabric, other data from the at least one GPU-accessible memory that is to be provided to the remote system via the ROCE packet data communication; and 
 encryption, decryption, and compression-related host central processing unit (CPU) offload operations; 
 
 
 
 wherein:
 the writing and the reading are to be performed in a manner that bypasses both (1) host CPU and/or host operating system (OS) in the writing and the reading, and (2) copying of the received packet data and the other data to the at least one host memory of the host system; 
 the writing and/or the reading are configurable to comprise use of direct data placement (DDP); 
 the writing and/or the reading are configurable to comprise use of address translation; 
 the address translation is to be implemented, at least in part, using the device driver; 
 portions of the received packet data and/or the other data are to be routed to their destinations via respective fabric-associated routings; 
 the respective fabric-associated routings are configurable to be mutually different from each other, at least in part; and 
 the at least one multi-switch fabric is to communicatively couple multiple switches associated with the host system and the remote system. 
 
 
     
     
       29. The data center system of  claim 28 , wherein:
 the at least one host fabric comprises Peripheral Component Interconnect Express (PCIe) interconnect; and 
 the network interface controller circuitry is to be comprised in a circuit board that is to be communicatively coupled to the PCIe interconnect. 
 
     
     
       30. The data center system of  claim 29 , wherein:
 the received packet data and/or the other data are for use in association with artificial intelligence and/or machine learning. 
 
     
     
       31. The data center system of  claim 30 , wherein:
 the host system and the remote system each comprise multiple respective graphics processing units; 
 the at least one GPU-accessible memory is accessible by the multiple respective graphics processing units of the host system; and 
 at least one other GPU-accessible memory of the remote system is accessible by the multiple respective graphics processing units of the remote system. 
 
     
     
       32. The data center system of  claim 28 , wherein:
 the programmable circuitry is also for use in association with memory isolation. 
 
     
     
       33. The data center system of  claim 28 , wherein:
 at least one application specific integrated circuit (ASIC) comprises the programmable circuitry; and 
 the at least one host fabric comprises at least one accelerator fabric.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.