P
US11138686B2ActiveUtilityPatentIndex 62

Compute optimizations for low precision machine learning operations

Assignee: INTEL CORPPriority: Apr 28, 2017Filed: Jun 19, 2019Granted: Oct 5, 2021
Est. expiryApr 28, 2037(~10.8 yrs left)· nominal 20-yr term from priority
Inventors:OULD-AHMED-VALL ELMOUSTAPHABAGHSORKHI SARA SYAO ANBANGNEALIS KEVINCHEN XIAOMINGKOKER ALTUGAPPU ABHISHEK RWEAST JOHN CMACPHERSON MIKE BKIM DUKHWANHURD LINDA LASHBAUGH BEN JLAKSHMANAN BARATHMA LIWEIRAY JOYDEEPTANG PING TSTRICKLAND MICHAEL S
G06N 3/045G06N 3/044G06N 3/08G06N 3/0464G06F 9/3887G06F 15/17G06F 15/167G06F 7/57G06T 1/60G06F 9/3867G06T 1/20G06N 3/0895G06N 3/098G06N 3/0442G06N 3/096G06N 3/09G06F 2212/401G06F 12/0811G06N 3/084G06F 9/5044G06T 15/005G06N 3/063G06F 9/3863G06F 9/30185Y02D10/00G06F 7/483G06F 9/30014G06F 3/14G06N 20/00G06N 3/0445G06N 3/0454
62
PatentIndex Score
0
Cited by
83
References
20
Claims

Abstract

Embodiments described herein provide a graphics processor that can perform a variety of mixed and multiple precision instructions and operations. One embodiment provides a streaming multiprocessor that can concurrently execute multiple thread groups, wherein the streaming multiprocessor includes a single instruction, multiple thread (SIMT) architecture and the streaming multiprocessor is to execute multiple threads for each of multiple instructions. The streaming multiprocessor can perform concurrent integer and floating-point operations and includes a mixed precision core to perform operations at multiple precisions.

Claims

exact text as granted — not AI-modified
What is claimed is: 
     
       1. A graphics processor comprising:
 a memory device; 
 a level-two (L2) cache memory and a raster operations unit (ROP) coupled with the memory device; 
 a compressor to perform lossless compression on data to be written to the memory device; and 
 a streaming multiprocessor coupled with the memory device, the streaming multiprocessor to concurrently execute multiple thread groups, wherein the streaming multiprocessor includes a single instruction, multiple thread (SIMT) architecture and the streaming multiprocessor is to execute multiple threads for each of multiple instructions; 
 wherein the multiple instructions include a first instruction to cause at least a first portion of the streaming multiprocessor to perform a floating-point operation on multiple floating-point input operands and a second instruction to cause at least a second portion of the streaming multiprocessor to perform an integer operation on multiple integer operands, the first instruction to execute concurrently with the second instruction; and 
 wherein the streaming multiprocessor includes a mixed precision core to perform operations for at least a third instruction of the multiple instructions, the mixed precision core to perform a first operation of the third instruction at a first precision and a second operation of the third instruction at a second precision, the first operation is a multiply having at least one 16-bit floating-point input, and the second operation is an accumulate having a 32-bit floating-point input. 
 
     
     
       2. The graphics processor as in  claim 1 , wherein the memory device is a memory stack including multiple memory dies. 
     
     
       3. The graphics processor as in  claim 2 , wherein the memory stack is a high-bandwidth memory stack on a same physical package as the streaming multiprocessor. 
     
     
       4. The graphics processor as in  claim 1 , wherein the first portion of the streaming multiprocessor includes a set of logic units configured to perform floating-point operations and the second portion of the streaming multiprocessor includes a set of logic units configured to perform integer operations. 
     
     
       5. The graphics processor as in  claim 4 , wherein the second instruction is to cause the second portion of the streaming multiprocessor to perform an 8-bit integer operation on multiple 8-bit floating-point input operands. 
     
     
       6. The graphics processor as in  claim 4 , wherein the first instruction is to cause the first portion of the streaming multiprocessor to perform a 16-bit floating-point operation on multiple 32-bit floating-point input operands. 
     
     
       7. The graphics processor as in  claim 6 , wherein the set of logic units configured to perform the floating-point operations are to track a loss of precision during execution of the first instruction. 
     
     
       8. The graphics processor as in  claim 1 , wherein the third instruction is a matrix multiply and accumulate operation. 
     
     
       9. The graphics processor as in  claim 8 , wherein, the first operation of the third instruction has two or more 16-bit floating-point inputs. 
     
     
       10. The graphics processor as in  claim 1 , additionally comprising a register file to store operands associated with the multiple instructions. 
     
     
       11. The graphics processor as in  claim 10 , wherein the operands are loaded into the register file from the L2 cache memory. 
     
     
       12. A graphics processing system comprising:
 a system bus interface coupled with an internal bus; 
 a graphics memory device coupled with the internal bus; 
 a level-two (L2) cache memory and a raster operations unit (ROP) coupled with the graphics memory device; 
 a compressor to perform lossless compression on data to be written to the graphics memory device; and 
 a streaming multiprocessor coupled with the graphics memory device, the streaming multiprocessor to concurrently execute multiple thread groups, wherein the streaming multiprocessor includes a single instruction, multiple thread (SIMT) architecture and the streaming multiprocessor is to execute multiple threads for each of multiple instructions; 
 wherein the multiple instructions include a first instruction to cause at least a first portion of the streaming multiprocessor to perform a floating-point operation on multiple floating-point input operands and a second instruction to cause at least a second portion of the streaming multiprocessor to perform an integer operation on multiple integer operands, the first instruction to execute concurrently with the second instruction; and 
 wherein the streaming multiprocessor includes a mixed precision core to perform operations for at least a third instruction of the multiple instructions, the mixed precision core to perform a first operation of the third instruction at a first precision and a second operation of the third instruction at a second precision, the first operation is a multiply having at least one 16-bit floating-point input, and the second operation is an accumulate having a 32-bit floating-point input. 
 
     
     
       13. The graphics processing system as in  claim 12 , wherein the graphics memory device is a graphics double data rate (GDDR) memory device. 
     
     
       14. The graphics processing system as in  claim 13 , wherein the GDDR memory device includes GDDR6 memory. 
     
     
       15. The graphics processing system as in  claim 12 , wherein the first portion of the streaming multiprocessor includes a set of logic units configured to perform floating-point operations and the second portion of the streaming multiprocessor includes a set of logic units configured to perform integer operations. 
     
     
       16. The graphics processing system as in  claim 15 , wherein the second instruction is to cause the second portion of the streaming multiprocessor to perform an 8-bit integer operation on multiple 8-bit floating-point input operands. 
     
     
       17. The graphics processing system as in  claim 15 , wherein the first instruction is to cause the first portion of the streaming multiprocessor to perform a 16-bit floating-point operation on multiple 32-bit floating-point input operands. 
     
     
       18. The graphics processing system as in  claim 17 , wherein the set of logic units configured to perform the floating-point operations are to track a loss of precision during execution of the first instruction. 
     
     
       19. A method comprising:
 decoding a first instruction via an instruction decoder of a graphics processor, the first instruction decoded into a first decoded instruction, wherein the graphics processor includes a streaming multiprocessor coupled to a memory device, a level-two (L2) cache memory and a raster operations unit (ROP) coupled with the memory device, and a compressor to perform lossless compression on data to be written to the memory device, and the streaming multiprocessor includes a single instruction, multiple thread (SIMT); 
 executing multiple threads associated with the first decoded instruction via the streaming multiprocessor, wherein the first decoded instruction causes at least a first portion of the streaming multiprocessor to perform a floating-point operation on multiple floating-point input operands; 
 decoding a second instruction via the instruction decoder of the graphics processor into a second decoded instruction; 
 executing multiple threads associated with the second decoded instruction via the streaming multiprocessor, wherein the second decoded instruction causes at least a second portion of the streaming multiprocessor to perform an integer operation on multiple integer operands and the first decoded instruction executes concurrently with the second decoded instruction; 
 decoding a third instruction via the instruction decoder of the graphics processor into a third decoded instruction; and 
 executing multiple threads associated with the third decoded instruction via the streaming multiprocessor via a mixed precision core of the streaming multiprocessor, wherein the mixed precision core performs a first operation of the third decoded instruction at a first precision and a second operation of the third decoded instruction at a second precision, the first operation is a multiply having at least one 16-bit floating-point input, and the second operation is an accumulate having a 32-bit floating-point input. 
 
     
     
       20. The method as in  claim 19 , wherein executing the first decoded instruction includes performing a 16-bit floating-point operation on multiple 32-bit floating-point input operands and executing the second decoded instruction includes performing an 8-bit integer operation on multiple 8-bit floating-point input operands.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.