US11138686B2ActiveUtilityPatentIndex 62
Compute optimizations for low precision machine learning operations
Est. expiryApr 28, 2037(~10.8 yrs left)· nominal 20-yr term from priority
Inventors:OULD-AHMED-VALL ELMOUSTAPHABAGHSORKHI SARA SYAO ANBANGNEALIS KEVINCHEN XIAOMINGKOKER ALTUGAPPU ABHISHEK RWEAST JOHN CMACPHERSON MIKE BKIM DUKHWANHURD LINDA LASHBAUGH BEN JLAKSHMANAN BARATHMA LIWEIRAY JOYDEEPTANG PING TSTRICKLAND MICHAEL S
G06N 3/045G06N 3/044G06N 3/08G06N 3/0464G06F 9/3887G06F 15/17G06F 15/167G06F 7/57G06T 1/60G06F 9/3867G06T 1/20G06N 3/0895G06N 3/098G06N 3/0442G06N 3/096G06N 3/09G06F 2212/401G06F 12/0811G06N 3/084G06F 9/5044G06T 15/005G06N 3/063G06F 9/3863G06F 9/30185Y02D10/00G06F 7/483G06F 9/30014G06F 3/14G06N 20/00G06N 3/0445G06N 3/0454
62
PatentIndex Score
0
Cited by
83
References
20
Claims
Abstract
Embodiments described herein provide a graphics processor that can perform a variety of mixed and multiple precision instructions and operations. One embodiment provides a streaming multiprocessor that can concurrently execute multiple thread groups, wherein the streaming multiprocessor includes a single instruction, multiple thread (SIMT) architecture and the streaming multiprocessor is to execute multiple threads for each of multiple instructions. The streaming multiprocessor can perform concurrent integer and floating-point operations and includes a mixed precision core to perform operations at multiple precisions.
Claims
exact text as granted — not AI-modifiedWhat is claimed is:
1. A graphics processor comprising:
a memory device;
a level-two (L2) cache memory and a raster operations unit (ROP) coupled with the memory device;
a compressor to perform lossless compression on data to be written to the memory device; and
a streaming multiprocessor coupled with the memory device, the streaming multiprocessor to concurrently execute multiple thread groups, wherein the streaming multiprocessor includes a single instruction, multiple thread (SIMT) architecture and the streaming multiprocessor is to execute multiple threads for each of multiple instructions;
wherein the multiple instructions include a first instruction to cause at least a first portion of the streaming multiprocessor to perform a floating-point operation on multiple floating-point input operands and a second instruction to cause at least a second portion of the streaming multiprocessor to perform an integer operation on multiple integer operands, the first instruction to execute concurrently with the second instruction; and
wherein the streaming multiprocessor includes a mixed precision core to perform operations for at least a third instruction of the multiple instructions, the mixed precision core to perform a first operation of the third instruction at a first precision and a second operation of the third instruction at a second precision, the first operation is a multiply having at least one 16-bit floating-point input, and the second operation is an accumulate having a 32-bit floating-point input.
2. The graphics processor as in claim 1 , wherein the memory device is a memory stack including multiple memory dies.
3. The graphics processor as in claim 2 , wherein the memory stack is a high-bandwidth memory stack on a same physical package as the streaming multiprocessor.
4. The graphics processor as in claim 1 , wherein the first portion of the streaming multiprocessor includes a set of logic units configured to perform floating-point operations and the second portion of the streaming multiprocessor includes a set of logic units configured to perform integer operations.
5. The graphics processor as in claim 4 , wherein the second instruction is to cause the second portion of the streaming multiprocessor to perform an 8-bit integer operation on multiple 8-bit floating-point input operands.
6. The graphics processor as in claim 4 , wherein the first instruction is to cause the first portion of the streaming multiprocessor to perform a 16-bit floating-point operation on multiple 32-bit floating-point input operands.
7. The graphics processor as in claim 6 , wherein the set of logic units configured to perform the floating-point operations are to track a loss of precision during execution of the first instruction.
8. The graphics processor as in claim 1 , wherein the third instruction is a matrix multiply and accumulate operation.
9. The graphics processor as in claim 8 , wherein, the first operation of the third instruction has two or more 16-bit floating-point inputs.
10. The graphics processor as in claim 1 , additionally comprising a register file to store operands associated with the multiple instructions.
11. The graphics processor as in claim 10 , wherein the operands are loaded into the register file from the L2 cache memory.
12. A graphics processing system comprising:
a system bus interface coupled with an internal bus;
a graphics memory device coupled with the internal bus;
a level-two (L2) cache memory and a raster operations unit (ROP) coupled with the graphics memory device;
a compressor to perform lossless compression on data to be written to the graphics memory device; and
a streaming multiprocessor coupled with the graphics memory device, the streaming multiprocessor to concurrently execute multiple thread groups, wherein the streaming multiprocessor includes a single instruction, multiple thread (SIMT) architecture and the streaming multiprocessor is to execute multiple threads for each of multiple instructions;
wherein the multiple instructions include a first instruction to cause at least a first portion of the streaming multiprocessor to perform a floating-point operation on multiple floating-point input operands and a second instruction to cause at least a second portion of the streaming multiprocessor to perform an integer operation on multiple integer operands, the first instruction to execute concurrently with the second instruction; and
wherein the streaming multiprocessor includes a mixed precision core to perform operations for at least a third instruction of the multiple instructions, the mixed precision core to perform a first operation of the third instruction at a first precision and a second operation of the third instruction at a second precision, the first operation is a multiply having at least one 16-bit floating-point input, and the second operation is an accumulate having a 32-bit floating-point input.
13. The graphics processing system as in claim 12 , wherein the graphics memory device is a graphics double data rate (GDDR) memory device.
14. The graphics processing system as in claim 13 , wherein the GDDR memory device includes GDDR6 memory.
15. The graphics processing system as in claim 12 , wherein the first portion of the streaming multiprocessor includes a set of logic units configured to perform floating-point operations and the second portion of the streaming multiprocessor includes a set of logic units configured to perform integer operations.
16. The graphics processing system as in claim 15 , wherein the second instruction is to cause the second portion of the streaming multiprocessor to perform an 8-bit integer operation on multiple 8-bit floating-point input operands.
17. The graphics processing system as in claim 15 , wherein the first instruction is to cause the first portion of the streaming multiprocessor to perform a 16-bit floating-point operation on multiple 32-bit floating-point input operands.
18. The graphics processing system as in claim 17 , wherein the set of logic units configured to perform the floating-point operations are to track a loss of precision during execution of the first instruction.
19. A method comprising:
decoding a first instruction via an instruction decoder of a graphics processor, the first instruction decoded into a first decoded instruction, wherein the graphics processor includes a streaming multiprocessor coupled to a memory device, a level-two (L2) cache memory and a raster operations unit (ROP) coupled with the memory device, and a compressor to perform lossless compression on data to be written to the memory device, and the streaming multiprocessor includes a single instruction, multiple thread (SIMT);
executing multiple threads associated with the first decoded instruction via the streaming multiprocessor, wherein the first decoded instruction causes at least a first portion of the streaming multiprocessor to perform a floating-point operation on multiple floating-point input operands;
decoding a second instruction via the instruction decoder of the graphics processor into a second decoded instruction;
executing multiple threads associated with the second decoded instruction via the streaming multiprocessor, wherein the second decoded instruction causes at least a second portion of the streaming multiprocessor to perform an integer operation on multiple integer operands and the first decoded instruction executes concurrently with the second decoded instruction;
decoding a third instruction via the instruction decoder of the graphics processor into a third decoded instruction; and
executing multiple threads associated with the third decoded instruction via the streaming multiprocessor via a mixed precision core of the streaming multiprocessor, wherein the mixed precision core performs a first operation of the third decoded instruction at a first precision and a second operation of the third decoded instruction at a second precision, the first operation is a multiply having at least one 16-bit floating-point input, and the second operation is an accumulate having a 32-bit floating-point input.
20. The method as in claim 19 , wherein executing the first decoded instruction includes performing a 16-bit floating-point operation on multiple 32-bit floating-point input operands and executing the second decoded instruction includes performing an 8-bit integer operation on multiple 8-bit floating-point input operands.Cited by (0)
No later patents cite this yet.
References (0)
No backward citations on record.