P
US9916162B2ActiveUtilityPatentIndex 84

Using a global barrier to synchronize across local thread groups in general purpose programming on GPU

Assignee: INTEL CORPPriority: Dec 26, 2013Filed: Dec 8, 2014Granted: Mar 13, 2018
Est. expiryDec 26, 2033(~7.5 yrs left)· nominal 20-yr term from priority
Inventors:GUPTA NIRAJ
G06F 9/48G06F 9/3851G06F 9/522
84
PatentIndex Score
9
Cited by
3
References
14
Claims

Abstract

Methods and systems may synchronize workloads across local thread groups. The methods and systems may provide for receiving, at a graphics processor, a workload from a host processor and receiving, at a plurality of processing elements, a plurality of threads that from one or more local thread groups. Additionally, the processing of the workload may be synchronized across the one or more thread groups. In one example, the global barrier determines that all threads across the thread groups have been completed without polling.

Claims

exact text as granted — not AI-modified
We claim: 
     
       1. A system comprising:
 one or more transceivers; 
 a host processor in communication with the one or more transceivers; 
 a system memory associated with the host processor; 
 a processor, in communication with the system memory, to receive a workload from the host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group, the processor including:
 a first plurality of processing elements to receive and process a first thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, a second plurality of processing elements to receive and process a second thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, and 
 a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the global barrier to enable the workload to be partitioned into the plurality of kernels and to synchronize the processing of the workload across the first thread group and the second thread group, wherein the plurality of kernels are to be processed concurrently and in parallel. 
 
 
     
     
       2. The system of  claim 1 , wherein the first thread group and the second thread group form a global thread group. 
     
     
       3. The system of  claim 1 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed. 
     
     
       4. The system of  claim 3 , wherein the determination is made without polling. 
     
     
       5. The system of  claim 1 , wherein the processor is a graphics processor. 
     
     
       6. An apparatus comprising:
 a graphics processor to receive a workload from a host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group, the graphics processor including:
 a first plurality of processing elements to receive and process a first thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, 
 a second plurality of processing elements to receive and process a second thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, and 
 a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the global barrier to enable the workload to be partitioned into the plurality of kernels and to synchronize the processing of the workload across the first thread group and the second thread group, wherein the plurality of kernels are to be processed concurrently and in parallel. 
 
 
     
     
       7. The apparatus of  claim 6 , wherein the first thread group and the second thread group form a global thread group. 
     
     
       8. The apparatus of  claim 6 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed. 
     
     
       9. A method comprising:
 receiving, at a graphics processor, a workload from a host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group; 
 receiving, at a first plurality of processing elements, a first thread group having plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, 
 receiving, at a second plurality of processing elements, a second thread group having plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, 
 synchronizing, at a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the processing of the workload across the first thread group and the second thread group, wherein the global barrier enables the workload to be partitioned into the plurality of kernels and the plurality of kernels are to be processed concurrently and in parallel. 
 
     
     
       10. The method of  claim 9 , wherein the first thread group and the second thread group form a global thread group. 
     
     
       11. The method of  claim 9 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed. 
     
     
       12. At least one non-transitory computer readable storage medium comprising a set of instructions which, if executed by a graphics processor, cause a computer to:
 receive, at a processor, a workload from a host processor, wherein the workload is partitioned into a plurality of kernels each containing a thread group; 
 receive, at a first plurality of processing elements, a first thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, 
 receive, at a second plurality of processing elements, a second thread group having a plurality of threads, wherein each of the plurality of processing elements processes at least one of the plurality of threads, 
 synchronize, at a global barrier in communication with the first plurality of processing elements and the second plurality of processing elements, the processing of the workload across the first thread group and the second thread group, wherein the global barrier enables the workload to be partitioned into the plurality of kernels and the plurality of kernels are to be processed concurrently and in parallel. 
 
     
     
       13. The at least one non-transitory computer readable storage medium of  claim 12 , wherein the first thread group and the second thread group form a global thread group. 
     
     
       14. The at least one non-transitory computer readable storage medium of  claim 12 , wherein the global barrier determines that all threads across the first thread group and the second thread group have been completed.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.