US11501157B2ActiveUtilityPatentIndex 59

Action shaping from demonstration for fast reinforcement learning

Assignee: IBMPriority: Jul 30, 2018Filed: Jul 30, 2018Granted: Nov 15, 2022

Est. expiryJul 30, 2038(~12.1 yrs left)· nominal 20-yr term from priority

Inventors:PHAM TU-HOA AGRAVANTE DON JOVEN RAVOY DE MAGISTRIS GIOVANNI TACHIBANA RYUKI

G06N 3/045G06N 3/08G06N 3/04G06N 3/092G06N 3/0499G06N 3/09

PatentIndex Score

Cited by

References

Claims

Abstract

A method is provided for reinforcement learning. The method includes obtaining, by a processor device, a first set and a second set of state-action tuples. Each of the state-action tuples in the first set represents a respective good demonstration. Each of the state-action tuples in the second set represents a respective bad demonstration. The method further includes training, by the processor device using supervised learning with the first set and the second set, a neural network which takes as input a state to provide an output. The output is parameterized to obtain each of a plurality of real-valued constraint functions used for evaluation of each of a plurality of action constraints. The method also includes training, by the processor device, a policy using reinforcement learning by restricting actions predicted by the policy according to each of the plurality of action constraints with each of the plurality of real-valued constraint functions.

Claims

exact text as granted — not AI-modified

What is claimed is: 
     
       1. A computer-implemented method for reinforcement learning, comprising:
 obtaining, by a processor device, a first set and a second set of state-action tuples, each of the state-action tuples in the first set representing a respective good demonstration, and each of the state-action tuples in the second set representing a respective bad demonstration; 
 training, by the processor device using supervised learning with the first set and the second set to minimize a neural network loss, a neural network which takes as input a state to provide an output, the output being parameterized to obtain each of a plurality of real-valued constraint functions used for evaluation of each of a plurality of action constraints; and 
 training, by the processor device, a policy using reinforcement learning by restricting actions during exploration which are predicted by the policy according to each of the plurality of action constraints with each of the plurality of real-valued constraint functions such that restricted actions result in a goal being reached faster than unrestricted actions by bypassing a computer operation unlikely to improve a computer output relating to the goal while avoiding wasting computer resources consumed by performing the bypassed computer operation, 
 wherein the neural network loss is minimized by satisfying all constraints for good demonstrations in the first set while violating at least one constraint for bad demonstrations in the second set. 
 
     
     
       2. The computer-implemented method of  claim 1 , wherein the neural network is trained such that the first set satisfies each of the plurality of action constraints and the second set violates at least one of the plurality of action constraints, evaluated with each of the plurality of real-valued constraint functions. 
     
     
       3. The computer-implemented method of  claim 1 , wherein training the policy comprises calculating, by using each of the plurality of real-valued constraint functions, an action closest to the action predicted by the policy among actions which satisfy each of the plurality of action constraints and executing the calculated action on an environment to obtain a reward for the reinforcement learning. 
     
     
       4. The computer-implemented method of  claim 1 , wherein of the plurality of action constraints is an inequality constraint. 
     
     
       5. The computer-implemented method of  claim 1 , wherein the first set is relaxed to allow non-optimal demonstrations that are directed closer towards succeeding than failing. 
     
     
       6. The computer-implemented method of  claim 1 , wherein the evaluation of each of the plurality of action constraints is performed relative to a violation margin and a satisfaction margin, wherein for a given one of the restricted actions, the violation margin represents a margin of violation between the action and the plurality of action constraints, and the satisfaction margin represents a margin of satisfaction between the action and the plurality of action constraints. 
     
     
       7. The computer-implemented method of  claim 1 , wherein the first set and the second set of state-action tuples are used as action ranges during the exploration in the reinforcement learning. 
     
     
       8. A computer program product for reinforcement learning, the computer program product comprising a non-transitory computer readable storage medium having program instructions embodied therewith, the program instructions executable by a computer to cause the computer to perform a method comprising:
 obtaining, by a processor device, a first set and a second set of state-action tuples, each of the state-action triples in the first set representing a respective good demonstration, and each of the state-action tuples in the second set representing a respective bad demonstration; 
 training, by the processor device using supervised learning with the first set and the second set, to minimize a neural network loss, a neural network which takes as input a state to provide an output, the output being parameterized to obtain each of a plurality of real-valued constraint functions used for evaluation of each of a plurality of action constraints; and 
 training, by the processor device, a policy using reinforcement learning by restricting actions during exploration which are predicted by the policy according to each of the plurality of action constraints with each of the plurality of real-valued constraint functions such that restricted actions result in a goal being reached faster than unrestricted actions by bypassing a computer operation unlikely to improve a computer output relating to the goal while avoiding wasting computer resources consumed by performing the bypassed computer operation, 
 wherein the neural network loss is minimized by satisfying all constraints for good demonstrations in the first set while violating at least one constraint for bad demonstrations in the second set. 
 
     
     
       9. The computer program product of  claim 8 , wherein the neural network is trained such that the first set satisfies each of the plurality of action constraints and the second set violates at least one of the plurality of action constraints, evaluated with each of the plurality of real-valued constraint functions. 
     
     
       10. The computer program product of  claim 8 , wherein training the policy comprises calculating, by using each of the plurality of real-valued constraint functions, an action closest to the action predicted by the policy among actions which satisfy each of the plurality of action constraints and executing the calculated action on an environment to obtain a reward for the reinforcement learning. 
     
     
       11. The computer program product of  claim 8 , wherein each of the plurality of action constraints is an inequality constraint. 
     
     
       12. The computer program product of  claim 8 , wherein the first set is relaxed to allow non-optimal demonstrations that are directed closer towards succeeding than failing. 
     
     
       13. The computer program product of  claim 8 , wherein the evaluation of each of the plurality of action constraints is performed relative to a violation margin and a satisfaction margin, wherein for a given one of the restricted actions, the violation margin represents a margin of violation between the action and the plurality of action constraints, and the satisfaction margin represents a margin of satisfaction between the action and the plurality of action constraints. 
     
     
       14. The computer program product of  claim 8 , wherein the first set and the second set of state-action tuples are used as action ranges during the exploration in the reinforcement learning. 
     
     
       15. A computer processing system for reinforcement learning, comprising:
 a memory for storing program code; and 
 a processor device operatively coupled to the memory for running the program code to
 obtain a first set and a second set of state-action tuples, each of the state-action tuples in the first set representing a respective good demonstration, and each of the state-action tuples in the second set representing a respective bad demonstration; 
 train, using supervised learning with the first set and the second set to minimize a neural network loss, a neural network which takes as input a state to provide an output, the output being parameterized to obtain each of a plurality of real-valued constraint functions used for evaluation of each of a plurality of action constraints; and 
 
 train a policy using reinforcement learning by restricting actions during exploration which are predicted by the policy according to each of the plurality of action constraints with each of the plurality of real-valued constraint functions such that restricted actions result in a goal being reached faster than unrestricted actions by bypassing a computer operation unlikely to improve a computer output relating to the goal while avoiding wasting computer resources consumed by performing the bypassed computer operation, 
 wherein the neural network loss is minimized by satisfying all constraints for good demonstrations in the first set while violating at least one constraint for bad demonstrations in the second set. 
 
     
     
       16. The computer processing system of  claim 15 , wherein the processor device trains the neural network such that the first set satisfies each of the plurality of action constraints and the second set violates at least one of the plurality of action constraints, evaluated with each of the plurality of real-valued constraint functions. 
     
     
       17. The computer processing system of  claim 15 , wherein the processor device trains the policy by calculating, by using each of the plurality of real-valued constraint functions, an action closest to the action predicted by the policy among actions which satisfy each of the plurality of action constraints and executing the calculated action on an environment to obtain a reward for the reinforcement learning. 
     
     
       18. The computer processing system of  claim 15 , wherein each of the plurality of action constraints is an inequality constraint. 
     
     
       19. The computer processing system of  claim 15 , wherein the first set is relaxed to allow non-optimal demonstrations that are directed closer towards succeeding than failing. 
     
     
       20. The computer processing system of  claim 15 , wherein the evaluation of each of the plurality of action constraints is performed relative to a violation margin and a satisfaction margin, wherein for a given one of the restricted actions, the violation margin represents a margin of violation between the action and the plurality of action constraints, and the satisfaction margin represents a margin of satisfaction between the action and the plurality of action constraints.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.