P
US11042640B2ActiveUtilityPatentIndex 50

Safe-operation-constrained reinforcement-learning-based application manager

Assignee: VMWARE INCPriority: Aug 27, 2018Filed: Jul 3, 2019Granted: Jun 22, 2021
Est. expiryAug 27, 2038(~12.1 yrs left)· nominal 20-yr term from priority
Inventors:NAG DEVBURK GREGORY TYANKOV YANISLAVSTEPHEN NICHOLAS MARK GRANTWANG DONGNI
G06N 7/01G06F 21/604G06F 2221/034G06N 20/00G06F 21/57
50
PatentIndex Score
0
Cited by
9
References
20
Claims

Abstract

The current document is directed to a safe-operation-constrained reinforcement-learning-based application manager that can be deployed in various different computational environments, without extensive manual modification and interface development, to manage the computational environments with respect to one or more reward-specified goals. Control actions undertaken by the safe-operation-constrained reinforcement-learning-based application manager are constrained, by stored action filters, to constrain state/action-space exploration by the safe-operation-constrained reinforcement-learning-based application manager to safe actions and thus prevent deleterious impact to the managed computational environment.

Claims

exact text as granted — not AI-modified
The invention claimed is: 
     
       1. A safe-operation-constrained reinforcement-learning-based application manager that manages one or more applications and a computing environment, within which the applications run, comprising one or more of a distributed computing system having multiple computer systems interconnected by one or more networks, a standalone computer system, and a processor-controlled user device, the modular reinforcement-learning based application manager comprising:
 a safe-operation-constrained reinforcement-learning-based application manager that receives rewards and observations from the computing environment and issues actions, indicated by an internally maintained policy π, to the computing environment; and 
 one or more filtering subsystems that apply one or more filters to actions indicated by an internally maintained policy π to prevent the safe-operation-constrained reinforcement-learning-based application manager from issuing actions that, if executed by the computing environment, would lead to harmful and undesired results. 
 
     
     
       2. The safe-operation-constrained reinforcement-learning-based application manager of  claim 1 
 wherein each action is represented as a vector of values and specifies one or more actions to be carried out by the computing environment; and 
 wherein the observations are represented as a vector of values that include metric values, configurations parameters, operational parameters, operation characteristics, and other values indicative of the current application and computing-environment state. 
 
     
     
       3. The safe-operation-constrained reinforcement-learning-based application manager of  claim 2  wherein the safe-operation-constrained reinforcement-learning-based application manager maintains:
 the policy π; 
 a current belief distribution b; 
 an action-value-update function; 
 a belief-distribution-update function; and 
 termination conditions. 
 
     
     
       4. The safe-operation-constrained reinforcement-learning-based application manager of  claim 2  wherein the safe-operation-constrained reinforcement-learning-based application manager:
 continuously
 receives a reward and an observation vector from the computing environment; 
 determines a new belief distribution b′ using the belief-distribution-update function and observation vector; 
 generates a next action a′ by applying the policy π to the new belief distribution b′; 
 applies one or more filter subsystems to the next action a′; and 
 delivers the next action a′ to the computing environment. 
 
 
     
     
       5. The safe-operation-constrained reinforcement-learning-based application manager of  claim 1 
 wherein the one or more filtering subsystems each comprises one or more filter stacks; and 
 wherein a filter stack comprises multiple filters. 
 
     
     
       6. The safe-operation-constrained reinforcement-learning-based application manager of  claim 5  wherein a filter receives an input action vector or an input action vector and an observation prediction and returns one of the input action vector, a modified version of the input action vector, or a NULL action vector. 
     
     
       7. The safe-operation-constrained reinforcement-learning-based application manager of  claim 6  wherein a first type of filter contains logic that analyzes an input action vector to
 return the input action vector when the action vector represents a safe action; and 
 when the input action vector represents an unsafe or deleterious action,
 when the input action vector can be modified to represent a related, safe action, modifies the input action vector and returns the modified action vector, and 
 otherwise returns a NULL action vector. 
 
 
     
     
       8. The safe-operation-constrained reinforcement-learning-based application manager of  claim 6  wherein a second type of filter contains logic that analyzes an input action vector and an observation prediction to
 return the input action vector when the action vector represents a safe action; and 
 when the input action vector represents an unsafe or deleterious action,
 when the input action vector can be modified to represent a related, safe action, modifies the input action vector and returns the modified action vector, and 
 otherwise returns a NULL action vector. 
 
 
     
     
       9. The safe-operation-constrained reinforcement-learning-based application manager of  claim 5  wherein a filter stack
 applies the first filter in the filter stack to an input action vector; 
 successively applies each remaining filter to the vector output from the preceding stack, short-circuiting successive application of the remaining filters when the preceding filter outputs a NULL vector; and 
 returns either a NULL action vector, the input action vector, or a modified action vector. 
 
     
     
       10. The safe-operation-constrained reinforcement-learning-based application manager of  claim 5  wherein a filtering subsystem
 receives input comprising one of an input action vector and an observation prediction; 
 determines a filter stack to which to direct the received input; 
 directs the input to the determined filter stack; 
 receives an output from the filter stack; and 
 when the input is determined to require additional processing,
 repeats filter-stack determination to determine a next filter stack and directs the output to the next filter stack to generate a next output, and 
 otherwise returns the output. 
 
 
     
     
       11. A method constraining a reinforcement-learning-based application manager to issue safe actions, the method comprising:
 including, in the reinforcement-learning-based application manager that manages one or more applications and a computing environment, within which the applications run, comprising one or more of a distributed computing system having multiple computer systems interconnected by one or more networks, a standalone computer system, and a processor-controlled user device, one or more action filtering subsystems that apply one or more filters to actions indicated by a policy π internally maintained by the reinforcement-learning-based application manager; and 
 applying, by the reinforcement-learning-based application manager, actions, indicated by an internally maintained policy π, to one or more action filtering subsystems. 
 
     
     
       12. The method of  claim 11 
 wherein each action is represented as a vector of values and specifies one or more actions to be carried out by the computing environment; and 
 wherein the observations are represented as a vector of values that include metric values, configurations parameters, operational parameters, operation characteristics, and other values indicative of the current application and computing-environment state. 
 
     
     
       13. The method of  claim 12  wherein the reinforcement-learning-based application manager maintains:
 the policy π; 
 a current belief distribution b; 
 an action-value-update function; 
 a belief-distribution-update function; and 
 termination conditions. 
 
     
     
       14. The method of  claim 13  wherein the reinforcement-learning-based application manager:
 continuously
 receives a reward and an observation vector from the computing environment; 
 determines a new belief distribution b′ using the belief-distribution-update function and observation vector; 
 generates a next action a′ by applying the policy π to the new belief distribution b′; 
 applies one or more filter subsystems to the next action a′; and 
 delivers the next action a′ to the computing environment. 
 
 
     
     
       15. The method of  claim 11 
 wherein the one or more filtering subsystems each comprises one or more filter stacks; and 
 wherein a filter stack comprises multiple filters. 
 
     
     
       16. The method of  claim 15  wherein a filter receives an input action vector or an input action vector and an observation prediction and returns one of the input action vector, a modified version of the input action vector, or a NULL action vector. 
     
     
       17. The method of  claim 16  wherein a first type of filter contains logic that analyzes an input action vector to
 return the input action vector when the action vector represents a safe action; and 
 when the input action vector represents an unsafe or deleterious action,
 when the input action vector can be modified to represent a related, safe action, modifies the input action vector and returns the modified action vector, and 
 otherwise returns a NULL action vector. 
 
 
     
     
       18. The method of  claim 16  wherein a second type of filter contains logic that analyzes an input action vector and an observation prediction to
 return the input action vector when the action vector represents a safe action; and 
 when the input action vector represents an unsafe or deleterious action,
 when the input action vector can be modified to represent a related, safe action, modifies the input action vector and returns the modified action vector, and 
 otherwise returns a NULL action vector. 
 
 
     
     
       19. The method of  claim 15  wherein a filter stack
 applies the first filter in the filter stack to an input action vector; 
 successively applies each remaining filter to the vector output from the preceding stack, short-circuiting successive application of the remaining filters when the preceding filter outputs a NULL vector; and 
 returns either a NULL action vector, the input action vector, or a modified action vector. 
 
     
     
       20. The method of  claim 15  wherein a filtering subsystem
 receives input comprising one of an input action vector and an input action vector and observation prediction; 
 determines a filter stack to which to direct the received input; 
 directs the input to the determined filter stack; 
 receives an output from the filter stack; and 
 when the input is determined to require additional processing,
 repeats filter-stack determination to determine a next filter stack and directs the output to the next filter stack to generate a next output, and 
 otherwise returns the output.

Cited by (0)

No later patents cite this yet.

References (0)

No backward citations on record.