Safe-operation-constrained reinforcement-learning-based application manager
Abstract
The current document is directed to a safe-operation-constrained reinforcement-learning-based application manager that can be deployed in various different computational environments, without extensive manual modification and interface development, to manage the computational environments with respect to one or more reward-specified goals. Control actions undertaken by the safe-operation-constrained reinforcement-learning-based application manager are constrained, by stored action filters, to constrain state/action-space exploration by the safe-operation-constrained reinforcement-learning-based application manager to safe actions and thus prevent deleterious impact to the managed computational environment.
Claims
exact text as granted — not AI-modifiedThe invention claimed is:
1. A safe-operation-constrained reinforcement-learning-based application manager that manages one or more applications and a computing environment, within which the applications run, comprising one or more of a distributed computing system having multiple computer systems interconnected by one or more networks, a standalone computer system, and a processor-controlled user device, the modular reinforcement-learning based application manager comprising:
a safe-operation-constrained reinforcement-learning-based application manager that receives rewards and observations from the computing environment and issues actions, indicated by an internally maintained policy π, to the computing environment; and
one or more filtering subsystems that apply one or more filters to actions indicated by an internally maintained policy π to prevent the safe-operation-constrained reinforcement-learning-based application manager from issuing actions that, if executed by the computing environment, would lead to harmful and undesired results.
2. The safe-operation-constrained reinforcement-learning-based application manager of claim 1
wherein each action is represented as a vector of values and specifies one or more actions to be carried out by the computing environment; and
wherein the observations are represented as a vector of values that include metric values, configurations parameters, operational parameters, operation characteristics, and other values indicative of the current application and computing-environment state.
3. The safe-operation-constrained reinforcement-learning-based application manager of claim 2 wherein the safe-operation-constrained reinforcement-learning-based application manager maintains:
the policy π;
a current belief distribution b;
an action-value-update function;
a belief-distribution-update function; and
termination conditions.
4. The safe-operation-constrained reinforcement-learning-based application manager of claim 2 wherein the safe-operation-constrained reinforcement-learning-based application manager:
continuously
receives a reward and an observation vector from the computing environment;
determines a new belief distribution b′ using the belief-distribution-update function and observation vector;
generates a next action a′ by applying the policy π to the new belief distribution b′;
applies one or more filter subsystems to the next action a′; and
delivers the next action a′ to the computing environment.
5. The safe-operation-constrained reinforcement-learning-based application manager of claim 1
wherein the one or more filtering subsystems each comprises one or more filter stacks; and
wherein a filter stack comprises multiple filters.
6. The safe-operation-constrained reinforcement-learning-based application manager of claim 5 wherein a filter receives an input action vector or an input action vector and an observation prediction and returns one of the input action vector, a modified version of the input action vector, or a NULL action vector.
7. The safe-operation-constrained reinforcement-learning-based application manager of claim 6 wherein a first type of filter contains logic that analyzes an input action vector to
return the input action vector when the action vector represents a safe action; and
when the input action vector represents an unsafe or deleterious action,
when the input action vector can be modified to represent a related, safe action, modifies the input action vector and returns the modified action vector, and
otherwise returns a NULL action vector.
8. The safe-operation-constrained reinforcement-learning-based application manager of claim 6 wherein a second type of filter contains logic that analyzes an input action vector and an observation prediction to
return the input action vector when the action vector represents a safe action; and
when the input action vector represents an unsafe or deleterious action,
when the input action vector can be modified to represent a related, safe action, modifies the input action vector and returns the modified action vector, and
otherwise returns a NULL action vector.
9. The safe-operation-constrained reinforcement-learning-based application manager of claim 5 wherein a filter stack
applies the first filter in the filter stack to an input action vector;
successively applies each remaining filter to the vector output from the preceding stack, short-circuiting successive application of the remaining filters when the preceding filter outputs a NULL vector; and
returns either a NULL action vector, the input action vector, or a modified action vector.
10. The safe-operation-constrained reinforcement-learning-based application manager of claim 5 wherein a filtering subsystem
receives input comprising one of an input action vector and an observation prediction;
determines a filter stack to which to direct the received input;
directs the input to the determined filter stack;
receives an output from the filter stack; and
when the input is determined to require additional processing,
repeats filter-stack determination to determine a next filter stack and directs the output to the next filter stack to generate a next output, and
otherwise returns the output.
11. A method constraining a reinforcement-learning-based application manager to issue safe actions, the method comprising:
including, in the reinforcement-learning-based application manager that manages one or more applications and a computing environment, within which the applications run, comprising one or more of a distributed computing system having multiple computer systems interconnected by one or more networks, a standalone computer system, and a processor-controlled user device, one or more action filtering subsystems that apply one or more filters to actions indicated by a policy π internally maintained by the reinforcement-learning-based application manager; and
applying, by the reinforcement-learning-based application manager, actions, indicated by an internally maintained policy π, to one or more action filtering subsystems.
12. The method of claim 11
wherein each action is represented as a vector of values and specifies one or more actions to be carried out by the computing environment; and
wherein the observations are represented as a vector of values that include metric values, configurations parameters, operational parameters, operation characteristics, and other values indicative of the current application and computing-environment state.
13. The method of claim 12 wherein the reinforcement-learning-based application manager maintains:
the policy π;
a current belief distribution b;
an action-value-update function;
a belief-distribution-update function; and
termination conditions.
14. The method of claim 13 wherein the reinforcement-learning-based application manager:
continuously
receives a reward and an observation vector from the computing environment;
determines a new belief distribution b′ using the belief-distribution-update function and observation vector;
generates a next action a′ by applying the policy π to the new belief distribution b′;
applies one or more filter subsystems to the next action a′; and
delivers the next action a′ to the computing environment.
15. The method of claim 11
wherein the one or more filtering subsystems each comprises one or more filter stacks; and
wherein a filter stack comprises multiple filters.
16. The method of claim 15 wherein a filter receives an input action vector or an input action vector and an observation prediction and returns one of the input action vector, a modified version of the input action vector, or a NULL action vector.
17. The method of claim 16 wherein a first type of filter contains logic that analyzes an input action vector to
return the input action vector when the action vector represents a safe action; and
when the input action vector represents an unsafe or deleterious action,
when the input action vector can be modified to represent a related, safe action, modifies the input action vector and returns the modified action vector, and
otherwise returns a NULL action vector.
18. The method of claim 16 wherein a second type of filter contains logic that analyzes an input action vector and an observation prediction to
return the input action vector when the action vector represents a safe action; and
when the input action vector represents an unsafe or deleterious action,
when the input action vector can be modified to represent a related, safe action, modifies the input action vector and returns the modified action vector, and
otherwise returns a NULL action vector.
19. The method of claim 15 wherein a filter stack
applies the first filter in the filter stack to an input action vector;
successively applies each remaining filter to the vector output from the preceding stack, short-circuiting successive application of the remaining filters when the preceding filter outputs a NULL vector; and
returns either a NULL action vector, the input action vector, or a modified action vector.
20. The method of claim 15 wherein a filtering subsystem
receives input comprising one of an input action vector and an input action vector and observation prediction;
determines a filter stack to which to direct the received input;
directs the input to the determined filter stack;
receives an output from the filter stack; and
when the input is determined to require additional processing,
repeats filter-stack determination to determine a next filter stack and directs the output to the next filter stack to generate a next output, and
otherwise returns the output.Cited by (0)
No later patents cite this yet.
References (0)
No backward citations on record.