Abstract
In this paper, a data-driven Model Predictive Controller (MPC) is presented, in which an off-policy Reinforcement Learning (RL) method called Deep Double Expected Sarsa is employed to update the weights of its cost function. While the parameterized MPC cost function is used as the current action-value function estimator, a Neural Network is used as the subsequent action-value function approximator. The target Neural Network is trained based on inputs and outputs of the primary MPC obtained at previous sampling times, whereby the training is performed either within each sampling time by sharing the time slots with the main algorithm or in parallel to the main algorithm as a whole. The latter reduces the required real-time computations per time slot. To compute the action of the target policy, two strategies are employed: Once a greedy policy using a minimization of the Neural Network model with respect to the action, and once the second element of the MPC vector related to the previous sampling time. Results show that there is no significant difference between the final control performance and training speed of both methods, whereas the real-time computational cost can be significantly reduced for the latter approach since the optimization related to the Neural Network can be omitted.