Abstract: Reinforcement learning (RL) is a widely used learning paradigm for adaptive agents. There exist several convergent and consistent
RL algorithms which have been intensively studied. In their original form,
these algorithms require that the environment states and agent actions
take values in a relatively small discrete set. Fuzzy representations for
approximate, model-free RL have been proposed in the literature for the
more difficult case where the state-action space is continuous. In this
work, we propose a fuzzy approximation architecture similar to those
previously used for Q-learning, but we combine it with the model-based
Q-value iteration algorithm. We prove that the resulting algorithm converges. We also give a modified, asynchronous variant of the algorithm
that converges at least as fast as the original version. An illustrative
simulation example is provided.

Abstract: Abstract. Reinforcement learning (RL) is a widely used learning paradigm for adaptive agents. Well-understood RL algorithms with good convergence and consistency properties exist. In their original form, these algorithms require that the environment states and agent actions take values in a relatively small discrete set. Fuzzy representations for approximate, model-free RL have been proposed in the literature for the more dificult case where the state-action space is continuous. In this work, we propose a fuzzy approximation structure similar to those previously used for Q-learning, but we combine it with the model-based Q-value iteration algorithm. We show that the resulting algorithm converges. We also give a modied, serial variant of the algorithm that converges at least as fast as the original version. An illustrative simulation example is provided.

Abstract: Reinforcement learning (RL) is a learning control paradigm that provides well-understood algorithms with good convergence and consistency properties. Unfortunately, these algorithms require that process states and control actions take only discrete values. Approximate solutions using fuzzy representations have been proposed in the literature for the case when the states and possibly the actions are continuous. However, the link between these mainly heuristic solutions and the larger body of work on approximate RL, including convergence results, has not been made explicit. In this paper, we propose a fuzzy approximation structure for the Q-value iteration algorithm, and show that the resulting algorithm is convergent. The proof is based on an extension of previous results in approximate RL. We then propose a modified, serial version of the algorithm that is guaranteed to converge at least as fast as the original algorithm. An illustrative simulation example is also provided.

Abstract: Planning in single-agent models like MDPs and POMDPs can be carried out by resorting to Q-value functions: a (near-) optimal Q-value function is computed in a recursive manner by dynamic programming, and then a policy is extracted from this value function. In this paper we study whether similar Q-value functions can be defined in decentralized POMDP models (Dec-POMDPs), what the cost of computing such value functions is, and how policies can be extracted from such value functions. Using the framework of Bayesian games, we argue that searching for the optimal Q-value function may be as costly as exhaustive policy search. Then we analyze various approximate Q-value functions that allow efficient computation. Finally, we describe a family of algorithms for extracting policies from such Q-value functions.

Abstract: In this technical report we treat some properties of the recently introduced QBG - value function. In particular we show that it is a piecewise linear and convex function over the space of joint beliefs. Furthermore, we show that there exists an optimal inﬁnite-horizon QBG -value function, as the QBG backup operator is a contraction mapping. We conclude by noting that the optimal Dec-POMDP Q-value function cannot be deﬁned over joint beliefs.

Abstract: Planning in single-agent models like MDPs and POMDPs can be carried out by resorting to Q-value functions: a (near-) optimal Q-value function is computed in a recursive manner by dynamic programming, and then a policy is extracted from this value function. In this paper we study whether similar Q-value functions can be defined in decentralized POMDP models (Dec-POMDPs), what the cost of computing such value functions is, and how policies can be extracted from such value functions. Using the framework of Bayesian games, we argue that searching for the optimal Q-value function may be as costly as exhaustive policy search. Then we analyze various approximate Q-value functions that allow efficient computation. Finally, we describe a family of algorithms for extracting policies from such Q-value functions.

Abstract: Abstract: Reinforcement learning (RL) comprises an array of techniques that learn a control
policy soas to maximize a reward signal. When applied to the control of elevator systems, RL
has the potential of ﬁnding better control policies than classical heuristic, suboptimal policies.
On theother hand, elevator systems oﬀer an interesting benchmark application for the study
of RL. In this paper, RL is applied toa single-elevator system. The mathematical model of
the elevator system is described in detail, making the system easy to re-implement and re-use.
An experimental comparison is made between the performance of the Q-value iteration and
Q-learning RL algorithms, when applied to the elevator system.