Table of Contents
Authors: Gema Parreño, David Suarez (Apiumhub).
Thanks to: Alberto Hernandez (BBVA Innovation Labs)
The following blogspot aims to do an introduction about Cooperative MARL and goes through innovations by S. Whiterson Lab, with QMIX (2019) their current contributions for #Neurips2021. Going through this article might imply to have certain fundamentals about Reinforcement Learning.
A multi-agent system describes multiple distributed entities—so-called agents—which take decisions autonomously and interact within a shared environment (Weiss 1999). MARL (Multi-Agent Reinforcement Learning) can be understood as a field related to RL in which a system of agents that interact within an environment to achieve a goal. The Goal of each one of these agents or learnable units is to learn a policy in order to maximize the long term reward, in which each agent discovers a strategy alongside other entities in a common environment and adapts its policy in response to the behavioural changes of others.
Taxonomy
Properties of MARL systems that are key to their modeling and depending on these properties we might be branching into specific particularities of areas of research.
Table 1. This taxonomic schema ( Weiss 1999) proposes to let us know more about the MARL exploration we will talk about today. In cooperative MARL, agents cooperate to achieve a common goal.
Challenges
From the environment perspective, we can enunciate several challenges :
- Non-stationarity: A single agent faces a moving target problem when the transition probability function changes.
- Credit assignment problem: Agent can’t know the impact of its own action towards the team’s success.
- The reality of Partial Observable environment: Partially Observable Markov Decision Process (POMDP). Most real-world cases of uses and applications are based on Partially observable environments.
When we branch from MARL into Cooperative MARL , we focus on reformulating the challenge into a system of agents that interact within an environment to achieve a common goal. These challenges might have more importance depending on the type of behaviour and environment. From the conceptual challenges derived from the agent interaction and performance perspective inside cooperation we can think of the following derived from:
- Coordination: Accomplishing a joint goal in cooperative settings requires agents to agree on a consensus.
- Communication: The learning of meaningful communication protocols in cooperative tasks.
- Commitment: Constructing cooperative commitments, so as to overcome incentives to neglect a cooperative arrangement.
- Scalability: MARL algorithms are hard to train: the potentially high number of agents and heterogeneous action space entails a linear growth of computational effort.
From now on, we will focus on centralized cooperative MARL and QMIX definition , notation and description.
Fig 1. Visual representation of MARL properties with some challenges regarding the taxonomy. The zoom area includes areas inside Cooperative AI posted in Open Problems in Cooperative AI and Q-MIX papers.
Centralized Cooperative Multi-Agent
Centralized Cooperative Multi-Agent RL Notations and Formulation for the coordination problem
- N = {1…N} denotes the set of N>1 interacting Agents
- S is the State space of all agents
- U = { U1 …Un } joint action set for all agents or the collection of individual action spaces from N agents
- R is the Reward.
- P : U x S → P( U ) is the probability distribution of actions
- O = { O1 …On } set of observations for all agents.
- ץ discount factor [0,1)
Notation 1. Colored letters set the Differences with respect to the traditional Reinforcement Learning approach. Notation for a Fully cooperative setup
QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
Qtot(τ, u) = Σ Qi ( τi , ui )
- Qtot -> global action-value function
- Qi -> Action-value function for each one of the agents
- τ -> joint action-observation history
- U -> joint action
Notation 2 . Global Action-value function as a sum of individual action-value functions, one for each agent.
Q-Mix paper, published in 2018 by T. Rashid et al explores a hybrid value-based multi-agent reinforcement learning method , adding a constraint and a mixing Network structure in order to make the learning stable, faster and ultimately better in a controlled setup.
As a conceptual key idea for QMIX is to understand centralized learning ( Qtot ) with decentralized execution paradigm ( Qi ), also known as CTDE : agents are trained in a centralized way with access to the overall action-observation history ( τ) and global state during training , but during execution have access only to their own local action-observation histories ( τi )
One of the main first ideas is to verify a constraint that enforces the monotonicity of the relationship between the global action-value function Qtot and the action-value function of each one of the agents Qi in every action. This constrained action allows each agent to participate in a decentralised execution by choosing greedy actions with respect to its action value function
მQtot / მQi ≥ 0, ∀a
Notation 3. The Global argmax Action-Value function divided for the argmax Action-Value function of each agent is 0 or higher, for every action
This function allows each agent to participate in a decentralized execution by choosing greedy actions with respect to its value function .
The overall QMIX architecture shows two main differentiated parts :
- Agent Networks: for each agent Ai , there is an agent Network that represents its action-value function. It receives the current observation and the last action as input at each time step and returns a Q action-value function Qi . The NN topology is inside the DRQN family that makes use of GRU, as it facilitates the learning over longer timescales and probably converges faster. This means that if we are dealing with an environment with, for example, two colosi agents, we might have.
- Mixing Network: A feedforward Network that takes the agents outputs (Qi for every one of the agents) and outputs the total Action-value function Qtot . Inside this architecture we find the creative and innovative part, in which the weights of the Neural Networks are produced by a separate hypernetwork, meaning that there is a NN that generates the weights for another network. The output of the hypernetwork is then a vector forced to be positive, making it possible to condition the weights of the monotonicity.
Fig 3. Overall architecture of QMix proposed by QMIX paper with the main components: the mixing network with the hypernetwork, that forces monotonicity and the agent networks.
- Satisfy a condition for choosing a greedy Action-Value function for each agent
- Each agent has an agent Network that calculates the Action-Value function.
- A Mixing-Network calculates the weights forced to be positive, based on the states in order to calculate the joint action-value function Qtot.
Regularized Softmax Deep Multi-Agent Q-learning at Neurips 2021
Neurips 2021: Regularized Softmax Deep Multi-Agent Q-learning
Overestimation is an important challenge because it indeed can be accumulated and be counterproductive for performance of value-based algorithms . Besides, the fact that there are multiple agents inside a MARL scenario derives into the joint-action space exponentially increasing with the number of agents and this can be considered an issue. In the case of Q-MIX, the overestimation fenomena can not only come from the calculation of Qi but also from the mixing network.
First the paper presents some key experimental results from some mental model to tackle the challenge that didn´t show the desired outcomes: a gradient Regularization of the mixing network and a baseline with Qtot by adding a regularized term to the loss λ (Qtot(s,u) − b(s,u))2, where they used the mean squared error loss and λ is the regularization coefficient.
As the final proposal that showed better empirical results they used a softmax for the joint action-value function (softmax(Qtot(s,u)) with principles from Deep Q-Learning, using the state and not the action-observation history τ as in QMIX Value Decomposition Networks approach.
For knowing more about this contribution, don´t hesitate to read their paper here.
References
- Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence, Weiss (1999)
- Review of Multi-Agent Deep Reinforcement Learning based on the work , A. Oroojlooy and D. Hajinezhad (2020)
- Open Problems in Cooperative AI , A.Dafoe et al. (2020)
- QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
- SMAC The StarCraft Multi-Agent Challenge, Mikayel Samvelyan et al. (2019)
- Regularized Softmax Deep Multi-Agent Q-Learning (2021)