Cooperative Multi-Agent Reinforcement Learning and QMIX at Neurips 2021

Share This Post

Authors: Gema Parreño, David Suarez (Apiumhub). 

Thanks to: Alberto Hernandez (BBVA Innovation Labs)

The following blogspot aims to do an introduction about Cooperative MARL  and goes through innovations by S. Whiterson Lab, with QMIX (2019)  their current contributions for #Neurips2021. Going through this article might imply to have certain fundamentals about Reinforcement Learning.

 A multi-agent system describes multiple distributed entities—so-called agents—which take decisions autonomously and interact within a shared environment (Weiss 1999). MARL (Multi-Agent Reinforcement Learning) can be understood as a field related to RL in which a system of agents that interact within an environment to achieve a goal. The Goal of each one of these agents or learnable units is to learn a policy in order to maximize the long term reward, in which each agent discovers a strategy alongside other entities in a common environment and adapts its policy in response to the behavioural changes of others.


Properties of MARL systems that are key to their modeling and depending on these properties we might be branching into specific particularities of areas of research.

Table 1. This taxonomic schema ( Weiss 1999) proposes to let us know more about the MARL exploration we will talk about today. In cooperative MARL, agents cooperate to achieve a common goal.


From the environment perspective, we can enunciate several challenges :

  • Non-stationarity: A single agent faces a moving target problem when the transition probability function changes.
  • Credit assignment problem: Agent can’t know the impact of its own action towards the team’s success.
  • The reality of Partial Observable environment: Partially Observable Markov Decision Process (POMDP). Most real-world cases of uses and applications are based on Partially observable environments.

When we branch from MARL into Cooperative MARL , we focus on reformulating the challenge into a system of agents that interact within an environment to achieve a common goal. These challenges might have more importance depending on the type of behaviour and environment. From the conceptual challenges derived from the agent interaction and performance perspective inside cooperation we can think of the following derived from:

  • Coordination: Accomplishing a joint goal in cooperative settings requires agents to agree on a consensus.
  • Communication: The learning of meaningful communication protocols in cooperative tasks.
  • Commitment: Constructing cooperative commitments, so as to overcome incentives to neglect  a cooperative arrangement.
  • Scalability: MARL algorithms are hard to train: the potentially high number of agents and heterogeneous action space entails a linear growth of computational effort.
  Data as a service ( DaaS ) benefits & trends

From now on, we will focus on centralized cooperative MARL and QMIX definition , notation and description.

Neurips 1.1 Neurips 1.2

Fig 1. Visual representation of MARL properties with some challenges regarding the taxonomy. The zoom area includes areas inside Cooperative AI posted in Open Problems in Cooperative AI and Q-MIX papers.

Centralized Cooperative Multi-Agent

Centralized Cooperative Multi-Agent RL Notations and Formulation for the coordination problem

Regarding notation, the main differences between the notations for RL are that we introduce in the tuple the parameter N that stands for the number of agents, and O = { O1,…On } that is the set of observations for all agents (if different agents have different set of observations, all of them might be represented in this set) and the same happens for the set of observations U = { U1 …Un } stands for the joint action set for all agents, meaning that the action will be taken in a cooperative manner even though this shall mean that each agent takes a different action. Therefore , taking into account the tuple < N, S, U, R, P, O, ץ > 
  • N = {1…N} denotes the set of N>1 interacting Agents
  • S is the State space of all agents
  • U = { U1 …Un } joint action set for all agents or the collection of individual action spaces from N agents
  • R is the Reward.
  • P : U x S → P( U ) is the probability distribution of actions
  • O = { O1 …On } set of observations for all agents.
  • ץ discount factor [0,1)

Notation 1. Colored letters set the Differences with respect to the traditional Reinforcement Learning approach. Notation for a Fully cooperative setup

Neurips 2.1 Neurips 2.2
Fig 2. Visual representation of a fully cooperative and partially observable multiagent environment Dec-POMDPs. The example takes SMAC environment 2c_vs_64zg: at each time step t, the environment sends observations to the agents (2 Colosi) about enemy positions and actions of both enemies and the other agent , and each agent (Colosi) produces an action based on their Qtot value function. All the agents share the same reward.

QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning

Regarding a fully cooperative behavior with centralized learning with decentralized execution, the joint action-value function Qtot can be decomposed into N Q-functions for N agents, in which each Q-function Qi  measures how good each action is, given a state, for the agents following a policy

Qtot(τ, u) = Σ Q( τ, u)

  Facial Recognition System - the new future of Biometrics Identification
  • Qtot  -> global action-value function
  • Qi  -> Action-value function for each one of the agents
  • τ -> joint action-observation history
  • U -> joint action

Notation 2 . Global Action-value function as a sum of individual action-value functions, one for each agent.

Q-Mix paper, published in 2018 by T. Rashid et al explores a hybrid value-based multi-agent reinforcement learning method , adding a constraint and a mixing Network structure in order to make the learning stable, faster and ultimately better in a controlled setup.

As a conceptual key idea for QMIX is to understand centralized learning ( Qtot ) with decentralized execution paradigm Qi ), also known as CTDE : agents are trained in a centralized way with access to the overall action-observation history ( τ) and global state during training , but during execution have access only to their own local action-observation histories ( τi )

One of the main first ideas is to verify a constraint that enforces the monotonicity of the relationship between the global action-value function Qtot and the action-value function of each one of the agents Qi in every action. This constrained action allows each agent to participate in a decentralised execution by choosing greedy actions with respect to its action value function

Qtot  / მQi  ≥ 0, ∀a

Notation 3. The Global argmax Action-Value function divided for the argmax Action-Value function of each agent is 0 or higher, for every action

This function allows each agent to participate in a decentralized execution by choosing greedy actions with respect to its value function .

The overall QMIX architecture shows two main differentiated parts :

  • Agent Networks: for each agent Ai , there is an agent Network that represents its action-value function. It receives the current observation and the last action as input at each time step and returns a Q action-value function Qi . The NN topology is inside the DRQN family that makes use of GRU, as it facilitates the learning over longer timescales and probably converges faster. This means that if we are dealing with an environment with, for example, two colosi agents, we might have.
  • Mixing Network: A feedforward Network that takes the agents outputs (Qi for every one of the agents) and outputs the total Action-value function Qtot . Inside this architecture we find the creative and innovative part, in which the weights of the Neural Networks are produced by a separate hypernetwork, meaning that there is a NN that generates the weights for another network. The output of the hypernetwork is then a vector forced to be positive, making it possible to condition the weights of the monotonicity.
Neurips 3.1 Neurips 3.2

Fig 3. Overall architecture of QMix proposed by QMIX paper with the main components: the mixing network with the hypernetwork, that forces monotonicity and the agent networks.

Key ideas from Q-MIX algorithm :
  • Satisfy a condition for choosing a greedy Action-Value function for each agent
  • Each agent has an agent Network that calculates the Action-Value function.
  • A Mixing-Network calculates the weights forced to be positive, based on the states in order to calculate the joint action-value function Qtot.

Regularized Softmax Deep Multi-Agent Q-learning at Neurips 2021

Neurips 2021: Regularized Softmax Deep Multi-Agent Q-learning

During #Neurips2021, the lab will present the challenge of practical severe overestimation Q-MIX presents, proposing a regularization-based update scheme that penalizes large Qtot values that stabilizes learning and a softmax operator that reduces overestimation bias.

Overestimation is an important challenge because it indeed can be accumulated and be counterproductive for performance of value-based algorithms . Besides, the fact that there are multiple agents inside a MARL scenario derives into the joint-action space exponentially increasing with the number of agents and this can be considered an issue. In the case of Q-MIX, the overestimation fenomena can not only come from the calculation of Qi but also from the mixing network.

First the paper presents some key experimental results from some mental model to tackle the challenge that didn´t show the desired outcomes: a gradient Regularization of the mixing network and a baseline with Qtot by adding a regularized term to the loss  λ (Qtot(s,u) − b(s,u))2, where they used the mean squared error loss and λ is the regularization coefficient.

As the final proposal that showed better empirical results they used a softmax for the joint action-value function (softmax(Qtot(s,u)) with principles from Deep Q-Learning, using the state and not the action-observation history τ as in QMIX Value Decomposition Networks approach.

For knowing more about this contribution, don´t hesitate to read their paper here.


  1. Multiagent Systems: A Modern Approach to Distributed Artificial Intelligence, Weiss (1999)
  2. Review of Multi-Agent Deep Reinforcement Learning based on the work , A. Oroojlooy and D. Hajinezhad (2020)
  3. Open Problems in Cooperative AI , A.Dafoe et al. (2020)
  4. QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning
  5. SMAC The StarCraft Multi-Agent Challenge,  Mikayel Samvelyan et al. (2019)
  6. Regularized Softmax Deep Multi-Agent Q-Learning (2021)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

Subscribe To Our Newsletter

Get updates from our latest tech findings

Have a challenging project?

We Can Work On It Together

apiumhub software development projects barcelona
Secured By miniOrange