This is a condensed version of my Master’s thesis, presented here in a simplified form to enhance accessibility. The complete version can be downloaded here
This study explores how heterogeneous agents develop specialised roles in a capture the flag setting. Agents are trained how to play the game using reinforcement learning with simple reward signals and self-play.
To achieve this, I created a custom environment and introduced four distinct agent types, each with different attributes. Using Proximal Policy Optimisation, the agents were trained through self-play across nine experimental scenarios.
The analysis focuses on whether these built-in differences lead to emergent specialisation. To evaluate this, I developed tailored metrics and combined quantitative results with qualitative observations to better understand agent behaviour.
This study finds that heterogeneous agents trained via reinforcement learning and self-play reliably develop scalable, specialised roles that adapt dynamically in response to opposing team strategies over the course of training.
Reinforcement learning (RL) focuses on how agents learn to map actions to rewards through trial and error, considering both immediate and long-term outcomes. Agents must balance trying new actions to gain new knowledge, versus leveraging what they have already know works well from prior experience.
RL problems are commonly framed in terms of an agent-environment interaction cycle, where an agent percepts the current of state of the environment, uses this information to take some action, which then leads to some new state of the environment and a reward signal received by the agent.
The main goal in RL is to find the optimal policy that maximises the expected discounted sum of rewards. In CTF, if an agent is near the opposing team’s flag (the state), the agent should pick up the flag (the action). An example of a policy is therefore; when near the opposing team’s flag, pick up the flag.
Capture the Flag (CTF) is a team-based game where two teams compete to steal the opponent’s flag and return it to their own territory while avoiding being tagged. Players can tag opponents to send them back to their base. The game ends after a time limit, with the highest-scoring team winning.
In this study, CTF is used as an example of a mixed multi-agent problem, combining cooperation (within teams) and competition (between teams). Although simplified, it provides a useful environment for studying how agents learn strategies, make decisions, and coordinate over time.
I designed four types of agents, each with different abilities based on mobility, strength, damage, and how they interact with the environment.
All agents can move in four directions (or stay still), capture the flag, and tag opponents. This ensures the game can function with any mix of agent types.
This environment can be thought of as a game where multiple agents interact over a series of steps. Each game runs for 500 timesteps, and at every timestep, all agents observe the situation, take an action, and the game updates. The order in which agents take actions at each turn is randomised in order to introduce uncertainty into the environment.
At any moment, each agent understands the game through two types of information:
Together, these give the agent a complete picture of the game at that moment.
Every agent can move up, down, left, or right, stay still, capture the flag, and tag opponents. Some agents have extra abilities:
Not all actions are available to all agents. To make learning faster, agents are prevented from choosing actions they can’t actually perform.
Agents choose their actions using a learning method called Proximal Policy Optimisation (PPO), which gradually improves their decisions over time based on experience.
Agents learn by receiving rewards based on what happens in the game:
The reward system is intentionally simple. Instead of telling agents exactly what to do, it encourages agents to figure out good strategies on their own.
After all agents take an action, the game updates to a new state. This update isn’t perfectly predictable, due to the randomised order in which agents take actions at each turn.
To train the agents, I used Proximal Policy Optimisation (PPO) - a widely used reinforcement learning method known for being stable and effective across many environments. I adapted the core algorithm to support multiple agents and teams. I have made the following modifications:
Each agent is powered by a neural network that decides what action to take. It takes in two types of information:
Before processing, the game board is adjusted so that everything is viewed from the agent’s own perspective. The network produces two outputs:
After training the network, the agent chooses its action based on the probability distribution of the action choice output.
Agents improve by repeatedly playing against each other. Each team trains against the previous version of its opponent, creating a constantly adapting challenge.
To evaluate whether agents develop specialised roles, I created a set of custom metrics and used both data analysis and direct observation.
The metrics track things like flag captures, tagging, positioning on the map, teamwork, and (for miners) how they modify the environment. By comparing these behaviours across agents, it’s possible to identify roles — for example, attackers, defenders, or supportive teammates.
These metrics are collected by running multiple matches and averaging the results. Statistical tests are then used to check whether differences in behaviour are meaningful, both between individual agents and between teams.
In addition to this, I analysed how behaviours change over time, looked at movement patterns across the map, and observed actual gameplay. This helped capture more subtle strategies and provide a fuller picture of how specialisation emerges.
Description: This experiment tests whether the Vaulter learns to leverage its attacking attributes. Team 1 consists of a Scout and a Vaulter (T1-S and T1-V), while Team 2 has two Scouts (T2-S and T2-S). A horizontal wall blocks the direct path between the flags, forcing agents to take a longer route between flags - however, the Vaulter can pass through the wall.
Hypothesis: The T1 Vaulter will learn to use its vaulting ability to vault the block tiles to create a shorter route for flag capture and become a specialist in flag captures.
Results: T1-V learns to traverse the barrier in the middle of the grid in order to expedite flag captures. However, T1-V does not become adept at flag capturing as T2 learn to mitigate the superior mobility of T1-V over the course of training.
Description: This experiment tests whether a Miner agent can learn to manipulate the environment in order to free a trapped teammate and gain a numerical advantage. T1 consists of a Miner and a Scout (T1-S and T1-M), and T2 consists of two Scouts. One Scout from each team is trapped by destructible tiles, which can only be removed by a Miner agent.
Hypothesis: The T1 Miner will learn to remove the destructible tiles surrounding the T1 Scout, gaining a numerical advantage over the opposing team and creating more flag captures.
Results: T1-M learns to mine blocks to effectively free its T1-S teammate, creating a numerical advantage over T2. This results in total game dominance and is consistent with the hypothesis. Once freed, T1-S and T1-M share similar behaviours.
Description: This experiment examines whether the Miner agent can learn to mine and place blocks in order to trap the opposing team. T1 consists of a Miner and a Scout (T1-S and T1-M), and T2 consists of two Scouts. The spawning zones for each team have a narrow opening from which agents can enter and exit. Several destructible tiles are placed throughout the map, including a row of tiles between the two flags, forcing a longer route for capture.
Hypothesis: T1-M will learn to mine tiles and block the T2 spawning zone, preventing T2 from exiting their spawning zone when tagged and effectively nullifying their threat in the game.
Results: Overall results indicate that T1-M learns to mine and place blocks to trap T2 agents, consistent with my hypothesis. Adaptations made by T2 to mitigate the T1 are observed in later training generations.
This study finds strong evidence that agents with different abilities naturally develop specialised roles when trained using reinforcement learning and self-play in a capture the flag setting.
Each agent type adopts behaviours suited to its strengths: Scouts balance attack and defence, Guardians focus on protecting their flag, Vaulters use speed to capture flags, and Miners reshape the environment to support teammates or hinder opponents. These roles remain consistent even as the environment becomes more complex.
Agents also adapt over time, responding to opponent strategies and sometimes changing or losing their specialisation. In some cases, strong imbalances between teams can limit the need for specialisation altogether.
Overall, the results show that reinforcement learning can produce coordinated, specialised behaviours without explicit instructions—simply by rewarding success in the game. This suggests broader potential for using RL to design effective teams of diverse agents in real-world applications, without needing to manually define roles or strategies.
Description: This experiment tests whether the Guardian learns to leverage it's defensive abilities. Team 1 consists of a Scout and a Guardian (T1-S and T1-G), while Team 2 has two Scouts (T2-S and T2-S). A single diagonal row of block tiles separates the two teams, forcing a single area of movement in the top right quadrant of the game grid, forcing confrontation between teams.
Hypothesis: The Guardian will learn to defend its flag, whilst the T1 Scout will assume a more attacking role.
Results: Overall results indicate that T1-G learns to defend its flag whilst T1-S assumes a more attacking role, consistent with the hypothesis.