Game Theory and Reinforcement Learning

Karthick Gopalswamy

by Karthick Gopalswamy
Ph.D. Candidate in Industrial and Systems Engineering, North Carolina State University

This article to intended to introduce Game theory and Reinforcement learning to students in ORMS field and is not a comprehensive review of these topics. 

Reinforcement Learning (RL) and Game Theory are two streams of mathematics with significant applications in solving real-life problems. Despite different origins, these methods share common traits in how the problems are defined in the game environment; i.e. states, agents and strategies (or policies). Reinforcement learning, a field of machine learning, is a ‘trial and error’ algorithm. Based on its observations, in RL an agent acts on an unknown environment to maximize the reward. Game theory is a mathematical way of defining the logical intricacies inherent to any rational analysis of conflict. Although the terminology ‘games’ sounds naive, an investigation of historical games found in economics, international trade, sociology, psychology, political policies and warfare and their origins is valuable in understanding the evolution of human thinking process (evolutionary biology). Problems in RL and GT complement each other in the sense that RL provides efficient algorithms to solve more complex games using mathematical inspiration from GT. While the former is more of an art, the later could well explain the science behind this art. Both fields have a long-standing history and, despite similarities, have significantly evolved as parallel domains.   

Success of Alpha Go and Open AI Five have sparked significant interest among researchers in the field of RL. RL was originally developed for solving Markov Decision Process (MDP), a stochastic process where the system is fully characterized by the given state, independent of the past. Games that are not exactly an MDP can be converted to MDP by defining the history as state; many games fall under this category. Chess is an example of a fully observable MDP where, given a state and action to be taken, the next state is known with certainty. While the underlying theory in Alpha Go was a 2-person zero-sum game for which MDP theories are well understood, this is not the case with OpenAI Five. The years between Alpha Go and OpenAI Five certainly quantify the complexity of extending 2-player games to multi-player setting.  Von Neumann and Morgenstern had only managed to define the concept of equilibrium for a 2-person zero-sum game, a pure competition where one player gains with the loss of the other. John Nash addressed the case of competition using Nash equilibrium which, despite being a highly useful concept, uses the fundamental assumptions of rationality. This makes static Nash equilibrium,  unable to extend successfully to real world dynamic problems. Nash equilibrium is defined as a set of payoff strategies with the property that no player can increase their payoff by changing their strategy alone.  

Current AI systems are based on either a single agent tackling a task or a couple of agents competing (Alpha Go). AGI (Artificial General Intelligence) can materialize only by understanding how humans behave in everyday life. For example, it can be argued that the reason for helping or greeting others is because of the resulting reward; rewards like the satisfaction of helping a person and the general etiquette of greeting people. While it is easier for us to understand the result of our own actions and improve, we constantly fail to understand others. Why? We try to reason with rationality. We know people, including ourselves, are not rational.  Irrationality leads to Games based on imperfect information which are not susceptible to simple analysis.  In my opinion, the current state of AGI is far from successful in the sense that humans tend to evolve over time and are dynamic by nature. Depending on the situation, humans choose to compete, cooperate or remain neutral. These choices increase the complexity in conceptualizing an AGI when using only the classic results on MDP, GT, etc. The classic example of prisoner’s dilemma (shown in figure below) may not necessarily find equilibrium solution, when the prisoners learn based on their history.

RL_GT

Fortunately, Evolutionary Game Theory (EGT) can incorporate dynamic behavior and adaptive learning; offering a solid basis for understanding dynamic iterative situations in the context of strategic games. Further adaptation of EGT to formulate agent dynamics resulted in Multi-Agent Reinforcement Learning (MARL) which corresponds to more complex interactions. To achieve general intelligence, agents must learn how to interact with others in a shared environment: this is the challenge of MARL. Problems like negotiation among different product teams in an organization, social interactions, strategy games, and consumer markets can be addressed more accurately using MARL. With the unprecedented success of Deep Reinforcement Learning, there is a renewed interest in Deep MARL; the method used successfully by OpenAI in playing Dota-2 with human experts. Dota-2 is played in matches between two teams of five players, with each team occupying and defending their own separate base on the map. Each of the ten players independently controls a powerful character, known as a "hero", who all have unique abilities and differing styles of play. During a match, players collect experience points and items for their heroes to successfully defeat the opposing team's heroes in player versus player combat. The AI was able to learn and act in a continuously changing environment with multiple interactions (agents) over a long-time horizon.  The significant leap in intelligence from Alpha Go to OpenAI Five, in my opinion, is the AI’s ability to reason: strategizing with other agents (cooperate) while competing against the enemy under partially observable state (a human like behavior). 

Finally, having said all the above, one must consider the moral and ethical issues surrounding the advancements of these algorithms or AGI. While we move towards more autonomous mode of living, we should always question ourselves about accountability. AI decisions are algorithmic, unless we believe in the concept that an AI reasons like us (which is irrational), thus, we need to decide where accountability falls. On the flip side, despite not being optimal, a person will always be held accountable for the consequences of their decisions.