Introduction to Multiarmed Bandits (MABs)
MultiArmed Bandit (MAB) is a fundamental framework for sequential decisionmaking problems. In this setting, a decisionmaker (or player) faces a set of arms, each with an unknown reward distribution. At each time step, the player must choose an arm to pull, receiving a reward drawn from the environment (or arm’s distribution). The goal is to maximize the cumulative reward over time. MABs are a versatile tool with applications in various fields, including:
 Recommender Systems: Optimizing content recommendations based on user preferences.
 Clinical Trials: Determining the most effective treatment for a disease.
 Online Advertising: Optimizing ad placements to maximize clickthrough rates.
MABs can be categorized based on the nature of the reward process:
 Stochastic MABs: The reward distribution for each arm is stationary, stochastic, and independent over time.
 Adversarial MABs: The rewards are chosen by an adversary, potentially aiming to mislead the player.
 Markovian MABs: The reward distribution depends on the history of past arm selections.
 Contextual MABs: The player receives contextual information at each time step, which can influence the reward distribution.
The MAB problem has attracted significant attention from researchers in various fields, including computer science, mathematics, and statistics. Notable pioneers in the field include:
 Richard Sutton: Known for his work on reinforcement learning, which has strong connections to MABs.
 Peter Auer: A leading researcher in the development of algorithms for stochastic MABs.
 John Langford: Known for his contributions to contextual MABs and online learning.
 Sébastien Bubeck: A prominent researcher in the theoretical analysis of MAB algorithms.
 Lihong Li: Known for his work on bandit algorithms for largescale applications.
 …
These researchers, along with many others, have contributed to the development of theoretical foundations, algorithms, and applications of MABs. This has made the field an active and growing area of research with ongoing advancements in both theory and practice.
Theoretical Parts
1. Stochatic MABs
In the problem, each machine provides a random reward from a probability distribution specific to that machine that is not known as apriori. The objective of the gambler is to maximize the sum of rewards earned through a sequence of lever pulls. The crucial tradeoff the gambler faces at each trial is between “exploitation” of the machine that has the highest expected payoff and “exploration” to get more information about the expected payoffs of the other machines. The tradeoff between exploration and exploitation is also faced in machine learning. In practice, multiarmed bandits have been used to model problems such as managing research projects in a large organization, like a science foundation or a pharmaceutical company. In early versions of the problem, the gambler begins with no initial knowledge about the machines.
2. Adversarial MABs
The adversarial multiarmed bandit (MAB) is a variation of the classic MAB problem in which an adversary controls the reward distributions for each arm and can change them arbitrarily over time. In other words, the adversary has full control over the rewards that the decisionmaker receives at each time step, and their goal is to minimize the total expected reward of the decisionmaker.
The adversarial MAB problem is a worstcase scenario in which the decision maker cannot make any assumptions about the reward distributions and must adaptively respond to the adversary’s actions. This problem arises in many realworld applications, such as online advertising, where the adversary may manipulate the user’s preferences or behavior to reduce the effectiveness of the advertising.
Various algorithms and strategies have been developed to solve the adversarial MAB problem, including the EXP3 (exponential weighting for exploration and exploitation) algorithm, which uses a randomized exploration strategy to avoid exploitation of arms with low rewards, and the followtheleader algorithm, which chooses the arm with the highest observed reward in the past.
The adversarial MAB problem is more challenging than the classic MAB problem because it requires the decision maker to adaptively respond to the actions of an intelligent adversary who is actively trying to manipulate the decision maker’s choices. However, it is also more relevant to realworld applications where the decisionmaker may face strategic opponents who are trying to exploit or deceive them.
3. Markovian MABs
The Markovian multiarmed bandit (MAB) is a variation of the classic MAB problem in which the reward distributions for each arm depend on the state of a Markov decision process (MDP). In an MDP, the system’s state evolves over time according to a set of probabilistic transition rules, and the decision maker’s actions can influence the transition probabilities and the rewards received at each time step.
In the Markovian MAB problem, the decision maker must learn a policy that maps each state to the best arm to pull at each time step in order to maximize the total expected reward over a sequence of pulls. This problem arises in many realworld applications, such as resource allocation in wireless communication networks, where the decision maker must adaptively allocate resources to different users based on their channel conditions, which can vary over time.
Various algorithms and strategies have been developed to solve the Markovian MAB problem, including the MDPbased bandit algorithm, which uses an MDP model to estimate the expected rewards for each arm in each state, and the Qlearning algorithm, which learns a Qfunction that represents the expected total reward for each stateaction pair.
The Markovian MAB problem is more complex than the classic MAB problem because it requires the decisionmaker to reason about the system’s dynamics and how their actions can influence future states and rewards. However, it is also more powerful because it allows the decisionmaker to take advantage of the system’s temporal structure to improve their decisionmaking performance.
4. Contextual MABs
The contextual multiarmed bandit (MAB) is an extension of the classic MAB problem in which each arm is associated with a context or set of features that can provide additional information to the decisionmaker. In other words, the reward distribution for each arm is dependent on the context or features associated with that arm.
The goal of the contextual MAB problem is to learn a policy that maps each context to the best arm to pull at each time step in order to maximize the total expected reward over a sequence of pulls. This problem arises in many realworld applications, such as online advertising, personalized medicine, and recommendation systems, where the decisionmaker has access to contextual information about the user or environment.
Various algorithms and strategies have been developed to solve the contextual MAB problem, including the contextual bandit algorithm, which uses a model of the reward distribution for each arm conditioned on the context, and the contextual Thompson sampling algorithm, which extends the classic Thompson sampling algorithm to incorporate contextual information.
The contextual MAB problem is more challenging than the classic MAB problem because it requires the decision maker to learn a policy that effectively uses contextual information to make informed decisions about which arm to pull at each time step. However, it is also more powerful because it allows the decisionmaker to take advantage of additional information to improve their decisionmaking performance.
Application Parts
1. MAB in Wireless Communications
Resource management has been a perpetual theme in wireless network design, deployment, and operations. In today’s wireless systems, data demands continue to grow with a diverse range of applications, from bandwidthhungry multimedia streaming, delaysensitive instant messaging, and online gaming to bulk data transfer. The everincreasing needs for highspeed ubiquitous network access by mobile users are further aggravated by emerging machinetomachine communication for home and industrial automation, widearea sensing and monitoring, autonomous vehicles, etc. Delivery of these rich sets of applications is fundamentally limited by resource scarcity in wireless networks that manifests at various levels. For instance, spectrum scarcity has emerged as a primary problem when trying to launch new wireless services. Vendors and operators are increasingly looking into millimeter radio bands for 5G cellular standard though the spectrum was previously considered unsuitable for wider area applications. Interferences among wireless transceivers in close proximity continue to pose challenges to the delivery of reliable and timely services. Mobile devices are inherently powerconstrained, demanding efficient communication schemes and protocols.
Many resource management solutions in wireless networks operate on the assumption that the decisionmakers have complete knowledge of system states and parameters (e.g., channel states, network topology, and user density). When such information is unavailable or incomplete, probing or learning has to be conducted prior to the making of resource management decisions. As an example, in orthogonal frequencydivision multiplexing (OFDM) systems, pilot signals are transmitted either along a dedicated set of subcarriers or a specific period across all subcarriers for channel estimation. This allows the adaption of subsequent transmissions to current channel conditions. Sequential learning, in contrast, is a paradigm where learning and decisionmaking are performed concurrently. The framework is applicable in a variety of scenarios where the utility of resource management resources follows singleparametrized independent and identically distributions, Markovian, or unknown adversarial processes. It is a powerful tool in wireless resource management. Sequential learning and decisionmaking have been successfully applied to for resource management in cognitive radio networks, wireless LANs, and mesh networks.
However, there are significant barriers to the wider adoption of the framework in addressing resource management problems in the wireless community. We believe that this can be attributed to two reasons: First, the sequential learning theory originates from complex stochastic concepts posing a significant learning curve. Effective algorithms often rely on underlying assumptions such as stationary and independent stochastic processes. Identifying a suitable solution to a specific problem can be a tall order for beginners. Second, there is a disconnect between theory and practical constraints in realworld settings. For example, in practice, the timeliness of decisionmaking often trumps optimality. In contrast, the primary concerns of sequential learning literature are the convergence rate in the long run and the optimality of the sequence of actions.
It is the first attempt to bridge the aforementioned gaps. Though the literature on sequential learning is abundant, a comprehensive treatment of its applications in wireless networks is lacking. In this book, we aim to lay out the theoretical foundation of the socalled multiarmed bandit (MAB) problems and put it in the context of resource management in wireless networks. Part I of this book presents the formulations, algorithms, and performance of three forms of MAB problems, namely stochastic, Markov, and adversarial. To the best of our knowledge, this is the first work that covers all three forms of MAB problems. Part II of this book provides detailed discussions of representative applications of the sequential learning framework in wireless resource management. They serve as case studies both to illustrate how the theoretical framework and tools in Part I can be applied and also to demonstrate how existing algorithms can be extended to address practical concerns in operational wireless networks. We believe both the industry and the wireless research community can benefit from a comprehensive and timely treatment of these topics.
Study Materials for MAB

Regret analysis of stochastic and nonstochastic multiarmed bandit problems

Introduction to multiarmed bandits