The beach bar process is a Markov Decision Process with \(|X|\)
states disposed on a one dimensional torus (\(X = {0,..., |X|-1}\)), which
represents a beach. A bar is located in one of the states. As the
weather is very hot, players want to be as close as possible to the bar,
while keeping away from too crowded areas.[#1bb]_
In this problem, there is a multilevel building and each agent of the
crowd wants to go downstairs as quickly as possible while favoring
social distancing. At each floor, two staircases are located at two
opposite corners, such as the crowd has to cross the whole floor to take
the next staircase. Each agent can remain in place, move in the 4
directions (up, down, right, left) as well as go up or down when on a
staircase location.[#be]_
In this problem, a large number of homogeneous firms producing the same
product under perfect competition are considered. The price of the
product is determined endogenously by the supply-demand equilibrium.
Each firm, meanwhile, maintains a certain inventory level of the raw
materials for production, and decides about the quantity of raw
materials to consume for production and the quantity of raw materials to
replenish the inventory.
Guo, X., Hu, A., Xu, R., & Zhang, J. (2022).
A general framework for learning mean-field games.
Mathematics of Operations Research.
A large number of agents choose simultaneously between going left (L) or
right (R). Afterwards, each agent shall be punished proportional to the
number of agents that chose the same action, but more-so for choosing right
than left.
Cui, Kai, and Heinz Koeppl. “Approximately solving mean field games via
entropy-regularized deep reinforcement learning.” International Conference
on Artificial Intelligence and Statistics. PMLR, 2021.
https://proceedings.mlr.press/v130/cui21a.html
Perrin, Sarah, et al. “Fictitious play for mean field games: Continuous time
analysis and applications.” Advances in Neural Information Processing
Systems 33 (2020): 13199-13213.
A custom environment in which the rewards and transition probabilities
are random affine functions of the mean-field. For transition
probabilities to be valid, a softmax function is applied on top of the
corresponding affine function.
This game is inspired by Shapley (1964) and their generalized non-zero-sum
version of Rock-Paper-Scissors, for which classical fictitious play would not
converge. Each of the agents can choose between rock, paper and scissors, and
obtains a reward proportional to double the number of beaten agents minus the
number of agents beating the agent.
Cui, Kai, and Heinz Koeppl. “Approximately solving mean field games via
entropy-regularized deep reinforcement learning.” International Conference
on Artificial Intelligence and Statistics. PMLR, 2021.
https://proceedings.mlr.press/v130/cui21a.html
In this problem, a large number of agents can choose between social
distancing (D) or going out (U). If a susceptible (S) agent chooses social
distancing, they may not become infected (I). Otherwise, an agent may become
infected with a probability proportional to the number of agents being infected.
If infected, an agent will recover with a fixed chance every time step. Both
social distancing and being infected have an associated cost.
Cui, Kai, and Heinz Koeppl. “Approximately solving mean field games via
entropy-regularized deep reinforcement learning.” International Conference
on Artificial Intelligence and Statistics. PMLR, 2021.
https://proceedings.mlr.press/v130/cui21a.html
The implementation is based on Fictitious Play Damped.
When alpha=None, the algorithm is the same as the original Fictitious Play
algorithm. When alpha=1, the algorithm is the same as Fixed Point Iteration
algorithm.
Tune the algorithm over a given environment suite.
Parameters:
env_suite – A list of environment instances.
max_iter – The number of iterations to run the algorithm on each environment
instance.
atol – Absolute tolerance criteria for early stopping.
rtol – Relative tolerance criteria for early stopping.
metric – Determines which metric to be used for scoring a trial. Either
shifted_geo_mean or failure_rate.
n_trials – The number of trials. If this argument is not given, as many
trials are run as possible.
timeout – Stop tuning after the given number of second(s) on each
environment instance. If this argument is not given, as many trials are
run as possible.
Tune the algorithm over a given environment suite.
Parameters:
env_suite – A list of environment instances.
max_iter – The number of iterations to run the algorithm on each environment
instance.
atol – Absolute tolerance criteria for early stopping.
rtol – Relative tolerance criteria for early stopping.
metric – Determines which metric to be used for scoring a trial. Either
shifted_geo_mean or failure_rate.
n_trials – The number of trials. If this argument is not given, as many
trials are run as possible.
timeout – Stop tuning after the given number of second(s) on each
environment instance. If this argument is not given, as many trials are
run as possible.
Tune the algorithm over a given environment suite.
Parameters:
env_suite – A list of environment instances.
max_iter – The number of iterations to run the algorithm on each environment
instance.
atol – Absolute tolerance criteria for early stopping.
rtol – Relative tolerance criteria for early stopping.
metric – Determines which metric to be used for scoring a trial. Either
shifted_geo_mean or failure_rate.
n_trials – The number of trials. If this argument is not given, as many
trials are run as possible.
timeout – Stop tuning after the given number of second(s) on each
environment instance. If this argument is not given, as many trials are
run as possible.
Tune the algorithm over a given environment suite.
Parameters:
env_suite – A list of environment instances.
max_iter – Notes number of iterations to run the algorithm on each environment
instance.
atol – Absolute tolerance criteria for early stopping.
rtol – Relative tolerance criteria for early stopping.
metric – Determines which metric to be used for scoring a trial. Either
shifted_geo_mean or failure_rate.
n_trials – The number of trials. If this argument is not given, as many
trials are run as possible.
timeout – Stop tuning after the given number of second(s) on each
environment instance. If this argument is not given, as many trials are
run as possible.