Environments¶

Built-In¶

MFGLib comes with 10 pre-implemented environments which can be accessed by calling the corresponding classmethods of Environment. The pre-implemented environments are listed below:

classmethod Environment.beach_bar(T: int = 2, n: int = 4, bar_loc: int = 2, log_eps: float = 1e-20, p_still: float = 0.5, mu0: Literal['uniform'] | Tensor = 'uniform') → Environment¶

Instantiate the Beach Bar environment.

The beach bar process is a Markov Decision Process with \(|X|\) states disposed on a one dimensional torus (\(X = {0,..., |X|-1}\)), which represents a beach. A bar is located in one of the states. As the weather is very hot, players want to be as close as possible to the bar, while keeping away from too crowded areas.

See also

Refer to Perrin et al.⁹ for further details.

classmethod Environment.building_evacuation(T: int = 3, n_floor: int = 5, floor_l: int = 10, floor_w: int = 10, log_eps: float = 1e-20, eta: float = 1.0, evac_r: float = 10.0, mu0: Literal['uniform'] | Tensor = 'uniform') → Environment¶

Instantiate the Building Evacuation environment.

In this problem, there is a multilevel building and each agent of the crowd wants to go downstairs as quickly as possible while favoring social distancing. At each floor, two staircases are located at two opposite corners, such as the crowd has to cross the whole floor to take the next staircase. Each agent can remain in place, move in the 4 directions (up, down, right, left) as well as go up or down when on a staircase location.

See also

Refer to Perolat et al.⁸ for further details.

classmethod Environment.conservative_treasure_hunting(T: int = 5, n: int = 3, r: tuple[float, ...] = (1.0, 1.0, 1.0), c: tuple[float, ...] = (1.0, 1.0, 1.0, 1.0, 1.0), mu0: Literal['uniform'] | Tensor = 'uniform') → Environment¶: Instantiate the Conservative Treasure Hunting environment.

See also

Refer to Guo et al.⁶ for further details.

classmethod Environment.crowd_motion(T: int = 3, torus_l: int = 20, torus_w: int = 20, loc_change_freq: int = 2, c: float = 10.0, log_eps: float = 1e-10, p_still: float = 0.5, seed: int = 0, mu0: Literal['uniform'] | Tensor = 'uniform') → Environment¶

Instantiate the Crowd Motion environment.

An adaptation of Crowd Motion environment, which extends the Beach Bar environment in 2 dimensions.

See also

Refer to Perolat et al.⁸ for further details.

classmethod Environment.equilibrium_price(T: int = 4, s_inv: int = 3, Q: int = 2, H: int = 2, d: float = 1.0, e0: float = 1.0, sigma: float = 1.0, c: tuple[float, float, float, float, float] = (1.0, 1.0, 1.0, 1.0, 1.0), mu0: Literal['uniform'] | Tensor = 'uniform') → Environment¶

Instantiate the Equilibrium Price environment.

In this problem, a large number of homogeneous firms producing the same product under perfect competition are considered. The price of the product is determined endogenously by the supply-demand equilibrium. Each firm, meanwhile, maintains a certain inventory level of the raw materials for production, and decides about the quantity of raw materials to consume for production and the quantity of raw materials to replenish the inventory.

See also

Refer to Guo et al.⁶ for futher details.

classmethod Environment.left_right(mu0: tuple[float, float, float] = (1.0, 0.0, 0.0)) → Environment¶

Instantiate the Left Right environment.

A large number of agents choose simultaneously between going left (L) or right (R). Afterwards, each agent shall be punished proportional to the number of agents that chose the same action, but more-so for choosing right than left.

See also

Refer to Cui and Koeppl³ for further details.

classmethod Environment.linear_quadratic(T: int = 3, el: int = 5, m: int = 2, sigma: float = 3.0, delta: float = 0.1, k: float = 1.0, q: float = 0.01, kappa: float = 0.5, c_term: float = 1.0, mu0: Literal['uniform'] | Tensor = 'uniform') → Environment¶: Instantiate the Linear Quadratic environment.

See also

Refer to Perrin et al.⁹ for further details.

classmethod Environment.random_linear(T: int = 3, n: int = 5, m: float = 10.0, seed: int = 0, mu0: Literal['uniform'] | Tensor = 'uniform') → Environment¶

Instantiate the Random Linear environment.

A custom environment in which the rewards and transition probabilities are random affine functions of the mean-field. For transition probabilities to be valid, a softmax function is applied on top of the corresponding affine function.

classmethod Environment.rock_paper_scissors(T: int = 1, mu0: tuple[float, float, float, float] = (1.0, 0.0, 0.0, 0.0)) → Environment¶

Instantiate the Rock Paper Scissors environment.

This game is inspired by Shapley (1964) and their generalized non-zero-sum version of Rock-Paper-Scissors, for which classical fictitious play would not converge. Each of the agents can choose between rock, paper and scissors, and obtains a reward proportional to double the number of beaten agents minus the number of agents beating the agent.

See also

Refer to Cui and Koeppl³ for further details.

classmethod Environment.susceptible_infected(T: int = 50, mu0: tuple[float, float] = (0.4, 0.6)) → Environment¶

Instantiate the Susceptible Infected environment.

In this problem, a large number of agents can choose between social distancing (D) or going out (U). If a susceptible (S) agent chooses social distancing, they may not become infected (I). Otherwise, an agent may become infected with a probability proportional to the number of agents being infected. If infected, an agent will recover with a fixed chance every time step. Both social distancing and being infected have an associated cost.

See also

Refer to Cui and Koeppl³ for further details.

All implemented algorithms are parameterized so that you can control the size of the state space, action space, and time horizon. In the following example, we create two distinct buildings, one with 10 floors each 20 by 20, and another with 100 floors each 50 by 5.

from mfglib.env import Environment

env_1 = Environment.building_evacuation(n_floor=10, floor_l=20, floor_w=20)
env_2 = Environment.building_evacuation(n_floor=100, floor_l=50, floor_w=5)

User-Defined¶

Any environment defined in this library has the following attributes:

T: Sets the time horizon of the environment from 0 to T (inclusive, integer steps).
S: State space shape. For example, if the state space is all the integers from 1 to 100, then S=(100,), and if the state space is all the integer grid points \((x, y)\) such that \(1 \leq x,y \leq 100\), then S=(100, 100).
A: Action space shape.
mu0: Initial state distribution.
r_max: The supremum of the absolute value of rewards. This parameter is only used in Mean-Field Occupation Measure Optimization algorithm and does not necessarily need to be exact. Even a loose upper bound would be sufficient.
reward_fn: Defines the reward function.
transition_fn: Defines the tranistion probability function.

Note

Notice that in the integer grid points case, we could flatten the state space and show it using a one dimensional vector of size 10,000. But keeping the state (and action) space multi-dimensional, whenever it is possible, is the convention used in this library. This convention results in easier to interpret policies, mean-fields, rewards, etc.

Policy and Mean-Field Tensors. Given T, S, and A, the shape of policy and mean-field tensors will be (T+1,) + S + A. For example, if T=10, S=(20, 20), A=(5,), the policy and mean-field tensors will be of size (11, 20, 20, 5). In general, let S=(S_1, S_2, ..., S_n) and A=(A_1, A_2, ..., A_m), and let pi and L be a policy and a mean-field tensor, respectively. Then, pi[t, s_1, s_2, ..., s_n, a_1, a_2, ..., a_m] is the probability of choosing action a = (a_1, a_2, ..., a_m) conditional on being at the state s = (s_1, s_2, ..., s_n) at time t, and L[t, s_1, s_2, ..., s_n, a_1, a_2, ..., a_m] is the portion of players that are in state s = (s_1, s_2, ..., s_n) and choose action a = (a_1, a_2, ..., a_m) at time t.

Reward Function. We define the reward function via the argument reward_fn. The user is allowed to pass either a function or a class implementing __call__. The inputs of the reward function must be env (an environment instance), t (a specific time less than or equal to the time horizon), and L_t (the mean-field tensor at time t). The output will be a tensor of shape S + A. Let r be the output tensor, and assume S=(S_1, S_2, ..., S_n) and A=(A_1, A_2, ..., A_m). Then, r[s_1, s_2, ..., s_n, a_1, a_2, ..., a_m] is the reward that agent gets from choosing action a=(a_1, a_2, ..., a_m) conditional on being at state s = (s_1, s_2, ..., s_n).

Transition Function. We define the transition probability function via the argument transition_fn. The user is allowed to pass either a function or a class implementing __call__. The inputs of the transition probability function must be env (an environment instance), t (a specific time less than or equal to the time horizon), and L_t (the mean-field tensor at time t). The output will be a tensor of shape S + S + A. Let p be the output tensor, and assume S=(S_1, S_2, ..., S_n) and A=(A_1, A_2, ..., A_m). Then, p[s2_1, s2_2, ..., s2_n, s1_1, s1_2, ..., s1_n, a_1, a_2, ..., a_m] is the probability of going to the state s2 = (s2_1, s2_2, ..., s2_n) conditional on being at the state s1 = (s1_1, s1_2, ..., s1_n) and choosing the action a=(a_1, a_2, ..., a_m).

Custom Environment Example¶

In order to create a custom environment, you can define each one of the above-mentioned attributes and pass them to Environment. Let’s take a look at the environment Random Linear, which is a custom environment already implemented in the library.

We first define the states and actions. We want to have n states and n actions. Therefore, S=(n,) and A=(n,). Also, we use a uniform initial state distribution. To get a specific instance, we consider n=5.

import torch

# Define the state and action space shape
n = 5
S = (n,)
A = (n,)

# Initial state distribution
mu0 = torch.ones(n) / n

Now, we define the reward and transition functions. As the name of the environment suggests, we want the reward and transition probabilities to be a random linear (affine indeed) function of the mean-field, that is given the mean field \(L\), the reward and transition probabilities should be equal to \(M_1 \times L + M_2\) for some randomly generated matrices \(M_1, M_2\). We generate different pairs of matrices for reward and transition functions.

Note that in order for transition probabilities to be well-defined, we apply a softmax function to the output of the affine function. Furthermore, we restrict all the entries of the randomly generated matrices to be in \([-m, m]\). With this constraint, it is fairly straightforward to see that the absolute value of rewards cannot be larger than \(2m\) implying that we should set r_max equal to \(2m\). To get an environment instance, we set m=1. Putting it all together,

from mfg.env import Environment
import torch

n = 5
m = 1

torch.manual_seed(0)
soft_max = torch.nn.Softmax(dim=-1)

r1 = 2 * m * torch.rand(n, n) - m  # M_1 for reward_fn
r2 = 2 * m * torch.rand(n, n) - m  # M_2 for reward_fn

p1 = 2 * m * torch.rand(n, n, n) - m  # M_1 for transition_fn
p2 = 2 * m * torch.rand(n, n, n) - m  # M_2 for transition_fn

user_defined_random_linear = Environment(
    T=4,
    S=(n,),
    A=(n,),
    mu0=torch.ones(n) / n,
    r_max=2 * m,
    reward_fn=lambda env, t, L_t: r1 @ L_t + r2,
    transition_fn=lambda env, t, L_t: softmax(p1 @ L_t + p2),
)

Refer to the MFGLib implementation of Random Linear for an alternative class-based implementation.