Environments¶
Pre-implemented¶
MFGLib comes with 10 pre-implemented environments which can be accessed by calling the corresponding classmethods
of Environment. The pre-implemented environments are listed below:
Beach Bar (Perrin et al.[1]) –
Environment.beach_bar()Building Evacuation (Perolat et al.[2]) –
Environment.building_evacuation()Conservative Treasure Hunting (Guo et al.[3]) –
Environment.conservative_treasure_hunting()Crowd Motion (Perolat et al.[2]) –
Environment.crowd_motion()Equilibrium Price (Guo et al.[3]) –
Environment.equilibrium_price()Left Right (Cui and Koeppl[4]) –
Environment.left_right()Linear Quadratic (Perrin et al.[1]) –
Environment.linear_quadratic()Random Linear –
Environment.random_linear().Rock Paper Scissors (Cui and Koeppl[4]) –
Environment.rock_paper_scissors()Susceptible Infected (Cui and Koeppl[4]) –
Environment.susceptible_infected()
All implemented environments take initialization parameters that modify the resulting instance (in terms of state and action space, underlying reward and transition probabilities, etc.) of the same environment. Let’s look at the Building Evacuation environment for example. We can create distinct buildings (distinct environment instances) by changing the number of floors, the size of each floor, etc. In the following, we create two distinct buildings, one with 10 floors each 20 by 20, and another one with 100 floors each 50 by 5.
from mfglib.env import Environment
building_evacuation_1 = Environment.building_evacuation(n_floor=10, floor_l=20, floor_w=20)
building_evacuation_2 = Environment.building_evacuation(n_floor=100, floor_l=50, floor_w=5)
User-defined¶
Any environment defined in this library has the following attributes:
T: Sets the time horizon of the environment from 0 toT(inclusive, integer steps).S: State space shape. For example, if the state space is all the integers from 1 to 100, thenS=(100,), and if the state space is all the integer grid points \((x, y)\) such that \(1 \leq x,y \leq 100\), thenS=(100, 100).A: Action space shape.mu0: Initial state distribution.r_max: The supremum of the absolute value of rewards. This parameter is only used in Mean-Field Occupation Measure Optimization algorithm and does not necessarily need to be exact. Even a loose upper bound would be sufficient.reward_fn: Defines the reward function.transition_fn: Defines the tranistion probability function.
Note
Notice that in the integer grid points case, we could flatten the state space and show it using a one dimensional vector of size 10,000. But keeping the state (and action) space multi-dimensional, whenever it is possible, is the convention used in this library. This convention results in easier to interpret policies, mean-fields, rewards, etc.
Policy and Mean-Field Tensors. Given T, S, and A, the shape of policy and mean-field tensors will be
(T+1,) + S + A. For example, if T=10, S=(20, 20), A=(5,), the policy and mean-field tensors will be of size
(11, 20, 20, 5). In general, let S=(S_1, S_2, ..., S_n) and A=(A_1, A_2, ..., A_m), and let pi and
L be a policy and a mean-field tensor, respectively. Then, pi[t, s_1, s_2, ..., s_n, a_1, a_2, ..., a_m] is
the probability of choosing action a = (a_1, a_2, ..., a_m) conditional on being at the state
s = (s_1, s_2, ..., s_n) at time t, and L[t, s_1, s_2, ..., s_n, a_1, a_2, ..., a_m] is the portion of
players that are in state s = (s_1, s_2, ..., s_n) and choose action a = (a_1, a_2, ..., a_m) at time t.
Reward Function. We define the reward function via the argument reward_fn. The user is allowed to pass either
a function or a class implementing __call__. The inputs of the reward function must be env (an environment
instance), t (a specific time less than or equal to the time horizon), and L_t (the mean-field tensor at time
t). The output will be a tensor of shape S + A. Let r be the output tensor, and assume
S=(S_1, S_2, ..., S_n) and A=(A_1, A_2, ..., A_m). Then, r[s_1, s_2, ..., s_n, a_1, a_2, ..., a_m] is the
reward that agent gets from choosing action a=(a_1, a_2, ..., a_m) conditional on being at state
s = (s_1, s_2, ..., s_n).
Transition Function. We define the transition probability function via the argument transition_fn. The user is
allowed to pass either a function or a class implementing __call__. The inputs of the transition probability
function must be env (an environment instance), t (a specific time less than or equal to the time horizon),
and L_t (the mean-field tensor at time t). The output will be a tensor of shape S + S + A. Let p be the
output tensor, and assume S=(S_1, S_2, ..., S_n) and A=(A_1, A_2, ..., A_m). Then,
p[s2_1, s2_2, ..., s2_n, s1_1, s1_2, ..., s1_n, a_1, a_2, ..., a_m] is the probability of going to the state
s2 = (s2_1, s2_2, ..., s2_n) conditional on being at the state s1 = (s1_1, s1_2, ..., s1_n) and choosing the
action a=(a_1, a_2, ..., a_m).
Custom Environment Example¶
In order to create a custom environment, you can define each one of the above-mentioned attributes and pass them to
Environment. Let’s take a look at the environment Random Linear, which is a custom environment already
implemented in the library.
We first define the states and actions. We want to have n states and n actions. Therefore, S=(n,) and
A=(n,). Also, we use a uniform initial state distribution. To get a specific instance, we consider n=5.
import torch
# Define the state and action space shape
n = 5
S = (n,)
A = (n,)
# Initial state distribution
mu0 = torch.ones(n) / n
Now, we define the reward and transition functions. As the name of the environment suggests, we want the reward and transition probabilities to be a random linear (affine indeed) function of the mean-field, that is given the mean field \(L\), the reward and transition probabilities should be equal to \(M_1 \times L + M_2\) for some randomly generated matrices \(M_1, M_2\). We generate different pairs of matrices for reward and transition functions.
Note that in order for transition probabilities to be well-defined, we apply a softmax function to the output of the
affine function. Furthermore, we restrict all the entries of the randomly generated matrices to be in \([-m, m]\).
With this constraint, it is fairly straightforward to see that the
absolute value of rewards cannot be larger than \(2m\) implying that we should set r_max equal to \(2m\).
To get an environment instance, we set m=1. Putting it all together,
from mfg.env import Environment
import torch
n = 5
m = 1
torch.manual_seed(0)
soft_max = torch.nn.Softmax(dim=-1)
r1 = 2 * m * torch.rand(n, n) - m # M_1 for reward_fn
r2 = 2 * m * torch.rand(n, n) - m # M_2 for reward_fn
p1 = 2 * m * torch.rand(n, n, n) - m # M_1 for transition_fn
p2 = 2 * m * torch.rand(n, n, n) - m # M_2 for transition_fn
user_defined_random_linear = Environment(
T=4,
S=(n,),
A=(n,),
mu0=torch.ones(n) / n,
r_max=2 * m,
reward_fn=lambda env, t, L_t: r1 @ L_t + r2,
transition_fn=lambda env, t, L_t: softmax(p1 @ L_t + p2),
)
Refer to the MFGLib implementation of Random Linear for an alternative class-based implementation.