Environments
============

Built-In
--------

MFGLib comes with 10 pre-implemented environments which can be accessed by calling the corresponding classmethods
of ``Environment``. The pre-implemented environments are listed below:

.. automethod:: mfglib.env::Environment.beach_bar

.. automethod:: mfglib.env::Environment.building_evacuation

.. automethod:: mfglib.env::Environment.conservative_treasure_hunting

.. automethod:: mfglib.env::Environment.crowd_motion

.. automethod:: mfglib.env::Environment.equilibrium_price

.. automethod:: mfglib.env::Environment.left_right

.. automethod:: mfglib.env::Environment.linear_quadratic

.. automethod:: mfglib.env::Environment.random_linear

.. automethod:: mfglib.env::Environment.rock_paper_scissors

.. automethod:: mfglib.env::Environment.susceptible_infected

All implemented algorithms are parameterized so that you can control the size of the state space, action space, and time
horizon. In the following example, we create two distinct buildings, one with 10 floors each 20 by 20, and another with
100 floors each 50 by 5.

.. code-block:: python

   from mfglib.env import Environment
   
   env_1 = Environment.building_evacuation(n_floor=10, floor_l=20, floor_w=20)
   env_2 = Environment.building_evacuation(n_floor=100, floor_l=50, floor_w=5)


User-Defined
------------

Any environment defined in this library has the following attributes:

* ``T``: Sets the time horizon of the environment from 0 to ``T`` (inclusive, integer steps).
* ``S``: State space shape. For example, if the state space is all the integers from 1 to 100, then ``S=(100,)``, and if the state space is all the integer grid points :math:`(x, y)` such that :math:`1 \leq x,y \leq 100`, then ``S=(100, 100)``.
* ``A``: Action space shape.
* ``mu0``: Initial state distribution.
* ``r_max``: The supremum of the absolute value of rewards. This parameter is only used in **Mean-Field Occupation Measure Optimization** algorithm and does not necessarily need to be exact. Even a loose upper bound would be sufficient.
* ``reward_fn``: Defines the reward function.
* ``transition_fn``: Defines the tranistion probability function.

.. note::
    Notice that in the integer grid points case, we could flatten the state space and show it using a one dimensional
    vector of size 10,000. But keeping the state (and action) space multi-dimensional, whenever it is possible, is the
    convention used in this library. This convention results in easier to interpret policies, mean-fields, rewards, etc.

**Policy and Mean-Field Tensors.** Given ``T``, ``S``, and ``A``, the shape of policy and mean-field tensors will be
``(T+1,) + S + A``. For example, if ``T=10, S=(20, 20), A=(5,)``, the policy and mean-field tensors will be of size
``(11, 20, 20, 5)``. In general, let ``S=(S_1, S_2, ..., S_n)`` and ``A=(A_1, A_2, ..., A_m)``, and let ``pi`` and
``L`` be a policy and a mean-field tensor, respectively. Then, ``pi[t, s_1, s_2, ..., s_n, a_1, a_2, ..., a_m]`` is
the probability of choosing action ``a = (a_1, a_2, ..., a_m)`` conditional on being at the state
``s = (s_1, s_2, ..., s_n)`` at time ``t``, and ``L[t, s_1, s_2, ..., s_n, a_1, a_2, ..., a_m]`` is the portion of
players that are in state ``s = (s_1, s_2, ..., s_n)`` and choose action ``a = (a_1, a_2, ..., a_m)`` at time ``t``.

**Reward Function.** We define the reward function via the argument ``reward_fn``. The user is allowed to pass either
a function or a class implementing ``__call__``. The inputs of the reward function must be ``env`` (an environment
instance), ``t`` (a specific time less than or equal to the time horizon), and ``L_t`` (the mean-field tensor at time
``t``). The output will be a tensor of shape ``S + A``. Let ``r`` be the output tensor, and assume
``S=(S_1, S_2, ..., S_n)`` and ``A=(A_1, A_2, ..., A_m)``. Then, ``r[s_1, s_2, ..., s_n, a_1, a_2, ..., a_m]`` is the
reward that agent gets from choosing action ``a=(a_1, a_2, ..., a_m)`` conditional on being at state
``s = (s_1, s_2, ..., s_n)``.

**Transition Function.** We define the transition probability function via the argument ``transition_fn``. The user is
allowed to pass either a function or a class implementing ``__call__``. The inputs of the transition probability
function must be ``env`` (an environment instance), ``t`` (a specific time less than or equal to the time horizon),
and ``L_t`` (the mean-field tensor at time ``t``). The output will be a tensor of shape ``S + S + A``. Let ``p`` be the
output tensor, and assume ``S=(S_1, S_2, ..., S_n)`` and ``A=(A_1, A_2, ..., A_m)``. Then,
``p[s2_1, s2_2, ..., s2_n, s1_1, s1_2, ..., s1_n, a_1, a_2, ..., a_m]`` is the probability of going to the state
``s2 = (s2_1, s2_2, ..., s2_n)`` conditional on being at the state ``s1 = (s1_1, s1_2, ..., s1_n)`` and choosing the
action ``a=(a_1, a_2, ..., a_m)``.

Custom Environment Example
^^^^^^^^^^^^^^^^^^^^^^^^^^

In order to create a custom environment, you can define each one of the above-mentioned attributes and pass them to
``Environment``. Let's take a look at the environment **Random Linear**, which is a custom environment already
implemented in the library.

We first define the states and actions. We want to have ``n`` states and ``n`` actions. Therefore, ``S=(n,)`` and
``A=(n,)``. Also, we use a uniform initial state distribution. To get a specific instance, we consider ``n=5``.

.. code-block:: python

    import torch

    # Define the state and action space shape
    n = 5
    S = (n,)
    A = (n,)

    # Initial state distribution
    mu0 = torch.ones(n) / n

Now, we define the reward and transition functions. As the name of the environment suggests, we want the reward and
transition probabilities to be a random linear (affine indeed) function of the mean-field, that is given the mean
field :math:`L`, the reward and transition probabilities should be equal to :math:`M_1 \times L + M_2` for some
randomly generated matrices :math:`M_1, M_2`. We generate different pairs of matrices for reward and transition
functions.

Note that in order for transition probabilities to be well-defined, we apply a softmax function to the output of the
affine function. Furthermore, we restrict all the entries of the randomly generated matrices to be in :math:`[-m, m]`. 
With this constraint, it is fairly straightforward to see that the
absolute value of rewards cannot be larger than :math:`2m` implying that we should set ``r_max`` equal to :math:`2m`.
To get an environment instance, we set ``m=1``.  Putting it all together,

.. code-block:: python

    from mfg.env import Environment
    import torch

    n = 5
    m = 1

    torch.manual_seed(0)
    soft_max = torch.nn.Softmax(dim=-1)

    r1 = 2 * m * torch.rand(n, n) - m  # M_1 for reward_fn
    r2 = 2 * m * torch.rand(n, n) - m  # M_2 for reward_fn

    p1 = 2 * m * torch.rand(n, n, n) - m  # M_1 for transition_fn
    p2 = 2 * m * torch.rand(n, n, n) - m  # M_2 for transition_fn

    user_defined_random_linear = Environment(
        T=4,
        S=(n,),
        A=(n,),
        mu0=torch.ones(n) / n,
        r_max=2 * m,
        reward_fn=lambda env, t, L_t: r1 @ L_t + r2,
        transition_fn=lambda env, t, L_t: softmax(p1 @ L_t + p2),
    )

Refer to the MFGLib implementation of **Random Linear** for an alternative class-based implementation.