Automatic PACMAN Based on Markov Decision Process

Markov Decision Process (MDP) is the optimal decision process of stochastic dynamic system based on Markov theory. The Markov decision process is Markov (no aftereffect, and the next state of the system is only related to the current state information and has nothing to do with the earlier state), but the difference is that MDP considers the action, that is, the system next The state is not only related to the current state, but also to the current action taken.
An MDP should be descripted by a five element tuple
2. Description
This coursework is asking us to create an MDP-solver to work in the Pacman environment. The pacman can know the locations of all the objects in the world (walls, food, capsules, ghosts).
2.1 Flow chart

2.2 Value Map Initialization
After every steps' action we need to initialize the value map by using the pacman state to get all objects locations and give rewards for each state (the locations).
For example, the value map will look like:

2.3 Iteration with Bellman Equation
i. Judge the layout type
Through the api function api.corners()
we can get the maximum index of row and col. Thus, we can know the layout is samllGrid or mediumClassic. Because the samllGrid is a
ii. Add value to near ghost positions
To make the pacman more afraid for the ghosts. I used a for
loop to set a nagtive value to the position neared the ghosts. The value is

iii. The position need to update value
To improve the speed of running code, I only use Bellman Equation to update the value at position near ghosts, which is a square within

iv. Iteration by Bellman equation
To find the best policy for the given set of rewards in section 2.1. The utility of each state can be computed by Bellman Equation:
For my program the equation was written as:
v. The utilities calculation
Follow the Bellman equation, we have to get the maximum utility of the coordinate as it’s value, which means we have to find the optimal policy for each movement. Assume the Pacman is an AI (Artifical Idiot) instead of an AI (Artificial Intelligence). 70% of the time the agent moves as intended and 10% for each other directions. Obviously, there will be four policy
2.4 Best Policy and The Decision
After updating the value map by Iteration, the last thing we have to do is get the policy at the pacman current position. We can get the largest utility by using the same equation in last section. At the same time, the policy leads to the largest utility is the optimal policy for pacman current state and it also will be the pacman movement decision.
3. Results
3.1 Evaluation
I ran the MDPAgent 100 times in mediumClassic in which the pacman won 57 times with the average score 734.36 and 100 times in smallGrid with 75 times winning and 247.62 average score.