Due Date:
March 26, 2018 23:59 hours
Provide a short answer to each of the questions in Parts I, II, and IIl.
Part I: Policy Interpretation
1. Run this simulation, press enter for the resulting policy and describe what is happening. Run each of these simulations more than once and see if you have a consistent result. What does the final policy mean in plain English?
python gridworld.py -a q --livingReward .5 --episodes 5 -s 100
2. Run this simulation and describe what the final policy means:
python gridworld.py -g CliffGrid -a q --livingReward -.5 --episodes 50 -s 100
3. Run this simulation and describe what the final policy means. Look at the q values as well as the policy:
python gridworld.py -g CliffGrid -a q --discount .1 --episodes 10 -s 100
4. Run this simulation and describe what the final policy means:
python gridworld.py -g MazeGrid -a q --discount .9 --episodes 30 -s 100
5. Run this simulation and compare #4 and #5, which technique seems to be more efficient? discount or living penalty? Why?
python gridworld.py -g MazeGrid -a q --livingReward -.1 --episodes 30 -s 100
6. Noise refers to the percent chance when an inadvertent consequence occurs. Run the following simulation which results in a nonoptimal policy. Reducing the noise to zero will return the optimal policy. What is the maximum amount of noise that will still result in an optimum policy?
python gridworld.py -a value -i 100 -g BridgeGrid --discount 0.9 --noise 0.2
7. If you change the discount rate in the previous question to 1.0, does this increase or decrease noise tolerance? Why? What is the new max noise that will still result in the optimum policy?
8. If you use only living penalty, does this increase or decrease noise tolerance and why? Feel free to experiment with larger "living reward" penalties:
python gridworld.py -a value -i 100 -g BridgeGrid --livingReward -.1 --noise 0.2