In the last policy iteration blog, we prove that starting from an initial pocliy, the iteration process of “evaluation -> greedy improvement -> evaluation -> greedy improvement …” can guarantee an optimal policy. We note that at each step of evaluation, we have to iterate it many times to get...
[Read More]