# 人工智能：用Python进行Q学习示例

Q学习是强化学习的基本形式, 它使用Q值(也称为动作值)来迭代地改善学习代理的行为。

Q值或操作值：

。这个估计

TD-更新规则

：代理的当前状态。

：当前操作已根据某些政策选择。

：下一个州代表代理的结局。

：要使用当前的Q值估算值来选择下一个最佳动作, 即, 在下一个状态下选择具有最大Q值的动作。

：从环境中观察到的当前奖励以响应当前操作。

(> 0和<= 1)：未来奖励的折现因子。未来的重制不如当前的奖励有价值, 因此必须予以打折。由于Q值是对某个州的预期回报的估计, 因此折现规则也适用于此。

：更新Q(S, A)估算值所采用的步长。

-贪婪政策：

-greedy policy of是使用当前Q值估计来选择动作的非常简单的策略。它如下：

``pip install gym``

``````import gym
import itertools
import matplotlib
import matplotlib.style
import numpy as np
import pandas as pd
import sys

from collections import defaultdict
from windy_gridworld import WindyGridworldEnv
import plotting

matplotlib.style.use( 'ggplot' )``````

``env = WindyGridworldEnv()``

-贪婪的政策。

``````def createEpsilonGreedyPolicy(Q, epsilon, num_actions):
"""
Creates an epsilon-greedy policy based
on a given Q-function and epsilon.

Returns a function that takes the state
as an input and returns the probabilities
for each action in the form of a numpy array
of length of the action space(set of possible actions).
"""
def policyFunction(state):

Action_probabilities = np.ones(num_actions, dtype = float ) * epsilon /num_actions

best_action = np.argmax(Q[state])
Action_probabilities[best_action] + = ( 1.0 - epsilon)
return Action_probabilities

return policyFunction``````

``````def qLearning(env, num_episodes, discount_factor = 1.0 , alpha = 0.6 , epsilon = 0.1 ):
"""
Q-Learning algorithm: Off-policy TD control.
Finds the optimal greedy policy while improving
following an epsilon-greedy policy"""

# Action value function
# A nested dictionary that maps
# state -> (action -> action-value).
Q = defaultdict( lambda : np.zeros(env.action_space.n))

# Keeps track of useful statistics
stats = plotting.EpisodeStats(
episode_lengths = np.zeros(num_episodes), episode_rewards = np.zeros(num_episodes))

# Create an epsilon greedy policy function
# appropriately for environment action space
policy = createEpsilonGreedyPolicy(Q, epsilon, env.action_space.n)

# For every episode
for ith_episode in range (num_episodes):

# Reset the environment and pick the first action
state = env.reset()

for t in itertools.count():

# get probabilities of all actions from current state
action_probabilities = policy(state)

# choose action according to
# the probability distribution
action = np.random.choice(np.arange(
len (action_probabilities)), p = action_probabilities)

# take action and get reward, transit to next state
next_state, reward, done, _ = env.step(action)

# Update statistics
stats.episode_rewards[ith_episode] + = reward
stats.episode_lengths[ith_episode] = t

# TD Update
best_next_action = np.argmax(Q[next_state])
td_target = reward + discount_factor * Q[next_state][best_next_action]
td_delta = td_target - Q[state][action]
Q[state][action] + = alpha * td_delta

# done is True if episode terminated
if done:
break

state = next_state

return Q, stats``````

``Q, stats = qLearning(env, 1000 )``

``plotting.plot_episode_stats(stats)``

• 回顶