首页 > 解决方案 > 强化算法似乎在学习,但脚本卡住了,代理没有重置

问题描述

目前正在研究使用 Q 表和海龟图形的强化算法。代理在 6 个方格的网格内,需要到达最右侧作为其目标。我已经构建了这个,然后我运行我的算法以便代理学习。我面临以下问题。剧本最终卡住了,结果我似乎只能经历一集。代理(蓝色标记)在 0,0 坐标标记周围闪烁,尽管我已为其设置了特定坐标。最后,代理基本上会留下其步骤的痕迹。我的逻辑似乎很好,但无法确定导致这些问题的原因

""" Basic Reinforcement Learning environment using Turtle Graphics """

#imported libraries required for this project
import turtle
import pandas as pd
import numpy as np
import time
#import numpy as np


""" Environment """

#initialise the screen using a turtle object
wn = turtle.Screen()
wn.bgcolor("black")
wn.title("Basic_Reinforcement_Learning_Environment")
#wn.bgpic("game_background.gif")

#this function initializes the 2D environment
def grid(size): 
    #this function creates one square
    def create_square(size,color="white"):
        greg.color(color)
        greg.pd()
        for i in range(4):
            greg.fd(size)
            greg.lt(90)
        greg.pu()
        greg.fd(size)
    #this function creates a row of sqaures based on simply one square
    def row(size,color="white"):
            for i in range(6):
                create_square(size)
            greg.hideturtle()

    row(size)       

greg = turtle.Turtle()
greg.speed(0)
greg.setposition(-150,0)
grid(50)


def player_set(S):
    player = turtle.Turtle()
    player.color("blue")
    player.shape("circle")
    player.penup()
    player.speed(0)
    player.setposition(S)
    player.setheading(90)

N_STATES = 6   # the length of the 1 dimensional world
ACTIONS = ['left', 'right']     # available actions
EPSILON = 0.9   # greedy police
ALPHA = 0.1     # learning rate
GAMMA = 0.9    # discount factor
MAX_EPISODES = 13   # maximum episodes
FRESH_TIME = 0.3    # fresh time for one move

#this functions builds a Q-table and initializes all values to 0
def build_q_table(n_states, actions):
    table = pd.DataFrame(
        np.zeros((n_states, len(actions))),     # q_table initial values
        columns=actions,    # actions's name
    )
    # print(table)    # show table
    return table

def choose_action(state, q_table):
    # This is how to choose an action
    state_actions = q_table.iloc[state, :]
    # act non-greedy or state-action have no value
    if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()): 
        action_name = np.random.choice(ACTIONS)
    else:   # act greedy
        # replace argmax to idxmax as argmax means a different function 
        action_name = state_actions.idxmax()    
    return action_name


def get_env_feedback(S, A):
    # This is how agent will interact with the environment
    if A == 'right':    # move right
        if S == N_STATES - 2:   # terminate
            S_ = 'terminal'
            R = 1
        else:
            S_ = S + 1
            R = 0
    else:   # move left
        R = 0
        if S == 0:
            S_ = S  # reach the wall
        else:
            S_ = S - 1
    return S_, R

def update_env(S, episode, step_counter):            
    coords = [(-125,25),(-75,25),(-25,25),(25,25),(75,25),(125,25)]

    if S == 'terminal':
        interaction = 'Episode %s: total_steps = %s' %(episode+1, step_counter)
        print('\r{}'.format(interaction), end='')
        time.sleep(2)
        print('\r', end='')
    else:
        player_set(coords[S])
        time.sleep(FRESH_TIME)


def rl():
    q_table = build_q_table(N_STATES, ACTIONS)
    for episode in range(MAX_EPISODES):
        step_counter = 0
        S = 0
        is_terminated = False
        update_env(S, episode, step_counter)
        while not is_terminated:
            A = choose_action(S, q_table)
            S_, R = get_env_feedback(S,A)
            q_predict = q_table.loc[S,A]
            if S_ != 'terminal':
                q_target = R + GAMMA * q_table.iloc[S_, :].max() 
            else:
                q_target = R
                is_terminated = True

            q_table.loc[S, A] += ALPHA * (q_target - q_predict)
            S = S_
            update_env(S, episode, step_counter+1)
            step_counter += 1
        return q_table

rl()

更改:更新了 return 语句,算法现在可以工作,因此它经历了 13 集!!!但是,我似乎无法实现播放器令牌(代理),因此它不会留下所有已采取的步骤的痕迹,我希望它在每集之后重置。这可能与范围有关:

最终解决方案:

""" Basic Reinforcement Learning environment using Turtle Graphics """

#imported libraries required for this project
import turtle
import pandas as pd
import numpy as np
import time
#import numpy as np


""" Environment """

#initialise the screen using a turtle object
wn = turtle.Screen()
wn.bgcolor("black")
wn.title("Basic_Reinforcement_Learning_Environment")
#wn.bgpic("game_background.gif")

#this function initializes the 2D environment
def grid(size): 
    #this function creates one square
    def create_square(size,color="white"):
        greg.color(color)
        greg.pd()
        for i in range(4):
            greg.fd(size)
            greg.lt(90)
        greg.pu()
        greg.fd(size)
    #this function creates a row of sqaures based on simply one square
    def row(size,color="white"):
            for i in range(6):
                create_square(size)
            greg.hideturtle()

    row(size)       

greg = turtle.Turtle()
greg.speed(0)
greg.setposition(-150,0)
grid(50)

player = turtle.Turtle()
player.color("blue")
player.shape("circle")
player.penup()
player.speed(0)
player.setheading(90)

def player_set(S):
    player.setposition(S)



N_STATES = 6   # the length of the 1 dimensional world
ACTIONS = ['left', 'right']     # available actions
EPSILON = 0.9   # greedy police
ALPHA = 0.1     # learning rate
GAMMA = 0.9    # discount factor
MAX_EPISODES = 13   # maximum episodes
FRESH_TIME = 0.3    # fresh time for one move

#this functions builds a Q-table and initializes all values to 0
def build_q_table(n_states, actions):
    table = pd.DataFrame(
        np.zeros((n_states, len(actions))),     # q_table initial values
        columns=actions,    # actions's name
    )
    # print(table)    # show table
    return table

def choose_action(state, q_table):
    # This is how to choose an action
    state_actions = q_table.iloc[state, :]
    # act non-greedy or state-action have no value
    if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()): 
        action_name = np.random.choice(ACTIONS)
    else:   # act greedy
        # replace argmax to idxmax as argmax means a different function 
        action_name = state_actions.idxmax()    
    return action_name


def get_env_feedback(S, A):
    # This is how agent will interact with the environment
    if A == 'right':    # move right
        if S == N_STATES - 2:   # terminate
            S_ = 'terminal'
            R = 1
        else:
            S_ = S + 1
            R = 0
    else:   # move left
        R = 0
        if S == 0:
            S_ = S  # reach the wall
        else:
            S_ = S - 1
    return S_, R

def update_env(S, episode, step_counter):            
    coords = [(-125,25),(-75,25),(-25,25),(25,25),(75,25),(125,25)]

    if S == 'terminal':
        interaction = 'Episode %s: total_steps = %s' %(episode+1, step_counter)
        print('\n{}'.format(interaction), end='')
        time.sleep(2)
        print('\r', end='')
    else:
        player_set(coords[S])
        time.sleep(FRESH_TIME)


def rl():
    q_table = build_q_table(N_STATES, ACTIONS)
    for episode in range(MAX_EPISODES):
        step_counter = 0
        S = 0
        is_terminated = False
        update_env(S, episode, step_counter)
        while not is_terminated:
            A = choose_action(S, q_table)
            S_, R = get_env_feedback(S,A)
            q_predict = q_table.loc[S,A]
            if S_ != 'terminal':
                q_target = R + GAMMA * q_table.iloc[S_, :].max() 
            else:
                q_target = R
                is_terminated = True

            q_table.loc[S, A] += ALPHA * (q_target - q_predict)
            S = S_
            update_env(S, episode, step_counter+1)
            step_counter += 1
    return q_table

rl()

标签: pythonalgorithmturtle-graphicsreinforcement-learning

解决方案


在从您的问题复制的以下代码片段中:

def rl():
    q_table = build_q_table(N_STATES, ACTIONS)
    for episode in range(MAX_EPISODES):
        step_counter = 0
        S = 0
        is_terminated = False
        update_env(S, episode, step_counter)
        while not is_terminated:
            # ...
            # <snip> 
            # ...
        return q_table

您的函数在循环遍历单个剧集的时间步之后,在遍历剧集的循环rl()有一个 return 语句。这意味着您的函数将仅有效地完成一个情节,然后在有机会开始第二情节之前已经(意味着该函数终止)。whileforreturnrl()


关于这个问题的更新:

更改:更新了 return 语句,算法现在可以工作,因此它经历了 13 集!!!但是,我似乎无法实现播放器令牌(代理),因此它不会留下所有已采取的步骤的痕迹,我希望它在每集之后重置。这可能与范围有关

我不是 100% 确定,因为我不熟悉这个turtle-graphics框架。但是,我确实注意到每当需要更新玩家的位置时都会update_env()调用它。player_set(coords[S])该函数具有以下实现:

def player_set(S):
    player = turtle.Turtle()
    player.color("blue")
    player.shape("circle")
    player.penup()
    player.speed(0)
    player.setposition(S)
    player.setheading(90)

在我看来,player每当调用该函数时,它都会在新位置创建一个全新的新对象,而不是更新player已经存在的对象的位置。player因此,每当状态更新时,看起来就像是创建了一个全新的对象,而旧player对象仍将保留在原来的位置。一个解决方案可能包括只创建一次player对象,然后创建一个单独的函数来更新其位置,而无需再次创建新对象。


推荐阅读