6-3. Q-Network 구현 (Cart Pole)

Notice

건조젤리의 블로그

Recent Posts

Recent Comments

Link

거인 블로그

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Tags more

Archives

Today

Total

관리 메뉴

건조젤리의 저장소

6-3. Q-Network 구현 (Cart Pole) 본문

공부 기록/모두를 위한 딥러닝 (RL)

6-3. Q-Network 구현 (Cart Pole)

건조젤리 2019. 11. 21. 19:07

김성훈 교수님의 강의내용을 정리한 내용입니다.

출처 : http://hunkim.github.io/ml/

모두를 위한 머신러닝/딥러닝 강의

hunkim.github.io

Cart Pole이란 막대의 밑 부분을 좌우로 움직여 중심을 잡는 게임이다.

4개의 상태를 가지고 있다.

막대가 넘어지지 않는 이상 보상은 항상 1이다.

막대가 넘어지게 되면 끝난다.

막대가 넘어지게 되면 보상으로 -100을 받게 설정한다.

Frozen Lake와 다르게 입력값을 One-hot으로 넣지 않는다! (그대로 넣는다)

4개의 상태를 입력으로 넣게되면 2개중 하나의 행동이 출력으로 나오게 된다.

변수를 만들때 Xavier_initializer()을 사용해보자~!

Cost는 제곱오차를 이용하고, 최적화 함수는 Adam을 이용한다.

전체 네트워크 설정 코드

학습 코드

결과를 계산하고 테스트 해보는 코드이다.

막대가 평균적으로 500프레임 이상 서있는다면 끝낸다.

결과는... 처참하다!

이유는 무엇일까?

신경망의 크기가 너무 작음
샘플들 간의 유사성
불안정한 목표

다음장에서 자세히 설명하도록 하겠다!

구현 코드 (환경: ubuntu:16.04 python 3.6)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91

import numpy as np
import tensorflow as tf
from collections import deque
 
import gym
env = gym.make('CartPole-v0')
 
# Constants defining our neural network
learning_rate = 1e-1
input_size = env.observation_space.shape[0]
output_size = env.action_space.n
 
X = tf.placeholder(tf.float32, [None, input_size], name="input_x")
 
# First layer of weights
W1 = tf.get_variable("W1", shape=[input_size, output_size],
                     initializer=tf.contrib.layers.xavier_initializer())
Qpred = tf.matmul(X, W1)
 
# We need to define the parts of the network needed for learning a policy
Y = tf.placeholder(shape=[None, output_size], dtype=tf.float32)
 
# Loss function
loss = tf.reduce_sum(tf.square(Y - Qpred))
# Learning
train = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(loss)
 
# Values for q learning
max_episodes = 5000
dis = 0.9
step_history = []
 
 
# Setting up our environment
init = tf.global_variables_initializer()
sess = tf.Session()
sess.run(init)
 
for episode in range(max_episodes):
    e = 1. / ((episode / 10) + 1)
    step_count = 0
    state = env.reset()
    done = False
 
    # The Q-Network training
    while not done:
        step_count += 1
        x = np.reshape(state, [1, input_size])
        # Choose an action by greedily (with e chance of random action) from
        # the Q-network
        Q = sess.run(Qpred, feed_dict={X: x})
        if np.random.rand(1) < e:
            action = env.action_space.sample()
        else:
            action = np.argmax(Q)
 
        # Get new state and reward from environment
        next_state, reward, done, _ = env.step(action)
        if done:
            Q[0, action] = -100
        else:
            x_next = np.reshape(next_state, [1, input_size])
            # Obtain the Q' values by feeding the new state through our network
            Q_next = sess.run(Qpred, feed_dict={X: x_next})
            Q[0, action] = reward + dis * np.max(Q_next)
 
        # Train our network using target and predicted Q values on each episode
        sess.run(train, feed_dict={X: x, Y: Q})
        state = next_state
 
    step_history.append(step_count)
    print("Episode: {}  steps: {}".format(episode, step_count))
    # If last 10's avg steps are 500, it's good enough
    if len(step_history) > 10 and np.mean(step_history[-10:]) > 500:
        break
 
# See our trained network in action
observation = env.reset()
reward_sum = 0
while True:
    #env.render()
 
    x = np.reshape(observation, [1, input_size])
    Q = sess.run(Qpred, feed_dict={X: x})
    action = np.argmax(Q)
 
    observation, reward, done, _ = env.step(action)
    reward_sum += reward
    if done:
        print("Total score: {}".format(reward_sum))
        break
Colored by Color Scripter

cs

 

'공부 기록 > 모두를 위한 딥러닝 (RL)' 카테고리의 다른 글

7-2. DQN 구현 (NIPS 2013) (1)	2019.11.22
7-1. DQN (0)	2019.11.22
6-2. Q-Network 구현 (Frozen Lake) (0)	2019.11.20
6-1. Q-Network (0)	2019.11.19
5-2. Windy Frozen Lake 구현 (0)	2019.11.19