Purpose

Purpose

this repogitory is run to study reinforcement learning. thus, we apply the tech to control Self-discipline system.

1762658300029

contents

  1. basic: that is the code to check fundamental reinforcement theology.
  2. documents: that is the note of reinforcement-learning.
  3. pole-problem: that is the code to try the feinforcement learning.
  4. multi-agentict environment.

Using Environment

my work

I am writing a book on reinforcement learning.

The book is designed for beginners to learn reinforcement learning step by step, covering everything from the basics to practical applications.

It is published on Amazon Kindle, so please feel free to check it out if you are interested.

Amazon.co.jp: 実践!強化学習入門: Pythonで動かしながら理解する AI学習書 (AI関係書籍) eBook : 3 Sons Lover: Kindleストア

q-learn

My Codes

The repository of my scratch code is stored in next URL:

Shinichi0713/Reinforce-Learning-Study: this is the codes which is in accordance with reinforcement-learning

On the other hand, I have LLM repository also.

Shinichi0713/LLM-fundamental-study: this site is the fundamental page of LLM-mechanism

Please look.

problems

Gird World with Dyna-Q

with using Dyna-Q, train agent to update the model.

below shows trainstion of reward vs episode.

1762658300029

ball cather

this is the behavior of q-learning agent.

q-learn

pole cart

with using dqn, the motion is completed.

q-learn

pendulum

this is the behavior of SAC.

q-learn

luna-landing

this is the behavior of SAC. DDPG can’t work well.

sac

robo-walking

this is the behavior of actor-critic.

sac

BipedalWalkerHardcore

with using sac, the agent gradually walk…

the agent of this walker is based on just fnn model. essencially, the progress of train isn’t proceed well.

sac

at the next, the agent is composed based on transformer. this agent size isn’t large. but, the progress of train proceed as expected. so that, i find ,in RL , the architecture is important. unfortunately, the agent don’t use both legs. this would be owing to short of exploration.

sac

sac

TSP

this is the behavior of PointerNet. not good….

sac

this is the result with using 3 methods.

sac

JSSP

using Actor-critc framework. and, the model is composed with Transformer Network. learn how short the total job become.

jssp-1

jssp-2

jssp-3

imitation learning - behavior clone

when imitation learning is utilized, i check the effect. in this case, reward is improved when using imitation learning.

jssp-1

jssp-2

jssp-3

IRL-GAIL

with using stable_baselines3 and imitation, agent is trained with gail. the reward is given as below.

trial no reward
1st 289.0
2nd 295.0
3rd 278.0

arranging boxes

when using ddqn, ai agent can arrange boxes toward restricted space.

alt text

Multi-Agent example

We implemented DQN training using two Rock-Paper-Scissors agents as a multi-agent example problem. We visualized two aspects:

The trend of Agent 1’s average reward (Learning stability).

The trend of Agent 1’s final Q-values (Learned action strategy).

<img src="image/README/1763209192634.png" alt="jssp-3" width="500px" height="auto">

MARL implementation is descripted as below URL:

Reinforce-Learning-Study/miulti-agent/readme.md at main · Shinichi0713/Reinforce-Learning-Study

MARL warehouse

new theme is considering. the environment is displayed as next.

I have been working on the Warehouse Problem, where the task is to have two agents deliver items to designated locations within a warehouse.

Initially, I approached this using QMIX, but I encountered a situation where either both agents would fail to move, or only one of them would operate. I concluded that a lack of exploration was the primary cause.

After switching to HSAC (Heterogeneous Soft Actor-Critic), the agents began to cooperate and function properly. This experience has truly highlighted the critical importance of exploration in reinforcement learning.

with using QMIX, the agents doesn’t work.

jssp-3

with using HASAC, lulti-agent systems have started to operate in coordination with each other.

jssp-3

MARL adventure

This is a cooperative Multi-Agent Reinforcement Learning (MARL) example focusing on information sharing and continuous coordination. The core challenge is to efficiently cover an unknown area by pooling decentralized knowledge.

Environment and Setup

Item Details
Environment Anunknown grid map representing a disaster site where critical targets are hidden.
Observation Each drone has avery narrow sensor range (e.g., only adjacent cells), leading to significant local partial observability.
Agents Multiple search drones (or mobile sensor robots).
Actions Movement (Up, Down, Left, Right, Stay).
Goal Maximize map coverage efficiency by minimizing the time required to fully explore the entire map (minimizing unexplored area).

Learning Objectives and Cooperation Points

  1. Information Sharing and Distributed Knowledge
  1. Optimal Coverage and Spatial Load Balancing

jssp-3

Fundamental Knowledge

Roles of Memory in Deep Reinforcement Learning (DRL)

Memory—often referred to as a Replay Buffer or Experience Replay—is essential in DRL for three primary mathematical and practical reasons: stabilizing training and improving efficiency.

1. Breaking Data Correlation (I.I.D. Assumption)

Deep learning models (neural networks) are optimized under the assumption that training data is Independent and Identically Distributed (i.i.d.).

2. Reuse of Valuable Experiences (Improving Sample Efficiency)

In RL, obtaining a “reward” often requires many steps, making experiences with significant rewards extremely valuable.

3. Enabling Off-policy Learning

Algorithms like DQN (Deep Q-Network) utilize Off-policy learning, which allows the agent to learn from data generated by a version of itself that is different from its current policy (i.e., its past self or other agents).

MARL

Centralized vs. Decentralized Learning

In Reinforcement Learning, “Centralized Learning” and “Decentralized Learning” refer to structural differences in “who processes the information and where the command center for learning is located.”

The following outlines their characteristics within the context of Multi-Agent Reinforcement Learning (MARL).

1. Centralized Training

This style involves collecting data (observations, actions, rewards) from all agents into a single location to train a single, massive intelligence.

2. Decentralized Training

This is a style where each agent learns independently based solely on its own experience.

3. Hybrid: Centralized Training, Decentralized Execution (CTDE)

Currently the most popular approach for tasks like cooperative drone control, CTDE combines the best of both worlds.

Summary Comparison

Feature Centralized Decentralized CTDE (Hybrid)
Data Aggregation Always centralized Distributed per agent Centralized only during training
Learning Stability High (Global view) Low (Others are moving) Medium to High (Balanced)
Execution Autonomy Low (Requires comms) High (Operates alone) High (Operates alone)
Primary Use Cases Small-scale precision control Large-scale independent envs Multi-drone coordination

Conclusion: Regarding the choice between centralized or decentralized, the modern MARL consensus is that CTDE (Centralized Training, Decentralized Execution) is the most efficient and practical solution.

MARL Learning Methodologies

The major frameworks for cooperative learning in MARL are categorized into three types based on how they handle information and the learning process. While CTDE is the current standard, here are the characteristics of each:

1. Decentralized Training, Decentralized Execution (DTDE)

The simplest form where each agent treats others as “part of the environment” (like moving walls) and learns/executes independently.

2. Centralized Training, Centralized Execution (CTCE)

Treats all agents as one “giant AI,” processing all observations and actions collectively.

3. Centralized Training, Decentralized Execution (CTDE)

The current de facto standard. It shares information only during training and maintains independence during execution.

4. Communication-based Learning

A framework where agents encourage cooperation by sending “messages” to one another.

5. Hierarchical / Role-based Learning

Dividing agents into a “Manager” (commander) and “Workers” (executors), or assigning specific “Roles.”

For more detail, please come in here.

cite

in this repogitory, oss ‘pygame-learning-environment’ is used. https://github.com/ntasfi/PyGame-Learning-Environment

deep mind archives!

very nice site!

google-deepmind/deepmind-research: This repository contains implementations and illustrative code to accompany DeepMind publications

References

when studying RL, I refer to any other web-site. show the reference site.

AI compass this site indicates many ai knowledge with insight.

星の本棚 this site shows nice tips about reinforcement learning.

blog

I publish technical articles focused on Reinforcement Learning related technics on my blog. Feel free to visit and have a read.

writer's blog