this repogitory is run to study reinforcement learning. thus, we apply the tech to control Self-discipline system.

Using Environment
I am writing a book on reinforcement learning.
The book is designed for beginners to learn reinforcement learning step by step, covering everything from the basics to practical applications.
It is published on Amazon Kindle, so please feel free to check it out if you are interested.
Amazon.co.jp: 実践!強化学習入門: Pythonで動かしながら理解する AI学習書 (AI関係書籍) eBook : 3 Sons Lover: Kindleストア

The repository of my scratch code is stored in next URL:
On the other hand, I have LLM repository also.
Shinichi0713/LLM-fundamental-study: this site is the fundamental page of LLM-mechanism
Please look.
with using Dyna-Q, train agent to update the model.
below shows trainstion of reward vs episode.

this is the behavior of q-learning agent.

with using dqn, the motion is completed.

this is the behavior of SAC.

this is the behavior of SAC. DDPG can’t work well.

this is the behavior of actor-critic.

with using sac, the agent gradually walk…
the agent of this walker is based on just fnn model. essencially, the progress of train isn’t proceed well.

at the next, the agent is composed based on transformer. this agent size isn’t large. but, the progress of train proceed as expected. so that, i find ,in RL , the architecture is important. unfortunately, the agent don’t use both legs. this would be owing to short of exploration.


this is the behavior of PointerNet. not good….

this is the result with using 3 methods.

using Actor-critc framework. and, the model is composed with Transformer Network. learn how short the total job become.



when imitation learning is utilized, i check the effect. in this case, reward is improved when using imitation learning.



with using stable_baselines3 and imitation, agent is trained with gail. the reward is given as below.
| trial no | reward |
|---|---|
| 1st | 289.0 |
| 2nd | 295.0 |
| 3rd | 278.0 |
when using ddqn, ai agent can arrange boxes toward restricted space.

We implemented DQN training using two Rock-Paper-Scissors agents as a multi-agent example problem. We visualized two aspects:
The trend of Agent 1’s average reward (Learning stability).
The trend of Agent 1’s final Q-values (Learned action strategy).
<img src="image/README/1763209192634.png" alt="jssp-3" width="500px" height="auto">
MARL implementation is descripted as below URL:
Reinforce-Learning-Study/miulti-agent/readme.md at main · Shinichi0713/Reinforce-Learning-Study
new theme is considering. the environment is displayed as next.
I have been working on the Warehouse Problem, where the task is to have two agents deliver items to designated locations within a warehouse.
Initially, I approached this using QMIX, but I encountered a situation where either both agents would fail to move, or only one of them would operate. I concluded that a lack of exploration was the primary cause.
After switching to HSAC (Heterogeneous Soft Actor-Critic), the agents began to cooperate and function properly. This experience has truly highlighted the critical importance of exploration in reinforcement learning.
with using QMIX, the agents doesn’t work.

with using HASAC, lulti-agent systems have started to operate in coordination with each other.

This is a cooperative Multi-Agent Reinforcement Learning (MARL) example focusing on information sharing and continuous coordination. The core challenge is to efficiently cover an unknown area by pooling decentralized knowledge.
Environment and Setup
| Item | Details |
|---|---|
| Environment | Anunknown grid map representing a disaster site where critical targets are hidden. |
| Observation | Each drone has avery narrow sensor range (e.g., only adjacent cells), leading to significant local partial observability. |
| Agents | Multiple search drones (or mobile sensor robots). |
| Actions | Movement (Up, Down, Left, Right, Stay). |
| Goal | Maximize map coverage efficiency by minimizing the time required to fully explore the entire map (minimizing unexplored area). |
Learning Objectives and Cooperation Points

Memory—often referred to as a Replay Buffer or Experience Replay—is essential in DRL for three primary mathematical and practical reasons: stabilizing training and improving efficiency.
Deep learning models (neural networks) are optimized under the assumption that training data is Independent and Identically Distributed (i.i.d.).
In RL, obtaining a “reward” often requires many steps, making experiences with significant rewards extremely valuable.
Algorithms like DQN (Deep Q-Network) utilize Off-policy learning, which allows the agent to learn from data generated by a version of itself that is different from its current policy (i.e., its past self or other agents).
In Reinforcement Learning, “Centralized Learning” and “Decentralized Learning” refer to structural differences in “who processes the information and where the command center for learning is located.”
The following outlines their characteristics within the context of Multi-Agent Reinforcement Learning (MARL).
This style involves collecting data (observations, actions, rewards) from all agents into a single location to train a single, massive intelligence.
Theoretically, it is the most likely to reach an optimal global solution.
This is a style where each agent learns independently based solely on its own experience.
Resilient to privacy concerns and communication limits since interaction with others is not required for training.
Currently the most popular approach for tasks like cooperative drone control, CTDE combines the best of both worlds.
Execution: Decentralized execution is performed, where the “Actor” acts quickly based only on its own local sensor data.
| Feature | Centralized | Decentralized | CTDE (Hybrid) |
|---|---|---|---|
| Data Aggregation | Always centralized | Distributed per agent | Centralized only during training |
| Learning Stability | High (Global view) | Low (Others are moving) | Medium to High (Balanced) |
| Execution Autonomy | Low (Requires comms) | High (Operates alone) | High (Operates alone) |
| Primary Use Cases | Small-scale precision control | Large-scale independent envs | Multi-drone coordination |
Conclusion: Regarding the choice between centralized or decentralized, the modern MARL consensus is that CTDE (Centralized Training, Decentralized Execution) is the most efficient and practical solution.
The major frameworks for cooperative learning in MARL are categorized into three types based on how they handle information and the learning process. While CTDE is the current standard, here are the characteristics of each:
The simplest form where each agent treats others as “part of the environment” (like moving walls) and learns/executes independently.
Treats all agents as one “giant AI,” processing all observations and actions collectively.
The current de facto standard. It shares information only during training and maintains independence during execution.
Execution (Decentralized): Each agent uses only its own network (Actor) to decide actions based on local info.
A framework where agents encourage cooperation by sending “messages” to one another.
Dividing agents into a “Manager” (commander) and “Workers” (executors), or assigning specific “Roles.”
For more detail, please come in here.
in this repogitory, oss ‘pygame-learning-environment’ is used. https://github.com/ntasfi/PyGame-Learning-Environment
deep mind archives!
very nice site!
when studying RL, I refer to any other web-site. show the reference site.
AI compass this site indicates many ai knowledge with insight.
星の本棚 this site shows nice tips about reinforcement learning.
I publish technical articles focused on Reinforcement Learning related technics on my blog. Feel free to visit and have a read.