LEARN — ARTIFICIAL INTELLIGENCE

Reinforcement Learning: How Agents Learn by Doing

How a model discovers patterns through experience

Instead of being told the answer, the agent discovers it. It takes actions, observes rewards, and gradually learns which decisions lead to better outcomes.

In the previous lesson, machine learning learned from labeled data — “here's what happened, learn the pattern.” Reinforcement learning (RL) works differently. There are no labels. The agent interacts with an environment, receives rewards for good decisions, and learns through experience.

Think of it like learning to play chess. Nobody gives you a rule book that says “in this position, move the knight here.” Instead, you play games, win some, lose some, and gradually develop intuition about which moves tend to lead to wins.

What is Reinforcement Learning?

Every RL system has four components:

Agent

The decision-maker. In this simulation, a neural network that estimates the best action to take.

Environment

The market. It provides the agent with data (price history) and responds to the agent's actions.

Action

What the agent can do. Here: predict “up” (go long) or predict “down” (go short).

Reward

Feedback from the environment. The agent gets a reward of 1 if its prediction was correct, 0 if wrong.

The agent's goal is simple: maximize total reward over time. It starts knowing nothing and gradually learns which patterns in the data tend to precede upward vs downward moves.

RL in financial markets

Applying RL to finance is conceptually elegant but practically challenging. The agent uses a technique called a deep learning approach:

The agent observes 15 technical features per day — price action, moving averages, trend strength, and volume patterns
A neural network estimates the “value” of each possible action (long or short)
The agent picks the action with the highest estimated value — or explores randomly to discover new strategies
After each decision, it stores the experience (state, action, reward, next state) in memory
Periodically, it replays a batch of past experiences to train the neural network — learning from its own history

Exploration vs exploitation: The agent starts by exploring randomly (high epsilon) and gradually shifts to exploiting what it has learned (low epsilon). This balance is critical — too much exploration wastes time, too little means the agent may never discover better strategies.

Educational content only — not investment advice, recommendations, or a suggestion to act. Past performance is not indicative of future results. Your decisions are your own. Full disclaimer.

When it works, when it doesn't

Tends to work when

The environment has stable, repeatable patterns. RL excels in games (chess, Go) where the rules don't change. In markets, this is rare — but the learning process itself is instructive.

Tends to fail when

The environment is non-stationary (market regimes change), the reward signal is noisy (daily returns are very noisy), or the agent overfits to training data patterns that don't persist.

Worth noting: RL has achieved superhuman performance in games like Go and Atari. Financial markets are fundamentally different — the rules change, other players adapt, and the signal-to-noise ratio is extremely low. The purpose is to understand how the technique works — not to suggest it consistently generates profits.

See it in action

Pick a ticker, adjust the parameters, and watch the agent train in real time. The learning curve shows how the agent's reward improves (or doesn't) across episodes. The equity chart compares the trained agent's estimates to buy & hold.

The agent learns as you watch — nothing is pre-computed. Training takes a few seconds.

Loading SPY data...

What to notice:

The learning curve — does the agent's reward increase over episodes, or does it plateau?
In-sample vs out-of-sample — the shaded area shows the training period. Does the strategy hold up on unseen data?
Try different epsilon settings — higher start means more random exploration early on. Higher decay means slower transition to exploitation.
Run it multiple times — because training involves randomness, results will differ each run. Consistency matters more than any single result.

Your turn

Think about how you learn from your own decisions. When a financial decision works out, do you attribute it to skill or luck? When it goes poorly, do you update your approach or blame the market?

The RL agent has no ego. It updates its strategy based purely on outcomes. Humans are different — people have biases, emotions, and narratives that make it harder to learn from experience objectively. Recognizing this is the first step toward more disciplined decision-making.

Reflect in your Journal

What you've learned

-Reinforcement learning agents discover patterns through experience — no labeled data, just actions, rewards, and feedback.
-Deep learning uses a neural network to estimate the value of each action in a given state — the network trains directly in your browser.
-The exploration-exploitation trade-off is fundamental: too much exploration wastes time, too little means potentially missing better strategies.
-RL excels in stable environments with clear rules (games). Financial markets are non-stationary — the patterns the agent learns may not persist.
-The most important observation: does the agent’s performance on training data carry over to unseen data? If not, it has memorized noise, not learned signal.

Want to test this?

Many experienced investors suggest practicing with a paper money account on a reputable broker before risking real capital. Many brokers offer free simulated trading environments where you can test strategies with real market data and no financial risk.

Paper trading lets you build confidence, understand execution, and see how a strategy behaves in real time — without the emotional weight of real money on the line.

Important

Everything on this platform is educational and didactic in nature. We do not provide investment advice, financial advisory, or recommendations to buy or sell any financial instrument. Past performance is not indicative of future results. All strategies shown are historical simulations for learning purposes only. Always do your own research and consult a qualified financial advisor before making investment decisions.

Previous: Machine Learning Next: Options Payoffs

We're educators, not advisors. We don't make buy or sell recommendations under any circumstance. All content is for educational purposes. Past performance doesn't guarantee future results. Your decisions are your own. Disclaimer