Yonaka Research

awawawawawawawawawawa

2026-04-06T22:00:00+09:00

Q-Learning with Multiple Subactions

2025-07-31T22:00:00+09:00

In my last post, I showed how to handle continuous actions in Q-learning using cubic splines. That solved one major limitation, but there’s still another one that keeps people away from using DQN.

DQN is still missing support for MultiDiscrete action spaces. That’s the other major limitation I want to address. It also says MultiBinary is missing, but that’s just a weird MultiDiscrete so I’ll ignore it.

Box is also only half solved, but the full Box action space isn’t just about continuous action, it’s also an action space for multiple actions.

Actions with multiple decisions

MultiDiscrete actions represent situations where you need to make multiple independent decisions simultaneously in each step.

Think about controlling a character in a game where you need to be doing multiple things at once, you might need to decide movement direction, where to aim, or whether to use special abilities at any given moment. For each of these, you might have different buttons for controling them.

Let me use an example. The agent has an Xbox controller as the action space. At any timestep, the agent needs to decide to press:

Face buttons: Press A, B, X, Y or not (4 binary decisions)
Joysticks: Left stick direction, right stick direction (two 2D continuous inputs)
Triggers/Bumpers: Left trigger, right trigger, left bumper, right bumper (4 more binary decisions, or 2 binary and 2 continuous actions)

I’ll call these individual decisions a subaction. The complete action for that timestep is the combination of all subaction choices. In general, subactions are unordered set, they are executed simultaneously.

In this post, I’ll be going over how to handle composite actions.

The standard approach

The easiest way, and how many tutorial for Q-learning does is by making it a single action space by representing it as a cartesian product, or using a hand selected subset of it. This means, for every additional possible action, the possible actions that the agent needs to consider each step grows exponentially.

For the controller case, it would be

Face buttons: $2^4 = 16$ combinations
Joysticks: $8^2 = 64$ combinations (if discretized to 8 directions each)
Triggers/bumpers: $2^4 = 16$ combinations

For a total of $16 \times 64 \times 16 =16,384$ possible actions. If you want to consider joystick and trigger to have variable strengths instead of being discretized, this would grow even more.

The size of the action space isn’t the only problem here, the biggest problem is that every action is treated completely differently from each other. To the agent, an action with one different component would be represented as differently as a completely different different action. This would make training take so much longer, because associating actions would be needlessly hard.

It would be much better to keep subactions separate, sample them individually, and combine them into the final action. That way, it would reduce the represented actions, and also relate similar actions together better. This is how every other RL algorithm does it anyway.

How to make sampling subactions easier

Q-values represent expected future reward for taking a particular action. But what should Q-values mean for multiple simultaneous actions?

Q-values only make sense for a complete action that you’ve decided to take, which is why the standard approach has a full table of every combination. We can make sampling much easier if each subaction had its own function to sample, which has predictable effects in the overall Q-value. How could we do this? The simplest way might be to have functions for each subactions, and let the Q-value be the sum of them:

\[Q(\mathbf{a}; s) = \displaystyle\sum_{i = 1}^n Q(a_i; s)\]

I tried this initially and it kind of worked, but there’s an identifiability problem during training: Even if you’ve identified $Q(\mathbf{a}; s)$ for every possible $\mathbf{a}$ in a given state, you still can’t uniquely reconstruct each $Q(a_i; s)$. If one subaction’s Q-values have a constant added while another’s have the same constant subtracted, the sum remains unchanged. This is a problem for trying to figure out gradients, because there wouldn’t be a unique minimum, there will be entire symmetry group with the same minimum loss, where some of the minima have worse training dynamic.

But this the exact same problem that Dueling Networks faced, which means we can use the same trick as it!

They solved this by adding a state value that is independent of actions, and replacing action values with action advantages where the average over possible actions is subtracted. We can apply the same fix: add an action-independent state value, and replace all subaction values with subaction advantage.

\[Q(\mathbf{a}; s) = V(s) + \displaystyle\sum_{i = 1}^n \left ( Q(a_i; s ) - \displaystyle\sum_{a'_i \in A_i} \frac{1}{|A_i|} Q(a'_i; s) \right)\]

This approach works well, and it’s essentially what the Action Branching Architectures paper implements. They create independent advantages for each subaction and add them to a state value to make the Q-value.

\[\]

The Independence Problem

While the Action Branching paper treats independent subaction sampling as a feature, I think it’s actually a limitation.

Consider an agent learning to use an art program. For each brush stroke, it needs to simultaneously decide:

What to draw: Sun, tree, lake, or cloud
Where to place it: Top, middle, bottom of canvas
Which color: Yellow, green, blue, or white

You could draw a sun on the top in yellow, a lake in the middle in blue, and those combinations would make sense.

But with independent sampling, each subaction would have no idea on what it already chose, so you might get “draw sun + use blue + place at bottom” resulting in a blue sun underwater, or “draw lake + use yellow + place at top” giving you a floating yellow lake in the sky!

In Action Branching, the authors argue that the shared state representation can coordinate decisions. The idea is that the state embedding would have already decided which action it should take, and the independent subactions will all agree to take the action according to the decision.

But it’s not obvious how this could work. Unlike policy gradient methods where you only make one per subaction, in DQN you construct a Q-function that is defined over all possible actions. If the subaction advantage values are generated independently, the resulting Q-function necessarily has no correlation between different subactions for a given environment state.

"I was trying to pour milk from the fridge for my cereal while being distracted, but somehow I ended up putting my phone in the fridge and trying to pour milk on my cereal from my empty hoof."

Autoregressive Action Sampling

To handle actions being dependent on each other while keeping the sampling easy, I propose an autoregressive action sampling method, where each subaction are conditioned on previously sampled subactions within the same step.

To explain this, let me use the currying interpretation that helped last time. The Q-value function $Q(s, A)$ could be written as $Q(s)(a_0)(a_1)...(a_n)$ for some order of subactions. The idea is to sample one action at a time to build up the final Q-value, but we need to be careful a bit because this is actually doing two steps at a time.

When I write $Q(s)$ or $Q(s)(a_0)...(a_k)$, the result is a function. What we want is for each step to also be producing a value, so that each subaction can be sampled. I want a Q-advantage function $F$ that can turn any of these intermediates and produce the advantage function for the next specific subaction. For discrete actions, this looks like a table while for continuous actions, this looks like a spline curve.

So using these, the steps to sample the full action for a step is to

Make embedding $Q(s)$ and state value that does not depend on the action for this step
Make $F(Q(s))$ that is a function of the first subaction, and sample $a_0$
Make the next embedding $Q(s)(a_0)$ from sampled subaction
sample from $F(Q(s)(a_0))$ to get $a_1$, and repeat from 3 until finished

Then the Q-value for the state is the sum of state value and the subaction advantages.

This way, each subaction decision can account for what was already decided. In the earlier drawing example, if it decided to draw a tree as the first subaction, it could then decide to start from the trunk, and then choose brown as the color.

Model Architecture

Autoregressive action sampling needs to do two things, sample actions and evaluate a value given the state and action taken. Sampling autoregressively is a bit slow, since before being able to sample a new action, all the action before it needs to be sampled first. Evaluation in other hand can be faster, since you already have all of the action that needs to be taken, so the overhead of sampling individually is not there.

For a recurrent network, these two will be done exactly the same way. But transformers can take advantage of faster evaluation. Transformers can also use optimizations for sampling like KV caching for each sampling step, or architectures like Grouped Query Attention and Multi Latent Attention.

At first I’ve tried using a full self attention with the state and action embedding concatenated together, but this seems to perform worse than Action Branching. I’ve spent a lot of time trying to figure out why, but it seems there needed to be a clear separation between the states and action. Cross Attention works much better, with the states being the query and action being the key and value. At the worst of cases, it would behave like Action Branching.

What I like to do is to add a beginning of sequence tokens before the first action token, so that there are something for the state embedding to attend to for the first sampling.

Subaction Ordering Problem

With an autoregressive sampling, there needs to be an order to sample the subactions, but how would you decide the order?

In general, subactions are a set with no order. At the same time, there might be a more natural order to decide, but it’s impossible to know without some domain knowledge which order is better or not. You need to already know something about the environment.

Picking a predetermined action order at random is what I’m doing right now and this works, but feels unsatisfying. I have two ideas on how to make this better, but I haven’t been able to make them work yet. They are at the Action Order section

Conclusion

I couldn’t really get a clear result after many trial and error, and I underestimated just how well the prior work, Action Branching works well in practice. I can come up with cases where Action Branching would definitely fail, but I don’t have any environments where that might be a problem yet.

Q-learning can be used to handle actions which have multiple subactions to them just fine, without hitting the combinatorial explosion problem that people typically face. I’ve shown that combined with the spline action space for continuous action, Q-learning can be used for any environments just like policy gradient methods.

When writing this blog post, I've held off comparing against a baseline, and when I finally did compare, found out my method was much worse. That was kind of a good experience though, because it gave me clues on how to improve it, and also taught me that when trying out something new, I should test it against a baseline as soon as possible.

Addendum

Prior Works

Extending DQN to Continuous Action Spaces with Cubic Splines

2025-04-18T22:00:00+09:00

One of the main things that turns people away from using Deep Q-Learning is its inability to handle continuous actions or multiple sub-actions. In Stable Baselines 3, they have a table of reinforcement learning algorithms and what kind of action spaces they each work in.

In their table, DQN only has a tick on the Discrete actions box. That is very limiting! It would be nice if there was an easy and cheap way of allowing DQN to work with continuous and multiple actions. But for now, let’s focus on how to make the first one work.

The Problem with Discrete-Only Actions

In games such fighting games, where an agent selects from a set of actions (move left, jump, shoot), a normal DQN works wonderfully. But what about games that need more precise control? Think about:

A car adjusting its steering angle
Twinstick shooter like Binding of Isaac
A game like Minecraft where you need both discrete actions (moving with WASD keys, mining with click) and continuous control (moving the camera around)

Eventually I would have to build an agent that works with continuous control, but I knew DQN wouldn’t work out of the box. The standard approach-discretizing the action space into bins-technically works but produces jerky, unnatural movement. Imagine a car that can only turn its steering wheel in 10-degree increments instead of smoothly!

Most practitioners simply avoid DQN altogether for these tasks, moving to algorithms specifically designed for continuous control like DDPG or SAC. But I wondered: could we adapt DQN to handle continuous actions elegantly?

Why Can’t DQN Handle Continuous Actions?

To understand the problem, we need to revisit how Q-learning actually works.

In DQN, the Q-function represents the expected future reward when taking action a in state s, then following the policy afterward. This is written as $Q(s, a)$.

For an agent to act, it needs to find the action that maximizes this Q-function:

\[a^* = \arg\max_a Q(s, a)\]

For discrete actions, this is straightforward. If you have 4 possible actions, you calculate a Q-value for each one and pick the highest. Done!

But what happens with continuous actions? If an action can be any value between, say, 0 and 1, we can’t simply enumerate all possibilities.

The Standard Solution: Discretization

The most common approach is to simply chop up (discretize) the continuous action space into a finite set of actions.

For example, if your action space is $[0, 1]$, you might use ${0, 0.1, 0.2, …, 0.9, 1.0}$ as your discrete approximation.

This works, but has significant drawbacks:

Resolution problems: Too few points and your agent can’t make fine adjustments; too many and learning becomes inefficient
No knowledge transfer: Learning that an action is good doesn’t tells the agent whether a similar action would also be good
Curse of dimensionality: Discretizing multiple continuous actions leads to combinatorial explosion (more on this in next post!)

"I tried moving 45 degrees to the left and it worked well... but should I try 44? 46?"

A Different Way of Looking at Q-Functions

Let’s think about what happens when we’re trying to select an action. Notice something important:

For a given state $s$, the argmax operation over actions doesn’t depend on the state anymore. We’ve essentially “locked in” our state and now just need to find the best action for that particular state.

This means, to make the argmax operation easier, we could curry the state into the Q-function $Q(s, a)$ to make a simpler function that only depends on the action $Q_s(a)$, and then take the maximum over the action:

\[Q_s(a) = Q(s, a) \text{ where } s \text{ is fixed}\]

For discrete actions, $Q_s(a)$ is just a lookup table! Finding the maximum value in a table is trivial.

But for continuous actions, $Q_s(a)$ becomes a continuous function over the action space. Finding the maximum of an arbitrary continuous function is much harder.

What We Need in a Continuous Q-Function

If we want to use Q-learning with continuous actions, our representation of the Q-function needs to support several operations:

Evaluation: We need to compute $Q(s, a)$ for any action $a$
Maximization: We need to efficiently find the action $a$ that maximizes $Q(s, a)$
Integration: For some advanced techniques like Dueling Networks, we need to compute the average Q-value across all actions
Addition: We need to be able to add Q-functions together (useful for ensemble methods)

Many function approximators can handle evaluation, but maximization and integration are trickier. Neural networks, for instance, make evaluation easy but finding the global maximum is very difficult.

So what kind of mathematical construct could satisfy all these requirements?

Using Natural Cubic Splines

A cubic spline is a piecewise function made up of cubic polynomials that are smoothly connected at specific points called knots.

Cubic splines have several properties that make them perfect for our needs:

They’re smooth and continuous
They can approximate any continuous function (with enough knots)
We can analytically find their maximums and compute their integrals
They’re closed under addition (adding two cubic splines gives you another cubic spline)

How Cubic Splines Work

A cubic spline is defined by a set of control points (or knots) $(x_0, y_0), …, (x_n, y_n), (x_{n+1}, y_{n+1})$ where values $x$ are positions in our action space and values $y$ are our estimated Q-values at those actions.

Given these knots, the spline is

\[\begin{align}S(x) &= a_i t^3 + b_i t^2 + c_i t + d_i & \text{where} & & x_i \leq x \leq x_{i+1} & \text{,} & t = \frac{x-x_i}{x_{i+1} - x_i} \end{align}\]

These polynomials are crafted to ensure that:

The spline passes through all control points
The first and second derivatives match at each interior control point
Specific boundary conditions are met at the endpoints

I find that it’s much easier to handle if the internal coordinates of each polynomial goes from 0 to 1, and we translate when using them.

Check out WolframMathWorld for the cubic polynomial formula when the knots are equidistant, and the Addendum for non-equidistant knots.

Operations on Cubic Splines

Now let’s see how cubic splines handle all the operations we need:

1. Evaluation

To evaluate a cubic spline at a particular action value:

Find which segment the action falls into
Evaluate the cubic polynomial for that segment

2. Finding the Maximum

We can use the derivative tests to find all the potential points for each segment, and then find the maximum of those.

For each cubic polynomial segment:

Calculate its derivative curve
Find the roots of the derivative (1st derivative test)
Evaluate the spline at these points and at the boundaries
Take the maximum of all these values

Since we’re dealing with cubic polynomials, the derivative is quadratic, and finding roots of a quadratic equation is trivial using the quadratic formula

And we can even narrow down the points by half if we use the 2nd derivative test, halving the amount to search!

3. Computing the Mean

Taking the mean of the Q function over the action is needed in methods like Dueling Network and a few others.

The mean value of a function over the entire input could be computed by taking the integral and dividing by the input space size.

\[\mu = \int_{\min}^{\max} \frac{ S(x)}{\max - \min} dx\]

For our cubic spline, we just need to integrate all the cubic polynomials and add them together, then multiply by the segment lengths they’re in.

If we made the internal coordinates go from 0 to 1, we don’t even need to integrate, it all simplifies to a single einsum expression

\[\frac{ \left[\frac{1}{4}, \frac{1}{3}, \frac{1}{2}, 1 \right]_i Coeff_{ij} \Delta x_j}{x_{n+1} - x_0}\]

Derivation

4. Adding Splines

Adding Q functions together is needed in some methods such as some extended Dueling Network or multi goal learning

Adding two cubic splines is straightforward:

Combine all unique knot points
For each segment in the combined domain, add the corresponding polynomial coefficients

Advantages of Spline

Using cubic splines to represent our Q-function gives us several advantages:

Smooth approximation: Unlike discretization, splines provide a continuous representation with few points
Knowledge transfer: Learning about the Q-value at one action informs us about nearby actions
Analytical maximization: The optimal action can be found precisely and efficiently, without needing to evaluate the entire space
Circular action spaces: Spline curves can have connected end points with continuous derivative, handling angles well

I am not aware of any easily usable environments with circular action spaces to experiment in yet, let me know if you do

Conclusion

DQN doesn’t have to be limited to discrete action spaces. By representing the Q-function as a cubic spline, we can enable DQN to work with continuous actions, without adding too much overhead.

Since splines are controlled by knots, it works with the exact input shape with what you would have used when doing discretized actions, making it pretty much a drop in replacement.

In the next post, I’ll show how to solve the other limitation with Q learning, handling mutliple subactions in a step without getting cursed by the dimensionality

"I used to be limited to jumping between discrete steps, but with splines, I can slide smoothly through the action space! No more awkward robot movements - now my actions can be as fluid as a human player's!"

Addendum

Prior Works