Stuck Between 24 Rocks and an NP-hard Place

21 Jun, 2025

I became interested in the 24 game after my friend, who has incredible number sense, told me a story¹ about how he dominated the game when he played it in elementary school. Thinking it would be a fun and quick exercise, I decided to make myself a toy version of the game. Unbeknownst to me, I was embarking down a path that would lead me to a new appreciation for some of the core principles of computer science.

The rules of the 24 game are simple. Given four numbers in the range [1, 13], combine all four of them using the basic arithmetic operators (+, -, x, ÷) to get a total of 24.

The crux of making the game is generating problems (sets of four numbers) to be solved. Since we don't want to present unsolvable sets², it's not as simple as just generating four random numbers in the specified range. So how do we generate a set that is guaranteed to have a solution?

The first thought that popped into my head was to generate four random numbers in the specified range, and then try every combination of them to see if one works. However, I quickly rejected this idea, more for aesthetic reasons than any deep consideration of its time complexity or implementation. Something about this "brute force" solution was unappealing to some sense of style in the back of my head, so I figured there must be a better way.

In search of a more "elegant" solution, my next idea was an algorithm that "works backwards" from 24. Starting with total=24, pick a random number n in [1, 13] and a random operator op, then let total=(24 op n) and repeat the process on the new total. Once we've done this for three numbers and three operators the remaining total will be the fourth number of the set. Another way to think about this approach is that it builds a tree from the top down, with the leaves of the tree being a valid set for the game. This algorithm does guarantee that the set of four numbers produced can be combined to make 24, but it has a glaring flaw. We have no way to ensure that the fourth number is in the range [1, 13]. For example, a valid run of the algorithm could look like $24 \times 2 = 48, 48 \div 2 = 24, 24 \times 2 = 48$ . You can use the set 2 2 2 24 to make 24, but 48 is not in [1, 13], so this is not a valid set. Some finagling can be done to get around this issue and constrain the fourth number to the desired range, but it involves not allowing certain operators at certain parts of the tree. This means that if we want this "top down" algorithm to only generate valid sets, we have to constrain it so that it can only generate a subset of all the possible valid sets.

At this point, the ideal solution felt tantalizingly close. The question now is, can we create an "elegant" algorithm like the "top down" approach that has full coverage of the problem space like the "brute force" approach?

The answer is...no. To see why, let's first reframe the problem a bit. At its core, what we have here is a decision problem. Given four integers in the range [1, 13], is it possible to combine them with the four basic arithmetic operators to get a total of 24? It's also a decision problem that's very easy to check. To verify if a candidate solution (an expression using the four numbers and three operators) is correct, we just evaluate the expression and see if it's equal to 24. Wait a minute...decision problem...easy to verify...hard to solve...is this thing NP-complete? Turns out that a generalization of this problem has been proven to be NP-complete.

So the decision problem is NP-complete, but why does that mean our fantasized set generating algorithm can't exist? Well, what exactly are we asking the algorithm to do? We are asking it to enumerate the entire solution space, and only the solution space, without evaluating every member of the candidate space. This implies that we know the full structure of the solution space without actually computing it. If that were the case, we could use this knowledge to determine if a set of four numbers is a valid solution without having to search the candidate space. However, deciding whether a set is a valid solution without any search amounts to a quick solution to the 24 game decision problem, which implies that P=NP.

I guess this means I should walk back my claim that the dream set generating algorithm doesn't exist. It might exist, if P=NP, but I don't think I'm the guy who's going to prove that.

Ok, so if we want access to the full solution space there is no way around doing some good ol' fashioned search. But now we're back to the original problem: doing straight up brute force search is ugly. With no other options, it looks like we'll have to put some lipstick on this pig.

My lipstick of choice will be non-deterministic programming. The core idea of non-deterministic programming is that a non-deterministic expression can have more than one possible value. The program evaluator will automatically try different choices of possible values for that expression, backtracking and trying a new choice if the current one causes a failure later on in the program. In its simplest form, this really just amounts to doing depth first search over the possible values of the non-deterministic expressions.

This non-determinism will allow us to frame the set generating algorithm as a constraint programming problem. Constraint programming is a paradigm where the user declares constraints on valid solutions for a set of decision variables.

I decided to implement this non-deterministic constraint solver as a mini DSL in Scheme³ using macros. I found a great chapter in SICP about implementing a non-deterministic evaluator that I used as a reference.

To start, let's define the non-deterministic expressions we need in order to express the problem of generating sets for the 24 game as a constraint solving problem. For this problem, we want to declare possible choices for the components of the solution (the numbers and operators) and declare a constraint (equals 24) that a solution must meet to be considered valid.

Let's call the first expression restrict, and use it to declare the domain of our non-deterministic variables as follows:

(restrict var domain)

We can use this expression to restrict both our number variables to numbers in [1, 13], and to restrict the operators used to combine them to the four basic arithmetic operators.

Let's call the second expression require, and use it to declare a constraint that must be satisfied for the non-deterministic program to evaluate successfully as follows:

(require cond)

Here, cond can be any expression that evaluates to a boolean.

Using these expressions, we might express generating sets for the 24 game like so:

(restrict a nums)
(restrict b nums)
(restrict c nums)
(restrict d nums)
(restrict op1 ops)
(restrict op2 ops)
(restrict op3 ops)
(require (= (op1 a (op2 b (op3 c d) 24))
(print `(,a ,op1 ,b ,op2 ,c op3 ,d))

Since this is a DSL, we want to turn this new syntax into valid Scheme functions that can be evaluated by the normal Scheme interpreter. To do this we'll use macros, which will perform these transformations before compile time. But how do we transform these expressions into functions that will automatically do search and backtracking?

The answer given in SICP is to use continuation passing. Specifically, we transform expressions into functions that take two other functions as arguments, a success and a failure continuation. The idea is that the function does some computation, then calls the success continuation if that computation succeeds, or otherwise calls the failure continuation.

For a simple example, let's look at the macro for require:

(define-macro (require cond)
    `(lambda (succ fail)
        (if ,cond
            (succ 'done fail)
            (fail))))

The expression (require cond) is transformed into a function (lambda) that takes two arguments (succ and fail). The functions calls the success continuation if the cond passed to require is true, otherwise it calls the failure continuation. Note that succ takes two arguments. The first is the new success continuation that should be called if the current branch of the search succeeds, and the second is the failure continuation that should be called if it fails. We pass the same failure continuation down the branch because if it fails later on we want to return to the top of the branch and start down a new branch. In this specific case, if the require constraint is met then we are done, but if it fails we call the failure continuation that was passed to require and keep searching.

The calling of these continuations is how we propagate information up and down branches of our search. Every time a non-deterministic expression like restrict or require is evaluated a "decision point" is created. We can think of a "decision point" as the top of a branch in the search. If the choice we make at this decision point does not lead to a successful search, we can return to the decision point and try a new choice by calling the failure continuation that was generated at that point.

The last tricky bit of making this system work is to nest these continuation passing functions so that their continuations call each other, creating a chain of control flow that can be navigated. I chose to do this with another macro constrain:

(define-macro (constrain &rest exprs)
  (if (null? exprs)
        '(lambda (succ fail) (succ 'done fail))
        `(lambda (succ fail)
            (,(car exprs)
                (lambda (_ fail2) ((constrain ,@(cdr exprs)) succ fail2))
                fail))))

This macro takes all the non-deterministic expressions and nests them all within one big function. We can then pass that big function some continuations at the top level to set off the search. Depending on how we structure the top level continuations, we can have the program return just one possible solution or have it loop to return all possible solutions. Here's a small example to illustrate this in context:

(define (succ val fail)
    (print `(,a ,b))
    (fail))
(define (fail) (print 'no-more-solutions))
(define a 'null)
(define b 'null)
(define prog (constrain 
    (restrict a '(1 2 3 4))
    (restrict b '(5 6 7))
    (require (= (+ a b) 9))))
(prog succ fail)

Here, the top level success continuation prints the values of a and b, and then calls the failure continuation that it was passed. Since the top level success continuation calls the failure continuation that it was passed, control flow will return to the most recent decision point and the system will continue looking for more solutions. This loop will continue until finally the top level failure continuation is called, which will print no-more-solutions, indicating that the entire space has been searched.

Using this system to generate sets for the 24 game looks like this:

(define (succ val fail)
    (pretty24 a b c d op1 op2 op3)
    (fail))

(define (fail) (print 'no-more-solutions))

(define prog (constrain 
    (restrict a nums)
    (restrict b nums)
    (restrict c nums)
    (restrict d nums)
    (restrict op1 ops)
    (restrict op2 ops)
    (restrict op3 ops)
    (require (= (eval24 a b c d op1 op2 op3) 24))))
(prog succ fail)

The helper functions eval24 and pretty24 are not interesting, but the full implementation, including the macro for restrict (which is interesting), is up on my GitHub.

Throughout the process of wrangling these continuations, a question lingered in the background, taunting me from the shadows. What was the point of all this? Was the extra trouble really worth it to create a solution that is more "aesthetic"?

Yes, yes it was...but not because of aesthetics. The DSL is a better solution not because it is prettier, but because it is more expressive and extensible than the more straightforward alternative. The structure and declarative style of the syntax allows us to express the problem in a way much closer to how we reason about it in our heads. We can also extend the DSL to include expressions beyond the two basic ones that we used for this problem. We could include non-deterministic conditionals, for example, allowing us to do more complex computation without increasing the complexity needed to reason about it. While this specific use case is simple enough that the value of this choice might feel marginal, for sufficiently complex problems the expressive power and extensibility of a language become necessary for a person (or AI?) to even be able to implement a solution.

What I originally thought was my sense of aesthetics guiding me turned out to be my desire for more expressiveness. The first idea I thought of when I encountered this problem, doing search to find a solution, ended up being what I used to solve it. However, the feeling that a better solution existed eventually led me to the non-deterministic constraint solver. The DSL does not prove that P=Np, in fact it's no faster than any other implementation of depth first search. What it does do, though, is open up the possibility of solving other problems that I could not have imagined when I started.

My friend played a competitive version of the game in his elementary school. The set up is four competitors sit around a table, and a card with four numbers is placed on the card. First to slap the card gets to give the answer, and the answer has to be given within a second or two of slapping the card. According to my friend, every round he would slap the card immediately when it was put down, trusting in his ability to solve the problem in the seconds immediately after slapping it. He apparently dominated the school and district wide tournaments every year. Considering some of the properties of the game discussed later on, this seems to suggest some pretty remarkable things about the human brain.↩
Is 4 4 6 7 solvable? How about 2 3 5 12? One is and one isn't, and I personally would go crazy if unsolvable sets were part of the game.↩
The version of Scheme I used for this project uses this interpreter which is another project of mine. As such, it is very bare bones and does not have access to the broader Scheme ecosystem.↩