A Look at Tree-of-Thought Prompting

How do you get an AI to think slowly? Not instinctively, but deliberately — like solving a logic puzzle instead of guessing an answer?

This question touches on a fundamental difference in how decisions are made: quickly vs. carefully. Psychologist Daniel Kahneman called these two modes System 1 (fast, automatic) and System 2 (slow, effortful). Most large language models, like GPT-4o, default to System 1—they respond with speed and confidence. But recent techniques in prompting are designed to nudge them into System 2 territory.

One of the most promising methods is called Tree-of-Thought (ToT) prompting. Instead of moving straight from question to answer, ToT encourages models to explore a range of reasoning paths before making a decision.

Let’s explore this idea through an interactive example: the game Minesweeper.

From Simple Chains to Expanding Trees

Before jumping into ToT, it’s worth mentioning a related strategy: Chain-of-Thought (CoT) prompting. This method asks the model to explain its reasoning step by step before arriving at a final answer. It’s a simple yet powerful tool that improves accuracy in tasks like math, logic, and deduction.

Here’s a CoT example:

Q: I had 10 apples. I gave 2 to a neighbor and 2 to my pigs. Then I bought 5 more, but one was rotten. How many good apples do I have now?
A: Started with 10 → gave away 4 → now 6. Bought 5 → now 11. One is bad → 10 good apples.

The model breaks down the scenario instead of jumping to a guess. But CoT follows a single line of reasoning. What if we allowed multiple lines in parallel?

That’s what Tree-of-Thought brings to the table.

What Makes ToT Different?

With ToT, the model reasons like a decision tree:

Each node represents a thought or sub-solution.
Each branch represents an alternative path the model can explore.
Using methods like Breadth-First Search (BFS) or Depth-First Search (DFS), the model can explore many possibilities, compare outcomes, and backtrack if needed.

Let’s use a non-technical example:

Problem: A meeting room fits 20 people. You have 28 attendees. What are the options?

Option 1: Move 8 people elsewhere
→ Is there another room? Yes. Enough space? Yes. Easy to coordinate? Maybe.
→ Verdict: Likely workable.
Option 2: Expand the room
→ Can it be done physically? Maybe. Is it allowed? No. Exceptions possible? No.
→ Verdict: Not practical.
Option 3: Split the group
→ Can they attend on different days? Possibly.
→ Verdict: Worth exploring.

This kind of flexible evaluation is exactly what ToT aims to replicate in language models.

Applying ToT to Minesweeper

Minesweeper is a logic game where you uncover cells on a grid without triggering hidden mines. Each revealed number tells you how many adjacent mines surround that square.

It’s a great testing ground for reasoning-based AI — there’s no luck involved if you reason correctly.

Step 1: Create the Game Board

Here’s how we generate the board with Python:

def generate_board():
    board = np.zeros((BOARD_SIZE, BOARD_SIZE), dtype=int)
    mines = set()
    while len(mines) < NUM_MINES:
        r, c = random.randint(0, BOARD_SIZE - 1), random.randint(0, BOARD_SIZE - 1)
        if (r, c) not in mines:
            mines.add((r, c))
            board[r][c] = -1

    for r in range(BOARD_SIZE):
        for c in range(BOARD_SIZE):
            if board[r][c] == -1:
                continue
            count = sum(
                1 for dr in [-1, 0, 1] for dc in [-1, 0, 1]
                if 0 <= r+dr < BOARD_SIZE and 0 <= c+dc < BOARD_SIZE and board[r+dr][c+dc] == -1
            )
            board[r][c] = count
    return board, mines

This function builds a standard 8×8 board and calculates how many mines are near each cell.

Step 2: Prompting the Model to Think in Branche

To apply ToT, we prompt the model to generate multiple possible actions based on the current board, along with justifications and confidence scores for each move.

def llm_generate_thoughts(board, revealed, flagged_mines, known_safe, k=3):
    prompt = f"""
You are playing Minesweeper (8x8). Numbers show adjacent mines. '?' means hidden.

Mines flagged: {flagged_mines}  
Cells marked safe: {known_safe}  

Goal: Pick one hidden cell least likely to contain a mine.

Rules:
- If a '1' is next to only one '?', that '?' is a mine.
- If a '1' touches one flagged mine, others nearby are safe.
- Cells next to '0' are typically safe.

Board view:
{board_to_text(board, revealed)}

Available cells:
{valid_moves}

Instructions:
1. Propose {k} cell options.
2. Explain each one using logic.
3. Rate safety confidence (0.0–1.0).

Format your answer like this:
[
  {{ "cell": [r, c], "reason": "...", "score": 0.95 }},
  ...
]
"""

This creates multiple “thoughts” the model can evaluate—forming the branches of a reasoning tree.

Step 3: Executing the Decision

The final step is to choose the most confident, valid move from the model’s output.

def tot_llm_agent(board, revealed, flagged_mines, known_safe):
    thoughts = llm_generate_thoughts(board, revealed, flagged_mines, known_safe, k=5)

    if not thoughts:
        print("[ToT] No valid responses. Defaulting to backup strategy.")
        return baseline_agent(board, revealed)

    thoughts = [t for t in thoughts if 0 <= t["cell"][0] < BOARD_SIZE and 0 <= t["cell"][1] < BOARD_SIZE]
    thoughts.sort(key=lambda x: x["score"], reverse=True)

    for t in thoughts:
        if t["score"] >= 0.9:
            print(f"[ToT] Selecting move {t['cell']} with confidence {t['score']}")
            return tuple(t["cell"])

    print("[ToT] No high-confidence move found. Falling back.")
    return baseline_agent(board, revealed)

Only the most confident decisions (score ≥ 0.9) are taken seriously. Everything else is left to a baseline agent for safety.

Experiment Outcome

We ran this agent through 10 games of Minesweeper (8×8 board, 10 hidden mines). The result: a perfect 100% win rate — with no mines triggered. This showcases how deliberate reasoning via ToT dramatically outperforms naive guessing or even single-chain logic.

Wrapping Up

Tree-of-Thought prompting is a leap forward in LLM reasoning. Instead of guessing or rushing to conclusions, models learn to evaluate possibilities, compare outcomes, and make calculated decisions — just like humans using System 2.

This technique transforms language models into structured thinkers, capable of navigating uncertainty, solving puzzles, and planning strategically. The future of prompting isn’t just about clarity — it’s about designing prompts that teach the model to think.

Sources

Kahneman, D. (2011). Thinking, Fast and Slow.
Yao, S. et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601
OpenAI (2024). GPT-4 Technical Report.
Wikipedia: Minesweeper (video game)