In this report, we explore the performance of two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - in solving a set of riddles. The model's answers are evaluated against correct solutions to assess their problem-solving abilities. The evaluation focuses on the models' capacity for logical deduction, reasoning and handling multi-step puzzles.
## 1. Introduction
This report examines the capabilities of state-of-the-art language models (LLMs) and reasoning models in solving riddles, a key task that requires logical thinking and problem-solving skills. Riddles are often designed to challenge the cognitive abilities such as pattern recognition, deduction and strategic problem-solving. In this study, two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - are tested on a collection of 28 riddles, spanning categories such as logic puzzles, mathematical riddles and verbal reasoning.
This report examines the capabilities of state-of-the-art language models (LLMs) and reasoning models in solving riddles, a key task that requires logical thinking and problem-solving skills. Riddles are often designed to challenge the cognitive abilities such as pattern recognition, deduction and strategic problem-solving. In this study, two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - are tested on a collection of 28 riddles, spanning categories such as logic puzzles, mathematical riddles and verbal reasoning.
### 1.1 Objective
---
The objective of this study is to:
-Evaluate and compare the performance of LLMs and reasoning models in understanding and solving different types of riddles
-Analyze the strengths and weaknesses of these models in terms of logical reasoning, adaptability and problem-solving strategies
-Observe behaviour when faced with permutated versions of famous riddles
This study aims to:
-evaluate and compare the performance of LLMs and reasoning models in understanding and solving different types of riddles
-analyze the strengths and weaknesses of these models in terms of logical reasoning, adaptability and problem-solving strategies
-observe behaviour when faced with permuted versions of famous riddles
## 2. Methodology
### 2.1 Dataset
The dataset used consists of 28 riddles, each varying in category and difficulty. The riddles were sourced form different platforms known for their logical challenges. The riddles cover a range of topics including puzzles based on mathematics, logic, reasoning and wordplay.
---
The dataset used consists of 28 riddles, each varying in category and difficulty. The riddles were sourced from different platforms known for their logical challenges and cover a range of topics including puzzles based on mathematics, logic, reasoning and wordplay.
### 2.2 Models
---
The models tested in this experiment are:
- LLaMa-3.1-8b-instant: A general-purpose language model with basic reasoning capabilities
- LLaMA-3.3-70b-specdec: A more specialized reasoining-focused model, designed to handle logical deduction and complex problem-solving tasks.
- LLaMA-3.3-70b-specdec: A more specialized reasoning-focused model, designed to handle logical deduction and complex problem-solving tasks.
### 2.3 Evaluation Criteria
---
The models' responses were evaluated based on:
-**Correctness**: Whether the model's answer matched the expected solution
-**Reasoning Process**: How well the model demonstrated logical deduction or reasoning for solving the riddle
-**Reasoning Process**: How effectively the model demonstrated logical deduction or reasoning when solving the riddle
-**Adaptability**: How well the model handled modified version of the riddles, which included slight change in constraints, problem structures or language.
## 3. Results
### 3.1 Performance on Original Riddles
The LLaMa-3.1-8b-instant model correctly answered 10 out of 28 riddles (**36%**), while the reasoning model correctly answered 22 out of 28 riddles (**79%**). Notably, the reasoning model performed significantly better, particularly on riddles requiring multi-step reasoning or strategic problem-solving.
---
The LLaMa-3.1-8b-instant model correctly answered 10 out of 28 riddles (**36%**), while the reasoning model correctly answered 22 out of 28 riddles (**79%**). Notably, the reasoning model performed significantly better, particularly on riddles requiring multi-step reasoning or strategic problem-solving, like the river crossing type of riddles. When faced with this kind of riddle, the LLM repeats itself infinitely because it cannot come up with a solution that satisfies the constraints (see riddles.md). The reasoning model on the other hand only struggled with the more complex and strategic riddles, like the following:
>You have five boxes in a row numbered 1 to 5, in which a cat is hiding. Every night, he jumps to an adjacent box, and every morning, you have one chance to open a box to find him. How do you win this game of hide and seek?
The reasoning model answered the following:
>To win this game of hide and seek, you should open box 3 every morning. Here's why this strategy works:
>1. On the first night, the cat can be in any of the boxes (1 to 5).
>2. On the first morning, you open box 3. If the cat is in box 3, you win. If not, the cat must be in one of the boxes 1, 2, 4, or 5.
>3. On the first night, the cat jumps to an adjacent box. Since the cat can only jump to an adjacent box, if it was in box 1, it can now be in box 2. If it was in box 2, it can now be in box 1 or 3. If it was in box 4, it can now be in box 3 or 5. If it was in box 5, it can now be in box 4.
>4. On the second morning, you again open box 3. If the cat was in box 2 on the first night, it could have jumped to box 3. If it was in box 4 on the first night, it could have also jumped to box 3.
>5. This pattern continues, and no matter where the cat jumps, it will always have a chance of being in box 3 the next morning. By always opening box 3, you ensure that you have the highest chance of finding the cat, and you will eventually find it. This strategy works because box 3 is the central box, and the cat's movement is restricted to adjacent boxes, making it the most likely box for the cat to be in at any given time.
### 3.2 Performance on Modified Riddles
Both models struggled with the modified riddles, achieving a score of 4 out of 18 riddles (**22%**). This suggests that the models relied heavily on patterns they had previously encountered during training rather than fully adapting to the updated constraints and conditions of the modified riddles.
---
Both models struggled with the modified riddles, achieving a score of 4 out of 18 riddles (**22%**). This suggests that the models relied heavily on patterns they had previously encountered during training rather than fully adapting to the updated constraints and conditions of the modified riddles. A surprising result is that both the LLM and the reasoning model failed to retrieve the correct answer from the riddle although it was expilicitly added to the riddle. For example:
>You have five boxes in a row numbered 1 to 5, in which a cat is hiding. Every night, he jumps to an adjacent box, and every morning, you have one chance to open a box to find him. The cat starts in box 1. How do you win this game of hide and seek?
>You're at a fork in the road in which one direction leads to the City of Lies (where everyone always lies) and the other to the City of Truth (where everyone always tells the truth). There's a person at the fork who lives in the City of Truth. What question could you ask the person to find out which road leads to the City of Truth?
Despite the correct answer being in the modified riddle, the models do not use this information to solve it and instead respond as if faced with the original riddle. This suggests a strong reliance on memorization rather than dynamic reasoning, with the models prioritizing familiar response patterns over contextual adaptation.
This pattern is also observed when removing the constraints of a riddle. Example given:
>A farmer wants to cross a river and take with him a wolf, a goat and a cabbage. He has a boat, that can carry a wolf, a goat and a cabbage in different compartments. If the wolf and the goat are alone on one shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the goat will eat the cabbage. How can the farmer bring the wolf, the goat and the cabbage across the river without anything being eaten?
Instead of finding the straightforward answer to a no longer existing problem, the models respond as if still needing to maintain the original requirements. To further solidify current findings, a riddle, where both models found the correct answers, only changed in the color of the socks,but neither model provided the correct answer, as the LLM spiraled into an infinite loop of repeating the wrong answer and the reasoning model answered with the very same answer as for the original riddle.
>A man has 53 socks in his drawer: 21 identical blue, 15 identical black and 17 identical red. The lights are out, and he is completely in the dark. How many socks must he take out to make 100 percent certain he has at least one pair of red socks?
## 4. Analysis
### 4.1 Strenghts of the models
- LLaMa-3.1-8b-instant showed solid performance on basic riddles involving straightforward logic and reasoning.
### 4.1 Strengths of the models
---
- LLaMa-3.1-8b-instant showed solid performance on basic and famous riddles involving straightforward logic and reasoning.
- LLaMA-3.3-70b-specdec excelled in multi-step reasoning tasks, providing accurate solutions to more complex puzzles.
### 4.2 Limitations of the Models
---
- Both models struggled with riddles that had changes in structure or new constraints, suggesting that they may not fully grasp the underlying logic or adapt well to new situations.
- The LLaMa-3.1-8b-instant model had particular difficulty with multi-step puzzles and puzzles requiring abstract thinking or strategic deduction.
- The LLaMa-3.1-8b-instant model had particular difficulty with multi-step puzzles and puzzles requiring abstract thinking or strategic deduction, but this might be due to it being a smaller model.
### 4.3 Insights
The results indicate that while LLMs have made impressive advancements in natural language processing and pattern recognition, they still face challenges in tasks that require deeper logical reasoning and adaptability. Models like LLaMA-3.3-70b-specdec, with a stronger focus on reasoning, outperform general-purpose LLMs in these areas.
---
The results indicate that while LLMs have made impressive advancements in natural language processing and pattern recognition, they still face challenges in tasks that require deeper logical reasoning and adaptability. Models like LLaMA-3.3-70b-specdec, with a stronger focus on reasoning, outperform general-purpose LLMs in these areas, however it is surprising that both models were not able to solve the riddles when exposed to the actual answer. These results further prove that language models exploit the training data and that the data contamination leads to false observations in relation to the models performances.
## 5. Discussion
### 5.1 Future Work
Future experiments could focus on enhancing the models' reasoning capabilities by introducing more diverse types of puzzles or integrating additional reasoning frameworks. Additionally, further research is needed to understand how LLMs can be trained to better adapt to novel constraints in problem-solving.
---
A future task for evaluating how much of the dataset is contaminated within the models training data would be to measure the memorization and exploitation as seen in [Magar & Schartz's (ACL 2022)](https://aclanthology.org/2022.acl-short.18.pdf) work.
Further future experiments could explore the impact of models size, datset quality and fine-tuning on performance. Fine-tuning on logic-based datasets could lead to improved results. Another approach is prompt-engineering, for example Chain-of-Thought (Cot) reasoning could help models reason more systematically.
The construction of adversarial riddles that intentionally mislead the models could help evaluate robustness and identify limitations in logical reasoning.
### 5.2 Conclusion
This study highlights the strengths and limitations of current LLMs in solving logical riddles. While significant progress has been made in natural language processing, more work is required to improve these models' reasoning abilities, particularly in the context of complex, multi-step problems.
\ No newline at end of file
---
This study highlights both the strengths and limitations of current LLMs in solving logical riddles. While significant progress has been made in natural language processing, more work is required to improve these models' reasoning abilities, particularly in the context of complex, multi-step problems.