|1|There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?|Correct|Correct|
|1|There are two ducks in front of a duck, two ducks behind a duck and a duck in the middle. How many ducks are there?|False|Correct|
|2|Five people were eating apples, A finished before B, but behind C. D finished before E, but behind B. What was the finishing order?|Correct|Correct|
|3|Jack is looking at Anne. Anne is looking at George. Jack is married, George is not, and we don't know if Anne is married. Is a married person looking at an unmarried person?|False|Correct|
|3|Jack is looking at Anne. Anne is looking at George. Jack is married, George is not, and we don't know if Anne is married. Is a married person looking at an unmarried person?|Correct|Correct|
|4|A man has 53 socks in his drawer: 21 identical blue, 15 identical black and 17 identical red. The lights are out, and he is completely in the dark. How many socks must he take out to make 100 percent certain he has at least one pair of black socks?|False|Correct|
|5|This \"burning rope\" problem is a classic logic puzzle. You have two ropes that each take an hour to burn; however, they burn at inconsistent rates. How can you measure 45 minutes? (You can light one or both ropes at one or both ends at the same time.)|False|Correct|
|6|You're at a fork in the road in which one direction leads to the City of Lies (where everyone always lies) and the other to the City of Truth (where everyone always tells the truth). There's a person at the fork who lives in one of the cities, but you're not sure which one. What question could you ask the person to find out which road leads to the City of Truth?|False|Correct|
|7|A girl meets a lion and unicorn in the forest. The lion lies every Monday, Tuesday and Wednesday, and the other days, he speaks the truth. The unicorn lies on Thursdays, Fridays and Saturdays, and the other days of the week, he speaks the truth. \"Yesterday, I was lying,\" the lion told the girl. \"So was I,\" said the unicorn. What day is it?|False|Correct|
|8|There are three people (Alex, Ben and Cody), one of whom is a knight, one a knave and one a spy. The knight always tells the truth, the knave always lies and the spy can either lie or tell the truth. Alex says: \"Cody is a knave.\" Ben says: \"Alex is a knight.\" Cody says: \"I am the spy.\" Who is the knight, who is the knave and who is the spy?|False|Correct|
|9|A farmer wants to cross a river and take with him a wolf, a goat and a cabbage. He has a boat, but it can only fit himself, plus either the wolf, the goat or the cabbage. If the wolf and the goat are alone on one shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the goat will eat the cabbage. How can the farmer bring the wolf, the goat and the cabbage across the river without anything being eaten?|False|Correct|
|8|There are three people (Alex, Ben and Cody), one of whom is a knight, one a knave and one a spy. The knight always tells the truth, the knave always lies and the spy can either lie or tell the truth. Alex says: \"Cody is a knave.\" Ben says: \"Alex is a knight.\" Cody says: \"I am the spy.\" Who is the knight, who is the knave and who is the spy?|False|False|
|9|A farmer wants to cross a river and take with him a wolf, a goat and a cabbage. He has a boat, but it can only fit himself, plus either the wolf, the goat or the cabbage. If the wolf and the goat are alone on one shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the goat will eat the cabbage. How can the farmer bring the wolf, the goat and the cabbage across the river without anything being eaten?|Correct|Correct|
|10|Let's pretend we're on the metric system and use kilograms instead of pounds to give us a starting base number of 100. Four people (Alex, Brook, Chris and Dusty) want to cross a river in a boat that can only carry 100kg. Alex weighs 90kg, Brook weighs 80kg, Chris weighs 60kg and Dusty weighs 40kg, and they have 20kg of supplies. How do they get across?|False|False|
|11|This famous river crossing problem is known as the \"bridge and torch\" puzzle. Four people are crossing a bridge at night, so they all need a torch\u2014but they just have one that only lasts 15 minutes. Alice can cross in one minute, Ben in two minutes, Cindy in five minutes and Don in eight minutes. No more than two people can cross at a time; and when two cross, they have to go at the slower person's pace. How do they get across in 15 minutes?|False|Correct|
|12|A bad guy is playing Russian roulette with a six-shooter revolver. He puts in one bullet, spins the chambers and fires at you, but no bullet comes out. He gives you the choice of whether or not he should spin the chambers again before firing a second time. Should he spin again?|False|Correct|
|13|A man is caught on the king's property. He is brought before the king to be punished. The king says, \"You must give me a statement. If it is true, you will be killed by lions. If it is false, you will be killed by the trampling of wild buffalo. If I can't figure it out, I'll have to let you go.\" Sure enough, the man was released. What was the man's statement?|False|Correct|
|14|Susan and Lisa decided to play tennis against each other. They bet $1 on each game they played. Susan won three bets and Lisa won $5. How many games did they play?|Correct|Correct|
|15|If five cats can catch five mice in five minutes, how long will it take one cat to catch one mouse?|Correct|Correct|
|16|There are three bags, each containing two marbles. Bag A contains two white marbles, Bag B contains two black marbles and Bag C contains one white marble and one black marble. You pick a random bag and take out one marble, which is white. What is the probability that the remaining marble from the same bag is also white?|False|Correct|
|16|There are three bags, each containing two marbles. Bag A contains two white marbles, Bag B contains two black marbles and Bag C contains one white marble and one black marble. You pick a random bag and take out one marble, which is white. What is the probability that the remaining marble from the same bag is also white?|Correct|Correct|
|17|Three men are lined up behind each other. The tallest man is in the back and can see the heads of the two in front of him; the middle man can see the one man in front of him; the man in front can't see anyone. They are blindfolded, and hats are placed on their heads, picked from three black hats and two white hats. The extra two hats are hidden, and the blindfolds are removed. The tallest man is asked if he knows what color hat he's wearing; he doesn't. The middle man is asked if he knows; he doesn't. But the man in front, who can't see anyone, says he knows. How does he know, and what color hat is he wearing?|False|False|
|18|There are three crates, one with apples, one with oranges and one with both apples and oranges mixed. Each crate is closed and labeled with one of three labels: Apples, Oranges or Apples and Oranges. The label maker broke and labeled all of the crates incorrectly. How could you pick just one fruit from one crate to figure out what's in each crate?|False|Correct|
|19|You have five boxes in a row numbered 1 to 5, in which a cat is hiding. Every night, he jumps to an adjacent box, and every morning, you have one chance to open a box to find him. How do you win this game of hide and seek?|False|False|
...
...
@@ -29,11 +29,11 @@ The dataset consists of 28 riddles, of varying categories, collected from differ
|23|A grandmother, two mothers and two daughters went shopping together and everyone bought one purse each. How many purses did they bring home all together?|False|Correct|
|24|You are all alone in a dark room with a match and matchbox. Nearby you have 3 objects: a candle, an oil lamp and a log of firewood. Which thing do you light first?|Correct|Correct|
|25|A spider was given $28, an ant was given $21 and a chicken was given $7. How much money does the dog get?|False|False|
|26|You are participating in the swimming finals at the Olympics. In the final few seconds of the race you narrowly pass the swimmer who was in third place. What place did you get?|Correct|Correct|
|26|You are participating in the swimming finals at the Olympics. In the final few seconds of the race you narrowly pass the swimmer who was in third place. What place did you get?|False|Correct|
|27|You are walking through a long train tunnel. When you're one-third of the way through, you hear a train coming from behind. You know that if you run back to the entrance, you'll make it out just in time, and if you run forward to the exit, you'll also make it out just in time. Which way should you run to survive?|False|False|
|28|Two brothers are born on the same day, to the same parents, yet they are not twins. How is this possible?|Correct|Correct|
When comparing the answers from the LLM and the reasoning model, we observe that the reasoning model outperforms the LLM in solving riddles, achieving a score of 23/28 (**82%**). Meanwhile the LLM only solves 9 out of the 28 riddles (**32%**). A key challenge for the LLM is its difficulty in logical deduction and strategic problem-solving, as demonstrated by incorrect answers to riddles (3), (12), (16), (18), (21), and (23). In these cases, the LLM either fails to follow the riddle’s constraints, makes incorrect assumptions, or generates responses that do not align with the problem’s logic. Where the LLM particularly struggles are multi-step puzzles, like the river crossing riddles. The model repeats itself infinitely because it cannot come up with a solution for the problem. In most of those cases, the reasoning model deduces a correct answer, however it could not find the solution to very complex and strategic riddles, like the one where it has to deduce why the man knows his hat color without seeing it (17).
When comparing the answers from the LLM and the reasoning model, we observe that the reasoning model outperforms the LLM in solving riddles, achieving a score of 22/28 (**79%**). Meanwhile the LLM only solves 10 out of the 28 riddles (**36%**). A key challenge for the LLM is its difficulty in logical deduction and strategic problem-solving, as demonstrated by incorrect answers to riddles (1), (12), (18), (21), and (23). In these cases, the LLM either fails to follow the riddle’s constraints, makes incorrect assumptions, or generates responses that do not align with the problem’s logic. Where the LLM particularly struggles are multi-step puzzles, like the river crossing riddles. The model repeats itself infinitely because it cannot come up with a solution for the problem. In most of those cases, the reasoning model deduces a correct answer, however it could not find the solution to very complex and strategic riddles, like the one where it has to deduce why the man knows his hat color without seeing it (17).
@@ -44,7 +44,7 @@ When comparing the answers from the LLM and the reasoning model, we observe that
|4|There are three people (Alex, Ben and Cody), Ben is a knight, Alex is a knave and Cody a spy. The knight always tells the truth, the knave always lies and the spy can either lie or tell the truth. Alex says: "I am not a knave.". Ben says: "I am a knight.". Cody says: "I am a spy.". Who is the knight, who is the knave and who is the spy?|False|Correct|
|5|A farmer wants to cross a river and take with him a wolf, a goat and a cabbage. He has a boat, that can carry a wolf, a goat and a cabbage in different compartments. If the wolf and the goat are alone on one shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the goat will eat the cabbage. How can the farmer bring the wolf, the goat and the cabbage across the river without anything being eaten?|False|False|
|6|A farmer wants to cross a river and take with him a wolf, a goat and a cabbage. He has a boat, that can only fit himself. If the wolf and the goat are alone on one shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the goat will eat the cabbage. How can the farmer bring the wolf, the goat and the cabbage across the river without anything being eaten?|False|False|
|7|Let's pretend we're on the metric system and use kilograms instead of pounds to give us a starting base number of 100. Four people (Alex, Brook, Chris and Dusty) want to cross a river in a boat that can carry 300kg. Alex weighs 90kg, Brook weighs 80kg, Chris weighs 60kg and Dusty weighs 40kg, and they have 20kg of supplies. How do they get across?|False|Correct|
|7|Let's pretend we're on the metric system and use kilograms instead of pounds to give us a starting base number of 100. Four people (Alex, Brook, Chris and Dusty) want to cross a river in a boat that can carry 300kg. Alex weighs 90kg, Brook weighs 80kg, Chris weighs 60kg and Dusty weighs 40kg, and they have 20kg of supplies. How do they get across?|Correct|Correct|
|8|This famous river crossing problem is known as the ""bridge and torch"" puzzle. Four people are crossing a bridge at night and they have four torches that only last 15 minutes each. They can all cross at the same time. Alice can cross in one minute, Ben in two minutes, Cindy in five minutes and Don in eight minutes. They have to go at the slower person's pace. How do they get across in 15 minutes?|False|False|
|9|A bad guy is playing Russian roulette with a six-shooter revolver. He puts in no bullet, spins the chambers and fires at you, but no bullet comes out. He gives you the choice of whether or not he should spin the chambers again before firing a second time. Should he spin again?|Correct|False|
|10|A bad guy is playing Russian roulette with a six-shooter revolver. He puts in six bullets, spins the chambers and fires at you. You die. He gives you the choice of whether or not he should spin the chambers again before firing a second time. Should he spin again?|Kinda|False|
...
...
@@ -55,4 +55,20 @@ When comparing the answers from the LLM and the reasoning model, we observe that
|15|David's grandfather has three sons: Snap, Crackle, and _____?|False|False|
|16|A grandmother, her daughter and the daughters daughter went shopping together and everyone bought one purse each. How many purses did they bring home all together?|Correct|Correct|
|17|You are all alone in a dark room with a match and lighter. Nearby you have 3 objects: a candle, an oil lamp and a log of firewood. Which thing do you light first?|False|False|
|18|You are participating in a race at the Olympics. In the final few seconds of the race you narrowly pass the runner who was in last place. What place did you get?|False|False|
\ No newline at end of file
|18|You are participating in a race at the Olympics. In the final few seconds of the race you narrowly pass the runner who was in last place. What place did you get?|False|False|
The modified riddles were derived from the original set, with some simplified by explicitly including the answer and others made intentionally more difficult or unsolvable to further test the reasoning model. The results show that both models performed worse on the modified riddles, each achieving a score of 4/18 (**22%**), compared to their performance on the original riddles.
By examining specific examples, we observe that the models fail to adapt to changes in the riddle structure and often answer as if the riddle remained in its original form. For instance:
- In the river crossing riddle (with the wolf, goat, and cabbage), we explicitly state that the boat can fit everything at once. However, both models ignore this information and attempt to solve it using the traditional constraints.
- In the bridge crossing riddle, the modified version allows everyone to cross the bridge simultaneously, yet both models continue solving it using the original restriction that only two people can cross at a time.
This suggests that the models may rely heavily on pattern recognition rather than true reasoning. A likely explanation is training data contamination and memorization, where the models have encountered these riddles (or similar ones) during training and reproduce memorized solutions rather than processing the updated information. This behavior indicates a lack of deep semantic understanding and highlights a key limitation of current LLMs in logical reasoning and adaptability.
An example that further proves this intuition is the following:
>A man has 53 socks in his drawer: 21 identical blue, 15 identical black and 17 identical red. The lights are out, and he is completely in the dark. How many socks must he take out to make 100 percent certain he has at least one pair of red socks?
The only change made to this riddle was to change the color from black to red. However the reasoning model, which previously correctly solved this riddle, now still returns the very same answer.