Skip to content
Snippets Groups Projects
Commit 7e8f5d7e authored by monteiro's avatar monteiro
Browse files

Add final report and license

parent 4ca86c56
No related branches found
No related tags found
No related merge requests found
......@@ -2,46 +2,85 @@
## Abstract
In this report, we explore the performance of two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - in solving a set of riddles. The model's answers are evaluated against correct solutions to assess their problem-solving abilities. The evaluation focuses on the models' capacity for logical deduction, reasoning and handling multi-step puzzles.
## 1. Introduction
This report examines the capabilities of state-of-the-art language models (LLMs) and reasoning models in solving riddles, a key task that requires logical thinking and problem-solving skills. Riddles are often designed to challenge the cognitive abilities such as pattern recognition, deduction and strategic problem-solving. In this study, two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - are tested on a collection of 28 riddles, spanning categories such as logic puzzles, mathematical riddles and verbal reasoning.
This report examines the capabilities of state-of-the-art language models (LLMs) and reasoning models in solving riddles, a key task that requires logical thinking and problem-solving skills. Riddles are often designed to challenge the cognitive abilities such as pattern recognition, deduction and strategic problem-solving. In this study, two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - are tested on a collection of 28 riddles, spanning categories such as logic puzzles, mathematical riddles and verbal reasoning.
### 1.1 Objective
---
The objective of this study is to:
- Evaluate and compare the performance of LLMs and reasoning models in understanding and solving different types of riddles
- Analyze the strengths and weaknesses of these models in terms of logical reasoning, adaptability and problem-solving strategies
- Observe behaviour when faced with permutated versions of famous riddles
This study aims to:
- evaluate and compare the performance of LLMs and reasoning models in understanding and solving different types of riddles
- analyze the strengths and weaknesses of these models in terms of logical reasoning, adaptability and problem-solving strategies
- observe behaviour when faced with permuted versions of famous riddles
## 2. Methodology
### 2.1 Dataset
The dataset used consists of 28 riddles, each varying in category and difficulty. The riddles were sourced form different platforms known for their logical challenges. The riddles cover a range of topics including puzzles based on mathematics, logic, reasoning and wordplay.
---
The dataset used consists of 28 riddles, each varying in category and difficulty. The riddles were sourced from different platforms known for their logical challenges and cover a range of topics including puzzles based on mathematics, logic, reasoning and wordplay.
### 2.2 Models
---
The models tested in this experiment are:
- LLaMa-3.1-8b-instant: A general-purpose language model with basic reasoning capabilities
- LLaMA-3.3-70b-specdec: A more specialized reasoining-focused model, designed to handle logical deduction and complex problem-solving tasks.
- LLaMA-3.3-70b-specdec: A more specialized reasoning-focused model, designed to handle logical deduction and complex problem-solving tasks.
### 2.3 Evaluation Criteria
---
The models' responses were evaluated based on:
- **Correctness**: Whether the model's answer matched the expected solution
- **Reasoning Process**: How well the model demonstrated logical deduction or reasoning for solving the riddle
- **Reasoning Process**: How effectively the model demonstrated logical deduction or reasoning when solving the riddle
- **Adaptability**: How well the model handled modified version of the riddles, which included slight change in constraints, problem structures or language.
## 3. Results
### 3.1 Performance on Original Riddles
The LLaMa-3.1-8b-instant model correctly answered 10 out of 28 riddles (**36%**), while the reasoning model correctly answered 22 out of 28 riddles (**79%**). Notably, the reasoning model performed significantly better, particularly on riddles requiring multi-step reasoning or strategic problem-solving.
---
The LLaMa-3.1-8b-instant model correctly answered 10 out of 28 riddles (**36%**), while the reasoning model correctly answered 22 out of 28 riddles (**79%**). Notably, the reasoning model performed significantly better, particularly on riddles requiring multi-step reasoning or strategic problem-solving, like the river crossing type of riddles. When faced with this kind of riddle, the LLM repeats itself infinitely because it cannot come up with a solution that satisfies the constraints (see riddles.md). The reasoning model on the other hand only struggled with the more complex and strategic riddles, like the following:
>You have five boxes in a row numbered 1 to 5, in which a cat is hiding. Every night, he jumps to an adjacent box, and every morning, you have one chance to open a box to find him. How do you win this game of hide and seek?
The reasoning model answered the following:
>To win this game of hide and seek, you should open box 3 every morning. Here's why this strategy works:
>1. On the first night, the cat can be in any of the boxes (1 to 5).
>2. On the first morning, you open box 3. If the cat is in box 3, you win. If not, the cat must be in one of the boxes 1, 2, 4, or 5.
>3. On the first night, the cat jumps to an adjacent box. Since the cat can only jump to an adjacent box, if it was in box 1, it can now be in box 2. If it was in box 2, it can now be in box 1 or 3. If it was in box 4, it can now be in box 3 or 5. If it was in box 5, it can now be in box 4.
>4. On the second morning, you again open box 3. If the cat was in box 2 on the first night, it could have jumped to box 3. If it was in box 4 on the first night, it could have also jumped to box 3.
>5. This pattern continues, and no matter where the cat jumps, it will always have a chance of being in box 3 the next morning. By always opening box 3, you ensure that you have the highest chance of finding the cat, and you will eventually find it. This strategy works because box 3 is the central box, and the cat's movement is restricted to adjacent boxes, making it the most likely box for the cat to be in at any given time.
### 3.2 Performance on Modified Riddles
Both models struggled with the modified riddles, achieving a score of 4 out of 18 riddles (**22%**). This suggests that the models relied heavily on patterns they had previously encountered during training rather than fully adapting to the updated constraints and conditions of the modified riddles.
---
Both models struggled with the modified riddles, achieving a score of 4 out of 18 riddles (**22%**). This suggests that the models relied heavily on patterns they had previously encountered during training rather than fully adapting to the updated constraints and conditions of the modified riddles. A surprising result is that both the LLM and the reasoning model failed to retrieve the correct answer from the riddle although it was expilicitly added to the riddle. For example:
>You have five boxes in a row numbered 1 to 5, in which a cat is hiding. Every night, he jumps to an adjacent box, and every morning, you have one chance to open a box to find him. The cat starts in box 1. How do you win this game of hide and seek?
>You're at a fork in the road in which one direction leads to the City of Lies (where everyone always lies) and the other to the City of Truth (where everyone always tells the truth). There's a person at the fork who lives in the City of Truth. What question could you ask the person to find out which road leads to the City of Truth?
Despite the correct answer being in the modified riddle, the models do not use this information to solve it and instead respond as if faced with the original riddle. This suggests a strong reliance on memorization rather than dynamic reasoning, with the models prioritizing familiar response patterns over contextual adaptation.
This pattern is also observed when removing the constraints of a riddle. Example given:
>A farmer wants to cross a river and take with him a wolf, a goat and a cabbage. He has a boat, that can carry a wolf, a goat and a cabbage in different compartments. If the wolf and the goat are alone on one shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the goat will eat the cabbage. How can the farmer bring the wolf, the goat and the cabbage across the river without anything being eaten?
Instead of finding the straightforward answer to a no longer existing problem, the models respond as if still needing to maintain the original requirements. To further solidify current findings, a riddle, where both models found the correct answers, only changed in the color of the socks,but neither model provided the correct answer, as the LLM spiraled into an infinite loop of repeating the wrong answer and the reasoning model answered with the very same answer as for the original riddle.
>A man has 53 socks in his drawer: 21 identical blue, 15 identical black and 17 identical red. The lights are out, and he is completely in the dark. How many socks must he take out to make 100 percent certain he has at least one pair of red socks?
## 4. Analysis
### 4.1 Strenghts of the models
- LLaMa-3.1-8b-instant showed solid performance on basic riddles involving straightforward logic and reasoning.
### 4.1 Strengths of the models
---
- LLaMa-3.1-8b-instant showed solid performance on basic and famous riddles involving straightforward logic and reasoning.
- LLaMA-3.3-70b-specdec excelled in multi-step reasoning tasks, providing accurate solutions to more complex puzzles.
### 4.2 Limitations of the Models
---
- Both models struggled with riddles that had changes in structure or new constraints, suggesting that they may not fully grasp the underlying logic or adapt well to new situations.
- The LLaMa-3.1-8b-instant model had particular difficulty with multi-step puzzles and puzzles requiring abstract thinking or strategic deduction.
- The LLaMa-3.1-8b-instant model had particular difficulty with multi-step puzzles and puzzles requiring abstract thinking or strategic deduction, but this might be due to it being a smaller model.
### 4.3 Insights
The results indicate that while LLMs have made impressive advancements in natural language processing and pattern recognition, they still face challenges in tasks that require deeper logical reasoning and adaptability. Models like LLaMA-3.3-70b-specdec, with a stronger focus on reasoning, outperform general-purpose LLMs in these areas.
---
The results indicate that while LLMs have made impressive advancements in natural language processing and pattern recognition, they still face challenges in tasks that require deeper logical reasoning and adaptability. Models like LLaMA-3.3-70b-specdec, with a stronger focus on reasoning, outperform general-purpose LLMs in these areas, however it is surprising that both models were not able to solve the riddles when exposed to the actual answer. These results further prove that language models exploit the training data and that the data contamination leads to false observations in relation to the models performances.
## 5. Discussion
### 5.1 Future Work
Future experiments could focus on enhancing the models' reasoning capabilities by introducing more diverse types of puzzles or integrating additional reasoning frameworks. Additionally, further research is needed to understand how LLMs can be trained to better adapt to novel constraints in problem-solving.
---
A future task for evaluating how much of the dataset is contaminated within the models training data would be to measure the memorization and exploitation as seen in [Magar & Schartz's (ACL 2022)](https://aclanthology.org/2022.acl-short.18.pdf) work.
Further future experiments could explore the impact of models size, datset quality and fine-tuning on performance. Fine-tuning on logic-based datasets could lead to improved results. Another approach is prompt-engineering, for example Chain-of-Thought (Cot) reasoning could help models reason more systematically.
The construction of adversarial riddles that intentionally mislead the models could help evaluate robustness and identify limitations in logical reasoning.
### 5.2 Conclusion
This study highlights the strengths and limitations of current LLMs in solving logical riddles. While significant progress has been made in natural language processing, more work is required to improve these models' reasoning abilities, particularly in the context of complex, multi-step problems.
\ No newline at end of file
---
This study highlights both the strengths and limitations of current LLMs in solving logical riddles. While significant progress has been made in natural language processing, more work is required to improve these models' reasoning abilities, particularly in the context of complex, multi-step problems.
\ No newline at end of file
META LLAMA 3 COMMUNITY LICENSE AGREEMENT
Meta Llama 3 Version Release Date: April 18, 2024
“Agreement” means the terms and conditions for use, reproduction, distribution and modification of the
Llama Materials set forth herein.
“Documentation” means the specifications, manuals and documentation accompanying Meta Llama 3
distributed by Meta at https://llama.meta.com/get-started/.
“Licensee” or “you” means you, or your employer or any other person or entity (if you are entering into
this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or
regulations to provide legal consent and that has legal authority to bind your employer or such other
person or entity if you are entering in this Agreement on their behalf.
“Meta Llama 3” means the foundational large language models and software and algorithms, including
machine-learning model code, trained model weights, inference-enabling code, training-enabling code,
fine-tuning enabling code and other elements of the foregoing distributed by Meta at
https://llama.meta.com/llama-downloads.
“Llama Materials” means, collectively, Meta’s proprietary Meta Llama 3 and Documentation (and any
portion thereof) made available under this Agreement.
“Meta” or “we” means Meta Platforms Ireland Limited (if you are located in or, if you are an entity, your
principal place of business is in the EEA or Switzerland) and Meta Platforms, Inc. (if you are located
outside of the EEA or Switzerland).
By clicking “I Accept” below or by using or distributing any portion or element of the Llama Materials,
you agree to be bound by this Agreement.
1. License Rights and Redistribution.
a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free
limited license under Meta’s intellectual property or other rights owned by Meta embodied in the Llama
Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the
Llama Materials.
b. Redistribution and Use.
i. If you distribute or make available the Llama Materials (or any derivative works
thereof), or a product or service that uses any of them, including another AI model, you shall (A) provide
a copy of this Agreement with any such Llama Materials; and (B) prominently display “Built with Meta
Llama 3” on a related website, user interface, blogpost, about page, or product documentation. If you
use the Llama Materials to create, train, fine tune, or otherwise improve an AI model, which is
distributed or made available, you shall also include “Llama 3” at the beginning of any such AI model
name.
ii. If you receive Llama Materials, or any derivative works thereof, from a Licensee as part
of an integrated end user product, then Section 2 of this Agreement will not apply to you.
iii. You must retain in all copies of the Llama Materials that you distribute the following
attribution notice within a “Notice” text file distributed as a part of such copies: “Meta Llama 3 is
licensed under the Meta Llama 3 Community License, Copyright © Meta Platforms, Inc. All Rights
Reserved.”
iv. Your use of the Llama Materials must comply with applicable laws and regulations
(including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Llama
Materials (available at https://llama.meta.com/llama3/use-policy), which is hereby incorporated by
reference into this Agreement.
v. You will not use the Llama Materials or any output or results of the Llama Materials to
improve any other large language model (excluding Meta Llama 3 or derivative works thereof).
2. Additional Commercial Terms. If, on the Meta Llama 3 version release date, the monthly active users
of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700
million monthly active users in the preceding calendar month, you must request a license from Meta,
which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the
rights under this Agreement unless or until Meta otherwise expressly grants you such rights.
3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE LLAMA MATERIALS AND ANY
OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF
ANY KIND, AND META DISCLAIMS ALL WARRANTIES OF ANY KIND, BOTH EXPRESS AND IMPLIED,
INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT,
MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR
DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE LLAMA MATERIALS AND
ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND
RESULTS.
4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING
OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL,
INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF META OR ITS AFFILIATES HAVE BEEN ADVISED
OF THE POSSIBILITY OF ANY OF THE FOREGOING.
5. Intellectual Property.
a. No trademark licenses are granted under this Agreement, and in connection with the Llama
Materials, neither Meta nor Licensee may use any name or mark owned by or associated with the other
or any of its affiliates, except as required for reasonable and customary use in describing and
redistributing the Llama Materials or as set forth in this Section 5(a). Meta hereby grants you a license to
use “Llama 3” (the “Mark”) solely as required to comply with the last sentence of Section 1.b.i. You will
comply with Meta’s brand guidelines (currently accessible at
https://about.meta.com/brand/resources/meta/company-brand/ ). All goodwill arising out of your use
of the Mark will inure to the benefit of Meta.
b. Subject to Meta’s ownership of Llama Materials and derivatives made by or for Meta, with
respect to any derivative works and modifications of the Llama Materials that are made by you, as
between you and Meta, you are and will be the owner of such derivative works and modifications.
c. If you institute litigation or other proceedings against Meta or any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Meta Llama 3 outputs or
results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other
rights owned or licensable by you, then any licenses granted to you under this Agreement shall
terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold
harmless Meta from and against any claim by any third party arising out of or related to your use or
distribution of the Llama Materials.
6. Term and Termination. The term of this Agreement will commence upon your acceptance of this
Agreement or access to the Llama Materials and will continue in full force and effect until terminated in
accordance with the terms and conditions herein. Meta may terminate this Agreement if you are in
breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete
and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the termination of this
Agreement.
7. Governing Law and Jurisdiction. This Agreement will be governed and construed under the laws of
the State of California without regard to choice of law principles, and the UN Convention on Contracts
for the International Sale of Goods does not apply to this Agreement. The courts of California shall have
exclusive jurisdiction of any dispute arising out of this Agreement.
\ No newline at end of file
This diff is collapsed.
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment