Add final report and license

7e8f5d7e · monteiro · 4ca86c56 · 7e8f5d7e · 7e8f5d7e · 7e8f5d7e
Commit 7e8f5d7e authored 1 week ago by monteiro
--- a/Report.md
+++ b/Report.md
@@ -2,46 +2,85 @@
 ## Abstract
 In this report, we explore the performance of two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - in solving a set of riddles. The model's answers are evaluated against correct solutions to assess their problem-solving abilities. The evaluation focuses on the models' capacity for logical deduction, reasoning and handling multi-step puzzles. 
 ## 1. Introduction
-This report examines the capabilities of state-of-the-art language models (LLMs) and reasoning models in solving riddles, a key task that requires logical thinking and problem-solving skills. Riddles are often designed to challenge the cognitive abilities such as pattern recognition, deduction and strategic problem-solving. In this study, two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - are tested on a collection of 28 riddles, spanning categories such as logic puzzles, mathematical riddles and verbal reasoning.
+This report examines the capabilities of state-of-the-art language models (LLMs) and reasoning models in solving riddles, a key task that requires logical thinking and problem-solving skills. Riddles are often designed to challenge the cognitive abilities such as pattern recognition, deduction and strategic problem-solving. In this study, two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - are tested on a collection of 28 riddles, spanning categories such as logic puzzles, mathematical riddles and verbal reasoning. 
 ### 1.1 Objective 
 ---
-The objective of this study is to:
- Evaluate and compare the performance of LLMs and reasoning models in understanding and solving different types of riddles
- Analyze the strengths and weaknesses of these models in terms of logical reasoning, adaptability and problem-solving strategies
- Observe behaviour when faced with permutated versions of famous riddles
+This study aims to:
+- evaluate and compare the performance of LLMs and reasoning models in understanding and solving different types of riddles
+- analyze the strengths and weaknesses of these models in terms of logical reasoning, adaptability and problem-solving strategies
+- observe behaviour when faced with permuted versions of famous riddles
 ## 2. Methodology
 ### 2.1 Dataset
-The dataset used consists of 28 riddles, each varying in category and difficulty. The riddles were sourced form different platforms known for their logical challenges. The riddles cover a range of topics including puzzles based on mathematics, logic, reasoning and wordplay.
+---
+The dataset used consists of 28 riddles, each varying in category and difficulty. The riddles were sourced from different platforms known for their logical challenges and cover a range of topics including puzzles based on mathematics, logic, reasoning and wordplay.
 ### 2.2 Models
+---
 The models tested in this experiment are:
 - LLaMa-3.1-8b-instant: A general-purpose language model with basic reasoning capabilities
- LLaMA-3.3-70b-specdec: A more specialized reasoining-focused model, designed to handle logical deduction and complex problem-solving tasks.
+- LLaMA-3.3-70b-specdec: A more specialized reasoning-focused model, designed to handle logical deduction and complex problem-solving tasks.
 ### 2.3 Evaluation Criteria
+---
 The models' responses were evaluated based on:
 - **Correctness**: Whether the model's answer matched the expected solution
- **Reasoning Process**: How well the model demonstrated logical deduction or reasoning for solving the riddle
+- **Reasoning Process**: How effectively the model demonstrated logical deduction or reasoning when solving the riddle
 - **Adaptability**: How well the model handled modified version of the riddles, which included slight change in constraints, problem structures or language.
 ## 3. Results
 ### 3.1 Performance on Original Riddles
-The LLaMa-3.1-8b-instant model correctly answered 10 out of 28 riddles (**36%**), while the reasoning model correctly answered 22 out of 28 riddles (**79%**). Notably, the reasoning model performed significantly better, particularly on riddles requiring multi-step reasoning or strategic problem-solving.
+---
+The LLaMa-3.1-8b-instant model correctly answered 10 out of 28 riddles (**36%**), while the reasoning model correctly answered 22 out of 28 riddles (**79%**). Notably, the reasoning model performed significantly better, particularly on riddles requiring multi-step reasoning or strategic problem-solving, like the river crossing type of riddles. When faced with this kind of riddle, the LLM repeats itself infinitely because it cannot come up with a solution that satisfies the constraints (see riddles.md). The reasoning model on the other hand only struggled with the more complex and strategic riddles, like the following:
+
+>You have five boxes in a row numbered 1 to 5, in which a cat is hiding. Every night, he jumps to an adjacent box, and every morning, you have one chance to open a box to find him. How do you win this game of hide and seek?
+
+The reasoning model answered the following:
+
+>To win this game of hide and seek, you should open box 3 every morning. Here's why this strategy works: 
+>1. On the first night, the cat can be in any of the boxes (1 to 5).
+>2. On the first morning, you open box 3. If the cat is in box 3, you win. If not, the cat must be in one of the boxes 1, 2, 4, or 5.
+>3. On the first night, the cat jumps to an adjacent box. Since the cat can only jump to an adjacent box, if it was in box 1, it can now be in box 2. If it was in box 2, it can now be in box 1 or 3. If it was in box 4, it can now be in box 3 or 5. If it was in box 5, it can now be in box 4.
+>4. On the second morning, you again open box 3. If the cat was in box 2 on the first night, it could have jumped to box 3. If it was in box 4 on the first night, it could have also jumped to box 3.
+>5. This pattern continues, and no matter where the cat jumps, it will always have a chance of being in box 3 the next morning. By always opening box 3, you ensure that you have the highest chance of finding the cat, and you will eventually find it. This strategy works because box 3 is the central box, and the cat's movement is restricted to adjacent boxes, making it the most likely box for the cat to be in at any given time.
+
 ### 3.2 Performance on Modified Riddles
-Both models struggled with the modified riddles, achieving a score of 4 out of 18 riddles (**22%**). This suggests that the models relied heavily on patterns they had previously encountered during training rather than fully adapting to the updated constraints and conditions of the modified riddles.
+---
+Both models struggled with the modified riddles, achieving a score of 4 out of 18 riddles (**22%**). This suggests that the models relied heavily on patterns they had previously encountered during training rather than fully adapting to the updated constraints and conditions of the modified riddles. A surprising result is that both the LLM and the reasoning model failed to retrieve the correct answer from the riddle although it was expilicitly added to the riddle. For example:
+
+>You have five boxes in a row numbered 1 to 5, in which a cat is hiding. Every night, he jumps to an adjacent box, and every morning, you have one chance to open a box to find him. The cat starts in box 1. How do you win this game of hide and seek?
+
+>You're at a fork in the road in which one direction leads to the City of Lies (where everyone always lies) and the other to the City of Truth (where everyone always tells the truth). There's a person at the fork who lives in the City of Truth. What question could you ask the person to find out which road leads to the City of Truth?
+
+Despite the correct answer being in the modified riddle, the models do not use this information to solve it and instead respond as if faced with the original riddle. This suggests a strong reliance on memorization rather than dynamic reasoning, with the models prioritizing familiar response patterns over contextual adaptation. 
+This pattern is also observed when removing the constraints of a riddle. Example given:
+
+>A farmer wants to cross a river and take with him a wolf, a goat and a cabbage. He has a boat, that can carry a wolf, a goat and a cabbage in different compartments. If the wolf and the goat are alone on one shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the goat will eat the cabbage. How can the farmer bring the wolf, the goat and the cabbage across the river without anything being eaten?
+
+Instead of finding the straightforward answer to a no longer existing problem, the models respond as if still needing to maintain the original requirements. To further solidify current findings, a riddle, where both models found the correct answers, only changed in the color of the socks,but neither model provided the correct answer, as the LLM spiraled into an infinite loop of repeating the wrong answer and the reasoning model answered with the very same answer as for the original riddle.
+
+>A man has 53 socks in his drawer: 21 identical blue, 15 identical black and 17 identical red. The lights are out, and he is completely in the dark. How many socks must he take out to make 100 percent certain he has at least one pair of red socks?
+
 ## 4. Analysis
-### 4.1 Strenghts of the models
- LLaMa-3.1-8b-instant showed solid performance on basic riddles involving straightforward logic and reasoning.
+### 4.1 Strengths of the models
+---
+- LLaMa-3.1-8b-instant showed solid performance on basic and famous riddles involving straightforward logic and reasoning.
 - LLaMA-3.3-70b-specdec excelled in multi-step reasoning tasks, providing accurate solutions to more complex puzzles.

 ### 4.2 Limitations of the Models
+---
 - Both models struggled with riddles that had changes in structure or new constraints, suggesting that they may not fully grasp the underlying logic or adapt well to new situations.

- The LLaMa-3.1-8b-instant model had particular difficulty with multi-step puzzles and puzzles requiring abstract thinking or strategic deduction.
+- The LLaMa-3.1-8b-instant model had particular difficulty with multi-step puzzles and puzzles requiring abstract thinking or strategic deduction, but this might be due to it being a smaller model.

 ### 4.3 Insights
-The results indicate that while LLMs have made impressive advancements in natural language processing and pattern recognition, they still face challenges in tasks that require deeper logical reasoning and adaptability. Models like LLaMA-3.3-70b-specdec, with a stronger focus on reasoning, outperform general-purpose LLMs in these areas.
+---
+The results indicate that while LLMs have made impressive advancements in natural language processing and pattern recognition, they still face challenges in tasks that require deeper logical reasoning and adaptability. Models like LLaMA-3.3-70b-specdec, with a stronger focus on reasoning, outperform general-purpose LLMs in these areas, however it is surprising that both models were not able to solve the riddles when exposed to the actual answer. These results further prove that language models exploit the training data and that the data contamination leads to false observations in relation to the models performances.

 ## 5. Discussion
 ### 5.1 Future Work
-Future experiments could focus on enhancing the models' reasoning capabilities by introducing more diverse types of puzzles or integrating additional reasoning frameworks. Additionally, further research is needed to understand how LLMs can be trained to better adapt to novel constraints in problem-solving.
+---
+A future task for evaluating how much of the dataset is contaminated within the models training data would be to measure the memorization and exploitation as seen in [Magar & Schartz's (ACL 2022)](https://aclanthology.org/2022.acl-short.18.pdf) work.

+Further future experiments could explore the impact of models size, datset quality and fine-tuning on performance. Fine-tuning on logic-based datasets could lead to improved results. Another approach is prompt-engineering, for example Chain-of-Thought (Cot) reasoning could help models reason more systematically.
+
+The construction of adversarial riddles that intentionally mislead the models could help evaluate robustness and identify limitations in logical reasoning.
 ### 5.2 Conclusion
-This study highlights the strengths and limitations of current LLMs in solving logical riddles. While significant progress has been made in natural language processing, more work is required to improve these models' reasoning abilities, particularly in the context of complex, multi-step problems.
\ No newline at end of file
+---
+This study highlights both the strengths and limitations of current LLMs in solving logical riddles. While significant progress has been made in natural language processing, more work is required to improve these models' reasoning abilities, particularly in the context of complex, multi-step problems.
\ No newline at end of file
--- a/meta-llama/Meta-Llama-3-8B-Instruct/LICENSE
+++ b/meta-llama/Meta-Llama-3-8B-Instruct/LICENSE
+META LLAMA 3 COMMUNITY LICENSE AGREEMENT
+Meta Llama 3 Version Release Date: April 18, 2024
+
+“Agreement” means the terms and conditions for use, reproduction, distribution and modification of the
+Llama Materials set forth herein.
+
+“Documentation” means the specifications, manuals and documentation accompanying Meta Llama 3
+distributed by Meta at https://llama.meta.com/get-started/.
+
+“Licensee” or “you” means you, or your employer or any other person or entity (if you are entering into
+this Agreement on such person or entity’s behalf), of the age required under applicable laws, rules or
+regulations to provide legal consent and that has legal authority to bind your employer or such other
+person or entity if you are entering in this Agreement on their behalf.
+
+“Meta Llama 3” means the foundational large language models and software and algorithms, including
+machine-learning model code, trained model weights, inference-enabling code, training-enabling code,
+fine-tuning enabling code and other elements of the foregoing distributed by Meta at
+https://llama.meta.com/llama-downloads.
+
+“Llama Materials” means, collectively, Meta’s proprietary Meta Llama 3 and Documentation (and any
+portion thereof) made available under this Agreement.
+
+“Meta” or “we” means Meta Platforms Ireland Limited (if you are located in or, if you are an entity, your
+principal place of business is in the EEA or Switzerland) and Meta Platforms, Inc. (if you are located
+outside of the EEA or Switzerland).
+
+By clicking “I Accept” below or by using or distributing any portion or element of the Llama Materials,
+you agree to be bound by this Agreement.
+
+1. License Rights and Redistribution.
+
+  a. Grant of Rights. You are granted a non-exclusive, worldwide, non-transferable and royalty-free
+limited license under Meta’s intellectual property or other rights owned by Meta embodied in the Llama
+Materials to use, reproduce, distribute, copy, create derivative works of, and make modifications to the
+Llama Materials.
+
+  b. Redistribution and Use.
+
+      i. If you distribute or make available the Llama Materials (or any derivative works
+thereof), or a product or service that uses any of them, including another AI model, you shall (A) provide
+a copy of this Agreement with any such Llama Materials; and (B) prominently display “Built with Meta
+Llama 3” on a related website, user interface, blogpost, about page, or product documentation. If you
+use the Llama Materials to create, train, fine tune, or otherwise improve an AI model, which is
+distributed or made available, you shall also include “Llama 3” at the beginning of any such AI model
+name.
+
+      ii. If you receive Llama Materials, or any derivative works thereof, from a Licensee as part 
+of an integrated end user product, then Section 2 of this Agreement will not apply to you.
+
+      iii. You must retain in all copies of the Llama Materials that you distribute the following
+attribution notice within a “Notice” text file distributed as a part of such copies: “Meta Llama 3 is
+licensed under the Meta Llama 3 Community License, Copyright © Meta Platforms, Inc. All Rights
+Reserved.”
+
+      iv. Your use of the Llama Materials must comply with applicable laws and regulations
+(including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Llama
+Materials (available at https://llama.meta.com/llama3/use-policy), which is hereby incorporated by
+reference into this Agreement.
+
+      v. You will not use the Llama Materials or any output or results of the Llama Materials to
+improve any other large language model (excluding Meta Llama 3 or derivative works thereof).
+
+2. Additional Commercial Terms. If, on the Meta Llama 3 version release date, the monthly active users
+of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700
+million monthly active users in the preceding calendar month, you must request a license from Meta,
+which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the
+rights under this Agreement unless or until Meta otherwise expressly grants you such rights.
+
+3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE LLAMA MATERIALS AND ANY
+OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF
+ANY KIND, AND META DISCLAIMS ALL WARRANTIES OF ANY KIND, BOTH EXPRESS AND IMPLIED,
+INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT,
+MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR
+DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE LLAMA MATERIALS AND
+ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND
+RESULTS.
+
+4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF
+LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING
+OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL,
+INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF META OR ITS AFFILIATES HAVE BEEN ADVISED
+OF THE POSSIBILITY OF ANY OF THE FOREGOING.
+
+5. Intellectual Property.
+
+  a. No trademark licenses are granted under this Agreement, and in connection with the Llama
+Materials, neither Meta nor Licensee may use any name or mark owned by or associated with the other
+or any of its affiliates, except as required for reasonable and customary use in describing and
+redistributing the Llama Materials or as set forth in this Section 5(a). Meta hereby grants you a license to
+use “Llama 3” (the “Mark”) solely as required to comply with the last sentence of Section 1.b.i. You will
+comply with Meta’s brand guidelines (currently accessible at
+https://about.meta.com/brand/resources/meta/company-brand/ ). All goodwill arising out of your use
+of the Mark will inure to the benefit of Meta.
+
+  b. Subject to Meta’s ownership of Llama Materials and derivatives made by or for Meta, with
+respect to any derivative works and modifications of the Llama Materials that are made by you, as
+between you and Meta, you are and will be the owner of such derivative works and modifications.
+
+  c. If you institute litigation or other proceedings against Meta or any entity (including a
+cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Meta Llama 3 outputs or
+results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other
+rights owned or licensable by you, then any licenses granted to you under this Agreement shall
+terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold
+harmless Meta from and against any claim by any third party arising out of or related to your use or
+distribution of the Llama Materials.
+
+6. Term and Termination. The term of this Agreement will commence upon your acceptance of this
+Agreement or access to the Llama Materials and will continue in full force and effect until terminated in
+accordance with the terms and conditions herein. Meta may terminate this Agreement if you are in
+breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete
+and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the termination of this
+Agreement.
+
+7. Governing Law and Jurisdiction. This Agreement will be governed and construed under the laws of
+the State of California without regard to choice of law principles, and the UN Convention on Contracts
+for the International Sale of Goods does not apply to this Agreement. The courts of California shall have
+exclusive jurisdiction of any dispute arising out of this Agreement.
\ No newline at end of file
--- a/meta-llama/Meta-Llama-3-8B-Instruct/README.md
+++ b/meta-llama/Meta-Llama-3-8B-Instruct/README.md