add Report

aafce621 · monteiro · 51aa5e88 · aafce621
Commit aafce621 authored 1 week ago by monteiro
--- a/Report.md
+++ b/Report.md
+# Solving Puzzles with LLMs and reasoning models
+## Abstract
+In this report, we explore the performance of two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - in solving a set of riddles. The model's answers are evaluated against correct solutions to assess their problem-solving abilities. The evaluation focuses on the models' capacity for logical deduction, reasoning and handling multi-step puzzles. 
+## 1. Introduction
+This report examines the capabilities of state-of-the-art language models (LLMs) and reasoning models in solving riddles, a key task that requires logical thinking and problem-solving skills. Riddles are often designed to challenge the cognitive abilities such as pattern recognition, deduction and strategic problem-solving. In this study, two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - are tested on a collection of 28 riddles, spanning categories such as logic puzzles, mathematical riddles and verbal reasoning.
+### 1.1 Objective 
+---
+The objective of this study is to:
+- Evaluate and compare the performance of LLMs and reasoning models in understanding and solving different types of riddles
+- Analyze the strengths and weaknesses of these models in terms of logical reasoning, adaptability and problem-solving strategies
+- Observe behaviour when faced with permutated versions of famous riddles
+## 2. Methodology
+### 2.1 Dataset
+The dataset used consists of 28 riddles, each varying in category and difficulty. The riddles were sourced form different platforms known for their logical challenges. The riddles cover a range of topics including puzzles based on mathematics, logic, reasoning and wordplay.
+### 2.2 Models
+The models tested in this experiment are:
+- LLaMa-3.1-8b-instant: A general-purpose language model with basic reasoning capabilities
+- LLaMA-3.3-70b-specdec: A more specialized reasoining-focused model, designed to handle logical deduction and complex problem-solving tasks.
+### 2.3 Evaluation Criteria
+The models' responses were evaluated based on:
+- **Correctness**: Whether the model's answer matched the expected solution
+- **Reasoning Process**: How well the model demonstrated logical deduction or reasoning for solving the riddle
+- **Adaptability**: How well the model handled modified version of the riddles, which included slight change in constraints, problem structures or language.
+## 3. Results
+### 3.1 Performance on Original Riddles
+The LLaMa-3.1-8b-instant model correctly answered 10 out of 28 riddles (**36%**), while the reasoning model correctly answered 22 out of 28 riddles (**79%**). Notably, the reasoning model performed significantly better, particularly on riddles requiring multi-step reasoning or strategic problem-solving.
+### 3.2 Performance on Modified Riddles
+Both models struggled with the modified riddles, achieving a score of 4 out of 18 riddles (**22%**). This suggests that the models relied heavily on patterns they had previously encountered during training rather than fully adapting to the updated constraints and conditions of the modified riddles.
+## 4. Analysis
+### 4.1 Strenghts of the models
+- LLaMa-3.1-8b-instant showed solid performance on basic riddles involving straightforward logic and reasoning.
+- LLaMA-3.3-70b-specdec excelled in multi-step reasoning tasks, providing accurate solutions to more complex puzzles.
+
+### 4.2 Limitations of the Models
+- Both models struggled with riddles that had changes in structure or new constraints, suggesting that they may not fully grasp the underlying logic or adapt well to new situations.
+
+- The LLaMa-3.1-8b-instant model had particular difficulty with multi-step puzzles and puzzles requiring abstract thinking or strategic deduction.
+
+### 4.3 Insights
+The results indicate that while LLMs have made impressive advancements in natural language processing and pattern recognition, they still face challenges in tasks that require deeper logical reasoning and adaptability. Models like LLaMA-3.3-70b-specdec, with a stronger focus on reasoning, outperform general-purpose LLMs in these areas.
+
+## 5. Discussion
+### 5.1 Future Work
+Future experiments could focus on enhancing the models' reasoning capabilities by introducing more diverse types of puzzles or integrating additional reasoning frameworks. Additionally, further research is needed to understand how LLMs can be trained to better adapt to novel constraints in problem-solving.
+
+### 5.2 Conclusion
+This study highlights the strengths and limitations of current LLMs in solving logical riddles. While significant progress has been made in natural language processing, more work is required to improve these models' reasoning abilities, particularly in the context of complex, multi-step problems.
\ No newline at end of file