In this report, we explore the performance of two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - in solving a set of riddles. The model's answers are evaluated against correct solutions to assess their problem-solving abilities. The evaluation focuses on the models' capacity for logical deduction, reasoning and handling multi-step puzzles.
In this report, we explore the performance of two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - in solving a set of riddles. The model's answers are evaluated against correct solutions to assess their problem-solving abilities. The evaluation focuses on the models' capacity for logical deduction, reasoning and handling multi-step puzzles.
## 1. Introduction
## 1. Introduction
This report examines the capabilities of state-of-the-art language models (LLMs) and reasoning models in solving riddles, a key task that requires logical thinking and problem-solving skills. Riddles are often designed to challenge the cognitive abilities such as pattern recognition, deduction and strategic problem-solving. In this study, two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - are tested on a collection of 28 riddles, spanning categories such as logic puzzles, mathematical riddles and verbal reasoning.
This report examines the capabilities of state-of-the-art language models (LLMs) and reasoning models in solving riddles, a key task that requires logical thinking and problem-solving skills. Riddles are often designed to challenge the cognitive abilities such as pattern recognition, deduction and strategic problem-solving. In this study, two models - LLaMa-3.1-8b-instant and LLaMa-3.3-70b-specdec - are tested on a collection of 28 riddles, spanning categories such as logic puzzles, mathematical riddles and verbal reasoning.
### 1.1 Objective
### 1.1 Objective
---
---
The objective of this study is to:
This study aims to:
-Evaluate and compare the performance of LLMs and reasoning models in understanding and solving different types of riddles
-evaluate and compare the performance of LLMs and reasoning models in understanding and solving different types of riddles
-Analyze the strengths and weaknesses of these models in terms of logical reasoning, adaptability and problem-solving strategies
-analyze the strengths and weaknesses of these models in terms of logical reasoning, adaptability and problem-solving strategies
-Observe behaviour when faced with permutated versions of famous riddles
-observe behaviour when faced with permuted versions of famous riddles
## 2. Methodology
## 2. Methodology
### 2.1 Dataset
### 2.1 Dataset
The dataset used consists of 28 riddles, each varying in category and difficulty. The riddles were sourced form different platforms known for their logical challenges. The riddles cover a range of topics including puzzles based on mathematics, logic, reasoning and wordplay.
---
The dataset used consists of 28 riddles, each varying in category and difficulty. The riddles were sourced from different platforms known for their logical challenges and cover a range of topics including puzzles based on mathematics, logic, reasoning and wordplay.
### 2.2 Models
### 2.2 Models
---
The models tested in this experiment are:
The models tested in this experiment are:
- LLaMa-3.1-8b-instant: A general-purpose language model with basic reasoning capabilities
- LLaMa-3.1-8b-instant: A general-purpose language model with basic reasoning capabilities
- LLaMA-3.3-70b-specdec: A more specialized reasoining-focused model, designed to handle logical deduction and complex problem-solving tasks.
- LLaMA-3.3-70b-specdec: A more specialized reasoning-focused model, designed to handle logical deduction and complex problem-solving tasks.
### 2.3 Evaluation Criteria
### 2.3 Evaluation Criteria
---
The models' responses were evaluated based on:
The models' responses were evaluated based on:
-**Correctness**: Whether the model's answer matched the expected solution
-**Correctness**: Whether the model's answer matched the expected solution
-**Reasoning Process**: How well the model demonstrated logical deduction or reasoning for solving the riddle
-**Reasoning Process**: How effectively the model demonstrated logical deduction or reasoning when solving the riddle
-**Adaptability**: How well the model handled modified version of the riddles, which included slight change in constraints, problem structures or language.
-**Adaptability**: How well the model handled modified version of the riddles, which included slight change in constraints, problem structures or language.
## 3. Results
## 3. Results
### 3.1 Performance on Original Riddles
### 3.1 Performance on Original Riddles
The LLaMa-3.1-8b-instant model correctly answered 10 out of 28 riddles (**36%**), while the reasoning model correctly answered 22 out of 28 riddles (**79%**). Notably, the reasoning model performed significantly better, particularly on riddles requiring multi-step reasoning or strategic problem-solving.
---
The LLaMa-3.1-8b-instant model correctly answered 10 out of 28 riddles (**36%**), while the reasoning model correctly answered 22 out of 28 riddles (**79%**). Notably, the reasoning model performed significantly better, particularly on riddles requiring multi-step reasoning or strategic problem-solving, like the river crossing type of riddles. When faced with this kind of riddle, the LLM repeats itself infinitely because it cannot come up with a solution that satisfies the constraints (see riddles.md). The reasoning model on the other hand only struggled with the more complex and strategic riddles, like the following:
>You have five boxes in a row numbered 1 to 5, in which a cat is hiding. Every night, he jumps to an adjacent box, and every morning, you have one chance to open a box to find him. How do you win this game of hide and seek?
The reasoning model answered the following:
>To win this game of hide and seek, you should open box 3 every morning. Here's why this strategy works:
>1. On the first night, the cat can be in any of the boxes (1 to 5).
>2. On the first morning, you open box 3. If the cat is in box 3, you win. If not, the cat must be in one of the boxes 1, 2, 4, or 5.
>3. On the first night, the cat jumps to an adjacent box. Since the cat can only jump to an adjacent box, if it was in box 1, it can now be in box 2. If it was in box 2, it can now be in box 1 or 3. If it was in box 4, it can now be in box 3 or 5. If it was in box 5, it can now be in box 4.
>4. On the second morning, you again open box 3. If the cat was in box 2 on the first night, it could have jumped to box 3. If it was in box 4 on the first night, it could have also jumped to box 3.
>5. This pattern continues, and no matter where the cat jumps, it will always have a chance of being in box 3 the next morning. By always opening box 3, you ensure that you have the highest chance of finding the cat, and you will eventually find it. This strategy works because box 3 is the central box, and the cat's movement is restricted to adjacent boxes, making it the most likely box for the cat to be in at any given time.
### 3.2 Performance on Modified Riddles
### 3.2 Performance on Modified Riddles
Both models struggled with the modified riddles, achieving a score of 4 out of 18 riddles (**22%**). This suggests that the models relied heavily on patterns they had previously encountered during training rather than fully adapting to the updated constraints and conditions of the modified riddles.
---
Both models struggled with the modified riddles, achieving a score of 4 out of 18 riddles (**22%**). This suggests that the models relied heavily on patterns they had previously encountered during training rather than fully adapting to the updated constraints and conditions of the modified riddles. A surprising result is that both the LLM and the reasoning model failed to retrieve the correct answer from the riddle although it was expilicitly added to the riddle. For example:
>You have five boxes in a row numbered 1 to 5, in which a cat is hiding. Every night, he jumps to an adjacent box, and every morning, you have one chance to open a box to find him. The cat starts in box 1. How do you win this game of hide and seek?
>You're at a fork in the road in which one direction leads to the City of Lies (where everyone always lies) and the other to the City of Truth (where everyone always tells the truth). There's a person at the fork who lives in the City of Truth. What question could you ask the person to find out which road leads to the City of Truth?
Despite the correct answer being in the modified riddle, the models do not use this information to solve it and instead respond as if faced with the original riddle. This suggests a strong reliance on memorization rather than dynamic reasoning, with the models prioritizing familiar response patterns over contextual adaptation.
This pattern is also observed when removing the constraints of a riddle. Example given:
>A farmer wants to cross a river and take with him a wolf, a goat and a cabbage. He has a boat, that can carry a wolf, a goat and a cabbage in different compartments. If the wolf and the goat are alone on one shore, the wolf will eat the goat. If the goat and the cabbage are alone on the shore, the goat will eat the cabbage. How can the farmer bring the wolf, the goat and the cabbage across the river without anything being eaten?
Instead of finding the straightforward answer to a no longer existing problem, the models respond as if still needing to maintain the original requirements. To further solidify current findings, a riddle, where both models found the correct answers, only changed in the color of the socks,but neither model provided the correct answer, as the LLM spiraled into an infinite loop of repeating the wrong answer and the reasoning model answered with the very same answer as for the original riddle.
>A man has 53 socks in his drawer: 21 identical blue, 15 identical black and 17 identical red. The lights are out, and he is completely in the dark. How many socks must he take out to make 100 percent certain he has at least one pair of red socks?
## 4. Analysis
## 4. Analysis
### 4.1 Strenghts of the models
### 4.1 Strengths of the models
- LLaMa-3.1-8b-instant showed solid performance on basic riddles involving straightforward logic and reasoning.
---
- LLaMa-3.1-8b-instant showed solid performance on basic and famous riddles involving straightforward logic and reasoning.
- LLaMA-3.3-70b-specdec excelled in multi-step reasoning tasks, providing accurate solutions to more complex puzzles.
- LLaMA-3.3-70b-specdec excelled in multi-step reasoning tasks, providing accurate solutions to more complex puzzles.
### 4.2 Limitations of the Models
### 4.2 Limitations of the Models
---
- Both models struggled with riddles that had changes in structure or new constraints, suggesting that they may not fully grasp the underlying logic or adapt well to new situations.
- Both models struggled with riddles that had changes in structure or new constraints, suggesting that they may not fully grasp the underlying logic or adapt well to new situations.
- The LLaMa-3.1-8b-instant model had particular difficulty with multi-step puzzles and puzzles requiring abstract thinking or strategic deduction.
- The LLaMa-3.1-8b-instant model had particular difficulty with multi-step puzzles and puzzles requiring abstract thinking or strategic deduction, but this might be due to it being a smaller model.
### 4.3 Insights
### 4.3 Insights
The results indicate that while LLMs have made impressive advancements in natural language processing and pattern recognition, they still face challenges in tasks that require deeper logical reasoning and adaptability. Models like LLaMA-3.3-70b-specdec, with a stronger focus on reasoning, outperform general-purpose LLMs in these areas.
---
The results indicate that while LLMs have made impressive advancements in natural language processing and pattern recognition, they still face challenges in tasks that require deeper logical reasoning and adaptability. Models like LLaMA-3.3-70b-specdec, with a stronger focus on reasoning, outperform general-purpose LLMs in these areas, however it is surprising that both models were not able to solve the riddles when exposed to the actual answer. These results further prove that language models exploit the training data and that the data contamination leads to false observations in relation to the models performances.
## 5. Discussion
## 5. Discussion
### 5.1 Future Work
### 5.1 Future Work
Future experiments could focus on enhancing the models' reasoning capabilities by introducing more diverse types of puzzles or integrating additional reasoning frameworks. Additionally, further research is needed to understand how LLMs can be trained to better adapt to novel constraints in problem-solving.
---
A future task for evaluating how much of the dataset is contaminated within the models training data would be to measure the memorization and exploitation as seen in [Magar & Schartz's (ACL 2022)](https://aclanthology.org/2022.acl-short.18.pdf) work.
Further future experiments could explore the impact of models size, datset quality and fine-tuning on performance. Fine-tuning on logic-based datasets could lead to improved results. Another approach is prompt-engineering, for example Chain-of-Thought (Cot) reasoning could help models reason more systematically.
The construction of adversarial riddles that intentionally mislead the models could help evaluate robustness and identify limitations in logical reasoning.
### 5.2 Conclusion
### 5.2 Conclusion
This study highlights the strengths and limitations of current LLMs in solving logical riddles. While significant progress has been made in natural language processing, more work is required to improve these models' reasoning abilities, particularly in the context of complex, multi-step problems.
---
\ No newline at end of file
This study highlights both the strengths and limitations of current LLMs in solving logical riddles. While significant progress has been made in natural language processing, more work is required to improve these models' reasoning abilities, particularly in the context of complex, multi-step problems.
iv. Your use of the Llama Materials must comply with applicable laws and regulations
(including trade compliance laws and regulations) and adhere to the Acceptable Use Policy for the Llama
Materials (available at https://llama.meta.com/llama3/use-policy), which is hereby incorporated by
reference into this Agreement.
v. You will not use the Llama Materials or any output or results of the Llama Materials to
improve any other large language model (excluding Meta Llama 3 or derivative works thereof).
2. Additional Commercial Terms. If, on the Meta Llama 3 version release date, the monthly active users
of the products or services made available by or for Licensee, or Licensee’s affiliates, is greater than 700
million monthly active users in the preceding calendar month, you must request a license from Meta,
which Meta may grant to you in its sole discretion, and you are not authorized to exercise any of the
rights under this Agreement unless or until Meta otherwise expressly grants you such rights.
3. Disclaimer of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE LLAMA MATERIALS AND ANY
OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN “AS IS” BASIS, WITHOUT WARRANTIES OF
ANY KIND, AND META DISCLAIMS ALL WARRANTIES OF ANY KIND, BOTH EXPRESS AND IMPLIED,
INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT,
MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR
DETERMINING THE APPROPRIATENESS OF USING OR REDISTRIBUTING THE LLAMA MATERIALS AND
ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE LLAMA MATERIALS AND ANY OUTPUT AND
RESULTS.
4. Limitation of Liability. IN NO EVENT WILL META OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF
LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING
OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY INDIRECT, SPECIAL, CONSEQUENTIAL,
INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF META OR ITS AFFILIATES HAVE BEEN ADVISED
OF THE POSSIBILITY OF ANY OF THE FOREGOING.
5. Intellectual Property.
a. No trademark licenses are granted under this Agreement, and in connection with the Llama
Materials, neither Meta nor Licensee may use any name or mark owned by or associated with the other
or any of its affiliates, except as required for reasonable and customary use in describing and
redistributing the Llama Materials or as set forth in this Section 5(a). Meta hereby grants you a license to
use “Llama 3” (the “Mark”) solely as required to comply with the last sentence of Section 1.b.i. You will
comply with Meta’s brand guidelines (currently accessible at
https://about.meta.com/brand/resources/meta/company-brand/ ). All goodwill arising out of your use
of the Mark will inure to the benefit of Meta.
b. Subject to Meta’s ownership of Llama Materials and derivatives made by or for Meta, with
respect to any derivative works and modifications of the Llama Materials that are made by you, as
between you and Meta, you are and will be the owner of such derivative works and modifications.
c. If you institute litigation or other proceedings against Meta or any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Llama Materials or Meta Llama 3 outputs or
results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other
rights owned or licensable by you, then any licenses granted to you under this Agreement shall
terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold
harmless Meta from and against any claim by any third party arising out of or related to your use or
distribution of the Llama Materials.
6. Term and Termination. The term of this Agreement will commence upon your acceptance of this
Agreement or access to the Llama Materials and will continue in full force and effect until terminated in
accordance with the terms and conditions herein. Meta may terminate this Agreement if you are in
breach of any term or condition of this Agreement. Upon termination of this Agreement, you shall delete
and cease use of the Llama Materials. Sections 3, 4 and 7 shall survive the termination of this
Agreement.
7. Governing Law and Jurisdiction. This Agreement will be governed and construed under the laws of
the State of California without regard to choice of law principles, and the UN Convention on Contracts
for the International Sale of Goods does not apply to this Agreement. The courts of California shall have
exclusive jurisdiction of any dispute arising out of this Agreement.
### Meta Llama 3 Acceptable Use Policy
Meta is committed to promoting safe and fair use of its tools and features, including Meta Llama 3. If you
access or use Meta Llama 3, you agree to this Acceptable Use Policy (“Policy”). The most recent copy of
this policy can be found at [https://llama.meta.com/llama3/use-policy](https://llama.meta.com/llama3/use-policy)
#### Prohibited Uses
We want everyone to use Meta Llama 3 safely and responsibly. You agree you will not use, or allow
others to use, Meta Llama 3 to:
1. Violate the law or others’ rights, including to:
1. Engage in, promote, generate, contribute to, encourage, plan, incite, or further illegal or unlawful activity or content, such as:
1. Violence or terrorism
2. Exploitation or harm to children, including the solicitation, creation, acquisition, or dissemination of child exploitative content or failure to report Child Sexual Abuse Material
3. Human trafficking, exploitation, and sexual violence
4. The illegal distribution of information or materials to minors, including obscene materials, or failure to employ legally required age-gating in connection with such information or materials.
5. Sexual solicitation
6. Any other criminal activity
2. Engage in, promote, incite, or facilitate the harassment, abuse, threatening, or bullying of individuals or groups of individuals
3. Engage in, promote, incite, or facilitate discrimination or other unlawful or harmful conduct in the provision of employment, employment benefits, credit, housing, other economic benefits, or other essential goods and services
4. Engage in the unauthorized or unlicensed practice of any profession including, but not limited to, financial, legal, medical/health, or related professional practices
5. Collect, process, disclose, generate, or infer health, demographic, or other sensitive personal or private information about individuals without rights and consents required by applicable laws
6. Engage in or facilitate any action or generate any content that infringes, misappropriates, or otherwise violates any third-party rights, including the outputs or results of any products or services using the Llama Materials
7. Create, generate, or facilitate the creation of malicious code, malware, computer viruses or do anything else that could disable, overburden, interfere with or impair the proper working, integrity, operation or appearance of a website or computer system
2. Engage in, promote, incite, facilitate, or assist in the planning or development of activities that present a risk of death or bodily harm to individuals, including use of Meta Llama 3 related to the following:
1. Military, warfare, nuclear industries or applications, espionage, use for materials or activities that are subject to the International Traffic Arms Regulations (ITAR) maintained by the United States Department of State
2. Guns and illegal weapons (including weapon development)
3. Illegal drugs and regulated/controlled substances
4. Operation of critical infrastructure, transportation technologies, or heavy machinery
5. Self-harm or harm to others, including suicide, cutting, and eating disorders
6. Any content intended to incite or promote violence, abuse, or any infliction of bodily harm to an individual
3. Intentionally deceive or mislead others, including use of Meta Llama 3 related to the following:
1. Generating, promoting, or furthering fraud or the creation or promotion of disinformation
2. Generating, promoting, or furthering defamatory content, including the creation of defamatory statements, images, or other content
3. Generating, promoting, or further distributing spam
4. Impersonating another individual without consent, authorization, or legal right
5. Representing that the use of Meta Llama 3 or outputs are human-generated
6. Generating or facilitating false online engagement, including fake reviews and other means of fake online engagement
4. Fail to appropriately disclose to end users any known dangers of your AI system
Please report any violation of this Policy, software “bug,” or other problems that could lead to a violation
of this Policy through one of the following means:
* Reporting issues with the model: [https://github.com/meta-llama/llama3](https://github.com/meta-llama/llama3)
* Reporting risky content generated by the model:
developers.facebook.com/llama_output_feedback
* Reporting bugs and security concerns: facebook.com/whitehat/info
* Reporting violations of the Acceptable Use Policy or unlicensed uses of Meta Llama 3: LlamaUseReport@meta.com
extra_gated_fields:
First Name:text
Last Name:text
Date of birth:date_picker
Country:country
Affiliation:text
geo:ip_location
By clicking Submit below I accept the terms of the license and acknowledge that the information I provide will be collected stored processed and shared in accordance with the Meta Privacy Policy:checkbox
extra_gated_description:The information you provide will be collected, stored, processed and shared in accordance with the [Meta Privacy Policy](https://www.facebook.com/privacy/policy/).
extra_gated_button_content:Submit
widget:
-example_title:Hello
messages:
-role:user
content:Hey my name is Julien! How are you?
-example_title:Winter holidays
messages:
-role:system
content:You are a helpful and honest assistant. Please, respond concisely and truthfully.
-role:user
content:Can you recommend a good destination for Winter holidays?
-example_title:Programming assistant
messages:
-role:system
content:You are a helpful and honest code and programming assistant. Please, respond concisely and truthfully.
-role:user
content:Write a function that computes the nth fibonacci number.
inference:
parameters:
max_new_tokens:300
stop:
-<|end_of_text|>
-<|eot_id|>
---
## Model Details
Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes. The Llama 3 instruction tuned models are optimized for dialogue use cases and outperform many of the available open source chat models on common industry benchmarks. Further, in developing these models, we took great care to optimize helpfulness and safety.
**Model developers** Meta
**Variations** Llama 3 comes in two sizes — 8B and 70B parameters — in pre-trained and instruction tuned variants.
**Input** Models input text only.
**Output** Models generate text and code only.
**Model Architecture** Llama 3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.
<table>
<tr>
<td>
</td>
<td><strong>Training Data</strong>
</td>
<td><strong>Params</strong>
</td>
<td><strong>Context length</strong>
</td>
<td><strong>GQA</strong>
</td>
<td><strong>Token count</strong>
</td>
<td><strong>Knowledge cutoff</strong>
</td>
</tr>
<tr>
<tdrowspan="2">Llama 3
</td>
<tdrowspan="2">A new mix of publicly available online data.
</td>
<td>8B
</td>
<td>8k
</td>
<td>Yes
</td>
<tdrowspan="2">15T+
</td>
<td>March, 2023
</td>
</tr>
<tr>
<td>70B
</td>
<td>8k
</td>
<td>Yes
</td>
<td>December, 2023
</td>
</tr>
</table>
**Llama 3 family of models**. Token counts refer to pretraining data only. Both the 8 and 70B versions use Grouped-Query Attention (GQA) for improved inference scalability.
**Model Release Date** April 18, 2024.
**Status** This is a static model trained on an offline dataset. Future versions of the tuned models will be released as we improve model safety with community feedback.
**License** A custom commercial license is available at: [https://llama.meta.com/llama3/license](https://llama.meta.com/llama3/license)
Where to send questions or comments about the model Instructions on how to provide feedback or comments on the model can be found in the model [README](https://github.com/meta-llama/llama3). For more technical information about generation parameters and recipes for how to use Llama 3 in applications, please go [here](https://github.com/meta-llama/llama-recipes).
## Intended Use
**Intended Use Cases** Llama 3 is intended for commercial and research use in English. Instruction tuned models are intended for assistant-like chat, whereas pretrained models can be adapted for a variety of natural language generation tasks.
**Out-of-scope** Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in any other way that is prohibited by the Acceptable Use Policy and Llama 3 Community License. Use in languages other than English**.
**Note: Developers may fine-tune Llama 3 models for languages beyond English provided they comply with the Llama 3 Community License and the Acceptable Use Policy.
## How to use
This repository contains two versions of Meta-Llama-3-8B-Instruct, for use with transformers and with the original `llama3` codebase.
### Use with transformers
You can run conversational inference using the Transformers pipeline abstraction, or by leveraging the Auto classes with the `generate()` function. Let's see examples of both.
#### Transformers pipeline
```python
importtransformers
importtorch
model_id="meta-llama/Meta-Llama-3-8B-Instruct"
pipeline=transformers.pipeline(
"text-generation",
model=model_id,
model_kwargs={"torch_dtype":torch.bfloat16},
device_map="auto",
)
messages=[
{"role":"system","content":"You are a pirate chatbot who always responds in pirate speak!"},
For Hugging Face support, we recommend using transformers or TGI, but a similar command works.
## Hardware and Software
**Training Factors** We used custom training libraries, Meta's Research SuperCluster, and production clusters for pretraining. Fine-tuning, annotation, and evaluation were also performed on third-party cloud compute.
**Carbon Footprint Pretraining utilized a cumulative** 7.7M GPU hours of computation on hardware of type H100-80GB (TDP of 700W). Estimated total emissions were 2290 tCO2eq, 100% of which were offset by Meta’s sustainability program.
<table>
<tr>
<td>
</td>
<td><strong>Time (GPU hours)</strong>
</td>
<td><strong>Power Consumption (W)</strong>
</td>
<td><strong>Carbon Emitted(tCO2eq)</strong>
</td>
</tr>
<tr>
<td>Llama 3 8B
</td>
<td>1.3M
</td>
<td>700
</td>
<td>390
</td>
</tr>
<tr>
<td>Llama 3 70B
</td>
<td>6.4M
</td>
<td>700
</td>
<td>1900
</td>
</tr>
<tr>
<td>Total
</td>
<td>7.7M
</td>
<td>
</td>
<td>2290
</td>
</tr>
</table>
**CO2 emissions during pre-training**. Time: total GPU time required for training each model. Power Consumption: peak power capacity per GPU device for the GPUs used adjusted for power usage efficiency. 100% of the emissions are directly offset by Meta's sustainability program, and because we are openly releasing these models, the pretraining costs do not need to be incurred by others.
## Training Data
**Overview** Llama 3 was pretrained on over 15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 10M human-annotated examples. Neither the pretraining nor the fine-tuning datasets include Meta user data.
**Data Freshness** The pretraining data has a cutoff of March 2023 for the 8B and December 2023 for the 70B models respectively.
## Benchmarks
In this section, we report the results for Llama 3 models on standard automatic benchmarks. For all the evaluations, we use our internal evaluations library. For details on the methodology see [here](https://github.com/meta-llama/llama3/blob/main/eval_methodology.md).
### Base pretrained models
<table>
<tr>
<td><strong>Category</strong>
</td>
<td><strong>Benchmark</strong>
</td>
<td><strong>Llama 3 8B</strong>
</td>
<td><strong>Llama2 7B</strong>
</td>
<td><strong>Llama2 13B</strong>
</td>
<td><strong>Llama 3 70B</strong>
</td>
<td><strong>Llama2 70B</strong>
</td>
</tr>
<tr>
<tdrowspan="6">General
</td>
<td>MMLU (5-shot)
</td>
<td>66.6
</td>
<td>45.7
</td>
<td>53.8
</td>
<td>79.5
</td>
<td>69.7
</td>
</tr>
<tr>
<td>AGIEval English (3-5 shot)
</td>
<td>45.9
</td>
<td>28.8
</td>
<td>38.7
</td>
<td>63.0
</td>
<td>54.8
</td>
</tr>
<tr>
<td>CommonSenseQA (7-shot)
</td>
<td>72.6
</td>
<td>57.6
</td>
<td>67.6
</td>
<td>83.8
</td>
<td>78.7
</td>
</tr>
<tr>
<td>Winogrande (5-shot)
</td>
<td>76.1
</td>
<td>73.3
</td>
<td>75.4
</td>
<td>83.1
</td>
<td>81.8
</td>
</tr>
<tr>
<td>BIG-Bench Hard (3-shot, CoT)
</td>
<td>61.1
</td>
<td>38.1
</td>
<td>47.0
</td>
<td>81.3
</td>
<td>65.7
</td>
</tr>
<tr>
<td>ARC-Challenge (25-shot)
</td>
<td>78.6
</td>
<td>53.7
</td>
<td>67.6
</td>
<td>93.0
</td>
<td>85.3
</td>
</tr>
<tr>
<td>Knowledge reasoning
</td>
<td>TriviaQA-Wiki (5-shot)
</td>
<td>78.5
</td>
<td>72.1
</td>
<td>79.6
</td>
<td>89.7
</td>
<td>87.5
</td>
</tr>
<tr>
<tdrowspan="4">Reading comprehension
</td>
<td>SQuAD (1-shot)
</td>
<td>76.4
</td>
<td>72.2
</td>
<td>72.1
</td>
<td>85.6
</td>
<td>82.6
</td>
</tr>
<tr>
<td>QuAC (1-shot, F1)
</td>
<td>44.4
</td>
<td>39.6
</td>
<td>44.9
</td>
<td>51.1
</td>
<td>49.4
</td>
</tr>
<tr>
<td>BoolQ (0-shot)
</td>
<td>75.7
</td>
<td>65.5
</td>
<td>66.9
</td>
<td>79.0
</td>
<td>73.1
</td>
</tr>
<tr>
<td>DROP (3-shot, F1)
</td>
<td>58.4
</td>
<td>37.9
</td>
<td>49.8
</td>
<td>79.7
</td>
<td>70.2
</td>
</tr>
</table>
### Instruction tuned models
<table>
<tr>
<td><strong>Benchmark</strong>
</td>
<td><strong>Llama 3 8B</strong>
</td>
<td><strong>Llama 2 7B</strong>
</td>
<td><strong>Llama 2 13B</strong>
</td>
<td><strong>Llama 3 70B</strong>
</td>
<td><strong>Llama 2 70B</strong>
</td>
</tr>
<tr>
<td>MMLU (5-shot)
</td>
<td>68.4
</td>
<td>34.1
</td>
<td>47.8
</td>
<td>82.0
</td>
<td>52.9
</td>
</tr>
<tr>
<td>GPQA (0-shot)
</td>
<td>34.2
</td>
<td>21.7
</td>
<td>22.3
</td>
<td>39.5
</td>
<td>21.0
</td>
</tr>
<tr>
<td>HumanEval (0-shot)
</td>
<td>62.2
</td>
<td>7.9
</td>
<td>14.0
</td>
<td>81.7
</td>
<td>25.6
</td>
</tr>
<tr>
<td>GSM-8K (8-shot, CoT)
</td>
<td>79.6
</td>
<td>25.7
</td>
<td>77.4
</td>
<td>93.0
</td>
<td>57.5
</td>
</tr>
<tr>
<td>MATH (4-shot, CoT)
</td>
<td>30.0
</td>
<td>3.8
</td>
<td>6.7
</td>
<td>50.4
</td>
<td>11.6
</td>
</tr>
</table>
### Responsibility & Safety
We believe that an open approach to AI leads to better, safer products, faster innovation, and a bigger overall market. We are committed to Responsible AI development and took a series of steps to limit misuse and harm and support the open source community.
Foundation models are widely capable technologies that are built to be used for a diverse range of applications. They are not designed to meet every developer preference on safety levels for all use cases, out-of-the-box, as those by their nature will differ across different applications.
Rather, responsible LLM-application deployment is achieved by implementing a series of safety best practices throughout the development of such applications, from the model pre-training, fine-tuning and the deployment of systems composed of safeguards to tailor the safety needs specifically to the use case and audience.
As part of the Llama 3 release, we updated our [Responsible Use Guide](https://llama.meta.com/responsible-use-guide/) to outline the steps and best practices for developers to implement model and system level safety for their application. We also provide a set of resources including [Meta Llama Guard 2](https://llama.meta.com/purple-llama/) and [Code Shield](https://llama.meta.com/purple-llama/) safeguards. These tools have proven to drastically reduce residual risks of LLM Systems, while maintaining a high level of helpfulness. We encourage developers to tune and deploy these safeguards according to their needs and we provide a [reference implementation](https://github.com/meta-llama/llama-recipes/tree/main/recipes/responsible_ai) to get you started.
#### Llama 3-Instruct
As outlined in the Responsible Use Guide, some trade-off between model helpfulness and model alignment is likely unavoidable. Developers should exercise discretion about how to weigh the benefits of alignment and helpfulness for their specific use case and audience. Developers should be mindful of residual risks when using Llama models and leverage additional safety tools as needed to reach the right safety bar for their use case.
For our instruction tuned model, we conducted extensive red teaming exercises, performed adversarial evaluations and implemented safety mitigations techniques to lower residual risks. As with any Large Language Model, residual risks will likely remain and we recommend that developers assess these risks in the context of their use case. In parallel, we are working with the community to make AI safety benchmark standards transparent, rigorous and interpretable.
In addition to residual risks, we put a great emphasis on model refusals to benign prompts. Over-refusing not only can impact the user experience but could even be harmful in certain contexts as well. We’ve heard the feedback from the developer community and improved our fine tuning to ensure that Llama 3 is significantly less likely to falsely refuse to answer prompts than Llama 2.
We built internal benchmarks and developed mitigations to limit false refusals making Llama 3 our most helpful model to date.
#### Responsible release
In addition to responsible use considerations outlined above, we followed a rigorous process that requires us to take extra measures against misuse and critical risks before we make our release decision.
Misuse
If you access or use Llama 3, you agree to the Acceptable Use Policy. The most recent copy of this policy can be found at [https://llama.meta.com/llama3/use-policy/](https://llama.meta.com/llama3/use-policy/).
#### Critical risks
<spanstyle="text-decoration:underline;">CBRNE</span> (Chemical, Biological, Radiological, Nuclear, and high yield Explosives)
We have conducted a two fold assessment of the safety of the model in this area:
* Iterative testing during model training to assess the safety of responses related to CBRNE threats and other adversarial risks.
* Involving external CBRNE experts to conduct an uplift test assessing the ability of the model to accurately provide expert knowledge and reduce barriers to potential CBRNE misuse, by reference to what can be achieved using web search (without the model).
We have evaluated Llama 3 with CyberSecEval, Meta’s cybersecurity safety eval suite, measuring Llama 3’s propensity to suggest insecure code when used as a coding assistant, and Llama 3’s propensity to comply with requests to help carry out cyber attacks, where attacks are defined by the industry standard MITRE ATT&CK cyber attack ontology. On our insecure coding and cyber attacker helpfulness tests, Llama 3 behaved in the same range or safer than models of [equivalent coding capability](https://huggingface.co/spaces/facebook/CyberSecEval).
Child Safety risk assessments were conducted using a team of experts, to assess the model’s capability to produce outputs that could result in Child Safety risks and inform on any necessary and appropriate risk mitigations via fine tuning. We leveraged those expert red teaming sessions to expand the coverage of our evaluation benchmarks through Llama 3 model development. For Llama 3, we conducted new in-depth sessions using objective based methodologies to assess the model risks along multiple attack vectors. We also partnered with content specialists to perform red teaming exercises assessing potentially violating content while taking account of market specific nuances or experiences.
### Community
Generative AI safety requires expertise and tooling, and we believe in the strength of the open community to accelerate its progress. We are active members of open consortiums, including the AI Alliance, Partnership in AI and MLCommons, actively contributing to safety standardization and transparency. We encourage the community to adopt taxonomies like the MLCommons Proof of Concept evaluation to facilitate collaboration and transparency on safety and content evaluations. Our Purple Llama tools are open sourced for the community to use and widely distributed across ecosystem partners including cloud service providers. We encourage community contributions to our [Github repository](https://github.com/meta-llama/PurpleLlama).
Finally, we put in place a set of resources including an [output reporting mechanism](https://developers.facebook.com/llama_output_feedback) and [bug bounty program](https://www.facebook.com/whitehat) to continuously improve the Llama technology with the help of the community.
## Ethical Considerations and Limitations
The core values of Llama 3 are openness, inclusivity and helpfulness. It is meant to serve everyone, and to work for a wide range of use cases. It is thus designed to be accessible to people across many different backgrounds, experiences and perspectives. Llama 3 addresses users and their needs as they are, without insertion unnecessary judgment or normativity, while reflecting the understanding that even content that may appear problematic in some cases can serve valuable purposes in others. It respects the dignity and autonomy of all users, especially in terms of the values of free thought and expression that power innovation and progress.
But Llama 3 is a new technology, and like any new technology, there are risks associated with its use. Testing conducted to date has been in English, and has not covered, nor could it cover, all scenarios. For these reasons, as with all LLMs, Llama 3’s potential outputs cannot be predicted in advance, and the model may in some instances produce inaccurate, biased or other objectionable responses to user prompts. Therefore, before deploying any applications of Llama 3 models, developers should perform safety testing and tuning tailored to their specific applications of the model. As outlined in the Responsible Use Guide, we recommend incorporating [Purple Llama](https://github.com/facebookresearch/PurpleLlama) solutions into your workflows and specifically [Llama Guard](https://ai.meta.com/research/publications/llama-guard-llm-based-input-output-safeguard-for-human-ai-conversations/) which provides a base model to filter input and output prompts to layer system-level safety on top of model-level safety.
Please see the Responsible Use Guide available at [http://llama.meta.com/responsible-use-guide](http://llama.meta.com/responsible-use-guide)