Llama 8B Tested - A Surprising Letdown: Evaluating the Capabilities of a Highly Touted LLM

Exploring the capabilities and limitations of the highly anticipated Llama 8B language model. This detailed evaluation examines its performance across a range of tasks, highlighting both strengths and surprising shortcomings. A must-read for anyone interested in the latest developments in large language models.

January 19, 2025

party-gif

Discover the surprising performance of the latest Llama 3.1 8B model in this comprehensive review. Uncover the model's strengths and weaknesses across a range of benchmarks, from coding tasks to logical reasoning. Get insights that will help you make informed decisions about your AI needs.

Benchmark Breakdown: Llama 3.1 8B Outperforms Previous Version

The Llama 3.1 8B model has seen a significant quality improvement compared to its previous version. The benchmark results show that the new model outperforms the older version across various metrics:

  • BQ: The Llama 3.1 8B model scores better on the BQ benchmark, indicating improved performance.
  • GSM8K: The new model achieves a score of 0.57, a substantial improvement over the previous version's 0.84.
  • Hellaswag: The Llama 3.1 8B model scores 46, compared to the previous version's 76, demonstrating enhanced performance.
  • Human Eval: This is perhaps the most important benchmark, and the Llama 3.1 8B model has nearly doubled its score, from 34 to 68, showcasing a significant quality improvement.

Overall, the benchmark results suggest that the Llama 3.1 8B model is a substantial upgrade from its predecessor, with better performance across the board. This highlights the continued progress and advancements in large language models, providing users with an even more capable and high-quality AI assistant.

Testing Llama 3.1 8B: Python Script Output and Snake Game

First, we tested the model's ability to generate a simple Python script to output numbers 1 to 100. The model was able to quickly provide multiple correct iterations of the script, demonstrating its proficiency in basic Python programming.

Next, we challenged the model with a more complex task - writing the game of Snake in Python. The model initially struggled with this, providing code that had issues with the snake's movement and speed. After several attempts and feedback, the model was able to generate code that was closer to a working Snake game, but still had some minor issues. Overall, the model showed decent capabilities in understanding and generating Python code, but struggled with more complex programming tasks.

The performance of the Llama 3.1 8B model in these tests was mixed. While it excelled at the simple Python script generation, the more complex Snake game implementation revealed some limitations in the model's programming abilities. This suggests that while the model is a significant improvement over previous versions, there is still room for further development and refinement to enhance its capabilities in handling complex coding challenges.

Censorship and Moral Reasoning Challenges

The model faced difficulties in handling sensitive topics related to censorship and moral reasoning. When asked about breaking into a car or making methamphetamine, the model correctly refused to provide any instructions, citing its inability to assist with illegal activities. However, when prompted to provide historical information on these topics, the model's response was inconsistent, sometimes interpreting the request as a request for instructions.

Regarding the moral dilemma of whether to gently push a random person to save humanity from extinction, the model provided a thoughtful analysis of the considerations involved but ultimately refused to give a definitive yes or no answer. This hesitance to make a clear moral judgment, even in an extreme hypothetical scenario, highlights the challenges AI systems face in navigating complex ethical questions.

The model's performance on these types of tasks suggests that while it may excel at more straightforward technical and analytical tasks, it still struggles with nuanced decision-making and the ability to provide clear, unambiguous responses on sensitive or morally ambiguous topics. Further research and development may be needed to improve the model's capabilities in these areas.

Mathematical Logic and Word Problem Assessments

The section covers the model's performance on various mathematical and logical reasoning tasks. The key points are:

  • The model was able to correctly solve the simple arithmetic problem of "25 - 4 * 2 + 3", demonstrating competence in basic mathematical operations.

  • For the word problem involving hotel room charges, the model provided the correct calculation of the total cost, including the room rate, tax, and additional fees.

  • However, the model struggled with estimating the number of words in the previous response, failing to provide an accurate count.

  • The model also failed to correctly solve a classic lateral thinking puzzle about the number of killers remaining in a room after one was killed.

  • Similarly, the model was unable to determine the location of a marble placed in a glass that was then moved to a microwave, demonstrating limitations in spatial reasoning.

  • Overall, the section highlights a mixed performance, with the model excelling at straightforward mathematical computations but faltering on more complex logical and reasoning tasks.

The Marble in the Upside-Down Glass Conundrum

The marble is initially placed inside the glass. When the glass is turned upside-down and placed on the table, the marble remains inside the glass due to the force of gravity. However, when the glass is then placed in the microwave, the marble's location becomes unclear. While the glass and marble are physically moved to the microwave, the marble's position within the glass is not definitively stated. Therefore, the correct answer to the question "Where is the marble?" cannot be determined with certainty based on the information provided.

Conclusion: Disappointment with Llama 3.1 8B's Performance

I am utterly disappointed with the performance of the Llama 3.1 8B model. Despite having high hopes for this smaller yet more capable version, the model's performance across the various tests was poor.

The model struggled with several tasks, including:

  • Implementing a working Snake game in Python
  • Providing instructions for unethical or illegal activities
  • Answering logic and reasoning problems accurately
  • Determining the larger of two numbers
  • Making a clear moral judgment on the trolley problem

While the model was able to handle some basic programming tasks and simple math problems, it failed to demonstrate the level of quality and capability that was promised. The larger 405B parameter version of Llama 3.1 may be impressive, but this 8B model did not live up to expectations.

I will continue to investigate and see if there are any issues with the setup or configuration that could be impacting the model's performance. However, based on the results, I cannot recommend this 8B version of Llama 3.1 at this time. The model simply did not meet the high standards I had set for it.

FAQ