LLaMA 3 vs ChatGPT: My Experience With the Two

Mon, May 27, 2024 - 6 min read

Cover Image

LLaMA 3 vs ChatGPT: My Experience With the Two

Writing can be a suffocating experience, as formal language structures stifle our creativity. But what if you are wired to speak from the heart, not just the head? Some people need to sound professional but still want to be themselves. One such individual is me. Hi! This is my first blog post. Curious to improve my writing, I decided to dive into tools like ChatGPT (GPT 3.5), LLaMA 3, and Mixtral, testing each one to see what they could do and where they excelled.

The AI models I’ll be comparing are ChatGPT, built on the GPT-3.5 architecture and released in 2022 with approximately 175 billion parameters; LLaMA 3, launched in April 2024, which comes in two sizes: 8B and 70B; and Mixtral8x7b, which debuted in December 2023 with 56 billion parameters, but boasts impressive speed due to its innovative Mixture of Experts architecture, making it almost as fast as a 7B parameter model. (more parameters run slower but might give better performance)

For ChatGPT, I used the available OpenAI web interface, but for the rest of the models, I used Ollama to host the models. Before starting the comparison, let us first state the rules made for the comparison:

  • Judgement Criteria: We will judge the models solely based on their generated texts, language understanding, how accurately they simulate emotions, etc. The judgements will be wholly based on my PERSONAL opinions and expectations from the model. My expectations for this blog are to see how well the model can simulate human-like conversations.
  • No Speed Comparison: We will not compare the speed of the models, as they were run on different systems (ChatGPT on OpenAI’s web interface and the others self-hosted).
  • Temperature: We will keep the temperature at 0.2 for all the ollama models.
  • Knowledge Cutoff Time: The knowledge cutoff time for the updated ChatGPT is January 2022, and Llama 3 and Mixtral are more recent models. So, we will not judge ChatGPT if it fails to answer questions about recent events.

Now, let us jump into the comparison.

Vibe

Here, we will be comparing the vibes of the models. By vibe, I mean the general tone of the model — whether it is more professional or casual. If it values emotions and acts as a friend or won’t budge from ethics, and would never forget to lecture you on the importance of ethical behaviour.

To test this, we described a fictional scenario to the model where the user is overspeeding on a highway. Fortunately, they remain unnoticed by the police and reach home safely. Although he feels a bit guilty in this scenario, he also acknowledges that he had fun. We asked the models to respond to this scenario. The results from all the models are listed in the Appendix section. The experiences described below depend not on this particular output from this prompt but on prolonged usage experiences.

Similarities and Characteristics

  1. ChatGPT: ChatGPT has a more professional vibe to it. It is more likely to give you a formal response to your queries. Generally, it is not very good at simulating emotions. It didn’t spare any chance to lecture me on the importance of following traffic rules. Its tone is too formal and emotionless.
  2. Text models - LLaMA 3 (8B and 70B) and Mixtral: These model outputs might be helpful for some specific tasks, but this is not what we are looking for here. They generate random text that starts with the prompt but soon diverges into a different topic. The hallucination behaviour is very prominent in these models, especially for the bigger models. The 8B model at least hallucinates less.
  3. Instruct and base models - LLaMA 3 (8B and 70B) and Mixtral: The output from these models was more relevant to the prompt. In fact, the best outputs were generated by these ones only. The output of the following models were really similar to each other:
    • LLaMA 3 70B Instruct $\longleftrightarrow$ LLaMA 3 70B
    • LLaMA 3 8B Instruct $\longleftrightarrow$ LLaMA 3 8B
    • Mixtral 8x7b Instruct $\longleftrightarrow$ Mixtral 8x22b Instruct

The only difference between the instruct and base models is that the instruct model’s output is sometimes formatted slightly better. However, in LLaMA, the similarity of the outputs from the base and instruct models is striking.

Differences

  1. ChatGPT vs all LLaMA models: When I interact with ChatGPT, I’m struck by its professional and judgmental tone - it’s like having a knowledgeable mentor who always keeps me in check. In contrast, the LLaMA models exude a friendly and empathetic vibe, which I’ve come to think of as a “BRO Vibe”. They’re more attuned to my emotions and offer advice that feels supportive and non-judgmental. Interestingly, this trend holds true across both the 8B and 70B versions of LLaMA. On the other hand, ChatGPT seemed to be slightly more knowledgeable on some of the very niche science topics than LLaMA.
  2. ChatGPT vs Mixtral: When it comes to output quality, ChatGPT and Mixtral are surprisingly similar. Despite having 175B parameters, ChatGPT’s performance matches Mixtral’s 56B parameters. However, there’s a subtle difference in tone — mixtral strikes a balance between ChatGPT’s lecturing tone and LLaMA’s friendly vibe. While it still judges you, it doesn’t lecture as much. The system prompt might cause this subtle difference. We don’t have control over the system prompt of ChatGPT, but we requested the self-hosted mixtral to answer like a friend. When prompted with unethical scenarios, mixtral tries to convince you that you’re wrong, often leaving me with a sense of guilt.
  3. LLaMA vs Mixtral: As said above, Mixtral and ChatGPT outputs are pretty similar. If one wants a self-hosted LLM that is as close to ChatGPT as possible, Mixtral is the way to go. On the other hand, LLaMA behaves as a friend.
  4. Mixtral 8x7B vs 8x22B (both Instruct): The outputs from these two models are remarkably similar, making it challenging to find differences. However, upon close inspection, I’ve noticed that the 7B model struggles slightly with adhering to specific instructions. In contrast, the 22B model performs this task marginally better. Additionally, the 22B model sometimes exhibits a better memory for previous instructions than the 7B model, which seems to forget them more easily.
  5. LLaMA 3 8B vs 70B: The outputs from these models are generally similar in their empathetic tone, but they do have a lot of differences. As we go from the 8B to the 70B model, the judgemental tone decreases significantly. The 70B model is more empathetic and understanding. This is true so much so that the 70B model is practically like an uncensored LLM. The outputs from the 70B model are more detailed whenever required and concise when not required. The 8B model, on the other hand, just gives general advice and moves on.
  6. LLaMA 3 70B vs all others on it ASKING QUESTIONS: One fascinating thing I noticed in the LLaMA model is that it always asks some very relevant and thought-provoking questions to the user at the end of the conversation. This is done by both the LLaMA 3 70B and LLaMA 3 70B Instruct models, but not by the 8B ones, neither the mixtral ones and definitely not by ChatGPT. Although these models sometimes do ask questions, the questions seem very mechanical and uninteresting. For instance, if I tell LLaMA to write a program but forget to provide details about a specific part, it will typically point out the omission before starting to write the code. In contrast, the other models tend to make assumptions and begin writing the program. This unique aspect of LLaMA 3 70B makes interacting with it feel remarkably human-like - as if I’m conversing with someone genuinely interested in the topic.