Hands-On Testing of Claude 3 – GPT-4’s Era Is Coming to an End

In the middle of the night, a stone stirs up a thousand waves.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Claude 3 is officially launched.

This brother company that split from OpenAI, Anthropic, quietly released Claude 3.

There was no so-called press conference, no grand public relations, just a post on X.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

I find it quite interesting that these AI companies now treat X as their main release platform…

Few words, but a big deal.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

They released three models at once: Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku.

The names are quite story-rich.

Opus roughly means epic-level composition, incredibly powerful.

Sonnet is a fourteen-line poem.

Haiku is a three-line Japanese poem.

So one can simply understand it as: Opus (super large cup), Sonnet (large cup), Haiku (medium cup).

There isn’t much to say about the differences among these three; just check the screenshots at the end of the article.

The main difference lies in the additional image provided.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Claude 3’s Opus model completely surpasses GPT-4.

And this is under several tasks in a zero-shot scenario.

For example, in the MGSM, a multilingual mathematical reasoning test set, Claude 3 Opus achieved 90.7% accuracy using zero-shot, while GPT-4, using eight-shot, achieved only 74.5%.

Zero-shot means the large model is asked to complete a task without any examples in the prompt. In contrast, eight-shot means eight examples were given before the task.

You can see the difference here… one goes straight in without examples, while the other, with eight examples, still can’t outperform Claude 3.

In complex reasoning tasks, Claude 3 can be said to completely dominate GPT-4.

In other tests like MMLU and GSM8K, it is roughly on par with GPT-4, so the core improvement lies in reasoning ability.

Anyway, Claude 3 Opus is quite appealing to me…

But, Claude, that rascal, indeed learned from OpenAI’s playbook; the free version only allows use of Sonnet, while Opus requires a $20 membership…

Pah… what a rascal…

After I blew up eight accounts…

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

So what should we do? We can only send him $20…

After running around for several hours… I also tested many cases after August 2023.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

I summarize three characteristics of Claude:

1. Unique reasoning ability, 2. Multimodal on par with GPT-4V, 3. 200K long text optimization.

1. Unique Reasoning Ability

From the above, it can be seen that the biggest evolution of Claude 3 is reasoning, specifically logic.

However, just looking at the parameters won’t give you a sense of this, so let me provide a few representative examples.

Explain the concept of the complement method and use it to calculate the probability of this problem: “A company has two departments, Department A with 3 boys and 2 girls, and Department B with 4 boys and 6 girls. Now, we need to send 3 people on a business trip, ensuring at least one person from each department. What is the probability that at least one girl is sent?”

This is a critical problem. Even with a clear understanding of the complement method, GPT-4’s error rate remains as high as 50%. However, Claude 3 Opus, after 10 tests, achieved 90% accuracy, which is quite satisfying.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

“Zhang San is a salesperson. She sold one-third of the vacuum cleaners in the green house, sold 2 more in the red house, and sold half of the remaining vacuum cleaners in the orange house. If Zhang San has 5 vacuum cleaners left, how many did she start with?”

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Of course, one can also go directly to physics problems, just upload the image. All correct.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Chemistry, too.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Even some logical traps in the Chinese context are no problem.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Overall, Claude’s evolution in logic and reasoning is enormous. It can handle basic middle school science problems easily, but high school problems generally still struggle.

However, some simple problems or semantic logic do not stump Claude 3.

2. Multimodal on Par with GPT-4V

GPT-4V has been out for quite some time; multimodality is definitely one of those features that people can’t live without.

This time, Claude 3 has finally completed its visual capabilities, allowing images to be directly input.

After playing for several hours, my overall evaluation is that it is roughly on par with GPT-4V.

Official data also tends to support this.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Aside from being slightly better in the area of scientific diagrams, there is basically no difference.

Let me show a case with a scientific diagram, which is still quite strong.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

It can restore the source code of a website from a screenshot~

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Guessing a place name is naturally a small case.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Guessing an artist based on a work? OK.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Of course, it can also do some creative work. For example, this photo.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Claude 3 Opus provided standard answers, perfect.

Overall, it is roughly the same as GPT-4V, and its support for Chinese is also good, which compensates for Claude’s long-standing shortcomings.

3. 200K Long Text Optimization

I previously wrote an article angrily criticizing Claude 2.1…

Spent 7000 yuan testing the performance of Claude 2.1 with 200K tokens, and it was terrible…

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

It turned completely red, looking like this.

This time, they finally made significant improvements.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

It finally reached 99%. Well, still not 100%.

I directly uploaded my article’s PDF dataset to test it on a classic case when I wrote about Kimi:

“When you wrote the article on the Wonderful Duck camera, whose photo did you use as a case?”

After a long time, it finally responded…

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

The content was correct, no issues.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

However, the speed was simply too slow; I had to wait about a minute.

But better late than never.

Let me show another case of a query with a large span within a document.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Overall accuracy and semantic understanding are quite good.

The ability to conduct conversations, summaries, and queries based on ultra-long texts has finally been completed in Claude 3. One can only say it has caught up, as Kimi has been doing this for almost half a year, and Claude 3 is just now reaching Kimi’s level in long text.

However, overall, Claude 3 Opus is still the most versatile large model available.

Or it can be said that it is currently the No.1.

In Conclusion

Of course, this update of Claude 3 also has some other features.

For example, reducing unnecessary refusals, higher accuracy, etc., but I don’t feel like elaborating on those.

Finally, let me post three images showing the differences between Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku.

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

These three images make it clear at a glance: the more powerful, the more expensive; the cheaper, the faster.

To summarize, after this update, Claude 3 has unique reasoning ability, multimodality on par with GPT-4V, and 200K long text optimization.

It can be said without a doubt that it is currently the strongest large model on the market.

However, according to OpenAI and Ultraman’s nature, they probably won’t be able to tolerate this.

So, as netizens said in the comments:

Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

Ultraman, hurry up and release GPT-5 to counter Claude 3, don’t be afraid.

Let’s get it on.

That way, we can quickly welcome the accelerating future.

In the end, since you’ve read this far, if you find it good, please give a like, a view, and share. If you want to receive notifications as soon as possible, you can also star me ⭐~ Thank you for reading my article.

Leave a Comment