Hands-On Testing of Claude 3 - GPT-4's Era Is Coming to an End

In the middle of the night, a stone stirs up a thousand waves.

Claude 3 is officially launched.

This brother company that split from OpenAI, Anthropic, quietly released Claude 3.

There was no so-called press conference, no grand public relations, just a post on X.

I find it quite interesting that these AI companies now treat X as their main release platform…

Few words, but a big deal.

They released three models at once: Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku.

The names are quite story-rich.

Opus roughly means epic-level composition, incredibly powerful.

Sonnet is a fourteen-line poem.

Haiku is a three-line Japanese poem.

So one can simply understand it as: Opus (super large cup), Sonnet (large cup), Haiku (medium cup).

There isn’t much to say about the differences among these three; just check the screenshots at the end of the article.

The main difference lies in the additional image provided.

Claude 3’s Opus model completely surpasses GPT-4.

And this is under several tasks in a zero-shot scenario.

For example, in the MGSM, a multilingual mathematical reasoning test set, Claude 3 Opus achieved 90.7% accuracy using zero-shot, while GPT-4, using eight-shot, achieved only 74.5%.

Zero-shot means the large model is asked to complete a task without any examples in the prompt. In contrast, eight-shot means eight examples were given before the task.

You can see the difference here… one goes straight in without examples, while the other, with eight examples, still can’t outperform Claude 3.

In complex reasoning tasks, Claude 3 can be said to completely dominate GPT-4.

In other tests like MMLU and GSM8K, it is roughly on par with GPT-4, so the core improvement lies in reasoning ability.

Anyway, Claude 3 Opus is quite appealing to me…

But, Claude, that rascal, indeed learned from OpenAI’s playbook; the free version only allows use of Sonnet, while Opus requires a $20 membership…

Pah… what a rascal…

After I blew up eight accounts…

So what should we do? We can only send him $20…

After running around for several hours… I also tested many cases after August 2023.

I summarize three characteristics of Claude:

1. Unique reasoning ability, 2. Multimodal on par with GPT-4V, 3. 200K long text optimization.

–

1. Unique Reasoning Ability

From the above, it can be seen that the biggest evolution of Claude 3 is reasoning, specifically logic.

However, just looking at the parameters won’t give you a sense of this, so let me provide a few representative examples.

Explain the concept of the complement method and use it to calculate the probability of this problem: “A company has two departments, Department A with 3 boys and 2 girls, and Department B with 4 boys and 6 girls. Now, we need to send 3 people on a business trip, ensuring at least one person from each department. What is the probability that at least one girl is sent?”

This is a critical problem. Even with a clear understanding of the complement method, GPT-4’s error rate remains as high as 50%. However, Claude 3 Opus, after 10 tests, achieved 90% accuracy, which is quite satisfying.

“Zhang San is a salesperson. She sold one-third of the vacuum cleaners in the green house, sold 2 more in the red house, and sold half of the remaining vacuum cleaners in the orange house. If Zhang San has 5 vacuum cleaners left, how many did she start with?”

Of course, one can also go directly to physics problems, just upload the image. All correct.

Chemistry, too.

Even some logical traps in the Chinese context are no problem.

Overall, Claude’s evolution in logic and reasoning is enormous. It can handle basic middle school science problems easily, but high school problems generally still struggle.

However, some simple problems or semantic logic do not stump Claude 3.

2. Multimodal on Par with GPT-4V

GPT-4V has been out for quite some time; multimodality is definitely one of those features that people can’t live without.

This time, Claude 3 has finally completed its visual capabilities, allowing images to be directly input.

After playing for several hours, my overall evaluation is that it is roughly on par with GPT-4V.

Official data also tends to support this.

Aside from being slightly better in the area of scientific diagrams, there is basically no difference.

Let me show a case with a scientific diagram, which is still quite strong.

It can restore the source code of a website from a screenshot~

Guessing a place name is naturally a small case.

Guessing an artist based on a work? OK.

Of course, it can also do some creative work. For example, this photo.

Claude 3 Opus provided standard answers, perfect.

Overall, it is roughly the same as GPT-4V, and its support for Chinese is also good, which compensates for Claude’s long-standing shortcomings.

3. 200K Long Text Optimization

I previously wrote an article angrily criticizing Claude 2.1…

Spent 7000 yuan testing the performance of Claude 2.1 with 200K tokens, and it was terrible…

It turned completely red, looking like this.

This time, they finally made significant improvements.

It finally reached 99%. Well, still not 100%.

I directly uploaded my article’s PDF dataset to test it on a classic case when I wrote about Kimi:

“When you wrote the article on the Wonderful Duck camera, whose photo did you use as a case?”

After a long time, it finally responded…

The content was correct, no issues.

However, the speed was simply too slow; I had to wait about a minute.

But better late than never.

Let me show another case of a query with a large span within a document.

Overall accuracy and semantic understanding are quite good.

The ability to conduct conversations, summaries, and queries based on ultra-long texts has finally been completed in Claude 3. One can only say it has caught up, as Kimi has been doing this for almost half a year, and Claude 3 is just now reaching Kimi’s level in long text.

However, overall, Claude 3 Opus is still the most versatile large model available.

Or it can be said that it is currently the No.1.

In Conclusion

Of course, this update of Claude 3 also has some other features.

For example, reducing unnecessary refusals, higher accuracy, etc., but I don’t feel like elaborating on those.

Finally, let me post three images showing the differences between Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku.

These three images make it clear at a glance: the more powerful, the more expensive; the cheaper, the faster.

To summarize, after this update, Claude 3 has unique reasoning ability, multimodality on par with GPT-4V, and 200K long text optimization.

It can be said without a doubt that it is currently the strongest large model on the market.

However, according to OpenAI and Ultraman’s nature, they probably won’t be able to tolerate this.

So, as netizens said in the comments:

Ultraman, hurry up and release GPT-5 to counter Claude 3, don’t be afraid.

Let’s get it on.

That way, we can quickly welcome the accelerating future.

In the end, since you’ve read this far, if you find it good, please give a like, a view, and share. If you want to receive notifications as soon as possible, you can also star me ⭐~ Thank you for reading my article.

Hands-On Testing of Claude 3 – GPT-4’s Era Is Coming to an End

Leave a Comment Cancel reply