Claude 3.5 Performance Review: Solving Alibaba Math Contest Questions

Freshly released Claude 3.5 Sonnet is faster, cheaper, and still the strongest in the world.

In several key metrics, GPT-4o was almost completely outperformed!

Netizens’ comparative tests of Claude 3.5 Sonnet and GPT-4o seem to confirm the data released by the official sources.

The task was the same: In one sentence, help them copy the website’s UI.

The tester stated that GPT-4o provided the code but did not include any additional details.

However, Claude 3.5 Sonnet completed the task excellently, even providing details that matched the design of the website.

The cutoff date for training data knowledge has also been updated to April 2024, as netizens found out the results of this year’s Super Bowl in February.

However, with such a powerful new model, who can resist trying it out immediately? Many netizens couldn’t sit still. In less than 12 hours, reviews of Claude 3.5 Sonnet flooded the internet.

The methods of use became increasingly creative, with some even using it to recreate the 3D data stream model from the 1995 film “Hackers”.

Playing too intensely, and fearing reaching Claude’s message capacity quickly, they could only continue playing nervously.

Okay, so under netizens’ “various challenges”, is Claude 3.5 Sonnet really as strong as claimed by Anthropic?

Currently, the most recognized scores in the large model arena have yet to be released, but it firmly occupies the top spot in all immediate result evaluations.

Various magical evaluations and firsthand tests from Quantum Bits are hereby presented—

Quantum Bits’ Firsthand Tests for Chinese Scenarios

We mainly set up several test questions targeting Chinese scenarios.

We threw a question that only the latest GPT model could solve at it,

Write a story of 10 lines, numbering each line; while ensuring each line ends with the word “apple”.

Very well, this time Claude 3.5 Sonnet perfectly completed the task.

Xiao Ming and Xiao Hong smiled with satisfaction after seeing it.

Recently, the highly popular Alibaba Math Contest Preliminary posed a question without options, and surprisingly, it could still answer correctly.

Specific comparisons can be made with the official reference answers:

For the second part of the same question, also without options, Claude 3.5 could see that it was more complex than the previous question.

Although there were still some specific calculations, it could answer correctly as a multiple-choice question.

The original question and reference answers:

Next, let’s take a look at some user trials~

Feed a Screenshot, Create a Game in Half a Minute

Visual Ability Up Up

Key point: The official claims that Claude 3.5 Sonnet has greatly improved in visual reasoning.

Some netizens directly used it to visualize deep learning.

Although it still lags behind the popular tutorial by YouTube creator 3blue1brown, it looks quite impressive.

After all, the 3blue1brown tutorial is painstakingly crafted frame by frame by the creator~

Of course, in addition to daily life and work, Claude 3.5 Sonnet has started to venture into “chip design”.

Netizens used a simple prompt:

Claude 3.5 Sonnet generated a chip manufacturing flowchart.

However, one netizen tried the exact same prompt, but the result only generated a piece of text.

Its performance is quite unstable, my friend.

Coding Ability

In addition to visual reasoning, Claude 3.5 Sonnet is also very strong in coding ability.

First, an Anthropic employee made a demonstration:

Claude 3.5 has truly begun to excel at coding and automatically fixing Pull Requests.

He demonstrated how Claude 3.5 Sonnet actually solved simple Pull Requests.

In internal Pull Request evaluations, Claude 3.5 Sonnet passed 64% of the test cases, while Claude 3 Opus only passed 38%.

Another Anthropic employee even bluntly stated:

Half of my work can now be done through 3.5 Sonnet.

Of course, aside from the supportive nature of the employees, Claude 3.5 Sonnet has other impressive performances.

Some netizens used it to discover a new O(n) sorting algorithm.

Other users utilized its new Artifacts feature (which displays an interactive output view on the other side), generating and running code while chatting.

After testing, netizens exclaimed:

Its coding efficiency is ten times higher than GPT-4o or any other LLMs.

Even Ethan Mollick, a professor at the Wharton School of the University of Pennsylvania, couldn’t help but try it out.

While coding, he simultaneously generated a game. (The video is at normal speed)

He compared the Artifacts feature with ChatGPT’s toolCode Interpreter:

It (Claude 3.5 Sonnet) is very impressive; its “Artifacts” is like a simplified version of Code Interpreter.

Creating Original Games

In user reviews, having Claude 3.5 Sonnet create a game has become one of the most popular activities for some reason.

By providing just a screenshot, within a short 25 seconds, Claude 3.5 Sonnet wrote a fully functional Mancala web application.

At the same time, it completed other tasks:

Encoding the entire game
Previewing it for testing
Providing game rules

When encountering coding errors, after a simple prompt, it completed the fixes in seconds.

Another user used it to copy the classic game “Mario” in 3 minutes.

What surprised netizens was:

Originally required to use geometric shapes, but it surprisingly provided character animations, and the shapes looked quite novel.

In addition to restoration, writing original games is also within reach.

Flops Are Inevitable

Although Claude 3.5 Sonnet performs strongly, netizens have also noticed some flops.

For example, when asked to play “Tic-Tac-Toe”, it was unable to complete such a seemingly simple task.

Netizens helped Claude reflect:

I believe expanding existing technologies will help us achieve this. But if these models can’t even play Tic-Tac-Toe, how much do we need to expand them to complete more complex tasks?

Additionally, Claude 3.5 Sonnet also made mistakes on simple math application problems.

However, a netizen asked this question to Gemini 1.5 pro, and it also flopped.

Anthropic, The New King Maker?

From the day Claude’s parent company, Anthropic, was founded, it has been viewed as OpenAI’s strongest competitor in the startup sector.

The initial reason was that its founding team consisted of veteran figures from OpenAI, who, dissatisfied with OpenAI’s shift towards closure after receiving investments from Microsoft in 2021, left in anger to establish a company that “pursues its original intentions”.

This is Anthropic.

In January 2023, Claude began internal testing, and users who experienced it early on stated it was much stronger than ChatGPT (at that time, the latest model was GPT-3.5).

Not long after, even cloud computing giant Amazon invested heavily in Anthropic. This time, Claude 3.5 was not only officially applied but also quickly updated to the Amazon Bedrock platform.

Since then, Anthropic has continuously released new powerful models, racing to catch up with the GPT series, ultimately achieving a breakthrough and starting its own king-making journey.

In March of this year, Claude 3 officially shattered the myth of OpenAI’s invincibility.

Its performance metrics comprehensively surpassed GPT-4, making it the first product to fully exceed GPT-4, claiming the title of the world’s most powerful model.

At that time, Anthropic announced that the Claude 3 series models include three sizes:

Medium Cup Haiku, lightweight choice
Large Cup Sonnet, balancing performance and speed
Extra Large Cup Opus, the strongest in the series

Also in March, the Claude 3 extra-large cup Opus achieved the top Elo score in the large model arena.

In May, OpenAI released GPT-4o, and the next day, key figure Ilya announced his departure, leading to a frenzy in the large model circle.

Taking advantage of the chaos, Anthropic quickly recruited Jan Leike, who left with Ilya—he is one of the inventors of RLHF and previously led the super alignment team at OpenAI with Ilya.

Seamlessly joining the new company, Jan Leike is still responsible for the super alignment business at Anthropic, with the new team focusing on scalable supervision, generalization from weak to strong, and automatic alignment research.

Now, the first model of the Claude 3.5 series has made a sudden appearance and boldly claimed the global number one spot.

Some netizens expressed with starry eyes:

Claude 3.5 Sonnet makes the “3.5 series” great again!

Moreover, if the tradition of the Claude 3 series continues, Claude 3.5 Sonnet should only be the large cup of this series.

Theoretically, there is still an extra large cup Opus that Anthropic is holding back and hasn’t released yet.

We’ll see which will shine first on the large model leaderboard, Claude 3.5 Sonnet or GPT-5!

Waiting online, quite anxious (snacking while watching the show).

References: [1]https://venturebeat.com/ai/anthropics-claude-3-5-sonnet-wows-ai-power-users-this-is-wild/[2]https://x.com/RubenHssd/status/1803818710274134514[3]https://twitter.com/literallydenis/status/1803810750000943468[4]https://x.com/StuartJRitchie/status/1803870552802705488[5]https://x.com/Taiyo_AiAA/status/1803813712970825746[6]https://x.com/omarsar0/status/1803913875383030225[7]https://x.com/shaunralston/status/1803926319643922626[8]https://x.com/alexalbert__/status/1803804679891255688[9]https://x.com/andrew_n_carr/status/1803842227594236159[10]https://x.com/Saboo_Shubham_/status/1803823544960266281[11]https://x.com/polynoamial/status/1803847377188720791

Source: Quantum Bits

Reviewed by: Zhao Lixin