Is Wenxin Yiyan 4.0 Really Comparable to GPT-4?

Today, let’s get straight to the point. This time we are going to test the Wenxin Yiyan large model 4.0 that was just released yesterday.

The reason for this test is because Li Yanhong said at the conference yesterday:

The comprehensive level of the Wenxin large model 4.0 is already comparable to GPT-4.

Is Wenxin Yiyan 4.0 Really Comparable to GPT-4?

Once this statement was made, many people were stirred up.

According to Li Yanhong’s side, Wenxin 4.0 has made rapid progress in memory, understanding, logic, and generation.

Although he personally demonstrated many cases on site, many users were not convinced at all.

Many people joked: “ It’s fine to fool your brothers, but don’t fool yourself. ”

So is it really comparable, or just bragging? Let’s test it ourselves to find out.

This time, thanks to Shichao’s connections, he was fortunate enough to obtain a qualification for early testing.

Since he claims to be comparable to GPT-4, let’s have these two compete and see who is better.

From the moment he got the qualification, Shichao tested it for a whole day. This time, he won’t keep everyone in suspense and will directly share the test conclusion:

Overall, GPT-4 wins steadily, but Wenxin Yiyan 4.0 surprisingly outperforms GPT-4 in certain aspects.

Shichao’s testing this time started from several common evaluation angles, which reflects a more comprehensive and realistic view. However, the difficulty of the tests was aligned with the previous GPT-4 evaluation difficulty.

In the first round of this competition, let’s test something everyone enjoys.

Starting with some easy questions and semantic traps, we can also examine logic and understanding abilities.

However, many large models have been specially trained in this area and were not caught off guard by many questions. Nevertheless, after persistent effort, Shichao was able to catch a loophole.

I asked a very classic silly question: Is there really a “dragon” in the world? I have been served by “a dragon” somewhere.

To my surprise, neither AI got this question right…

First, looking at GPT-4, it started to fabricate historical anecdotes because it did not understand what the two “dragons” meant.

Wenxin didn’t do much better either, fabricating a kind of “humorous” explanation.

Even later, Shichao gave it another chance and asked: Are the two dragons the same dragon?

Wenxin still firmly gave me a completely wrong answer.

However, for the second question, GPT-4 stood up.

When I asked: A company is a warm big family, no wonder I always play the role of a grandson.

Wenxin was still talking about the “warm company” and “no hierarchy”.

But looking at GPT-4, this foreign AI had already understood the implied meaning in Chinese: on the surface warm, but actually cold.

However, when Shichao added a leadership question, the situation suddenly reversed, and Wenxin won thoroughly.

Shichao asked a few popular jokes: “When the leader serves food, you turn the table; when the leader drinks water, you hit the brakes,” and asked them to generate a few similar sentences.

Actually, to get this question right is not easy. It requires not only precise understanding of the question but also the ability to infer the pattern and emotional color of the sentences.

Both AI provided me with sentences that were quite neat in parallelism, but GPT-4 completely misunderstood the semantics. The flattery towards the leader was very perfect, but unfortunately, the answers were all wrong.

Wenxin’s answers were indeed in line with the contemporary youth’s leadership culture.

However, a friendly reminder, in practical use, it is recommended to start with GPT-4 as the benchmark.

The first round of competition ended, Wenxin vs GPT-4 ended in a tie, 1 to 1.

It seems that Wenxin’s claim of rapid progress is not entirely bragging.

In the second round of competition, Shichao wanted to continue with something interesting and test the AI’s ability to interpret memes.

When GPT-4 was launched, it could interpret meme images, which was quite impressive for a long time.

This time, Shichao not only let them read meme images but also increased the test to see if they could handle various new memes on the internet.

Since the previous tests were in Chinese semantics, Shichao felt it was a bit unfair to GPT-4, so he specifically chose a meme image that had both Chinese and English annotations.

Just like my life

I don’t know what I’m busy with

I’m not sure if it was the English assistance, but this time GPT-4’s meme interpretation ability was significantly stronger.

Not only could it understand that the dog was the key character in this meme, but it also grasped that the humor lies in the contrast between “seriously helping” and “having no effect”.

However, Wenxin was still treating the meme as a reading comprehension question…

And it was quite adamant, saying that there was nothing funny about this image and insisted: I don’t understand what you’re laughing about.

However, although Wenxin is not good at explaining memes, it quickly turned the tables when it came to Chinese internet memes.

Shichao asked about the recently popular internet celebrity Wanyan Huide’s “lonely” meme.

If you are not a level 10 surfer, you would probably be confused when you see this sentence.

As a result, Wenxin not only pointed out the source of the meme but also correctly explained that this is a homophone meme.

Although in the end, it regrettably misunderstood “ethics” as “theory”, it was just one step away from scoring.

But if Wenxin didn’t get a perfect score, then GPT-4 would probably be failing…

Not only did it fail to understand the meme, but it also got the source wrong, suggesting you look for answers in the large documentary The Legend of Wanyan Huide.

After these two small tests in the second round, both sides had their strengths and weaknesses, resulting in a tie. Wenxin’s hot meme updates are very fast, while GPT-4 has stronger image interpretation.

After two rounds of competition, there is still no clear winner, and the score remains tied at 2 to 2.

Next, to widen the score gap, we need to bring out some heavyweights.

The previous two rounds focused more on basic semantic understanding, so we will test their professional abilities in the third round, directly challenging GPT-4’s super strong point—coding problems.

I wonder if anyone remembers back in the day when GPT-4 took 60 seconds to create a complete Snake game, shocking the whole community.

Now we will use the same test to see how Wenxin performs.

Since the code is relatively long, we won’t display it all here. Let’s go straight to the end to see the final effect.

First, let’s look at GPT-4, which continues to perform stably. In about a few dozen seconds, it produced a complete, playable Snake game, including the movement of the snake, the random appearance of points, and the effect of increasing size after eating.

However, Wenxin’s performance was completely unsatisfactory.

Not to mention, this Snake game didn’t even move, and when I tried to get Wenxin to correct the code, it just got worse.

This isn’t a GIF that isn’t moving

It’s that Wenxin didn’t produce any moving effects at all

However, this doesn’t mean Wenxin is terrible; this significant disparity in strength is actually due to GPT-4’s coding ability being too abnormal.

If we slightly lower the difficulty and let them create a website based on a sketch, Wenxin can handle it easily as well.

However, even with this comparison of the effects of the two websites, GPT-4 is still more exquisite and complete.

Wenxin Yiyan

GPT-4

In this third round of competition, GPT-4 undoubtedly dominated comprehensively. Now the score has widened, Wenxin vs GPT-4 = 2:3.

To ensure fairness, since we tested a strong point of GPT-4 earlier, we will now test a capability that Wenxin claims to be good at—memory.

Shichao found an interview document related to guide dogs, which contained over 13,000 words.

After throwing this large document to both AI, I asked a very simple question:

Why is it said that guide dogs are a scam?

Surprisingly, GPT-4, although its answer was correct, analyzed it completely off-topic.

When I asked about the reason for the scam, it talked about the training difficulty and the guide dog’s guiding ability…

On the contrary, Wenxin understood it accurately, answering that the costs are high, the promotion is exaggerated, and that guide devices have better prospects, etc., which are the key pieces of information.

Wenxin has indeed shown solid performance in memory and understanding. It successfully turned the score back to a tie at 3:3.

Since the situation is so tight, for this final round, let’s try a more interesting question.

Previously, it was mentioned that the GPT-4 Vision version has very strong image recognition capabilities, able to label individuals in group photos and sort images, etc.

Having proven that Wenxin’s image recognition ability is also not weak in several previous questions, we will use an image for this final question to determine the winner.

Shichao submitted a dental X-ray image and asked both to act as doctors and diagnose the condition.

Both AIs diagnosed the existing wisdom tooth impaction issue, and GPT-4 even pointed out that there were alignment issues with the upper teeth, with three teeth overlapping.

Although Wenxin also identified the wisdom tooth impaction problem and pointed out other possible issues, GPT-4’s answer was still more accurate and relevant.

After five rounds of competition, Wenxin Yiyan lost to GPT-4 with a score of 3:4, especially being heavily defeated in coding.

However, in terms of Chinese semantic understanding and memory, it has indeed improved significantly, as claimed by Baidu.

In addition to the basic tests mentioned above, Wenxin Yiyan has also launched several plugin features.

For example, Yijing Liuying (video generation), Shuo Tu Jie Hua (image interpretation), and E Yan Yi Tu (visual data analysis).

For example, saying a sentence to make a video of a golden retriever climbing stairs, a few minutes later a video with sound is completed.

However, it is not very complete at present, often encountering situations where there is not enough material to generate a video.

As a toy, it’s quite interesting to experience, but as a productivity tool, it’s somewhat lacking.

Despite this, Wenxin 4.0’s performance has already impressed me.

To be honest, Shichao originally did not have high hopes for Wenxin. Because GPT-4’s power is evident to all.

In the face of such a strong opponent, it is easy to feel that all your efforts are in vain…

This time, although it still lost, at least you can feel the areas of progress and the fields it is better at.

However, it must be emphasized that Shichao’s tests can only provide a simple comparison of the two large models from a conventional perspective. It can only be considered to let everyone have a taste and experience it in advance, and it cannot fully represent the strength of the large models.

To truly understand their capabilities, we need to wait until they are fully open for everyone to experience firsthand for a deeper understanding.

Written by: Shida Edited by: Mianxian & Jiangjiang Cover by: Xuanxuan

Image and data source：

Wenxin Yiyan, GPT-4

Baidu World 2023 Conference

Leave a Comment Cancel reply