Devin's First-Hand Experience: High Completion Rate But Far From Replacing Programmers

New Intelligence Report

Editor: Run

[New Intelligence Guide] Is Devin just a nice demo or an intelligent agent that can already replace programmers? How is the user experience? A netizen who obtained the testing qualification shared their experience right away.

Developed by the entrepreneurial team Cognition AI, which has 10 IOI gold medals, Devin is the world’s first AI programmer agent. Its release has caused quite a stir in the tech circle.

In the demonstration, Devin can almost independently complete many tasks that human programmers would take a lot of time to finish, with results no worse than those of ordinary programmers.

However, where are the boundaries of product capabilities? There are gaps between actual experiences and demonstrations, and we need to see the effects after hands-on testing.

This Stanford student contacted the team as soon as Devin was released and obtained the qualification for first-hand experience.

He asked Devin to work on several projects of varying difficulty and recorded a video, sharing his feelings on Twitter.

First, he asked Devin to create a software that retrieves stock prices using an API:

The next task was to have Devin create a website that allows ordinary users to play chess directly with a large model.

Complex programming tasks are still out of reach

User The next move will be translated into a prompt for GPT-4, which will then respond, and the response will be converted into a specific move on the chessboard.

According to the student’s requirements, the system needs to consist of quite a few components.

He is particularly concerned about whether Devin can achieve the following points during the system development process:

Know how to accurately use the GPT-4 API, as most LLMs do not actually know how to use it, and there are version conflicts in API calls.

Request the API key correctly and handle it securely.

Handle package errors.
Understand how to prompt LLM to play chess and return prompts accurately.

What surprised the student was that Devin not only required him to provide the API key but also correctly protected it during the trial.

However, Devin’s response speed is still quite slow; the student speculated that this is because the prompts happening in the background far exceed what can be seen.

From the moment the student initiated the request, it took about 19 minutes to ask for the API key.

The student guessed that if the delay was due to a large number of prompts running in the background, the delay should speed up over time.

Because they may later access dedicated GPUs or cooperate with Claude or OpenAI to reduce latency (estimated to be GPT-4 or Claude Opus).

Devin first made a plan.

In the upper right corner, users can toggle the “Follow” status, allowing the screen to automatically move to the#Devin currently activated tab.

The student did not enable the follow status because he wanted to observe changes at various locations at any time.

The planner will keep the update status for the current task at all times.

The shell looks no different from a regular shell, but it’s really fun to use!

Devin opens multiple shells during the work process, and at the bottom of the shell, users can drag the blue slider to review commands written by Devin.

The image below shows when it was trying to debug the content that was not rendered on the chessboard.

Meanwhile, the student asked it to perform another data analysis task.

The student asked Devin to “create a map of sea temperatures in Antarctica over the past fifty years.”

For this request, the student felt that there were two aspects that might be quite challenging:

Handling spatial data plotting/visualization.

Knowing where to download the data and understanding how to use the data source, as geospatial data is quite tricky to handle.

Devin can read the documentation like an excellent programmer and also performs some basic EDA to understand the data structure.

The data turned out to be an ASCII file, which the student found a bit strange.

When the student clicked on one of the steps in the “Debug Python script…” dialog, it opened the relevant part of the code library, allowing him to track what happened at a specific point in time.

The student is more concerned that if Devin doesn’t need to ask for the API key, it seems to keep coding non-stop.

So he tried to see if he could change his previous request or specify something else to interrupt Devin’s coding process.

Because for most users, there might be a change of mind while coding or new things they want to add to the system, being able to handle such situations is very necessary.

This is a screenshot during the coding process:

The browser interface is presented as follows:

Then the student made another request for data visualization, asking the system to set high temperatures to blue and low temperatures to red.

In order not to interrupt the coding process, it seems Devin opened another working thread to record the student’s temporary request.

Ultimately, Devin deployed the app on Netlify, and one application is already online.

Devin's First-Hand Experience: High Completion Rate But Far From Replacing Programmers

Link to the webpage:https://t.co/wTbtz2waDn

Just like programs written by humans, the first version is bound to have bugs.

Since the student requested temperature records for Antarctica, it seems that Devin had some obstacles in understanding it.

So the student changed the requested display location to North America.

Summary

The student did not provide the results of Devin fixing the bugs, only a preliminary summary of the experience of using Devin to develop the first website.

First, the advantages:

Devin’s productization is very well done; it gives the user experience of a complete product rather than just a simple dialog box.

The AI is the most critical part of the system, but the product structure supporting AI functionality is the highlight of Devin.

Devin can complete various excellent functions such as automatic deployment, API key protection, modifying and adding requirements at any time, etc.

The product’s level of completion is already very high, far exceeding that of a typical demo.

Now for the disadvantages:

Devin’s response is still quite slow; of course, the student also mentioned that he used a 1M Starlink to surf the internet, so the slow response could very likely be his own reason.

Secondly, it still does not allow users to directly edit code, nor can they collaborate on tasks.

Of course, the initial chess application stumped Devin, and it ultimately failed to complete deployment. And the data visualization task seems to have some bugs as well.

In the end, the student created a Chrome extension with Devin that can help users convert GitHub repos into Claude prompts.

Plugin download link: https://t.co/k3l8JTWK7Z

Netizen Feedback

After watching this practical test, netizens felt a bit disappointed; after all, a junior programmer could complete this task, but Devin’s visualization project only produced a webpage with bugs.

It seems Devin is essentially just a large model that can go online; now it still has difficulty solving practical problems.

References:

https://twitter.com/itsandrewgao/status/1768012781083566217?s=20

https://twitter.com/varunshenoy_/status/1767591341289250961?s=20

Devin’s First-Hand Experience: High Completion Rate But Far From Replacing Programmers

New Intelligence Report

[New Intelligence Guide] Is Devin just a nice demo or an intelligent agent that can already replace programmers? How is the user experience? A netizen who obtained the testing qualification shared their experience right away.

Leave a Comment Cancel reply