Skip to content
Machine Heart Reports
Machine Heart Editorial Team
After working with Devin for a month, these researchers provided some less than optimistic feedback.
In the field of AI programming, you may have heard of Devin, a product released by the startup Cognition. Devin acts like a super-intelligent assistant, helping engineers complete their work faster and better. Upon its release, many praised Devin as the world’s first AI software engineer. It seems to have the ability to learn new technologies, debug mature codebases, deploy complete applications, and even train AI models.
But is that really the case? The answer is: not necessarily. Recently, researchers from the new AI development lab Answer.AI documented their experiences of using Devin.
They also wrote a blog titled “Reflections After a Month with Devin,” detailing their feelings after assigning Devin more than 20 tasks.
Blog link: https://www.answer.ai/posts/2025-01-08-devin.html
Here is the content of the blog:
Simple Tasks Performed Acceptably
The first task was simple but real: pulling data from a Notion database into Google Sheets. Devin completed this task with surprising ability. It browsed the Notion API documentation, understood what was needed, and guided me in setting up the necessary credentials in Google Cloud Console. It didn’t just dump the API instructions but led me through each menu and button click—saving me the hassle of searching through documentation. The entire process took about an hour (but only a few minutes of manual interaction). In the end, Devin shared a link to a perfectly formatted Google Sheet containing our data.
The code it generated was somewhat verbose but runnable. This felt like a glimpse into the future—an AI that can handle “glue code” tasks that consume a lot of developer time. Johno used Devin to create a planet tracker to debunk misconceptions about the historical positions of Jupiter and Saturn, achieving similar success. Impressively, he did this entirely through his phone, with Devin handling all the heavy lifting of setting up the environment and writing code.
Fatal Flaw: Wasting Time on Impossible Tasks
Building on early successes, we relied on Devin’s asynchronous capabilities. We imagined having Devin write documentation during meetings or debug issues while we focused on design work. But as we expanded the scope of testing, problems emerged. Seemingly simple tasks often took days instead of hours, with Devin getting stuck in technical dead ends or producing overly complex, unusable solutions.
More concerning was Devin’s tendency to push forward on tasks that were actually impossible to complete. When asked to deploy multiple applications to a single Railway deployment (which Railway does not support), Devin failed to recognize this limitation and spent over a day trying various methods while fantasizing about nonexistent features.
The most frustrating part was not the failures themselves—every tool has limitations—but how much time we wasted trying to salvage these attempts.
Delving into Where Things Went Wrong
What puzzled us during the exploration was that Devin could skillfully handle API integrations and build functional applications but struggled with some seemingly simpler tasks.
We began to wonder if it was just bad luck or if we were using it incorrectly. To answer this, we systematically documented attempts across the following categories of tasks over nearly a month:
-
Creating new projects from scratch
-
Conducting research tasks
-
Analyzing and modifying existing projects
The results were shocking: of the 20 tasks, Devin failed 14 times, succeeded 3 times, and the results were uncertain 3 times.
Even more concerning was that we could not find any patterns to predict which tasks would succeed; even tasks similar to early successes would fail in unexpected ways.
Here are some of our experiences summarized across the task categories.
Creating New Projects from Scratch
Creating new projects from scratch was supposed to be Devin’s strong suit. After all, Devin’s initial demo video showcased its ability to autonomously complete Upwork tasks. However, reality was more complex.
Here, we attempted a project integrating an observability platform. The task was clear: generate synthetic data and upload it. However, Devin did not provide a concise solution but instead generated a pile of code soup—layers of abstraction made simple operations unnecessarily complex.
We eventually gave up on Devin and turned to Cursor for step-by-step building integration, which proved to be much more efficient than Devin.
Similarly, we tried other attempts, such as when we asked Devin to create an integration between an AI note-taking tool and Spiral.computer. The code generated by Devin was described by one of our team members as “spaghetti code, more confusing to read than writing from scratch.” After testing, we found that although Devin had access to the documentation for both systems, it seemed to complicate every aspect of the integration.
However, the most telling issue arose when we asked Devin to perform web scraping. We requested Devin to track links on Google Scholar and scrape the last 25 papers from an author—this task should be straightforward using tools like Playwright.
Given that Devin has the capability to browse the web and write code, this should have been particularly easy to accomplish. However, it got stuck in an endless HTML parsing loop, unable to extricate itself from its own confusion.
If Devin performed poorly on specific coding tasks, perhaps it would do better on research tasks?
However, the results were at best mixed. While it could handle basic document lookups, it faced challenges with more complex research tasks.
For example, when we asked Devin to provide a transcription summary with accurate timestamps—Devin simply repeated some information unrelated to the core issue without genuinely addressing the problem. Specifically, Devin failed to explore potential solutions or identify key technical challenges, instead providing some generic code examples that did not address the fundamental issues.
Even when Devin seemed to make progress, the results were often unsatisfactory. For instance, when asked to create a minimal DaisyUI theme, it generated a seemingly viable solution. However, upon closer inspection, we found that the theme actually did not work at all—the colors we saw came from the default theme, not our custom settings.
Analyzing and Modifying Existing Code
Devin’s most concerning failures appeared when handling existing codebases. These tasks require understanding context and maintaining consistency with existing patterns—skills that should be core competencies for an AI software engineer.
Our experience trying to get Devin to handle the nbdev project was particularly enlightening. When asked to migrate a Python project to nbdev, Devin could not even grasp the basic nbdev setup, despite us providing comprehensive access to documentation. More confusingly, its approach to handling notebooks—rather than directly editing the notebooks, it created Python scripts to modify them, adding unnecessary complexity to simple tasks. While it occasionally provided useful comments or ideas, the code it generated was consistently problematic.
Security audits also exposed similar issues. When we asked Devin to assess a GitHub repository (with fewer than 700 lines of code) for security vulnerabilities, it overreacted, flagging numerous false positives and even fabricating issues that did not exist. Such analysis might be better suited for a simple, targeted LLM call rather than Devin’s more complex approach.
This pattern persisted in debugging tasks. When investigating why SSH key forwarding was not working in a setup script, Devin focused solely on the script itself, never considering that the issue might lie elsewhere. Similarly, when asked to add conflict checks between user input and database values, a team member spent hours reviewing Devin’s attempts, ultimately giving up and completing the functionality themselves in about 90 minutes.
Team Sentiments: No Scenarios That Made Us Want to Use Devin
After a month of intensive testing, our team members expressed the following sentiments:
“The tasks it can complete are those very small and well-defined tasks that I could probably do faster my way. The complex tasks that are supposed to save time are likely to fail. So there’s no specific scenario that makes me feel like I really want to use it.” —Johno Whitaker
“Initially, I was excited that it was so close to what I wanted because I felt I just needed to tweak a few things. Then, as I had to change more and more, I became frustrated and ultimately found it better to start from scratch and build step by step.” —Isaac Flath
“Devin struggled with key internal tools used at AnswerAI, along with some other issues that made it difficult to use. Despite providing Devin with a wealth of documentation and examples, it still encountered these problems. However, I didn’t find such issues when using tools like Cursor. With Cursor, there are more opportunities to guide things in the right direction step by step.” —Hamel Husain
In contrast to our experience with Devin, we found that workflows more developer-driven (like those adopted by tools such as Cursor) could avoid most of the issues we encountered while working with Devin.
Working with Devin showcased the aspiration of what autonomous AI development could be. The user experience is polished—chatting through Slack, observing it work asynchronously, and seeing it set up environments and handle dependencies.
But the problem is, it often doesn’t work well. Of the 20 tasks we attempted, we saw 14 failures, 3 uncertain results, and only 3 successes. Even more concerning, we could not predict which tasks would succeed. Even tasks similar to our early successes would fail in complex, time-consuming ways. What seemed like promising autonomy became a burden—Devin would spend days pursuing impossible solutions rather than recognizing fundamental barriers.
This reflects a pattern we have repeatedly observed in AI tools. The excitement of social media and company valuations has little to do with real-world practicality. The most reliable signals come from detailed stories of users delivering products and services.
One More Thing: Is the New Version Here to Solve Problems?
A lengthy blog from the Answer.AI team exposed the issues Devin encountered. The original blog’s appendix also showcased specific tasks researchers undertook with Devin.
Perhaps everyone is eagerly awaiting the arrival of the new version, hoping these issues will be resolved.
Unfortunately, while a new version has been released, it is just a minor update. The latest release, Devin 1.2, has made significant upgrades to its ability to reason based on context in the repository.
New version updates can be summarized as follows:
1: The updated Devin is more likely to find relevant files that need editing, reuse existing code and patterns, and overall generate more accurate Pull Requests. These improvements will roll out gradually to all users.
2: Devin can now respond to audio messages. Try verbally explaining your tasks and feedback to Devin to receive a response.
3: Enterprise accounts launched. Administrators of enterprise accounts can:
-
Manage members and access permissions for all organizations;
-
Centralize billing for all organizations.
Currently, the enterprise account features are only available to Devin’s enterprise customers.
4: Pay-as-you-go billing launched. Starting this month, users can pay as they go until reaching the additional usage budget you set.
Users can set their additional usage budget in “Settings > Plans > Manage Plan Limits” or “Settings > Usage and Limits > Manage Additional Usage Budget.”
Looking at this, while Devin has evolved to version 1.2, it does not cover the various issues users encountered during use, such as the series of problems faced by Answer.AI mentioned above.
What problems have you encountered while using Devin? Feel free to vent in the comments section.
https://www.cognition.ai/blog/jan-25-product-update
For reprints, please contact this public account for authorization