DS-Agent: Case-Based Reasoning for Data Science Automation

Machine Heart Column

Machine Heart Editorial Team

Case-based reasoning empowers large model agents to tackle automated data science tasks. Teams from Jilin University, Shanghai Jiao Tong University, and Wang Jun’s team have released a framework focused on data science called DS-Agent.

In the era of big data, data science encompasses the entire cycle of extracting insights from data, including key processes such as data collection, processing, modeling, and prediction. Given the complex nature of data science projects and their deep reliance on human expert knowledge, automation has tremendous potential to change the paradigm of data science. With the rise of generative pre-trained language models, it has become increasingly important for large language model agents to handle complex tasks.

Traditional data processing and analysis mostly rely on specialized data scientists, which is time-consuming and labor-intensive. If large language model agents could take on the role of data scientists, they could provide more efficient insights and analyses, opening up unprecedented industrial models and research paradigms.

Thus, provided with data task requirements, a data science-focused agent can autonomously handle massive amounts of data, uncovering patterns and trends hidden in the data. More broadly, it can provide clear strategies and code for model building, invoke machines for model deployment inference, and finally utilize data visualization to make complex data relationships clear.

Recently, the team from Jilin University, Shanghai Jiao Tong University, and UCL led by Wang Jun proposed DS-Agent, positioning this agent as a data scientist aimed at handling complex machine learning modeling tasks in automated data science. Technically, the team adopted a classic artificial intelligence strategy—Case-Based Reasoning (CBR)—which endows the agent with the ability to “reference” successful past solutions, enabling it to utilize previous experiences in solving similar problems to address new issues.

Paper link: https://arxiv.org/pdf/2402.17453.pdf
Code link: https://github.com/guosyjlu/DS-Agent
Paper title: DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning

Research Background

In the open decision-making scenario of automated data science, current large model agents (e.g., AutoGPT, LangChain, ResearchAgent, etc.) struggle to ensure a high success rate, even when paired with GPT-4. The main challenge is that large model agents cannot reliably generate stable machine learning solutions and face the issue of hallucination outputs. Of course, fine-tuning large models for this specific data science scenario seems a feasible strategy, but it also introduces two new problems: (1) Generating effective feedback signals requires training based on machine learning models, a process that takes a lot of time to accumulate sufficient fine-tuning data. (2) The fine-tuning process requires executing backpropagation algorithms, which not only increases computational overhead but also significantly raises the demand for computational resources.

In this context, the team decided to use Kaggle as a key resource. As the largest data science competition platform in the world, it has a wealth of technical reports and code contributed by an experienced community of data scientists. To enable large model agents to efficiently leverage this expert knowledge, the team adopted a classic artificial intelligence problem-solving paradigm—Case-Based Reasoning.

The core working mechanism of case-based reasoning is to maintain a case library that continuously stores past experiences. When a new problem arises, CBR retrieves similar past cases from the case library and attempts to reuse their solutions to solve the new problem. Subsequently, CBR evaluates the effectiveness of the solution and revises it based on feedback; successful solutions during this process are added to the case library for future reuse.

Based on this, the team proposed DS-Agent, which utilizes CBR to enable large model agents to analyze, extract, and reuse human expert insights from Kaggle, iteratively revising solutions based on actual execution feedback, thus achieving continuous performance improvement for data science tasks.

Framework Details

Overall, DS-Agent implements two modes to adapt to different application stages and resource requirements.

Standard Mode (Development Stage): DS-Agent adopts CBR to build an automated iterative process that simulates the continuous exploration process of data scientists when building and adjusting machine learning models, seeking the best solution through continuous experimentation and optimization.
Low Resource Mode (Deployment Stage): DS-Agent reuses successful cases accumulated during the development stage to generate code, significantly reducing the demand for computational resources and base model inference capabilities, making it possible for open-source large models to solve automated data science tasks.

During the development stage, given a new data science task, DS-Agent first retrieves relevant human expert knowledge from Kaggle and constructs a preliminary solution based on this. It then enters an iterative loop to program and debug to train the machine learning model, obtaining performance metrics on the test set. These feedback metrics become key criteria for evaluating and improving solutions. DS-Agent makes necessary modifications to the model design based on these metrics to seek the optimal model design. In this process, the most optimal machine learning solutions are stored in the case library, providing references for future similar tasks.

In the deployment stage, the working mode of DS-Agent becomes more direct and efficient. At this stage, it directly retrieves and reuses verified successful cases to generate code, without starting the exploration from scratch again. This not only reduces the demand for computational resources, allowing DS-Agent to respond quickly to user needs, but also significantly lowers the requirements for large model base capabilities, providing high-quality machine learning models in a low-resource manner.

Experimental Setup

We collected 30 different data science tasks covering three main data modalities (text, tabular, and time series) and two core machine learning problems (classification and regression), designing different evaluation metrics to ensure task diversity.

Development Stage Experimental Results

In the development stage, DS-Agent achieved a 100% success rate for the first time in data science tasks using GPT-4; in contrast, DS-Agent even exhibited a higher success rate than the strongest baseline ResearchAgent using GPT-4, even when using GPT-3.5.

Moreover, when using GPT-4 and GPT-3.5, DS-Agent achieved first and second place respectively in the test set evaluation metrics, significantly outperforming the strongest baseline ResearchAgent.

Deployment Stage Experimental Results

In the deployment stage, DS-Agent achieved nearly 100% first-time success rate using GPT-4, while the open-source model Mixtral-8x7b-Instruct’s first-time success rate increased from 6.11% to 31.11%.

In the test set metrics evaluation, DS-Agent achieved first and second place using GPT-4 and GPT-3; however, unfortunately, the open-source large model Mixtral-8x7b-Instruct, even with the support of DS-Agent, still did not surpass GPT-3.5.

Finally, we analyzed the API call costs of DS-Agent in the two different modes. By comparison, we found that during the development stage, the cost per call for DS-Agent using GPT-4 and GPT-3.5 was $1.60 and $0.06 respectively. However, in the deployment stage, costs were significantly reduced: the cost per call for DS-Agent using GPT-4 dropped to just 13 cents, while using GPT-3.5 was less than 1 cent. This means that in the deployment stage, we achieved over 90% cost savings compared to the development stage.

With DS-Agent, even if you do not understand programming or have not studied machine learning, you can easily tackle various complex data analysis challenges, instantly gaining deep business insights, making effective decision support, optimizing strategies, and predicting future trends, thereby significantly enhancing the efficiency of enterprise data departments. Imagine, marketing personnel only need to describe their needs in natural language, and the agent can quickly generate user profiles and marketing strategy analyses; financial analysts can say goodbye to the tediousness of manual modeling and engage with the agent to discuss market trends… All of this may soon become a reality. Of course, automated data science is still in its infancy and large-scale applications are yet to come. But the emergence of DS-Agent undoubtedly presents an exciting future. As artificial intelligence continues to evolve, the tedious data analysis work may one day be taken over by AI, allowing humans to focus more on insights and innovative decision-making.

DS-Agent: Case-Based Reasoning for Data Science Automation

For reprints, please contact this public account for authorization

Submissions or inquiries: [email protected]

Leave a Comment Cancel reply