DS-Agent: Case-Based Reasoning for Near 100% Success in Data Science Tasks

MLNLP community is a well-known machine learning and natural language processing community in China and abroad, covering NLP graduate students, university teachers, and industry researchers.

The vision of the community is to promote communication and progress between the academic and industrial circles of natural language processing and machine learning, especially for beginners.

Reprinted from | Machine Heart

In the era of big data, data science covers the entire cycle of extracting insights from data, including key aspects such as data collection, processing, modeling, and prediction. Given the complex nature of data science projects and their deep reliance on human expert knowledge, automation has great potential to change the paradigm of data science. With the rise of generative pre-trained language models, it has become increasingly important to enable large language model agents to handle complex tasks.

Traditional data processing and analysis mostly rely on professional data scientists, which is time-consuming and labor-intensive. If we could allow large language model agents to take on the role of data scientists, not only could we gain more efficient insights and analyses, but we could also open up unprecedented industrial models and research paradigms.

This way, as long as data task requirements are given, agents focused on data science can autonomously handle massive amounts of data, discovering hidden patterns and trends behind the data. More broadly, they can provide clear strategies and code for model building, call machines for model deployment inference, and finally utilize data visualization to make complex data relationships clear.

Recently, a team from Jilin University, Shanghai Jiao Tong University, and University College London proposed DS-Agent, whose role is defined as a data scientist aimed at automating complex machine learning modeling tasks. Technically, the team adopted a classic artificial intelligence strategy – Case-Based Reasoning (CBR), empowering the agent with the ability to “reference” successful past cases, enabling it to use the experience of solving similar problems to tackle new issues.

Paper link: https://arxiv.org/pdf/2402.17453.pdf
Code link: https://github.com/guosyjlu/DS-Agent
Paper title: DS-Agent: Automated Data Science by Empowering Large Language Models with Case-Based Reasoning

Research Background

In the open decision-making scenario of automated data science, current large model agents (such as AutoGPT, LangChain, ResearchAgent, etc.) struggle to ensure a high success rate even when paired with GPT-4. The main challenge is that large model agents cannot reliably generate stable machine learning solutions and face the issue of hallucination outputs. Of course, fine-tuning large models for this specific data science scenario seems feasible, but it introduces two new problems: (1) generating effective feedback signals requires training based on machine learning models, which takes a lot of time to accumulate enough fine-tuning data; (2) the fine-tuning process requires executing backpropagation algorithms, which not only increases computational overhead but also significantly raises the demand for computational resources.

In this context, the team decided to use Kaggle as a key resource. As the world’s largest data science competition platform, it has a wealth of technical reports and code contributed by an experienced community of data scientists. To enable large model agents to efficiently utilize this expert knowledge, the team adopted a classic artificial intelligence problem-solving paradigm – Case-Based Reasoning.

The core working mechanism of Case-Based Reasoning is to maintain a case library that continuously stores past experiences. When a new problem arises, CBR retrieves similar past cases from the case library and attempts to reuse these cases’ solutions to solve the new problem. Subsequently, CBR evaluates the effectiveness of the solution and revises it based on feedback; successful solutions in this process are added to the case library for future reuse.

Building on this, the team proposed DS-Agent, which uses CBR to enable large model agents to analyze, extract, and reuse human expert insights from Kaggle, iteratively revising solutions based on actual execution feedback, thereby achieving continuous performance improvement for data science tasks.

Framework Details

Overall, DS-Agent implements two modes to adapt to different application stages and resource requirements.

Standard Mode (Development Phase): DS-Agent uses CBR to build an automated iterative process that simulates the continuous exploration process of data scientists when building and adjusting machine learning models, seeking the best solution through constant experimentation and optimization.
Low Resource Mode (Deployment Phase): DS-Agent reuses successful cases accumulated during the development phase to generate code, significantly reducing the demand for computational resources and base model inference capabilities, making it possible to use open-source large models for automated data science tasks.

In the development phase, given a new data science task, DS-Agent first retrieves relevant human expert knowledge from Kaggle and builds a preliminary solution based on it. It then enters an iterative loop to program and debug the machine learning model to obtain performance metrics on the test set. These feedback metrics become key criteria for evaluating and improving solutions. DS-Agent makes necessary adjustments to the model design based on these metrics to seek optimal model designs. In this process, the most optimal machine learning solutions are stored in the case library, providing references for similar tasks in the future.

In the deployment phase, DS-Agent’s working mode becomes more direct and efficient. At this stage, it directly retrieves and reuses verified successful cases to generate code, eliminating the need to start exploration from scratch again. This not only reduces the demand for computational resources, enabling DS-Agent to quickly respond to user needs; it also significantly lowers the requirements for the base model’s capabilities, providing high-quality machine learning models in a low-resource manner.

Experimental Setup

We collected 30 different data science tasks covering three main data modalities (text, table, and time series) and two core machine learning problems (classification and regression), designing different evaluation metrics to ensure task diversity.

Development Phase Experimental Results

In the development phase, DS-Agent achieved a 100% success rate in data science tasks using GPT-4 for the first time; in contrast, DS-Agent even using GPT-3.5 exhibited a higher success rate than the strongest baseline ResearchAgent using GPT-4.

Additionally, DS-Agent achieved first and second place in test set evaluation metrics when using GPT-4 and GPT-3.5, significantly outperforming the strongest baseline ResearchAgent.

Deployment Phase Experimental Results

In the deployment phase, DS-Agent achieved a near 100% first-attempt success rate when using GPT-4, while the one-attempt success rate of the open-source model Mixtral-8x7b-Instruct jumped from 6.11% to 31.11%.

In test set metric evaluations, DS-Agent achieved first and second place with GPT-4 and GPT-3, respectively; however, unfortunately, the open-source large model Mixtral-8x7b-Instruct still did not surpass GPT-3.5 under the support of DS-Agent.

Finally, we analyzed the API call costs of DS-Agent in two different modes. By comparison, we found that in the development phase, the single call cost for DS-Agent using GPT-4 and GPT-3.5 was $1.60 and $0.06, respectively. However, during the deployment phase, costs were significantly reduced: the cost of a single use of GPT-4 dropped to just 13 cents, while the cost of a single use of GPT-3.5 was less than a cent. This means that in the deployment phase, we achieved over 90% cost savings compared to the development phase.

With DS-Agent, even if you do not understand programming or have not studied machine learning, you can easily tackle various complex data analysis challenges, instantly gain deep business insights, provide effective decision support, optimize strategies, and predict future trends, thus significantly improving the efficiency of enterprise data departments. Imagine, marketers can quickly generate user profiles and marketing strategy analyses just by describing their needs in natural language; financial analysts can say goodbye to the tediousness of manual modeling and instead discuss market trends with the agent… All of this may soon become a reality. Of course, automated data science is still in its infancy and requires time for large-scale application. However, the emergence of DS-Agent undoubtedly presents an exciting future. With the continuous development of artificial intelligence, the tedious tasks of data analysis may one day be taken over by AI, allowing humans to spend more time on insightful thinking and innovative decision-making.

Technical Group Invitation

△ Long press to add assistant

Scan the QR code to add the assistant on WeChat

Please note: Name-School/Company-Research Direction

(e.g., Xiao Zhang-Harbin Institute of Technology-Dialogue System)

You can apply to join the Natural Language Processing/PyTorch technical group

About Us

MLNLP community is a grassroots academic community jointly established by machine learning and natural language processing scholars from home and abroad. It has developed into a well-known machine learning and natural language processing community, aiming to promote progress between the academic and industrial circles of machine learning and natural language processing.

The community can provide an open communication platform for practitioners’ further education, employment, and research. Everyone is welcome to follow and join us.