

Abstract. This paper introduces a modular LangGraph framework based on Spark, aimed at enhancing machine learning workflows through scalability, visualization, and intelligent process optimization. The core of this framework introduces the AI agent, a key innovation that utilizes Spark’s distributed computing capabilities and integrates with LangGraph for workflow orchestration.
Agent AI facilitates the automation of data preprocessing, feature engineering, and model evaluation while dynamically interacting with data through Spark SQL and DataFrame agents. Through LangGraph’s graph-structured workflows, the agents execute complex tasks, adapt to new inputs, and provide real-time feedback, ensuring seamless decision-making and execution in distributed environments. This system simplifies machine learning processes by allowing users to visually design workflows, which are then converted into Spark-compatible code for high-performance execution.
The framework also incorporates large language models through the LangChain ecosystem, enhancing interaction with unstructured data and enabling advanced data analysis. Experimental evaluations demonstrate significant improvements in process efficiency and scalability, as well as accurate data-driven decision-making in diverse application scenarios.
This paper emphasizes the integration of Spark with intelligent agents and graph-based workflows to redefine the development and execution of machine learning tasks in big data environments, paving the way for scalable and user-friendly AI solutions.
Keywords: Large Language Model · Agent · LangChain · LangGraph ChatGPT · ERNIE-4 · GLM-4 · Big Data · Machine Learning · Apache Spark · Data Analysis.
Introduction
The development of information technology brings convenience to life and fast-growing data. With the maturity of big data analysis technology represented by machine learning, big data has a tremendous effect on social and economic life and has provided a lot of help for business decision-making. For example, in the e-commerce industry, Taobao recommends suitable goods professionally to users after analyzing large amounts of transaction data; in the advertising industry, online advertising predicts users’ preferences by tracking users’ clicks to improve users’ experience.
However, the traditional business relational data management system has been unable to deal with the characteristics of big data, including large capacity, diversity, and high-dimensionality. In order to solve the problem of big data analysis, distributed computing has been widely used, among which Apache Hadoop is one of the widely used distributed systems in recent years. Hadoop adopts MapReduce as a rigorous computing framework. The emergence of Hadoop has promoted the popularity of large-scale data processing platforms. Spark, a big data architecture developed by AMPLab of the University of Berkeley, is also widely used. Spark integrates batch analysis, flow analysis, SQL processing, graph analysis, and machine learning applications. Compared with Hadoop, Spark is fast, flexible, and fault-tolerant, which is the ideal choice to run machine learning analysis programs. However, Spark is a tool for developers, which requires analysts to have certain computer skills and spend a lot of time creating, deploying, and maintaining systems.
The results of machine learning depend heavily on data quality and model logic, so this paper designs and implements a Spark-based flow machine learning analysis tool to enable analysts to concentrate on the process itself and not spend energy on compiling, running, and parallelizing the analysis program. Formally, each machine learning analysis task is decomposed into different stages and is composed of components, which reduces the user’s learning cost. In technology, general algorithms are encapsulated into component packages for reuse, and the training process is differentiated by setting parameters, which reduces the time cost of creating machine learning analysis programs. Users can flexibly organize their own analysis process by dragging and pulling algorithm components to improve the efficiency of application creation and execution.
Spark, as a powerful distributed computing system, has significant advantages in the field of big data processing and analysis. With the rapid development of large model technology, the application of Spark combined with large model agents has gradually become a hot topic in research. Large model agents can more accurately perceive the environment, react and make judgments, and form and execute decisions. Supported by Spark, large model agents can process large-scale datasets, achieving efficient data analysis and decision-making. LangGraph is an agent workflow framework officially launched by LangChain, which defines workflows based on graph structures, supports complex loops and conditional branches, and provides fine-grained agent control. The integration of Spark large model agents with the advanced tools and workflow management of the LangChain and LangGraph frameworks is driving innovative applications in various fields, including large model data analysis.
This paper will show the characteristics of the tool by comparing with the currently existing products and then make a detailed description of the system architecture design, business model with use cases, and function operation by in-depth system modules. At the same time, this paper will also provide a technical summary and look forward to the future research directions based on LangGraph for Spark agents.
Related Technology Introduction
AML Machine Learning Service
Azure Machine Learning (AML) is a web-based machine learning service launched by Microsoft on its public cloud Azure, which contains more than 20 algorithms for classification, regression, and clustering based on supervised learning and unsupervised learning, and is still increasing. However, AML is based on Hadoop and can only be used in Azure. Different from AML, the tool in this paper is designed and implemented based on Spark, therefore, it can be deployed on different virtual machines or cloud environments.
Apache Zeppelin is a Spark-based responsive data analysis system, whose goal is to build interactive, visualized, and shareable web applications that integrate multiple algorithmic libraries. It has become an open-source, notebook-based analysis tool that supports a large number of algorithmic libraries and languages. However, Zeppelin does not provide a user-friendly graphical interface, and all analyzers require users to write scripts to submit and run, which improves the user’s programming technical requirements. The tools mentioned in this paper provide component graphics tools and a large number of machine learning algorithms, which make users simply and quickly define the machine learning process and run to get results.
In reference [6], Haflow, a big data analysis service platform, is introduced. The system uses component design, which can make users drag and drop the process analysis program, and open an extended interface, which enables developers to create custom analysis algorithm components. At present, Haflow only supports MapReduce algorithm components of the Hadoop platform but the tool mentioned in this paper is based on Haflow, so that it can support Spark’s component application and provides a large number of machine learning algorithms running in the Spark environment.
With the rapid advancement of artificial intelligence technology, large model technology has become a hot topic in the field of AI today. In the realm of data management and analysis, the emergence of large language models (LLMs), such as GPT-4o, Llama 3.2, ERNIE-4, GLM-4, and other large models, has initiated a new era filled with challenges. These large models possess powerful semantic understanding and efficient dialogue management capabilities, which have a profound impact on data ingestion, data lakes, and data warehouses. The natural language understanding functionality based on large language models simplifies data management tasks and enables advanced analysis of big data, promoting innovation and efficiency in the field of big data.
LangChain is a large-scale model application development framework designed for creating applications based on Large Language Models (LLMs). It is compatible with a variety of language models and integrates seamlessly with OpenAI ChatGPT. The LangChain framework offers tools and agents that “chain” different components together to create more advanced LLM use cases, including Prompt templates, LLMs, Agents, and Memory, which help users to improve efficiency when facing challenges and generating meaningful responses.
LangGraph is a framework within the LangChain ecosystem that allows developers to define cyclic edges and nodes within a graph structure and provides state management capabilities. Each node in the graph makes decisions and performs actions based on the current state, enabling developers to design agents with complex control flows, including single agents, multi-agents, hierarchical structures, and sequential control flows, which can robustly handle complex scenarios in the real world.
The combination of Spark with LangChain and LangGraph allows data enthusiasts from various backgrounds to effectively engage in data-driven tasks. Through Spark SQL agents and Spark DataFrame agents, users are enabled to interact with, explore, and analyze data using natural language.
This paper aims at designing a process machine learning tool for data analysts, so it is necessary to implement the functions of common machine learning processes. Machine learning can be divided into supervised learning and unsupervised learning, mainly depending on whether there are specific labels. Labels are the purpose of observation data or the prediction object. Observation data are the samples used to train and test machine learning models. Feature is the attribute of observation data. Machine learning algorithm mainly trains the prediction rules from the characteristics of observation data.
In practice, the machine learning process is composed of a series of stages, including data preprocessing, feature processing, model fitting, and result verification or prediction. For example, classify a group of text documents including word segmentation, cleaning, feature extraction, training classification model, and output classification results.
These stages can be seen as black-box processes and packaged into components. Although there are many algorithmic libraries or software that provide programs for each stage, these programs are seldom prepared for large-scale data sets or distributed environments, and these programs do not naturally support process-oriented, requiring developers to connect each stage to form a complete process.
Therefore, while providing a large number of machine learning algorithm components, the system also needs to complete the function of automatic process execution, taking into account the operational efficiency of the process.
The system provides components to users as main business functions. The analysts can freely combine the existing components into different analysis processes. In order to be able to cover the commonly used machine learning processes, this system provides the following categories of business modules: input/output module, data preprocessing module, characteristic processing module, model fitting module, and the results predicted module. Different from other systems, the business module designed by this tool is using each stage of the process as a definition.
1. Input/output module. Used to implement data collection and writing, handling heterogeneous data sources, this module is the starting and ending point of the machine learning process. To handle different types of data, the system provides structured data (like CSV data), unstructured data (like TXT data), and semi-structured data (like HTML) data input or output functionality.
2. Data preprocessing module. This module includes data cleaning, filtering, joining/forking, and type conversion, etc. Data quality determines the upper limit of the accuracy of machine learning models, so it is necessary to improve the data preprocessing process before feature extraction. This module can clean empty values or outliers, change data types, and filter out unqualified data.
3. Feature processing module. Feature processing is the most important step before data modeling, including feature selection and feature extraction. The system currently includes 25 commonly used feature processing algorithms.
Feature selection is a multi-dimensional feature selection. The most valuable feature is selected by the algorithm. The selected feature is a subset of the original feature. According to the selected algorithm, it can be divided into information gain selector, chi-square information selector, and Gini coefficient selector.
Feature extraction is to transform the features of observed data into new variables according to a certain algorithm. Compared with data preprocessing, the rules of data processing are more complex. The extracted features are the mapping of the original features, including the following categories:
1. Standardization component. Standardization is an algorithm that maps the numerical features of data to a unified dimension. The standardized features are unified to the same reference frame, making the training model more accurate and converging faster during training. Different standardization components use different statistics for mapping, such as normalization component, standard scaler component, min-max scaler component, etc.
2. Text processing component. Text-type features need to be mapped to new numerical-type variables because they cannot be calculated directly. Common algorithms include TF-IDF components for indexing text through word segmentation, Tokenizer components for word segmentation, One Hot Encoder components for hot encoding, etc.
3. Dimensionality reduction component. These components compress the original feature information through some algorithms and represent it with fewer features, such as Principal Component Analysis (PCA) components.
4. Custom UDF components. Users can input SQL to customize feature processing functions.
5. Model fitting module. Model training uses specific algorithms to learn data, and the obtained models can be used for subsequent data predictions. Currently, the system provides a large number of supervised learning model components, which can be divided into classification models and regression models according to the nature of observation data labels.
6. Result prediction module. This module includes two functions: result prediction and validation.
Through the above general business modules, users can create a variety of common machine learning analysis processes in the system environment.
The system provides a user interface through the web, and the overall architecture is mainly based on the MVC framework. At the same time, it provides business modules of machine learning and execution modules of processes. The system architecture is shown in Figure 2.
Users create formal machine learning processes through the web interface provided by the system and submit them to the system. The system converts the received original processes into logical flow charts and verifies the validity of the flow charts. Validation of process validity is a necessary part of the analysis process before actual execution. When the process has obvious errors such as logic or data mismatch, it can immediately return the error, rather than wait for the execution of the corresponding component to report the error, which improves the efficiency of the system.
The execution engine of the system is the key module, which implements the multi-user and multi-task process execution function. It translates the validated logical flow chart into the corresponding execution model, which is the data structure identifiable by the system and used to schedule the corresponding business components. The translation of the execution model is a complex process, which will be introduced in detail in Section 4.3.
As shown in Figure 3, the architecture of the agent implementation that combines Apache Spark with LangChain and LangGraph is designed to enhance the level of intelligence in data processing and decision-making. This architectural diagram displays a Spark-based agent system capable of accomplishing complex tasks by integrating various technological components.
LangChain and LangGraph frameworks are introduced on top of Spark. LangChain is a framework for building and deploying large language models, enabling agents to interact with users or other systems through natural language. LangGraph provides a graph-structured workflow, allowing agents to plan and execute tasks in a more flexible manner. The core of the agent is the LLM (Large Language Model), which is responsible for understanding and generating natural language. The LLM receives instructions (Instruction), observations (Observation), and feedback (FeedBack) from LangChain and integrates these with data insights from Spark to form thoughts (Thought). These thoughts are then translated into specific action plans (Plan and Action), which are executed by the action executor (Action Executor).
The action executor is tightly integrated with Spark, processing structured data through Spark DataFrame Agent and Spark SQL Agent, performing complex data analysis and data processing tasks, and feeding the results back to the LLM. The LLM adjusts its action plans based on this feedback and new observations, forming a closed-loop learning and optimization process that further refines its performance and decision-making quality.
Intermediate data needs different management in different life cycles. When components process the previous data, that is to say, during the generation phase of intermediate data, the system records the generation location of intermediate data and transfers it to the next component. After the execution of the process, all the intermediate data generated by the process will no longer be used and will be deleted by the system. At the same time, the intermediate data storage space of a single process has a specified upper limit. When too much intermediate data is generated, the resource manager of the process will use the Least Recently Used algorithm (LRU) to clear the data, so as to prevent the overflow of memory caused by too much intermediate data.
In order to ensure the IO efficiency of intermediate data, Alluxio is used as the intermediate storage reservoir to store all the intermediate data in memory. Alluxio is a virtual distributed storage system based on memory, which can greatly accelerate the speed of data reading and writing.
Machine Learning Analysis Components Based on SparkMLlib. In section 3.2, the design of the machine learning module is described in detail. These modules complete main data processing and modeling functions in the form of components. In order to quickly provide as many algorithmic components as possible, only a small part of the processing program components are programmed according to the characteristics of the machine learning process, such as input and output components, data cleaning components, and so on, and a large number of the component functions are automatically converted to Spark jobs using Spark MLlib, which is Spark’s own machine learning algorithm library, containing a large number of classification, regression, clustering, dimensionality reduction, and other algorithms. For example, to classify with the help of Random Forest, the Random Forest Classifier object with corresponding parameters is instantiated by the execution engine of the system according to the node information of the process. The fit method is used to fit the input data, and the corresponding Model object is generated. Then the model is serialized and saved through the intermediate data management module for subsequent prediction or verification components to use.
There are two ways to run components in the process. One is to call as an independent Spark program and start the Spark Context once for each run. When the Spark program starts, it creates a context environment to determine the allocation of resources, such as how many threads and memory to be called, and then schedule tasks accordingly. The general machine learning process is composed of many components. It will take a lot of running time to start and switch the context. Another way is to share the same context for each process. The whole process can be regarded as a large Spark program. However, the execution engine of the system needs to create and manage the context for each process and release the context object to recover resources at the end of the process.
In order to achieve context sharing, each component inherits Spark Job Life or its subclasses and implements the methods create Instance and execute. Figure 4 is the design and inheritance diagram of components classification, among which Transformers, Models, and Predictors are respectively the parent of data cleaning and data preprocessing model, learning and training model, validation and prediction model.
After the user has designed and submitted the machine learning analysis process through the graphical interface, the system will start to create the logical analysis process. First, the system will make a topological analysis of the original process and generate the logical flow chart expressed by the Directed Acyclic Graph (DAG). The logical flow chart includes the dependence and parallelism of each component, as well as the input and output information and parameters.
After the logical structure of the current process is generated, the validity of the overall process will be verified. The specific steps are as follows: 1. Check the input and output of each node in the graph and other necessary parameter information; if missing, return an error. For example, the feature processing component requires users to define input and output columns; 2. Check the integrity of the entire process to ensure that there is at least one input component and output component as the start and end; otherwise, return an error; 3. Check for self-loops in the flow chart; otherwise, return an error; 4. Check whether each component complies with the front and back dependencies of the machine learning process; for example, feature processing must occur before model fitting; if not, return an error.
After validating the process, the flow chart will be submitted to the execution engine. Firstly, the system needs to represent the logical flow chart as a model that can be executed directly, and then convert it into a machine learning algorithm component based on SparkMLlib, which can be executed serially or in parallel. This process is called process translation and execution.
MLlib is a distributed machine learning algorithm library with Spark built-in support, which optimizes the parallel storage and operation of large-scale data and models. With Spark MLlib, a large number of efficient component programs can be developed quickly. This section will focus on how the system translates the process into an executable model to speed up the operation of the machine learning analysis process.
The translation method used when multiple join/fork parallel tasks occur in the process was introduced in the previous section. But the actual machine learning process is not a simple serial or parallel, but a combination of serial tasks and parallel tasks, so the actual machine learning process is more complex. The difficulty to convert complex processes into execution engines lies in executing the process as parallel as possible without disrupting the data dependencies between components. The following are the translation methods for composite processes:
1. The flow chart is traversed in breadth-first order to determine the topological relationship between business components. 2. Subprocesses of the same stage are divided according to the stages of data preprocessing, feature processing, model fitting, and prediction. 3. The critical path algorithm is used to determine the internal execution of each subprocess to determine the hierarchical relationship of branches within the subprocess. 4. The same-level branches obtained in the previous step are optimized according to the algorithm in the previous section.
The implementation of Spark agents based on LangGraph involves the following key steps: 1. Within the LangChain framework, detailed interaction prompts are defined for Spark SQL, clarifying the roles and functions of the agents. The agent is designed to interact with the Spark SQL database, receive queries, construct and execute syntactically correct Spark SQL queries, analyze results, and provide accurate answers based on those results. To improve efficiency and relevance, query results are limited to at most the top k records and can be sorted by relevant columns to highlight examples in the database. The agent only queries columns directly related to the question, using specific database interaction tools, and relies on the information returned by these tools to construct the final response. Before executing the query, the agent performs strict checks on the query statement to ensure its correctness. If issues arise, the agent rewrites the query statement and attempts again. Moreover, the agent strictly prohibits the execution of any DML (Data Manipulation Language) statements, such as INSERT, UPDATE, DELETE, DROP, etc., to maintain the integrity of the database. For questions unrelated to the database, the agent will respond clearly with “I don’t know”.
2. By instantiating the Spark SQL toolkit and passing the llm and toolkit as parameters to the create spark sql agent method, an agent executor instance is constructed. This instance integrates four types of Spark SQL tools: QuerySpark SQLTool, Info Spark SQL Tool, ListSpark SQL Tool, and Query Checker Tool, which provide the AI agent with the ability to interact efficiently with the Spark SQL database.
3. The calling method of the agent executor is used to respond to user questions. In the LangChain framework, the agent performs tasks through three core components: Thought, Action, and Observation. First, the agent performs in-depth analysis and reasoning upon receiving input to determine the best course of action. Next, the agent executes specific operations based on the results of the thought process. Finally, the agent provides feedback and evaluates the outcomes of its actions, recording the observations as new input for the next round of thinking. Through this iterative loop of three steps, the agent can dynamically handle complex tasks and continuously optimize its behavior to achieve its goals.
Experimental Analysis
At present, the system is still in the prototype stage. In order to test the function of the system, this paper uses a four-core processor, 8G memory, and a 64-bit Ubuntu system to deploy a pseudo-distributed environment for the experiment.
The experimental data is from the public dataset of Kaggle. Through the crime record data of Los Angeles city from 2003 to 2015, the crime category is modeled. In order to facilitate the process description, three original features are selected in this paper, and the common machine learning analysis method is used to create the process. The data characteristics of the features and labels are shown in Table 1. In conclusion, the characters and labels are mainly character strings, which require data preprocessing to extract features and map them to numerical features.
In order to convert the original features into numerical eigen vectors that can be computed by the training model, a series of data preprocessing tasks are needed to be implemented. In Table 2, each feature processing method is illustrated. All parameters are set by default, and any changes will be noted.
The feature obtained after pre-processing will be merged into feature vectors by Join components. After TF-IDF, the feature vectors have high dimensionality but are sparse. Chi Sq Selector is used to select 100 features fitting models with the largest chi-square information. Logistic Regression with LBFGS is used to fit the multi-classification model, and then the test data is predicted through the trained model, and save the output results as a CSV file. The interface of the above analysis process after system creation is shown in Fig. 6.
By comparing the predicted value of the test data with the actual label, it is found that the accuracy is about 72.54%. If more features are added to the process, the complexity of the model will increase, and the accuracy will also increase. With this system, machine learning processes can be created easily and quickly, and users can focus on the improvement of the analysis method.
The parallel execution optimization of the process is introduced in Section 4. In order to test the effectiveness of the optimization method, the data of this experiment are randomly extracted and divided into ten groups of data including 10%, 20%, 30%… 100%, which are made to execute the analysis process in this experiment separately with the optimized method and the non-optimized method. No optimization refers to the sequential execution of components in the process to obtain the running time of each process in ms.
It can be seen that, with the linear growth of data volume, the time of non-optimized process execution increases faster, and the growth rate of time tends to increase in the later period. While with the increase of data volume, the time growth of the optimized process execution scheme is relatively slow, which shows the effectiveness of the system by implementing the optimization scheme.
The Spark agent based on LangGraph provides a powerful tool for big data analysis systems, which not only simplifies the data analysis process but also enhances the scalability of the system. With the continuous advancement of Spark agent technology, it will play an even more critical role in the fields of data analysis and machine learning in the future.