Research on the Construction and Application of Teaching Intelligent Agents Based on Large Models

1. Introduction

With the rapid evolution of generative artificial intelligence, multimodal large models increasingly demonstrate their advantages in multimodal content understanding and generation. Multimodal large models (hereinafter referred to as “large models”) refer to artificial intelligence models capable of processing and understanding various modalities of data inputs, including text, images, and audio-visual content. Artificial intelligence models represented by GPT-4 belong to the category of multimodal large models. Large models typically possess extremely large-scale parameters, supporting reasoning and decision-making through methods such as prompt engineering and fine-tuning, and exhibit outstanding performance in various tasks such as natural language processing and audio-visual analysis. To further unleash the application potential of large models, researchers in the field of artificial intelligence have begun to attempt to construct intelligent agents based on large models. Intelligent agents, also known as Autonomous Agents, refer to adaptive systems that can perceive the environment and react to it to achieve their own goals [1]. Since the 20th century, the design and implementation of intelligent agents have become one of the main objectives of research in the field of artificial intelligence, but this research has long been limited by the intelligence level of core models. The emergence of large models provides a feasible technological pathway for the construction and realization of intelligent agents, and the interaction capability of agents with external environments can also facilitate the adaptation of large models to downstream tasks. Therefore, the construction of intelligent agents based on large models is receiving increasing attention in the field of artificial intelligence and various vertical domains [2], but there is still a lack of systematic design and comprehensive discussion in the educational field.

In the field of education, research on intelligent agents by scholars at home and abroad is still in its infancy. Early studies focused on the construction of intelligent agents aimed at optimizing knowledge sharing and organization management models [3], and some research attempted to design educational intelligent agents that can interact with learners to achieve student-centered personalized online learning [4]. In recent years, researchers have conducted studies on intelligent agents from the perspectives of cognitive learning mechanisms [5] and social cue design [6], emphasizing the high-level significance processing and interactive innovation of learning subjects [7]. As generative artificial intelligence technologies such as large models mature, researchers are trying to construct general intelligent agents based on large language models (LLM), utilizing them as user interaction assistants for mathematical formula formatting [8]. Additionally, researchers have attempted to use multiple general intelligent agents to simulate classroom interactions between teachers and students, with experimental results indicating a high similarity with real classroom environments in dimensions such as teaching methods, curriculum, and student performance [9].

Current research in the educational field mostly remains at the stage of attempts to apply general intelligent agents in education, lacking clear definitions and systematic architectural designs. Therefore, this study proposes the concept of “teaching intelligent agents,” aiming to fully utilize the autonomous adaptation capabilities of generative artificial intelligence in environmental perception, reasoning and planning, learning improvement, and decision-making in various educational scenarios such as classroom teaching, after-school tutoring, teacher training, and home-school cooperation, to provide intelligent services for teaching and learning to all stakeholders in education. This study will conduct research on the design and implementation pathways of teaching intelligent agents based on the latest technological advancements in the field of generative artificial intelligence, combined with the practical needs of the current educational field.

2. Intelligent Agents Based on Large Models

(1) Basic Concepts

In the field of artificial intelligence, agents need to be able to perceive their external environment and take actions to influence it. Typically, the perception and action between agents and the external environment continuously cycle, forming close interactions to achieve specific task objectives. Figure 1 illustrates the basic framework of intelligent agents based on large models, where the large model serves as the foundational support and core of the agent. For a given task, the agent first collects information through environmental perception capabilities to understand the dynamically changing external environment. Subsequently, the agent can utilize reasoning and planning methods to solve specific problems within the given task step by step based on logical thinking. During this process, a memory retrieval mechanism supports the storage and retrieval of past experiences, which helps improve the quality of reasoning and planning. At the same time, the agent can also leverage massive data to find objective patterns for learning improvements, thereby enhancing its performance and storing it in memory. On this basis, the agent weighs the pros and cons of actions in reasoning and planning, makes reasonable decisions, and selects execution tools, transforming decisions into actual actions and impacts on the external environment to achieve task objectives. Compared to traditional intelligent agents in the field of artificial intelligence, intelligent agents based on large models have several core capabilities.

Research on the Construction and Application of Teaching Intelligent Agents Based on Large Models

(2) Core Capabilities

1. Multimodal Perception and Generation

Multimodal perception and generation integrate various data channels such as vision, speech, and text to achieve external information acquisition in either isolated or combined ways, thus endowing the agent with the ability to understand and analyze the environment it is in. Relying on visual and linguistic understanding and generation capabilities, agents can perform various image-text perception and generation tasks such as Visual Question Answering (VQA) and Text-to-Image Generation [10]. For example, the visual question answering task requires the agent to answer scene understanding questions such as “counting,” “attribute recognition,” and “object detection” based on image information. On this basis, the external knowledge-based visual question answering task (Outside Knowledge VQA) requires the agent to answer questions that cannot be understood solely through image information and encourages it to generate answers by retrieving from external knowledge bases in combination with image understanding [11]. Currently, the multimodal perception and generation capabilities of agents have expanded into the video domain, supporting the interpretation and construction of continuous image frames, thereby accurately answering questions about video content or generating logically coherent high-quality videos. Furthermore, for embodied operational physical robots or virtual environment characters, multimodal perception can support real-time information updates for environmental interaction, assisting agents in completing complex tasks such as task planning, path planning, reasoning, and decision-making.

2. Retrieval-Augmented Generation

To alleviate issues such as the “hallucination” of large models generating factually incorrect information and the untimely updating of data, Retrieval-Augmented Generation (RAG) technology has gradually been adopted widely. RAG technology can provide large models with reliable external knowledge relevant to the problems to be solved based on a knowledge base retriever. This technology mainly consists of three basic steps: index establishment, question retrieval, and content generation [12]. The index establishment step requires constructing a knowledge base based on specified task information, where the types of knowledge resources can include documents, web pages, knowledge graphs, etc., and utilizing methods such as word embedding from language models to extract the semantic features of resources, which are then stored in the knowledge base either online or offline. The question retrieval step treats the user’s question content as a retrieval request, extracts its text semantic features, and retrieves matching data content from the knowledge base as auxiliary information for the question. In the content generation step, the large model generates answers to the question based on the user’s inquiry information and the retrieved auxiliary information using methods such as data augmentation or constructing attention mechanisms [13]. The Carnegie Mellon University team provided an interface document query function for intelligent agents based on the RAG concept. By querying the interface document of a liquid processor, the intelligent agent can achieve precise control over the liquid processor using a UV-visible spectrometer and Python computing tools based on the user’s textual instructions, which can be used for the automatic design, planning, and execution of scientific experiments [14].

3. Reasoning and Planning

The task planning of the agent aims to decompose complex tasks into multiple executable subtasks, primarily relying on the reasoning capabilities of large models. By constructing prompt information that can stimulate the reasoning capabilities of large models, the agent can autonomously complete multi-step reasoning, breaking down complex tasks into multiple executable subtasks, thereby autonomously planning specific behavior sequences to achieve task objectives. Various effective prompt construction methods have been validated, including typical single-path reasoning methods such as Chain of Thought (CoT), multi-path reasoning methods like Self-consistency (CoT-SC), Tree of Thought (TOT), and planning methods that integrate behavioral reflection like ReAct.

Chain of Thought [15], as one of the earliest proposed reasoning methods, aims to stimulate large models to use multi-step thinking during the reasoning process, guiding the large model to execute specific tasks along a single path. Building on this, Self-consistency Chain of Thought [16] considers the randomness of generated content by large models and constructs multiple parallel chains of thought through repeatedly executing single-path reasoning, ultimately selecting the result with the highest consistency as the answer. The Tree of Thought [17] further refines the reasoning process and integrates the advantages of the aforementioned two, constructing a tree-structured problem-solving solution. Each path in the tree structure represents a solution, and each node represents an intermediate step. The Tree of Thought decomposes the intermediate steps of task reasoning based on different problem attributes, striving to execute each step with relatively reliable small steps, thereby generating possible solutions for the next step at the current node. By setting independent quality assessment mechanisms for each node or voting mechanisms among multiple nodes, the Tree of Thought provides heuristic basis for subsequent search algorithms to select optimal problem-solving paths. Ultimately, based on different tree structures, breadth-first or depth-first search algorithms are employed to select the best problem-solving path. The ReAct [18] planning method further integrates feedback information from task execution during the reasoning process, enabling the agent to achieve autonomous interaction with external resources, task planning updates, and anomaly handling. In specific task planning processes, the ReAct planning method consists of three basic behaviors: Thought, Action, and Observation of Action Results. By cyclically executing this behavior combination, the final solution containing multiple rounds of reasoning steps is generated. The ReAct method has demonstrated good performance in various language and decision-making tasks, including question answering, fact verification, text games, and web navigation, with the generated task trajectories exhibiting interpretability and credibility.

4. Interaction and Evolution

Agents can collaborate and complete complex tasks through interactions with external environments, humans, and other intelligent agents, thereby achieving autonomous evolution. In interactions with external environments, agents can autonomously set tasks and explore the external world based on their own data, available tools, inventory resources, and environmental descriptions. During the exploration process, agents can build skill libraries to accumulate skills based on self-check mechanisms in machine language, achieving complex and meaningful autonomous evolutionary behaviors. For example, the Voyager, as a large model-driven intelligent agent, has achieved autonomous world exploration and skill evolution in a sandbox survival game [19].

In interactions with humans, agents support users in customizing the integration of various large language models and tools, engaging in autonomous planning and human-machine collaboration to effectively solve tasks across multiple fields, including programming, mathematics, experimental operations, online decision-making, and Q&A. For instance, agents can solve complex number theory problems through automatic problem-solving, and when an answer is incorrect, they can obtain human feedback through human-machine collaboration to improve and correct errors in automatic answers [20]. Agents can complete tasks such as organic synthesis, drug discovery, and materials design based on user input instructions, achieving the automatic synthesis of a mosquito repellent and three organic catalysts during testing, while guiding humans to discover new colorants through human-machine collaboration [21].

Different intelligent agents can also interact and collaborate to advance the resolution of complex tasks and self-evolution. For example, by constructing a Plan, Execute, Inspect, and Learn (PEIL) guiding mechanism, multi-agent task planning and tool execution, visual perception and memory management, proactive learning, and solution optimization can be realized, demonstrating outstanding performance in visual question answering and reasoning tasks [22]. Meanwhile, debate-style interactions among multiple agents can enhance their problem-solving capabilities for complex reasoning issues and have shown effectiveness in common-sense machine translation and counterintuitive arithmetic reasoning tasks [23]. Additionally, agents can simulate and realize specific business processes and task objectives through role division and collaboration among multiple agents. For example, the MetaGPT multi-agent collaboration framework can assign role divisions (such as product managers and software development engineers) to multiple agents and set workflows, achieving automated software development processes through the introduction of human work mechanisms to facilitate the linking of multi-role agents and the transfer of data among them [24].

Furthermore, the memory mechanism of agents plays a crucial role in interaction and evolution, supporting the review of interaction history, knowledge acquisition, experience reflection, skill accumulation, and self-evolution. The Generative Agents proposed by Stanford University built a virtual town scene in a sandbox game engine, enabling dynamic behavior planning of virtual individuals and simulating credible human behavior [25]. This generative agent constructed a memory flow mechanism that can store perceived virtual environmental information and individual experiences in the memory flow. Agents can make behavioral decisions based on individual memories, and this can also be used to form long-term behavioral planning and high-level reflection, providing memory reserves for subsequent behavioral decision-making. For example, in the behavioral decision of whether to attend a party, the agent first retrieves relevant memory records from the memory flow, calculates the comprehensive score of each memory based on its timeliness, relevance, and importance to the decision task, and ranks the memories, with the top-ranked memories serving as decision-making references to be incorporated into the prompt information to assist its behavioral decision-making.

(3) Implementation Methods

Agents take large models as the core controller, emphasizing the dynamic interaction between agents and information, the integration of reasoning and planning capabilities, the establishment of memory and reflection mechanisms, the realization of tool usage and task execution capabilities, and the continuous evolution of capabilities during interactions with the external environment. These characteristics collectively endow agents with high-level information understanding and processing capabilities, making their decision-making approaches closer to humans and demonstrating an understanding and handling of complex situations [26]. To support the practical implementation of intelligent agents based on large models, several engineering frameworks have been developed and open-sourced, such as LangChain [27] and Auto-GPT [28] for single-agent implementation, and AutoGen [29], BabyAGI [30], and CAMEL [31] for multi-agent collaboration. These frameworks provide important resources for researchers and developers, facilitating the development and testing of multi-scenario applications of intelligent agents.

Among the aforementioned implementation frameworks, LangChain and AutoGen are widely used single-agent and multi-agent frameworks, respectively. LangChain provides a structured application process for large models, facilitating the engineering implementation of agents, with its technical components including model I/O, retrieval, agents, chains, and memory. LangChain supports rich tool and toolkit calls and can realize multiple core capabilities of agents, including retrieval-augmented generation and ReAct planning. AutoGen, on the other hand, allows users to flexibly define interaction patterns and human-machine collaboration modes between multiple agents according to their needs, such as a dynamic group discussion mode where one agent leads and humans participate in multi-agent interactions, or a collaborative coding mode where two agents are responsible for coding and debugging, respectively. AutoGen supports the interactive memory reading and writing of multiple agents, can utilize third-party Python toolkits for tool usage (such as calling Matplotlib for mathematical plotting), and supports converting tasks into machine language solutions, such as executing tasks step by step through code and ensuring successful program execution through inter-agent code execution and debugging.

Both LangChain and AutoGen provide feasible solutions for agent implementation and can leverage each other’s advantages through complementary use. For instance, AutoGen can be used to flexibly construct and implement the interaction framework and machine language-based task execution of agents, while LangChain can assist in connecting external rich tool libraries (such as ArXiv, Office365, Wolfram Alpha, etc.) and custom tools (by providing descriptions of tool functions, implementation codes, input-output formats, etc.) to expand the capability boundaries of agents.

3. Construction of Teaching Intelligent Agents Based on Large Models

Based on the rapid evolution and development of large models in the current educational field [32], this study proposes to construct teaching intelligent agents based on large models. As shown in Figure 2, teaching intelligent agents center around the “large model,” with their main functional modules including “educational task setting,” “educational task planning,” “implementation and expansion of educational capabilities,” and “educational content memory and reflection.” At the same time, teaching intelligent agents support interaction with multiple types of objects and achieve dynamic evolution, covering human-machine interaction, multi-agent interaction, and environmental interaction.

(1) Educational Task Setting

The “educational task setting” module encompasses the provision of key information such as educational scenario setting, educational demand setting, and educational role setting. Among them, the setting of educational scenarios provides contextual information for the agent’s educational tasks, such as project-based learning scenarios centered around students, online self-directed learning scenarios, and traditional classroom teaching scenarios; the educational demand setting provides specific goal descriptions for educational tasks, such as providing strategic scaffolding for project-based learning, evaluating learners’ problem-solving abilities, and coordinating group collaborative learning; the educational role setting endows the agent with specific role information to be played in educational tasks, such as teaching assistants, learning partners, training assistants, and home assistants. The setting of educational roles helps the agent interact more effectively with educational users, providing personalized interactive experiences and enhancing support effectiveness. Multiple teaching intelligent agents can also collaborate by playing different educational roles, fulfilling key teaching needs in specific educational scenarios through division of labor, debate, and human-machine collaboration.

(2) Educational Task Planning

Based on the established educational task information, teaching intelligent agents can achieve autonomous task planning, with the basic steps and sequence being “task scheme thinking,” “scheme decomposition planning,” and “execution result perception.” First, the “task scheme thinking” step involves reasoning and generating an overall plan based on the established key information such as educational scenarios, demands, and roles, combined with relevant educational standards or frameworks, educational resources, and auxiliary tools; the “scheme decomposition planning” step breaks down the generated overall plan into multiple executable and manageable subtasks, including planning specific teaching activities, teaching resources, teaching tools, and educational evaluation methods. Teaching intelligent agents can also adjust the subtasks in real-time based on feedback from teachers or learners to ensure the adaptability and effectiveness of educational task planning. After the execution of the planned subtasks is completed, the “execution result perception” step is responsible for acquiring execution results and multidimensional interaction information. By introducing an evaluation feedback mechanism, based on task execution results, the agent can autonomously reason and judge or receive human evaluations of the quality of subtask completion. If problems are identified or planning goals are not achieved, the agent will restart the “task scheme thinking” step until the goals are met before exiting the loop mechanism. Utilizing the aforementioned educational task planning process, the agent can iteratively optimize the educational task execution process and strategies to meet efficient personalized educational needs.

(3) Implementation and Expansion of Educational Capabilities

Teaching intelligent agents can realize and expand multiple basic capabilities to execute the specific educational tasks planned. First, teaching intelligent agents can call external professional teaching tools and their operating environments, including but not limited to mathematical calculation tools, educational software, and collaborative learning tools. For example, teaching intelligent agents can call the mathematical calculation tool Wolfram Alpha [33] to answer precise calculation problems required across multiple disciplines through interactions based on natural language or mathematical formulas. These external tools can provide teaching intelligent agents with professional capabilities that the large model does not possess, thereby assisting them in solving professional problems within the planned educational tasks.

Teaching intelligent agents can also avoid outputting erroneous educational information and expand their knowledge and capability boundaries through methods such as retrieval-augmented generation. Providing educational services typically requires high accuracy and interpretability, thus necessitating real-time updates and reliable information sources for the agents. For example, agents can obtain and integrate the latest educational resources and real-time educational data from channels such as national educational resource public service platforms, professional educational academic journals, and educational news websites, achieving retrieval-augmented generation of educational content and clearly explaining the basis for the content provided, ensuring the timeliness and accuracy of the educational services offered.

Additionally, after perceiving and understanding the elements of educational scenarios, teaching intelligent agents can automatically generate educational content and products in various forms, including teaching text dialogues and audio-visual teaching resources, providing full-process support for the established educational roles. For instance, when the execution of educational tasks involves programming and logical reasoning, the agent can leverage the code generation and debugging capabilities of large models to translate tasks into machine languages such as Python and assist learners in completing programming tasks. For educational tasks requiring embodied operations, teaching intelligent agents can automatically generate operational processes and control hardware and software in real-time based on the physical environmental perception capabilities and user instructions.

(4) Educational Content Memory and Reflection

The educational content memory of teaching intelligent agents is primarily used to store and retrieve important data during the planning and execution processes of educational tasks, supporting the agents’ self-reflection. Specifically, educational content memory can store foundational data from all steps of educational task planning and execution, such as educational task solution data, interaction Q&A data between agents and learners, the processes and results of external tool calls, and hardware and software control and operation data. Based on the stored foundational data, agents can reflect and process educational knowledge and experiences into higher-level information through self-questioning or summarization methods of large models. For example, teaching intelligent agents can reflect on the personalized characteristics of learners they serve and the effectiveness of their teaching interactions. Combined with trial-and-error mechanisms or interaction feedback, teaching intelligent agents can summarize failed or inefficient teaching experiences, using them as references for autonomous optimization and improvement of teaching strategies when encountering similar educational tasks again. Moreover, the rich educational memories and reflections stored by the agents can serve as important reference knowledge and resources to support the expansion of their teaching capabilities.

Based on permission dimensions, educational content memory can be divided into public memory and private memory. Public memory refers to the educational knowledge and teaching resources accumulated by teaching intelligent agents, including subject knowledge graphs, teaching method knowledge, curriculum standards, teaching materials, and auxiliary materials; private memory refers to individual information closely related to educational users and their roles, such as historical interaction and learning evaluation data of individual learners, teaching videos, teaching plans, and teaching evaluation data of individual teachers. Teaching intelligent agents need to reasonably utilize memory data with different permissions, respect the privacy of educational users, and establish corresponding educational data usage norms.

(5) Interactive Collaboration and Dynamic Evolution

Teaching intelligent agents can achieve collaborative planning and execution of educational tasks through interactions with different roles of educational users, other agents, and educational environments, promoting their dynamic evolution. In interactions with educational users, teaching intelligent agents can fully understand the intentions of different roles of educational users, thereby providing various forms and modalities of human-machine interaction services. For instance, in online self-directed learning scenarios for learners, they can provide scaffolding intelligent guidance interaction services, supporting the recommendation of multi-modal teaching content including text, video, and audio teaching resources, and providing real-time progress evaluation and feedback information. In interactions with other agents, agents can implement interaction modes such as supervision guidance, discussion exchange, division of labor collaboration, and even orderly debate confrontation based on the role-playing of different agents and educational tasks. For example, engaging in debate-style interactions among multiple agents can achieve scientific decomposition and reasonable planning of complex educational tasks. Additionally, educational objects can be introduced into the multi-agent interaction process to achieve educational goals in a human-machine collaboration mode. For example, during collaborative exam paper compilation, based on the subject, knowledge points, and differentiation requirements provided by teachers, different agents can serve as question setters, test subjects, and reviewers to complete the exam paper compilation, which is ultimately reviewed by the teacher for quality assurance. In interactions with educational environments, agents can fully utilize external hardware and software tools and their human-machine interaction capabilities based on a comprehensive perception and understanding of educational scenarios and environmental elements to achieve embodied operations and human-machine collaboration. For example, agents can collaborate with learners to complete complex experimental operations or scientific inquiry processes in subjects such as physics and chemistry through real-time perception of experimental instrument statuses and precise control of robotic arms.

Through continuous collection and analysis of process and result data and feedback information during interactions with educational users, other agents, and educational environments, teaching intelligent agents can gradually form educational experiences and reflective knowledge. These experiences and knowledge can be stored and retrieved in the agents’ memories and used for future educational task planning and execution of educational capabilities, thereby achieving dynamic enhancement and evolution of their problem-solving abilities. For instance, by reflecting on information from human-machine interaction processes and summarizing scientific instrument control procedures, teaching intelligent agents can more efficiently plan scientific experimental steps and provide real-time scaffolding for learners’ experimental operations and collaborative scientific inquiry services.

4. Applications of Teaching Intelligent Agents Based on Large Models

Based on the framework proposed above, this study uses project-based learning scenarios as an example to elaborate on the applications of teaching intelligent agents. Project-based learning is an effective teaching model for cultivating students’ core competencies and higher-order abilities [34]. In a typical project-based learning process, learners often require continuous support from teachers and peers to complete project outputs. Teaching intelligent agents can set project-based learning tasks, conduct specific task planning for project outputs, support memory and reflection related to project-based learning content, and provide capabilities such as multi-modal project resource generation, retrieval-augmented generation scaffolding for learning, high-quality code generation and feedback, while also supporting human-machine interaction and multi-agent interaction modes. As shown in Figure 3, teaching intelligent agents can assume two educational roles: “teaching assistant agent” and “peer agent,” with different task settings, expansion capabilities, and individual memories in different project phases, thereby demonstrating capability and functional differences to provide learners with various interactive supports. We take the interdisciplinary theme of “waste classification” commonly found in information technology or artificial intelligence courses as an example to illustrate the role of teaching intelligent agents in various stages of project-based learning.

(1) Personalized Driven Problem Proposal

Project-based learning requires proposing driven problems based on real situations, enabling learners to genuinely feel the urgency and feasibility of solving the problem, thereby stimulating their intrinsic motivation for in-depth exploration and project completion. Therefore, in the problem proposal phase, the “teaching assistant agent” can first establish a driven problem guidance framework based on the preset learning scenario. Building on this, the “teaching assistant agent” can engage in multi-modal online discussions with learners and, based on the characteristics and learning intentions of the learners, adopt personalized dialogue paths and interaction strategies to ultimately guide learners to autonomously propose driven problems for the project. The “teaching assistant agent” can utilize the intelligent agent module (Agents Module) from the aforementioned LangChain open-source technology framework to realize this primary function by setting the preset guidance framework as the target question for each round of dialogue and making learners a required consulting tool in each round of task planning, actively questioning and discussing with learners.

For instance, regarding the environmental theme of “waste classification,” the “teaching assistant agent” can create real scenarios for learners, utilizing the image generation capability of the ERNIE-ViLG multimodal large model [35] to illustrate scenario images such as “ocean garbage vortex” and “non-biodegradable plastic waste.” Simultaneously, the “teaching assistant agent” can engage in online discussions about the urgency of waste management with learners based on the capabilities of large model dialogue. Combining specific feedback from learners, the “teaching assistant agent” can continue to propose possible necessary steps and methods for waste management, thereby guiding students to autonomously think and clarify specific project activities to undertake, such as “how to promote the concept of waste classification” or “how to create a smart trash can for waste classification.”

(2) Collaborative Design of Project Solutions

To solve the personalized driven problems proposed by learners, teaching intelligent agents can build dynamic discussion groups between learners and agents, helping learners determine specific solutions based on their educational task planning capabilities and decompose and plan the solutions. Group discussions can adopt either an “agent-led” or “learner-led” mode based on project goals and learners’ styles. In the “agent-led” mode, teaching intelligent agents can utilize the aforementioned AutoGen open-source technology framework to construct multiple “peer agents,” simulating and playing different roles of human group members in the project-based learning process, engaging in multi-role interactions between human learners and multiple “peer agents.” In this process, the “teaching assistant agent” is primarily responsible for selecting the speaker for each round (either the human learner or a “peer agent”) based on the history of group dialogue and project goals, broadcasting the spoken content to all group members, thus achieving collaborative design of the project implementation plan through multiple rounds of speaking and information transmission. In the “learner-led” mode, learners can directly choose to engage in dialogue with different “peer agents.”

Specifically, in the “agent-led” mode, to solve the driven problem of “how to promote the concept of waste classification,” teaching intelligent agents can first utilize their task planning capabilities to decompose the solution to the driven problem into multiple executable subtasks, such as “understanding waste classification rules,” “collecting typical examples of various types of waste,” and “creating promotional materials and carriers.” Based on the planned subtasks, teaching intelligent agents can construct multiple “peer agents” to engage in discussions with learners regarding specific project solutions, providing strategic scaffolding and understanding learners’ opinions in real-time for timely feedback, gradually guiding learners to collaboratively complete the design of the project solution. For instance, regarding the subtask of “creating promotional materials and carriers,” multiple “peer agents” in the group discussion can propose different solutions in various promotional forms like posters, web pages, or WeChat mini-programs. If learners propose to support a web-based format based on their interests and expertise, the “teaching assistant agent” will select a “peer agent” with relevant capabilities to speak, focusing on the design and discussion of the “waste classification promotional website.” Subsequently, the “teaching assistant agent” can broadcast the proposals obtained within the group and select other “peer agents” to refine suggestions, such as first clarifying the rules of “waste classification” and displaying them prominently on the website.

(3) Collaborative Completion of Project Outputs

Based on the designed project solution, teaching intelligent agents can construct corresponding “peer agents” to collaboratively complete the production of project outputs. The production of project outputs first requires the collection of relevant materials and information. For instance, in the subtask of “understanding waste classification rules,” learners need to gather the latest waste classification standards in their locality. Since waste classification standards vary worldwide and are subject to change, the “teaching assistant agent” can employ RAG methods to provide learners with accurate content generation. As shown in Figure 4, the “peer agents” can utilize various functions provided by the LangChain framework to facilitate the RAG process. First, in the index establishment step, the agents automatically crawl or manually filter resources from government environmental department official websites on the internet, utilizing the Document Loaders method in LangChain to collect reliable information and employing the Text Splitter method to segment long texts into semantically related short sentences. Subsequently, in the question retrieval step, the large model extracts the text feature vectors, and relevant information is stored in the Chroma vector database to construct a feature retrieval knowledge base for “waste classification standards.” Furthermore, using a retrieval-based question-answering method (Retrieval QA), the agents extract the text features of the user’s questions and retrieve information from the vector database that is most relevant to the inquiry based on feature similarity. Finally, in the content generation step, the retrieved information and the user’s inquiry information are input into a prompt template, constructing complete prompt information, allowing the large model to ultimately generate the latest and accurate waste classification rules.

Once the relevant materials have been collected, the “teaching assistant agent” can further assist learners in creating the “waste classification promotional website.” In this process, learners can communicate with the agent in multi-modal ways, displaying hand-drawn drafts of the website’s front-end design or communicating their back-end design concepts through text. The “teaching assistant agent” can call multiple external web design scripting language libraries to automatically generate the corresponding website code. Meanwhile, the teaching intelligent agent can utilize the embedded machine language execution environment within the AutoGen framework to directly execute the generated code and provide feedback on execution results and error messages to the “teaching assistant agent,” guiding it to further automatically modify and improve the code. Learners can also provide feedback on modifications based on the generated pages through screenshots or natural language, allowing the “teaching assistant agent” to adjust and optimize the website accordingly.

(4) Multi-role Evaluation of Project Outputs

During the project output presentation and evaluation phase, the “teaching assistant agent” and “peer agents” can conduct evaluations of the project outputs from their respective perspectives, providing teacher evaluations and peer evaluations. The agents generate corresponding process and result evaluation rubrics based on personalized driven problems and project solutions in advance. During the learners’ presentation of the collaboratively produced project outputs, the “teaching assistant agent” and “peer agents” can evaluate the presentation content based on the degree of contribution made by learners during group discussions and website production using the respective process information stored in their memory modules. For instance, regarding the “waste classification promotional website” project output, the agents can provide objective evaluations based on the learners’ contributions during the group discussion and website production processes. Additionally, the agents can utilize their environmental interaction capabilities to access the website, conducting interactive tests and quantitative statistics on the website’s design, such as the number of elements on the page, color choices, and multimedia usage, to derive a result evaluation of the project output. Furthermore, the agents can input the learners’ presentation content in video form, evaluating aspects such as fluency of speech, logical coherence, and completeness of content in the learners’ project presentations.

Based on the evaluation information from this round of project outputs, the “teaching assistant agent” and “peer agents” can reflect and ask questions based on their respective memories regarding learners’ knowledge mastery, skill acquisition, and interaction effectiveness, promoting the simultaneous enhancement of their educational task planning, teaching, and interaction capabilities. Thus, in the next round of project-based learning, the agents can more effectively conduct project-based learning under the same theme for new groups of learners, achieving the evolution of the agents’ educational capabilities.

5. Summary and Outlook

Teaching intelligent agents based on large models represent one of the important future research directions and application forms of generative artificial intelligence in the field of education, as well as the core technological pathway to solve human-machine collaboration models in the educational domain. The teaching intelligent agent architecture proposed in this study centers on large models and their various capabilities, combining the multi-scenario needs and multi-role service characteristics of the educational field, aiming to inspire and assist in the design and realization of future highly intelligent educational systems. On this basis, the roles, functions, and collaborative practice paths of teaching intelligent agents in project-based learning scenarios are discussed in detail. Research on teaching intelligent agents is still in its infancy, and this paper proposes the following research outlook for their future development:

1. The design and development of teaching intelligent agents require urgent attention to ensure that the educational field can fully utilize cutting-edge technologies such as generative artificial intelligence to rapidly enhance the intelligence and interactivity of various educational products and services. The application of multi-agent technology should be emphasized, utilizing agents to simulate and play different key educational roles to achieve efficient teaching interaction processes through various modes such as “discussion-practice-reflection.” At the same time, the respective advantages of both agents and humans should be fully leveraged to achieve a more reasonable and effective human-machine collaborative educational model.

2. Compared to general-purpose or other vertical domain intelligent agents, the construction of educational agents has unique special needs and characteristics that require careful consideration of the complexity of educational scenarios and teaching subjects, designing proprietary educational large models and their core educational capabilities. Educational large models need to deeply understand educational resources, teaching subjects, and teaching processes, supported by relevant educational theories and learning science theories.

3. The design of teaching intelligent agents needs to fully consider their impact on learners’ values and ethical concepts, ensuring that the behaviors of the agents align with social moral standards and educational objectives. During the execution of educational tasks, teaching intelligent agents need to possess the ability for continuous learning and self-optimization, continually accumulating experiences through interactions with educational stakeholders to enhance the reliability and credibility of their educational services, providing inclusive educational resources and teaching strategies, and avoiding biases and discrimination.

References:

[1] Franklin S,Graesser A.Is it an Agent,or just a Program?:A Taxonomy for Autonomous Agents [A].International Workshop on Agent Theories, Architectures, and Languages [C].Berlin,Heidelberg:Springer Berlin Heidelberg,1996.21-35.

[2] Xi Z Chen W,Guo X,et al.The Rise and Potential of Large Language Model Based Agents:A Survey [DB/OL].https://arxiv.org/abs/2309.07864,2023-09-19.

[3] Soller A,Busetta P.An Intelligent Agent Architecture for Facilitating Knowledge Sharing Communication [A].Rosenschein S J,Wooldridge M.Proceedings of the Workshop on Humans and Multi-Agent Systems at the 2nd International Joint Conference on Autonomous Agents and Multi-Agent System [C].New York:Association for Computing Machinery,2003.94-100.

[4] Woolf B P.Building Intelligent Interactive Tutors: Student-centered Strategies for Revolutionizing e-Learning [M].Burlington:Morgan Kaufmann,2010.

[5] Liu Qingtang, Ba Shen et al. A Review of the Mechanism of Educational Agents on Cognitive Learning [J]. Distance Education Journal, 2019, 37(5):35-44.

[6] Liu Qingtang, Ba Shen et al. Research on the Design of Social Cues for Educational Agents in Video Courses [J]. Research on Educational Technology, 2020, 41(9):55-60.

[7] Liu Sannuy, Peng Zhan et al. Intelligent Education from the Perspective of New Data Elements: Models, Paths, and Challenges [J]. Research on Educational Technology, 2021, 42(9):5-11+19.

[8] Swan M., Kido T., Roland E., Santos R.P.D. Math Agents: Computational Infrastructure, Mathematical Embedding, and Genomics [DB/OL].https://arxiv.org/abs/2307.02502,2023-07-04.

[9] Jinxin S, Jiabao Z, et al. CGMI: Configurable General Multi-agent Interaction Framework [DB/OL].https://arxiv.org/abs/2308.12503,2023-08-28.

[10] Durante Z, Huang Q, et al. Agent AI: Surveying the Horizons of Multimodal Interaction [DB/OL].https://arxiv.org/abs/2401.03568,2024-01-25.

[11] Marino K, Rastegari M, et al. Ok-vqa: A Visual Question Answering Benchmark Requiring External Knowledge [A]. Robert S.and Florian K..Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition [C].Piscataway:IEEE Computer Society,2019.3195-3204.

[12] Gao Y, Xiong Y, et al. Retrieval-augmented Generation for Large Language Models: A Survey [DB/OL].https://arxiv.org/abs/2312.10997,2024-01-05.

[13] Li H, Su Y, et al. A Survey on Retrieval-augmented Text Generation [DB/OL].https://arxiv.org/abs/2202.01110,2022-02-13.

[14] Boiko D A, MacKnight R, Gomes G. Emergent Autonomous Scientific Research Capabilities of Large Language Models [DB/OL].https://arxiv.org/abs/2304.05332,2023-04-11.

[15] Wei J, Wang X, et al. Chain-of-thought Prompting Elicits Reasoning in Large Language Models [J]. Advances in Neural Information Processing Systems, 2022, 35:24824-24837.

[16] Wang X, Wei J, et al. Self-consistency Improves Chain of Thought Reasoning in Language Models [DB/OL].https://arxiv.org/abs/2203.11171,2023-03-07.

[17] Yao S, Yu D, et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models [J]. Advances in Neural Information Processing Systems, 2024, 36:1-11.

[18] Yao S, Zhao J, et al. ReAct: Synergizing Reasoning and Acting in Language Models [DB/OL].https://arxiv.org/abs/2210.03629,2023-03-10.

[19] Wang G, Xie Y, et al. Voyager: An Open-Ended Embodied Agent with Large Language Models [A]. Colas C, Teodorescu L, Ady N, Sancaktar C, Chu J. Intrinsically-Motivated and Open-Ended Learning Workshop@ NeurIPS2023 [C]. Cambridge, MA: MIT Press, 2023.

[20] Wu Q, Bansal G, et al. Autogen: Enabling Next-gen LLM Applications via Multi-agent Conversation Framework [DB/OL].https://arxiv.org/abs/2308.08155,2023-10-03.

[21] Bran A M, Cox S, et al. ChemCrow: Augmenting Large-language Models with Chemistry Tools [DB/OL].https://arxiv.org/abs/2304.05376,2023-10-02.

[22] Gao D, Ji L, et al. AssistGPT: A General Multi-modal Assistant that Can Plan, Execute, Inspect, and Learn [DB/OL].https://arxiv.org/abs/2306.08640,2023-06-28.

[23] Liang T, He Z, et al. Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [DB/OL].https://arxiv.org/abs/2305.19118,2023-05-30.

[24] Hong S, Zheng X, et al. Metagpt: Meta Programming for Multi-agent Collaborative Framework [DB/OL].https://arxiv.org/abs/2308.00352,2023-11-06.

[25] Park J S, O’Brien J, et al. Generative Agents: Interactive Simulacra of Human Behavior [A]. Follmer S, Han J, Steimle J, Riche N H. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology [C]. New York: Association for Computing Machinery, 2023.1-22.

[26] Wang L, Ma C, et al. A Survey on Large Language Model Based Autonomous Agents [J]. Frontiers of Computer Science, 2024, 18(6):1-26.

[27] LangChain. LangChain [EB/OL].https://python.langchain.com/docs/get_started/introduction,2023-11-12.

[28] Auto-GPT. Auto-GPT [EB/OL].https://docs.agpt.co/,2023-12-29.

[29] AutoGen. AutoGen [EB/OL].https://microsoft.github.io/autogen/,2023-12-28.

[30] BabyAGI. BabyAGI [DB/OL].https://github.com/yoheinakajima/babyagi,2023-12-28.

[31] Li G, Hammoud H, et al. Camel: Communicative Agents for “mind” Exploration of Large Language Model Society [J]. Advances in Neural Information Processing Systems, 2024, 36:1-34.

[32] Lu Yu, Yu Jinglei et al. Research and Outlook on the Educational Applications of Multimodal Large Models [J]. Research on Educational Technology, 2023, 44(6):38-44.

[33] Wolfram. WolframAlpha [EB/OL].https://www.wolframalpha.com/,2023-11-11.

[34] Ma Ning, Guo Jiahui et al. Evidence-based Project-based Learning Model and System under Big Data Background [J]. China Distance Education, 2022,(2):75-82.

[35] Zhang Z, Han X, et al. ERNIE: Enhanced Language Representation with Informative Entities [A]. Korhonen A, Traum D, Màrquez L. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics [C]. Stroudsburg: Association for Computational Linguistics, 2019.1441-1451.

Author Biography:

Lu Yu: Associate professor, doctoral supervisor, research direction in artificial intelligence and its educational applications.

Yu Jinglei: PhD candidate, research direction in artificial intelligence and its educational applications.

Chen Penghe: Lecturer, PhD, research direction in artificial intelligence and its educational applications.

Previous Issues Review

1. Welcome to Register for the 2024 Global Smart Education Conference

2. 【Concept Document】GSE2024 Global Smart Education Conference V0.8

3. Call for Excellent Cases in Smart Education (2024)

4. Global Smart Education Innovation Award Notice

5. Schedule Overview | 2024 Global Smart Education Conference

6. Synthesis Report of Global Smart Education Conference 2023

Leave a Comment Cancel reply