帮别人做彩票网站犯法嘛,网站方案策划,学校网站建设客户需求调查问卷,无锡网站制作哪家不错LLMs之Agent#xff1a;《Agent S: An Open Agentic Framework that Uses Computers Like a Human》翻译与解读 导读#xff1a;Agent S是一个开创性的开放智能体框架#xff0c;旨在通过模拟人类使用计算机的方式#xff0c;实现对图形用户界面#xff08;GUI#xff09…LLMs之Agent《Agent S: An Open Agentic Framework that Uses Computers Like a Human》翻译与解读导读Agent S是一个开创性的开放智能体框架旨在通过模拟人类使用计算机的方式实现对图形用户界面GUI任务的自主自动化。它通过引入“经验增强分层规划”来解决当前智能体在获取领域知识、执行长周期规划和导航动态界面方面的三大挑战。该规划机制巧妙地融合了在线网络知识与智能体内部的叙事记忆和情景记忆从而实现高效的任务分解和子任务执行。框架的核心还包括一个创新的“Agent-Computer Interface (ACI)”它为多模态大语言模型MLLMs提供了一个语言中心化的抽象层确保了精确的感知和受限的动作空间极大地提升了智能体的接地能力、安全性和操作效率。此外Agent S通过自我监督探索和持续记忆更新机制实现了智能体的持续学习和自我改进。实验结果有力地证明了Agent S在OSWorld基准测试中取得了显著的性能提升并展现了对WindowsAgentArena等不同操作系统的良好泛化能力。这项工作不仅为GUI智能体领域树立了新的技术标杆也为未来探索零样本、智能体式学习方法以及优化智能体的时间效率和模型规模提供了宝贵的见解和方向。 背景痛点●领域知识获取与更新 面对不断演变的应用和网站智能体需要具备专业且最新的领域知识并能从开放世界经验中持续学习。●长周期、多步骤任务规划 复杂的桌面任务通常涉及长周期的多步骤规划其中行动相互依赖且必须按特定顺序执行。智能体必须能够创建清晰的计划设定中间子目标并跟踪任务进度。●动态、非统一界面导航 GUI智能体需要处理大量的视觉和文本信息并在广阔的动作空间中操作这要求它们能够区分相关和不相关元素准确解释图形线索并响应任务执行过程中的视觉反馈。● MLLM智能体与GUI交互的局限性 现有的计算机接口为人类用户或软件程序设计不足以支持MLLM智能体在键盘-鼠标层面进行GUI控制和操作。MLLM智能体响应慢、离散缺乏内部坐标系并且无法有效处理每次细微鼠标移动或键盘输入后的细粒度反馈。 具体的解决方案● Agent S框架 提出了一个开放的智能体框架Agent S通过图形用户界面GUI实现与计算机的自主交互旨在自动化复杂的、多步骤的任务。● 经验增强分层规划 引入了一种经验增强分层规划方法该方法通过从外部知识搜索和内部经验检索中学习促进高效的任务规划和子任务执行。● Agent-Computer Interface (ACI) 引入了一个语言中心化的智能体-计算机接口作为抽象层以提高基于多模态大语言模型MLLMs的GUI智能体的接地能力、安全性和效率。● 记忆构建与持续更新 通过自我监督探索构建初始记忆并在与新任务交互时持续更新叙事记忆和情景记忆实现持续改进。 核心思路步骤(1)、经验增强分层规划 (Experience-augmented Hierarchical Planning):● Manager管理者融合外部知识和内部经验进行规划●● 观察感知查询 接收用户任务和初始环境观察带有截图的标注可访问性树生成一个“如何做X”格式的查询。●● 知识检索 使用该查询通过Perplexica搜索引擎进行在线网络搜索获取外部知识并从管理者的叙事记忆中检索相似的任务经验摘要获取内部经验。●● 知识融合 将检索到的外部网络知识和内部叙事记忆经验融合为一个统一的指导方针。●● 子任务规划 利用融合后的知识制定一个详细的、拓扑排序的子任务队列及其每个子任务的关联上下文。●Worker执行者从子任务经验和轨迹反思中学习●● 子任务经验检索 结合用户任务、当前子任务和上下文信息作为查询从执行者的情景记忆中检索详细的、分步的子任务经验。●● 轨迹反思 每个执行者关联一个轨迹反思模块该模块观察执行者执行子任务的整个过程提供反思性建议帮助智能体思考替代策略并避免重复动作。●● 动作生成 利用检索到的子任务经验和反思行动生成模块生成一个结构化的响应包括前一步动作状态检查、观察分析、语义下一步动作和接地下一步动作最终输出一个具体的接地动作。●● 子任务完成信号 当执行者判断子任务已完成时生成“DONE”信号若失败生成“FAIL”信号此时分层操作将被重置管理者根据中间环境配置重新规划。●Self-Evaluator自我评估器将经验总结为文本奖励●● 生成文本奖励 负责为管理者和执行者模块生成经验摘要作为文本奖励。●● 情景经验更新 当执行者发出“DONE”信号成功完成子任务时评估器观察整个子任务过程生成完成该子任务所用策略的摘要并将其反馈到执行者的情景记忆中。●● 叙事经验更新 当整个用户提供的任务完成时无论是所有子任务成功完成还是达到最大步骤限制评估器生成整个任务完成过程的摘要并将其反馈并保存到管理者的叙事记忆中。(2)、记忆构建与更新 (Memory Construction and Update):● 初始记忆构建自我监督探索 通过在随机生成的任务上进行自我监督探索来引导叙事记忆和情景记忆。这些探索任务包括环境无关任务通用应用任务和环境感知任务基于特定环境生成。● 持续记忆更新 智能体在与新任务交互时会持续更新叙事记忆和情景记忆从而在推理过程中也能学习并适应新的、更复杂的任务。(3)、Agent-Computer Interface (ACI) (智能体-计算机接口):● 感知和接地 采用双输入策略图像输入用于理解环境中的显著细节如弹窗、按钮状态可访问性树输入用于推理下一步和精确接地特定UI元素。通过OCR模块处理图像并将文本块作为可交互UI元素添加到可访问性树中以增强接地能力。● 受限动作空间与并发反馈 定义了一个受限的语言基元动作空间如click、type、hotkey、drag_and_drop等智能体通过标签ID引用元素。每次只允许执行一个离散动作以确保智能体能够观察到环境的即时反馈。 优势● 显著的性能提升 在OSWorld基准测试中Agent S的成功率比基线提高了9.37%相对提升83.6%达到新的最先进水平。在“日常”和“专业”任务类别中表现尤为突出。● 广泛的泛化能力 在WindowsAgentArena基准测试中Agent S无需任何明确适应性能从13.3%提升到18.2%证明了其对不同操作系统的广泛泛化能力。● 增强的领域知识获取 经验学习过程包括在线网络知识搜索、叙事记忆和情景记忆检索显著增强了GUI智能体的领域知识和适应性。● 激发MLLM的推理能力 ACI通过提供精确的感知和受限的动作空间有效激发了MLLM更好的推理能力提高了智能体的效率和可靠性。● 支持长周期工作流 分层规划对于建模和执行长周期、多步骤任务至关重要尤其是在经验学习的配合下管理者能够生成更详细和准确的计划。● 高效的记忆机制 自我监督探索、持续记忆更新和自我评估器将经验总结为文本奖励对于记忆的构建和优化至关重要实验证明使用摘要式经验存储优于原始轨迹存储。结论与未来展望 结论观点● Agent S是一个新颖的框架能够开发完全自主的图形用户界面GUI智能体通过直接控制键盘和鼠标来执行各种用户查询。● 强调了经验学习对于面向任务的GUI智能体的益处展示了MLLM智能体无需人工或环境反馈即可从外部来源和通过直接与环境交互进行学习的潜力。● 提出了GUI领域中智能体-计算机接口ACI的概念倡导使用抽象层使MLLM智能体能够在语言层面进行感知和推理并获得丰富且持续的反馈。● 通过经验增强分层规划、在线网络知识和ACIAgent S在OSWorld基准测试中展现了最先进的性能并在不同操作系统上具有泛化能力。● 该研究开启了GUI智能体零样本zero-shot、智能体式学习方法的新篇章。未来展望● 时间效率和准确性 未来的工作可以关注任务完成所需的智能体步骤数量和实际耗时将GUI控制视为最短路径导航问题评估不同智能体在时间效率和准确性维度上的帕累托最优性。● 小型开源MLLM的适用性 将经验学习和ACI的思想扩展到更小、开源的MLLM并通过微调来缩小与当前最先进模型如GPT-4o之间的性能差距。目录《Agent S: An Open Agentic Framework that Uses Computers Like a Human》翻译与解读AbstractFigure 1:Agent S uses a computer like a human to solve diverse desktop tasks on different systems.图 1智能体 S 像人类一样使用计算机在不同的系统上解决各种桌面任务。1、IntroductionFigure 2:Agent S vs. OSWorld Agent results across five broad computer task categories.图 2Agent S 与 OSWorld Agent 在五大类计算机任务中的对比结果。Figure 3: Overview of the Agent S framework. Given task Tu and initial environment observation o0, the Manager conducts experience-augmented hierarchical planning using web knowledge and narrative memory to produce subtasks s0,…,sn. For each si, Worker wi draws from episodic memory to generate an action at at time t, which is executed by the ACI to return the next immediate observation ot1. A self-evaluation module closes the loop by storing the summarized subtask and full-task trajectories in narrative and episodic memory.图 3Agent S 框架概述。给定任务 Tu 和初始环境观察 o0管理器利用网络知识和叙事记忆进行经验增强型分层规划生成子任务 s0,...,sn。对于每个 si工作者 wi 从情景记忆中提取信息在时间 t 生成动作 at该动作由自主计算智能体ACI执行返回下一个即时观察 ot1。自我评估模块通过将总结后的子任务和完整任务轨迹存储在叙事记忆和情景记忆中来闭合循环。Figure 4:The pipeline of memory construction and update, which contains two phases: Self-supervised Exploration and Continual Memory Update. The initial Narrative Episodic Memory is constructed through some randomly curated tasks during the exploration phase, and then it is updated based on the inference tasks continually.图 4记忆构建与更新的流程包含两个阶段自监督探索和持续记忆更新。在探索阶段通过一些随机选取的任务构建初始的叙述性与情景性记忆然后基于推理任务持续更新。6 Conclusion《Agent S: An Open Agentic Framework that Uses Computers Like a Human》翻译与解读地址https://arxiv.org/abs/2410.08164时间2024年10月10日作者Simular ResearchAbstractWe present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. To this end, Agent S introduces experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% on success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. Comprehensive analysis highlights the effectiveness of individual components and provides insights for future improvements. Furthermore, Agent S demonstrates broad generalizability to different operating systems on a newly-released WindowsAgentArena benchmark. Code available at this https URL.我们推出Agent S这是一个开放的智能体框架通过图形用户界面GUI实现与计算机的自主交互旨在通过自动化复杂的多步骤任务来变革人机交互方式。Agent S 旨在解决计算机任务自动化中的三个关键挑战获取特定领域的知识、在长任务周期内进行规划以及处理动态、非统一的界面。为此Agent S 引入了经验增强型分层规划它在多个层级上从外部知识搜索和内部经验检索中学习从而促进高效的任务规划和子任务执行。此外它采用智能体-计算机接口ACI以更好地激发基于多模态大型语言模型MLLM的 GUI 智能体的推理和控制能力。在 OSWorld 基准测试中的评估表明Agent S 的成功率比基线高出 9.37%相对提高了 83.6%并达到了新的最先进水平。全面分析突显了各个组件的有效性并为未来改进提供了见解。此外在新发布的 WindowsAgentArena 基准测试中Agent S 还展现出了在不同操作系统上的广泛通用性。代码可在该 https 网址获取。Figure 1:Agent S uses a computer like a human to solve diverse desktop tasks on different systems.图 1智能体 S 像人类一样使用计算机在不同的系统上解决各种桌面任务。“The digital revolution is far more significant than the invention of writing or even of printing.”— Douglas Engelbart, The Inventor of Computer Mouse“数字革命的意义远比书写甚至印刷术的发明更为重大。”——道格拉斯·恩格尔巴特计算机鼠标发明者1、IntroductionSince its invention, the mouse has been controlled by humans for interacting with computers. But does it really have to be? Autonomous Graphical User Interface (GUI) agents offer the promise of solving very specific and highly varied user queries—such as data entry, scheduling, and document creation for individual users, and streamlining operations in commercial settings—in the most general way: through direct UI interaction using the mouse and keyboard. Moreover, by eliminating the need for constant manual interaction, these agents not only boost efficiency but also improve accessibility, empowering individuals with disabilities to interact with technology in new, transformative ways. Recent advancements in Multimodal Large Language Models (MLLMs), such as GPT-4o (OpenAI, 2023) and Claude (Anthropic, 2024), have laid the foundation for the development of GUI agents for human-centred interactive systems like desktop OS (Xie et al., 2024; Bonatti et al., 2024).However, automating computer tasks presents significant challenges. First, the vast range of constantly-evolving applications and websites requires the agent to possess specialized and up-to-date domain knowledge and the ability to learn from open-world experience. Second, complex desktop tasks often involve long-horizon, multi-step planning with interdependent actions that must be executed in a specific sequence. The agent must, therefore, create a clear plan with intermediate subgoals and track task progress. Third, GUI agents must navigate dynamic, non-uniform interfaces, processing large volumes of visual and textual information while operating within a vast action space. This involves distinguishing between relevant and irrelevant elements, accurately interpreting graphical cues, and responding to visual feedback during task execution.自发明以来鼠标一直由人类控制以与计算机进行交互。但真的非得如此吗自主图形用户界面GUI智能体有望以最通用的方式解决非常具体且高度多样化的用户查询——例如为个人用户进行数据输入、日程安排和文档创建以及在商业环境中简化操作通过直接使用鼠标和键盘进行用户界面交互。此外通过消除持续手动交互的需求这些智能体不仅提高了效率还改善了可访问性使残疾人士能够以新的、变革性的方式与技术进行交互。最近在多模态大型语言模型MLLM方面的进展例如 GPT-4oOpenAI2023和 ClaudeAnthropic2024为开发面向以人为中心的交互系统如桌面操作系统的 GUI 智能体奠定了基础Xie 等人2024Bonatti 等人2024。然而自动化计算机任务面临着重大挑战。首先种类繁多且不断演进的应用程序和网站要求智能体具备专业且最新的领域知识以及从开放世界经验中学习的能力。其次复杂的桌面任务通常涉及长时序、多步骤规划其中相互依赖的动作必须按照特定顺序执行。因此智能体必须制定清晰的计划设定中间子目标并跟踪任务进度。第三图形用户界面智能体必须在动态、非统一的界面中导航处理大量视觉和文本信息同时在庞大的动作空间中操作。这包括区分相关和不相关元素、准确解读图形提示以及在任务执行期间对视觉反馈做出响应。In this paper, we present Agent S, a new agentic framework that tackles these challenges towards the goal of using computers like a human. First, to enhance the GUI agent’s capabilities in solving diverse, long-horizon desktop tasks with specific domain knowledge, we propose an Experience-Augmented Hierarchical Planning method. This approach leverages Online Web Knowledge and past experiences stored in Narrative Memory to decompose the complex, long-horizon task into a structured plan of manageable subtasks (see Figure 1). Online Web Knowledge provides up-to-date external knowledge about specific applications, allowing the agent to adapt to frequently changing software and websites. Narrative Memory contains high-level, abstractive task experiences from past interactions, equipping the agent with contextual understanding for effective task planning. The agent monitors task completion progress, and during each subtask execution, it retrieves detailed, step-by-step subtask experience from Episodic Memory to dynamically refine its actions and continuously improve its planning ability. Successful subtasks and the full task experience are evaluated, summarized, and stored in episodic and narrative memory to enable continual improvement.Furthermore, we introduce a specific language-centric Agent-Computer Interface (ACI) (Lieberman Selker, 2003) as an abstraction layer to improve grounding, safety, and efficiency for MLLM-based GUI agents. The ACI defines an interaction paradigm by (1) a dual-input strategy using visual input for understanding environmental changes together with an image-augmented accessibility tree for precise element grounding; (2) a bounded action space of language-based primitives (e.g., click(element_id)) that are conducive to MLLM common-sense reasoning and generate environment transitions at the right temporal resolution for the agent to observe immediate and task-relevant environment feedback.在本文中我们提出了智能体 S这是一种新的智能体框架旨在解决这些挑战以实现像人类一样使用计算机的目标。首先为了增强图形用户界面智能体在解决具有特定领域知识的多样化、长时序桌面任务方面的能力我们提出了一种经验增强型分层规划方法。这种方法利用在线网络知识和存储在叙述记忆中的过往经验将复杂的长期任务分解为一系列可管理的子任务的结构化计划见图 1。在线网络知识提供了关于特定应用程序的最新外部知识使智能体能够适应频繁变化的软件和网站。叙述记忆包含过往交互中的高层次、抽象的任务经验使智能体具备上下文理解能力从而能够有效地进行任务规划。智能体会监控任务完成进度在每个子任务执行期间它从情景记忆中检索详细的、分步的子任务经验以动态地优化其行动并持续改进其规划能力。成功的子任务和整个任务经验会被评估、总结并存储在情景记忆和叙述记忆中以实现持续改进。此外我们引入了一种特定的语言中心智能体 - 计算机接口ACILieberman 和 Selker2003 年作为抽象层来提高基于 MLLM 的 GUI 智能体的定位、安全性和效率。ACI 通过以下方式定义了一种交互范式1采用双输入策略利用视觉输入来理解环境变化并结合图像增强型辅助树实现精确元素定位2定义了一个基于语言的有限动作空间例如点击元素 ID这有利于 MLLM 的常识推理并以适当的时序分辨率生成环境转换使智能体能够观察到即时且与任务相关的环境反馈。Our approach shows a remarkable improvement in the overall performance of Agent S on the OSWorld benchmark (OpenAI, 2023) (from 11.21% to 20.58%, with a relative improvement of 83.6%), establishing the new state-of-the-art results. The detailed comparison is shown in Figure 2, which demonstrates consistent improvements by Agent S across five broad computer task categories over the OSWorld agent. We also evaluate our Agent S on a concurrent work—WindowsAgentArena (Bonatti et al., 2024), where we observe a performance improvement from 13.3% to 18.2% on an equivalent setup without any explicit adaptation. The improvement demonstrates the broad generalizability of Agent S to different operating systems. We detail the component-wise improvements introduced by the proposed strategies through ablation studies and present a comprehensive error analysis of our Agent S framework. In summary, our contributions are four-fold:• We introduce Agent S, a new agentic framework that integrates experience-augmented hierarchical planning, self-supervised continual memory update, and an Agent-Computer Interface for MLLM-based GUI agents to perform complex computer tasks.• We propose an experience-augmented hierarchical planning method that uses experience from external web knowledge and the agent’s internal memory to decompose complex tasks into executable subtasks.• We extend the concept of an ACI to GUI agents, allowing MLLM-based agents to operate computers more precisely using a set of high-level, predefined primitive actions.• We conduct extensive experiments on OSWorld to show the effectiveness of individual components of Agent S, establishing new state-of-the-art on automating computer tasks. Besides, we demonstrate its generalizability across different operating systems on WindowsAgentArena.我们的方法在 OSWorld 基准测试OpenAI2023 年中显著提升了 Agent S 的整体性能从 11.21% 提升至 20.58%相对提升 83.6%确立了新的最先进成果。详细对比情况见图 2该图展示了 Agent S 在五个广泛计算机任务类别上相对于 OSWorld 智能体的一致性改进。我们还在一项同期工作——WindowsAgentArenaBonatti 等人2024 年中评估了我们的 Agent S在相同设置下性能从 13.3% 提升至 18.2%且无需任何明确的适应调整。这一提升表明 Agent S 在不同操作系统中具有广泛的通用性。我们通过消融研究详细介绍了所提出策略带来的组件级改进并对我们的 Agent S 框架进行了全面的错误分析。总之我们的贡献有四点• 我们引入了 Agent S这是一种新的智能体框架它将经验增强的分层规划、自监督持续内存更新以及基于 MLLM 的 GUI 智能体的智能体-计算机接口相结合以执行复杂的计算机任务。• 我们提出了一种经验增强型分层规划方法该方法利用外部网络知识和智能体内部记忆中的经验将复杂任务分解为可执行的子任务。• 我们将 ACI 的概念扩展到图形用户界面GUI智能体使基于多模态大型语言模型MLLM的智能体能够通过一组高级预定义的基本操作更精确地操作计算机。• 我们在 OSWorld 上进行了大量实验以展示 Agent S 各个组件的有效性并在自动化计算机任务方面建立了新的技术前沿。此外我们在 WindowsAgentArena 上展示了其在不同操作系统中的通用性。Figure 2:Agent S vs. OSWorld Agent results across five broad computer task categories.图 2Agent S 与 OSWorld Agent 在五大类计算机任务中的对比结果。Figure 3: Overview of the Agent S framework. Given task Tu and initial environment observation o0, the Manager conducts experience-augmented hierarchical planning using web knowledge and narrative memory to produce subtasks s0,…,sn. For each si, Worker wi draws from episodic memory to generate an action at at time t, which is executed by the ACI to return the next immediate observation ot1. A self-evaluation module closes the loop by storing the summarized subtask and full-task trajectories in narrative and episodic memory.图 3Agent S 框架概述。给定任务 Tu 和初始环境观察 o0管理器利用网络知识和叙事记忆进行经验增强型分层规划生成子任务 s0,...,sn。对于每个 si工作者 wi 从情景记忆中提取信息在时间 t 生成动作 at该动作由自主计算智能体ACI执行返回下一个即时观察 ot1。自我评估模块通过将总结后的子任务和完整任务轨迹存储在叙事记忆和情景记忆中来闭合循环。Figure 4:The pipeline of memory construction and update, which contains two phases: Self-supervised Exploration and Continual Memory Update. The initial Narrative Episodic Memory is constructed through some randomly curated tasks during the exploration phase, and then it is updated based on the inference tasks continually.图 4记忆构建与更新的流程包含两个阶段自监督探索和持续记忆更新。在探索阶段通过一些随机选取的任务构建初始的叙述性与情景性记忆然后基于推理任务持续更新。6 ConclusionIn this work, we present Agent S—A novel framework for developing fully Autonomous Graphical User Interface (GUI) agents that can perform a wide range of user queries by directly controlling the keyboard and mouse. Through the Agent S framework, we show the benefits of Learning from Experience for Task-oriented GUI agents. We also discuss the concept of an Agent Computer Interface for the GUI domain, arguing in favour of an abstraction layer that allows MLLM agents to perceive and reason at a language level with rich and continuous feedback. By leveraging Experience-Augmented Hierarchical Planning, Online Web Knowledge, and an Agent-Computer Interface (ACI), Agent S demonstrates SOTA performance on the OSWorld benchmark and generalizability across different operating systems. We demonstrate the potential of MLLM agents to learn from external sources and through direct interaction with the environment, without any human or environmental feedback in the GUI agents domain, thus opening a discourse on zero-shot, agentic methods for GUI agents.在本研究中我们提出了 Agent S——一种用于开发完全自主图形用户界面GUI智能体的新框架该框架能够通过直接控制键盘和鼠标执行广泛的用户查询。通过 Agent S 框架我们展示了面向任务的 GUI智能体从经验中学习的优势。我们还讨论了 GUI 领域中智能体计算机接口的概念主张采用抽象层使多模态大型语言模型MLLM智能体能够在语言层面感知和推理并获得丰富且持续的反馈。借助经验增强型分层规划、在线网络知识以及智能体计算机接口ACIAgent S 在 OSWorld 基准测试中展示了最先进的性能并在不同操作系统中表现出良好的泛化能力。我们展示了 MLLM 智能体在 GUI 智能体领域无需任何人类或环境反馈仅通过从外部来源学习和与环境直接交互即可学习的潜力从而开启了关于零样本、智能体式方法在 GUI 智能体中的讨论。Future Work. A key metric that has been unaddressed in existing work on MLLM agents for computer control, including ours, is the number of agent steps and wall clock time required for task completion. While our work focuses on achieving significant improvement in task performance, future work can consider a shortest-path navigation formulation of GUI control and evaluate the Pareto-optimality of various agents on the dimensions of time and accuracy. In our work, we use the state-of-the-art GPT-4o and Claude-3.5-sonnet models. However, future work can extend the ideas of experiential learning and Agent Computer Interface for smaller, open-source MLLMs which could be fine-tuned to bridge the gap.未来工作。在现有的关于用于计算机控制的多模态语言模型智能体的研究中包括我们的研究在内有一个关键指标一直未被涉及那就是完成任务所需的智能体步骤数量和实际时间。虽然我们的工作重点在于显著提升任务性能但未来的研究可以考虑将图形用户界面控制表述为最短路径导航问题并从时间和准确性的维度评估各种智能体的帕累托最优性。在我们的研究中我们使用了最先进的 GPT-4o 和 Claude-3.5-sonnet 模型。然而未来的研究可以将经验学习和智能体计算机接口的理念拓展到更小的开源多模态语言模型上这些模型可以经过微调来弥补差距。