【20251008AI日报】Google's AI can now surf the web for you, click on buttons, and fill out forms with Gemini 2.5 Computer Use

今日新鲜事 · 10-07

本文字数:约 5378 字,预计阅读时间:18 分钟

Google's AI can now surf the web for you, click on buttons, and fill out forms with Gemini 2.5 Computer Use

Google has unveiled a new, fine-tuned, and custom-trained version of its powerful Gemini 2.5 Pro LLM, known as "Gemini 2.5 Pro Computer Use." This model can use a virtual browser to surf the web, retrieve information, fill out forms, and take actions on websites from a single text prompt. The model is not directly available for consumers, but it is accessible through Browserbase, a startup founded by former Twilio engineer Paul Klein. Browserbase offers a "headless" web browser specifically designed for AI agents and applications, allowing users to interact with the model via a graphical representation.

For AI builders and developers, Gemini 2.5 Computer Use is available through the Gemini API in Google AI Studio and Google Cloud's Vertex AI model selector and applications building platform. The model is designed to enable AI agents to perform direct interactions with user interfaces, including browsers and mobile applications. It supports a wide array of built-in UI actions such as click_at, type_text_at, scroll_document, and drag_and_drop.

In brief, unscientific initial tests, Gemini 2.5 Computer Use successfully navigated to Taylor Swift's official website and provided a summary of the top promotions. It also navigated through a Google Search Captcha and successfully searched for highly rated solar lights on Amazon. However, it faced some challenges in completing certain tasks, such as filling out forms. The model has demonstrated leading results in multiple interface control benchmarks compared to other major AI systems.

The model employs a specialized tool called computer_use and can be integrated into custom environments using tools like Playwright or via the Browserbase demo sandbox. It operates in an interaction loop, receiving a user task prompt, a screenshot of the interface, and a history of past actions. The model analyzes this input and produces a recommended UI action. If needed, it can request confirmation from the end user for riskier tasks, such as making a purchase. The model also includes built-in safeguards to avoid actions that might compromise security or violate Google’s prohibited use policies.

Gemini 2.5 Computer Use is being used in various domains, including Google's payments platform, third-party AI agent platforms, and proactive AI assistant providers. It has shown significant performance improvements in areas such as test execution recovery, complex data parsing, and interface interaction speed.

Pricing for Gemini 2.5 Computer Use aligns closely with the standard Gemini 2.5 Pro model, with a per-token billing structure. The model supports a wide array of built-in UI actions, and its output is limited to suggested UI actions or chatbot-style text responses. Any structured output like a document or file must be handled separately by the developer, often through custom code or third-party integrations.


Has this stealth startup finally cracked the code on enterprise AI agent reliability? Meet AUI's Apollo-1

Augmented Intelligence (AUI) Inc., a New York City-based startup co-founded by Ohad Elhelo and Ori Cohen, has unveiled a new foundation model called Apollo-1, designed to boost the reliability of AI agents in enterprise settings. Apollo-1 is built on a principle called stateful neuro-symbolic reasoning, which combines the strengths of symbolic and neural architectures to ensure consistent and policy-compliant outcomes in every customer interaction.

The model is designed to address the critical category of interaction where AI agents must reliably complete tasks outside of chat. Apollo-1 achieves a 92.5% pass rate on TAU-Bench Airline, a benchmark for finding and booking flights, compared to a 56% pass rate for the top-performing agents and models. This significant improvement is attributed to Apollo-1’s ability to predict the next action in a conversation based on a typed symbolic state, rather than generating text based on probabilistic predictions.

AUI’s approach involves encoding millions of real task-oriented conversations to identify universal procedural patterns that can be modeled explicitly. The model then uses a neuro-symbolic reasoner to decide the next action based on the symbolic state, ensuring determinism instead of probability. This architecture allows organizations to define exact behaviors through a System Prompt, which guarantees those behaviors will execute deterministically.

Apollo-1 is already running in limited pilots with Fortune 500 companies across sectors such as finance, travel, and retail. The model is set for general availability in November 2025, when it will open APIs and release full documentation. AUI’s strategic partnership with Google aims to bring Apollo-1 to a wider audience, offering voice and image capabilities in the future.

The company’s pitch is simple: make AI that businesses can trust to act, not just talk. With its promise of order-of-magnitude reliability differences, Apollo-1 could potentially bridge the gap between chatbots that sound human and agents that reliably perform human tasks.


IBM claims 45% productivity gains with Project Bob, its multi-model IDE that orchestrates LLMs with full repository context

IBM has unveiled Project Bob, an AI-first IDE designed to automate application modernization by orchestrating multiple Large Language Models (LLMs) with full repository context. Project Bob is an enterprise-focused tool that maintains full-repository context across editing sessions, automating complex tasks like upgrading Java versions and refactoring frameworks. It integrates DevSecOps practices like vulnerability detection and compliance checks directly into the IDE, enabling developers to design, debug, refactor, and modernize code without breaking their workflow.

Project Bob orchestrates between Anthropic’s Claude, Mistral, Meta’s Llama, and IBM’s Granite 4 models, using a data-driven model selection approach to balance accuracy, latency, and cost in real-time. Among 6,000 early adopters within IBM, 95% used Bob for task completion rather than code generation. The tool has achieved an average productivity gain of 45% and a 22-43% increase in code commits.

Project Bob benefits from a new partnership between IBM and Anthropic, integrating Claude models directly into the watsonx portfolio. This partnership also includes a guide for enterprise AI agent deployment, focusing on the Agent Development Lifecycle (ADLC) and Model Context Protocol (MCP). Additionally, IBM is extending its watsonx Orchestrate technology to integrate the open-source Langflow visual agent builder, addressing the prototype-to-production chasm.

The integration of Langflow with watsonx Orchestrate brings critical capabilities, including agent lifecycle management, integrated AI governance, enterprise infrastructure, no-code and pro-code options, pre-built domain agents, and production observability. Agentic Workflows and AgentOps further enhance governance and observability, providing real-time monitoring and policy-based controls across the full agent lifecycle.

Project Bob is now available in private tech preview, with broader availability expected in the future. IBM is accepting access requests through its developer portal, while its AgentOps and agentic workflows integrations are now available in watsonx Orchestrate.


「Sora 2」发布,工具、玩具、还是行业杀手?

Sora 2 的发布不仅标志着技术的进步,更体现了其产品与市场策略的精心策划。该产品不仅具备强大的功能,还在市场推广和用户定位上采取了全面的策略。这使得 Sora 2 成为了一个重要的市场参与者,有望成为行业中的领先工具。


销量暴涨107%却难进寻常家,AR眼镜真火还是虚火?

AR眼镜销量的暴涨引发了市场的广泛关注。然而,尽管销售数据亮眼,AR眼镜能否真正进入寻常家庭仍面临诸多挑战。技术成熟度、用户体验、价格等因素都可能影响其普及程度。尽管前景乐观,但实现大规模普及仍需时间。


EU Plans to Cut Steel Quotas by Nearly a Half and Hike Tariffs to 50%

欧盟计划削减钢铁配额近一半,并将关税提高至50%。此举旨在保护欧盟的钢铁行业,取代将于6月到期的临时机制。目前的机制对大多数进口钢铁产品征收25%的关税。欧盟的新机制将有助于稳定国内钢铁市场,但可能对国际贸易关系产生影响。


AMD Stock Soars Almost 24% on Deal with OpenAI to Deliver 6GW Chips

AMD 股价因与 OpenAI 达成协议而大幅上涨,协议内容包括向 OpenAI 提供 6GW 的芯片。此次合作将为 AMD 带来巨大的市场机遇,并进一步巩固其在 AI 芯片市场的地位。OpenAI 还获得了 AMD 的股份认购权,这将使其成为 AMD 的重要股东之一。


2025诺贝尔物理学奖颁给了谷歌量子计算机打造者

2025年诺贝尔物理学奖颁给了量子力学领域的三位科学家。这些科学家在谷歌量子计算机的研发中发挥了重要作用,他们的贡献为量子计算的发展奠定了基础。此次奖项的颁发不仅是对个人的肯定,也是对量子计算技术前景的认可。

总结

今日AI领域的新闻主要集中在技术进步和企业应用方面。Google的Gemini 2.5 Pro Computer Use展示了AI在网页浏览和表单填写等任务中的强大能力,AUI的Apollo-1则专注于提升企业AI代理的可靠性,IBM的Project Bob则旨在通过多模型IDE加速应用现代化。这些技术的进步不仅体现了AI在日常任务中的实用价值,也展示了其在企业级应用中的巨大潜力。此外,AMD与OpenAI的合作也为AI芯片市场带来了新的机遇。


作者:Qwen/Qwen2.5-32B-Instruct
文章来源:量子位, 钛媒体, VentureBeat
编辑:小康

Theme Jasmine by Kent Liao