【20251119AI日报】Musk's xAI launches Grok 4.1 with lower hallucination rate on the web and apps

本文字数：约 5400 字，预计阅读时间：20 分钟
Musk's xAI launches Grok 4.1 with lower hallucination rate on the web and apps — no API access (for now)
In what appeared to be a bid to soak up some of Google's limelight prior to the launch of its new Gemini 3 flagship AI model — now recorded as the most powerful LLM in the world by multiple independent evaluators — Elon Musk's rival AI startup xAI last night unveiled its newest large language model, Grok 4.1. The model is now live for consumer use on Grok.com, social network X (formerly Twitter), and the company’s iOS and Android mobile apps, and it arrives with major architectural and usability enhancements, among them: faster reasoning, improved emotional intelligence, and significantly reduced hallucination rates. xAI also commendably published a white paper on its evaluations and including a small bit on training process here. Across public benchmarks, Grok 4.1 has vaulted to the top of the leaderboard, outperforming rival models from Anthropic, OpenAI, and Google — at least, Google's pre-Gemini 3 model (Gemini 2.5 Pro). It builds upon the success of xAI's Grok-4 Fast, which VentureBeat covered favorably shortly following its release back in September 2025. However, enterprise developers looking to integrate the new and improved model Grok 4.1 into production environments will find one major constraint: it's not yet available through xAI's public API. Despite its high benchmarks, Grok 4.1 remains confined to xAI's consumer-facing interfaces, with no announced timeline for API exposure. At present, only older models—including Grok 4 Fast (reasoning and non-reasoning variants), Grok 4 0709, and legacy models such as Grok 3, Grok 3 Mini, and Grok 2 Vision—are available for programmatic use via the xAI developer API. These support up to 2 million tokens of context, with token pricing ranging from $0.20 to $3.00 per million depending on the configuration. For now, this limits Grok 4.1's utility in enterprise workflows that rely on backend integration, fine-tuned agentic pipelines, or scalable internal tooling. While the consumer rollout positions Grok 4.1 as the most capable LLM in xAI's portfolio, production deployments in enterprise environments remain on hold.

谷歌 Gemini 3.0 深夜炸场：没有悬念的最强 AI
来了。预热了快一个月的 Gemini 3.0 Pro，就在刚刚，正式在 Google AI Studio 上线 Preview 版，API 也同步开放。接下来将陆续上线Google的各项产品中。没有任何多余的废话，打开 Model Card，满眼写着的只有两个字：碾压。按照 Google 披露的测试数据，Gemini 3 Pro 毫无悬念地成为了目前地球上数学最强的 AI。在 AIME 2025 数学测试中，配合代码执行，它直接拿到了100% 的满分。而在数学竞赛的「地狱模式」MathArena 里，当包括 GPT-5.1 在内的其他大模型还在 1% 上下挣扎时，Gemini 3 Pro 直接干到了23.4%。编程能力方面，虽然在 SWE-Bench 上未拿 SOTA——但绝对属于第一梯队。Live Code Bench 的 Elo 得分超过 2400 分，在工具调用和终端操作基准测试中更是名列第一。真正炸裂的是它的「视觉智能」。对屏幕截图的理解能力高达72.7%，是目前最先进水平的两倍。这意味着 Agent 不再是瞎子，它将彻底重塑 AI 操作计算机的模式。但这还没完，Google 今晚还顺手扔出了一个小王炸：自家的 Agentic 编程平台——Google Antigravity。此前网传 Gemini 3 能实现「端到端编程」，大家以为是模型成精了。但看起来，并不是模型成精，而是 Google 正在探索如何用更好的系统工程实现端到端编程。如果说 Cursor 是目前最强的「外骨骼」，它通过 AI 补全让你写代码更快；那 Antigravity 就是奔着「自动驾驶」去的。它不再只是一个编辑器，而是一个智能体优先（Agent-first）发环境。集成了 Gemini 3.0 和能操控浏览器的 Gemini 2.5 Computer Use 模型，它的 Agent 能自己写代码、自己开终端跑测试、甚至自己打开浏览器验证 UI，发现报错自己修。不讲故事，只拼肌肉。Google 用这一波硬核发布宣告：新王已至。有趣的是，这次连 Sam Altman 都献上了自己的点赞。：

Microsoft remakes Windows for an era of autonomous AI agents
Microsoft is fundamentally restructuring its Windows operating system to become what executives call the first "agentic OS," embedding the infrastructure needed for autonomous AI agents to operate securely at enterprise scale — a watershed moment in the evolution of personal computing that positions the 40-year-old platform as the foundation for a new era of human-machine collaboration. The company announced Tuesday at its Ignite conference that it is introducing native agent infrastructure directly into Windows 11, allowing AI agents — autonomous software programs that can perform complex, multi-step tasks on behalf of users — to discover tools, execute workflows, and interact with applications through standardized protocols while operating in secure, policy-controlled environments separate from user sessions. The shift is Microsoft's most significant architectural evolution of Windows since the introduction of the modern security model, transforming the operating system from a platform where users manually orchestrate applications into one where they can "simply express your desired outcome, and agents handle the complexity," according to Pavan Davuluri, President of Windows & Devices at Microsoft. "Windows 11 starts with this notion of secure by design, secure by default," Davuluri said in an exclusive interview with VentureBeat. "And a lot of the work that we're doing today, when we think about the engagement we have with our customers, the expectations they have with us is making sure we are building upon the fact that Windows is the most secure platform for them and is the most resilient platform as well." The announcements arrive as enterprises are experimenting with AI agents but struggling with fragmented tooling, security concerns, and lack of centralized management — challenges that Microsoft believes only operating system-level integration can solve. The stakes are enormous: with Windows running on an estimated 1.4 billion devices globally, Microsoft's architectural choices will likely shape how organizations deploy autonomous AI systems for years to come.

Google unveils Gemini 3 claiming the lead in math, science, multimodal and agentic AI benchmarks

After more than a month of rumors and feverish speculation — including Polymarket wagering on the release date — Google today unveiled Gemini 3, its newest proprietary frontier model family and the company’s most comprehensive AI release since the Gemini line debuted in 2023. The models are proprietary (closed-source), available exclusively through Google products, developer platforms, and paid APIs, including Google AI Studio, Vertex AI, the Gemini command line interface (CLI) for developers, and third-party integrations across the broader integrated developer environment (IDE) ecosystem. Gemini 3 arrives as a full portfolio, including:Gemini 3 Pro: the flagship frontier modelGemini 3 Deep Think: an enhanced reasoning modeGenerative interface models powering Visual Layout and Dynamic ViewGemini Agent for multi-step task executionGemini 3 engine embedded in Google Antigravity, the company’s new agent-first development environment. "This is the best model in the world, by a crazy wide margin!" wrote Google DeepMind Research Scientist Yi Tay on X. Indeed, already, independent AI benchmarking and analysis organization Artificial Analysis has crowned Gemini 3 Pro the "new leader in AI" globally, achieving the top score of 73 on the organization's index, leaping Google from its former placement of 9th overall with the preceding Gemini 2.5 Pro model, which scored 60 behind OpenAI, Moonshot AI, xAI, Anthropic and MiniMax models. As Artificial Analysis wrote on X: "For the first time, Google has the most intelligent model." Another independent leaderboard site, LMArena reported that Gemini 3 Pro ranked first in the world across all of its major evaluation tracks, including text reasoning, vision, coding, and web development. In a public post, the @arena account on X said the model surpassed even the newly released (hours old) Grok-4.1, as well as Claude 4.5, and GPT-5-class systems in categories such as math, long-form queries, creative writing, and several occupational benchmarks. The post also highlighted the scale of gains over Gemini 2.5 Pro, including a 50-point jump in text Elo, a 70-point increase in vision, and a 280-point rise in web-development tasks. While these results reflect live community voting and remain preliminary, they signal unusually broad performance improvements across domains where previous Gemini models trailed competitors.

Writer's AI agents can actually do your work—not just chat about it

Writer, a San Francisco-based artificial intelligence startup, is launching a unified AI agent platform designed to let any employee automate complex business workflows without writing code — a capability the company says distinguishes it from consumer-oriented tools like Microsoft Copilot and ChatGPT. The platform, called Writer Agent, combines chat-based assistance with autonomous task execution in a single interface. Starting Tuesday, enterprise customers can use natural language to instruct the AI to create presentations, analyze financial data, generate marketing campaigns, or coordinate across multiple business systems like Salesforce, Slack, and Google Workspace—then save those workflows as reusable "Playbooks" that run automatically on schedules. The announcement comes as enterprises struggle to move AI initiatives beyond pilot programs into production at scale. Writer CEO May Habib has been outspoken about this challenge, recently revealing that 42% of Fortune 500 executives surveyed by her company said AI is "tearing their company apart" due to coordination failures between departments. "We're delivering an agent interface that is both incredibly powerful and radically simple to transform individual productivity into organizational impact," Habib said in a statement. "Writer Agent is the difference between a single sales rep asking a chatbot to write an outreach email and an enterprise ensuring that 1,000 reps are all sending on-brand, compliant, and contextually-aware messages to target accounts."

How AI tax startup Blue J torched its entire business model for ChatGPT—and became a $300 million company

In the winter of 2022, as the tech world was becoming mesmerized by the sudden, explosive arrival of OpenAI’s ChatGPT, Benjamin Alarie faced a pivotal choice. His legal tech startup, Blue J, had a respectable business built on the AI of a bygone era, serving hundreds of accounting firms with predictive models. But it had hit a ceiling. Alarie, a tenured tax law professor at the University of Toronto, saw the nascent, error-prone, yet powerful capabilities of large language models not as a curiosity, but as the future. He made a high-stakes decision: to pivot his entire company, which had been painstakingly built over nearly a decade, and rebuild it from the ground up on this unproven technology. That bet has paid off handsomely. Blue J has since quietly secured a $122 million Series D funding round co-led by Oak HC/FT and Sapphire Ventures, placing the company's valuation at over $300 million. The move transformed Blue J from a niche player into one of Canada's fastest-growing legal tech firms, multiplying its revenue roughly twelve-fold and attracting 10 to 15 new customers every day. The company now serves more than 3,500 organizations, including global accounting giant KPMG and several Fortune 500 companies. It is tackling a critical bottleneck in the professional services industry: a severe and worsening talent shortage. The U.S. has 340,000 fewer accountants than it did five years ago, and with 75% of current CPAs expected to retire in the next decade, firms are desperate for tools that can amplify the productivity of their remaining experts.

NuwaAI V1.0发布！一句话生成数字人，全面升级数字生产力

邦彦技术发布NuwaAI 1.0：打造行业首个可执行任务的智能体数字人平台

2025-2026赛季VEX机器人亚洲公开赛国际签名赛新闻发布会在京召开

以智慧与科技，筑造和谐盛世为主题

星尘智能融资合作双双推进，加速机器人规模化落地

雷峰网 AI科技评论消息，星尘智能近期动作频繁，11月14日于高交会上与百度智能云、极数迭代签署订单及合作战略协议，第二天又与金马游乐达成战略合作，更在11月18日完成数亿元A++轮融资，由国科投资和蚂蚁集团联合领投，多位知名财务机构及产业资本跟投。高交会上的合作内容主要基于各方的核心优势，形成精准互补。其最终目的是打造可复制的行业级方案，实现在科研、工业、商业和民生等场景下的规模化应用。而这次的融资主要是围绕研发人才梯队建设、绳驱本体的规模化制造准备、多场景解决方案深化与产业化能力提升，进一步推动具身智能的工程化与商业化落地。作为全球首个实现绳驱AI机器人量产的公司，星尘智能凭借独特的绳驱传动设计，获得了众多合作。今年9月，就同上海仙工智能科技股份有限公司完成了一笔今年国内人形机器人在工业领域最早的一批千台级商业订单，这为星尘智能带来了供应链上的优势，仙工智能的客户积累或许会成为后续星尘机器人落地的潜在客源。而这次新的合作，星尘智能已经不局限于工业场景，而是将机器人与文娱场景相结合。星尘智能与金马游乐打算联合推出新一代文旅文娱机器人系列产品，这是国内人形机器人在文旅文娱、商业服务领域最早一批规模化订单。星尘智能的市场总监万琳透露：商业服务场景要求机器与人之间的交互必须要安全，金马游乐之所以选择星尘，正是看中绳驱技术的高安全性。从技术上来讲，绳驱可以吸收冲击力，让机器人不容易撞坏，这意味着力控做得很好，力控很优秀的同时代表着高安全性。并且绳驱产生的力是高动力、低摩擦、阻力小，传动效率可以达到80%~90%，这体现为机器人的高操作性。这两个性能使得金马游乐自然而然地选择了星尘。据悉，金马游乐是一家集高科技虚拟沉浸式游乐项目研发、制造、安装到创意文旅项目的规划、设计、施工以及投资经营于一体的现代化大型文旅集团。与环球影城等多个大型旅游度假区和主题乐园保持着稳定的合作关系。可见，星尘智能与金马游乐的此次联手，既是一次在游乐园机器人赛道的垂类技术探索，也为星尘机器人的品牌影响力有所助益。从全局来看，星尘智能的每一步都是朝着规模化落地的方向而努力，公司借鉴自动驾驶“L2和L4双轮驱动”的商业化逻辑，逐步攻克机器人规模化落地的难题。从工业领域的千台订单到数亿资金的加持，以绳驱技术为核心的多场景落地布局已初具成效。雷峰网

首个数字人国家标准，商汤牵头定义，正式发布！

从定义行业标准到定义国家标准

马斯克悄然发布Grok 4.1，霸榜大模型竞技场所有排行榜

非思考模式超越了公开排行榜上所有其他模型的完整推理模式

总结

今日AI领域的主要动向集中在大型语言模型（LLM）的发布与技术升级。其中，Musk的xAI发布了Grok 4.1，这一模型在多个公共基准测试中达到了顶尖水平，但遗憾的是它尚未开放API接口，限制了其在企业生产环境中的应用。同时，Google发布了Gemini 3 Pro，这一模型在多个关键基准测试中表现优异，特别是在数学推理、视觉智能和工具使用方面，确立了自己在AI领域的领先地位。此外，微软宣布了Windows 11的新架构，旨在成为一个支持自主AI代理操作的安全平台，展示了其在操作系统领域推动AI集成的决心。这些动向反映了AI技术在模型性能、应用场景和基础设施层面的持续进步，同时也揭示了AI技术在实际应用中面临的挑战与机遇。

作者：Qwen/Qwen2.5-32B-Instruct
文章来源：钛媒体, 极客公园, VentureBeat, 雷锋网, 量子位
编辑：小康

【20251119AI日报】Musk's xAI launches Grok 4.1 with lower hallucination rate on the web and apps — no API access (for now)

总结