本文字数:约 4500 字,预计阅读时间:15 分钟
> ## Databricks research reveals that building better AI judges isn't just a technical concern, it's a people problem
The intelligence of AI models isn't what's blocking enterprise deployments. It's the inability to define and measure quality in the first place. That's where AI judges are now playing an increasingly important role. In AI evaluation, a "judge" is an AI system that scores outputs from another AI system. Judge Builder is Databricks' framework for creating judges and was first deployed as part of the company's Agent Bricks technology earlier this year. The framework has evolved significantly since its initial launch in response to direct user feedback and deployments.
Early versions focused on technical implementation but customer feedback revealed the real bottleneck was organizational alignment. Databricks now offers a structured workshop process that guides teams through three core challenges: getting stakeholders to agree on quality criteria, capturing domain expertise from limited subject matter experts, and deploying evaluation systems at scale.
The "Ouroboros problem" of AI evaluation Judge Builder addresses what Pallavi Koppol, a Databricks research scientist who led the development, calls the "Ouroboros problem." An Ouroboros is an ancient symbol that depicts a snake eating its own tail. Using AI systems to evaluate AI systems creates a circular validation challenge.
The solution is measuring "distance to human expert ground truth" as the primary scoring function. By minimizing the gap between how an AI judge scores outputs versus how domain experts would score them, organizations can trust these judges as scalable proxies for human evaluation. This approach differs fundamentally from traditional guardrail systems or single-metric evaluations.
Judge Builder integrates with Databricks' MLflow and prompt optimization tools and can work with any underlying model. Teams can version control their judges, track performance over time, and deploy multiple judges simultaneously across different quality dimensions.
Databricks' work with enterprise customers revealed three critical lessons that apply to anyone building AI judges:
- Your experts don't agree as much as you think. Organizations discover that even their own subject matter experts disagree on what constitutes acceptable output.
- Break down vague criteria into specific judges. Instead of one judge evaluating whether a response is "relevant, factual, and concise," create three separate judges.
- You need fewer examples than you think. Teams can create robust judges from just 20-30 well-chosen examples.
The business impact is clear. There are multiple customers who have gone through this workshop and have become seven-figure spenders on GenAI at Databricks in a way that they weren't before.
> ## Attention ISN'T all you need?! New Qwen3 variant Brumby-14B-Base leverages Power Retention technique
When the transformer architecture was introduced in 2017 in the now seminal Google paper "Attention Is All You Need," it became an instant cornerstone of modern artificial intelligence. Every major large language model (LLM) — from OpenAI's GPT series to Anthropic's Claude, Google's Gemini, and Meta's Llama — has been built on some variation of its central mechanism: attention, the mathematical operation that allows a model to look back across its entire input and decide what information matters most.
Eight years later, the same mechanism that defined AI's golden age is now showing its limits. Attention is powerful, but it is also expensive — its computational and memory costs scale quadratically with context length, creating an increasingly unsustainable bottleneck for both research and industry.
On October 28, 2025, the little-known AI startup Manifest AI introduced a radical alternative. Their new model, Brumby-14B-Base, is a retrained variant of Qwen3-14B-Base, one of the leading open-source transformer models. But while many variants of Qwen have been trained already, Brumby-14B-Base is novel in that it abandons attention altogether. Instead, Brumby replaces those layers with a novel mechanism called Power Retention—a recurrent, hardware-efficient architecture that stores and updates information over arbitrarily long contexts without the exponential memory growth of attention.
Trained at a stated cost of just $4,000, the 14-billion-parameter Brumby model performs on par with established transformer models like Qwen3-14B and GLM-4.5-Air, achieving near-state-of-the-art accuracy on a range of reasoning and comprehension benchmarks.
Brumby's power retention design offers another major advantage: hardware efficiency. Because the state update involves only local matrix operations, inference can be implemented with linear complexity in sequence length. Manifest AI reports that their fastest kernels, developed through their in-house CUDA framework Vidrial, can deliver hundreds-fold speedups over attention on very long contexts.
The release of Brumby-14B-Base is more than an engineering milestone; it is a proof of concept that the transformer's dominance may finally face credible competition. By replacing attention with power retention, Manifest AI has demonstrated that performance parity with state-of-the-art transformers is possible at a fraction of the computational cost—and that the long-context bottleneck can be broken without exotic hardware.
98% of market researchers use AI daily, but 4 in 10 say it makes errors — revealing a major trust problem
Market researchers have embraced artificial intelligence at a staggering pace, with 98% of professionals now incorporating AI tools into their work and 72% using them daily or more frequently. According to a new industry survey, the technology's transformative promise is coupled with persistent reliability problems.
The findings, based on responses from 219 U.S. market research and insights professionals, reveal both productivity gains and trust issues. While more than half of researchers — 56% — report saving at least five hours per week using AI tools, nearly four in ten say they've experienced "increased reliance on technology that sometimes produces errors."
The disconnect between productivity gains and trustworthiness has created a grand bargain in the research industry: professionals accept time savings and enhanced capabilities in exchange for constant vigilance over AI's mistakes. This dynamic may fundamentally reshape how insights work gets done.
The survey highlights the industry's reliance on AI for tasks like analyzing multiple data sources, summarizing findings, and automating insight reports. However, the technology's reliability remains a significant concern, with researchers citing issues around data quality, accuracy, and validation work.
The experience of researchers, early AI adopters who have integrated the technology into daily workflows, offers lessons about both opportunities and pitfalls. Speed genuinely matters, but productivity gains are uneven. The skills required for research are changing, with an emphasis on cultural fluency, strategic storytelling, ethical stewardship, and inquisitive insight advocacy.
The survey data suggests researchers are navigating this uncertainty by developing a form of professional muscle memory—learning which tasks AI handles well, where it tends to fail, and how much oversight each type of output requires. This tacit knowledge, accumulated through daily use and occasional failures, may become as important to the profession as statistical literacy or survey design principles.
Rokid 乐奇联手 BOLON 眼镜:眼镜巨头依视路的中国棋局
2025年进入寒冷的十一月,但AI眼镜行业的躁动不减反升。10月30日,BOLON眼镜官宣AI智能眼镜预约定购正式开启,产品已于当天晚20:00正式开售。这款备受关注的产品,正是由Rokid乐奇联合依视路EssilorLuxottica集团旗下的时尚品牌BOLON眼镜所研发。
据Rokid乐奇介绍,这款BZ5000 AI智能眼镜是一款集拍照(搭载1200万像素摄像头)、蓝牙耳机(支持6h连续听歌)、AI翻译与AI问答于一体的智能可穿戴设备,整机仅重38g,可通过语音或按键进行操控。
但值得注意的是,这款新品是一款无显示功能的AI眼镜。在海内外直接对标Ray-Ban Meta、小米AI眼镜等巨头产品,在11月数款带有显示功能智能眼镜即将扎堆发布的当下,依视路的这一步棋显得尤为特别。
依视路在中国的选择,也引发了不少观察者问出一个更深刻的问题:手握Ray-Ban Meta全球合作的依视路,为何要在中国“另起炉灶”,选择“重仓”一家本土AI眼镜公司Rokid乐奇?
跟一位日本比亚迪车主聊了聊,才明白我们对「出海」有多天真
目前,中国新能源汽车渗透率接近60%,自主品牌已经“杀疯了”,卷到了天际。所以,对那些有野心的中国车企来说,“出海”早就不是一道选择题,而是活下去的必答题。
日本移动出行展(Japan Mobility Show 2025)上,比亚迪作为中国车企的代表,在日本的表现成为所有中国车企出海之路的一个缩影。它们走出去的每一步,都远比想象中更复杂,也更具挑战。
宁德时代赚走185亿,车企却不想再给「宁王」打工了
宁德时代正在“杀死”过去的自己。10月20日,宁德时代发布2025年第三季度财报,让人充满矛盾却又引人深思。一方面,宁德时代赚钱的能力简直逆天。在大家觉得市场环境不太好、生意难做的时候,净利润增速超过40%。
但另一方面,宁德时代三季度营收为1041.86亿元,同比增长只有12.9%,和它以前动不动就翻倍的“火箭般”的速度相比,感觉像踩了刹车。这让很多人开始担心:宁德时代是不是遇到瓶颈,跑不动了?
国货美妆老三自然堂,活在珀莱雅和上美的阴影里
自然堂作为国货美妆的老三,活在珀莱雅和上美的阴影里。资本市场不相信平庸之辈,自然堂如何在激烈的竞争中突围,成为市场关注的焦点。
中国半导体的下一程
接下来半导体行业会进入“强链补链”的深水区。面对全球半导体行业的竞争格局,中国半导体行业如何在下一程中寻求突破,成为业界关注的重点。
总结
在AI领域的最新动态中,Databricks通过其Judge Builder框架展示了AI评估系统的复杂性,不仅涉及技术挑战,还有组织内的沟通与协作难题。Brumby-14B-Base的发布则标志着在AI架构上的创新,通过Power Retention技术挑战了传统变压器模型的局限性,展现了新的硬件效率和成本效益。此外,市场研究领域对AI的依赖日益加深,尽管AI工具带来了显著的效率提升,但其可靠性仍是不可忽视的问题。这些发展共同揭示了AI技术在实际应用中的多维度挑战与机遇。
作者:Qwen/Qwen2.5-32B-Instruct
文章来源:VentureBeat, 钛媒体, 极客公园
编辑:小康