The 79
Posts
Did xAI lie about Grok 3?

Did xAI lie about Grok 3?

February 24, 2025

Welcome back everyone! Hope you had a great weekend. Here’s what you need to know about AI today:

👉 xAI is accused of lying about Grok 3

👉 1X introduces its new humanoid robot, Neo Gamma

👉 Nvidia CEO says the market misunderstood the impact of DeepSeek-R1

and many more!

📧 Did someone forward you this email? Subscribe here for free to get the latest AI news everyday!

Read time: 5 minutes

XAI
xAI is being accused of misrepresenting Grok 3's benchmarks

Source: xAI | Performance of Grok 3 in AIME and GPQA benchmarks

What’s going on: There are rumors about xAI’s claims on Grok 3’s performance. Recently, xAI published a blog post featuring a graph that showcased Grok 3’s performance on the AIME 2025, a set of challenging math problems often used to evaluate AI models. The graph highlighted two variants of Grok 3, “Grok 3 Beta (Think)” and “Grok 3 mini Beta (Think)” outperforming OpenAI’s top model, o3-mini-high. However, this assertion quickly drew criticism from OpenAI researchers, who accused xAI in a X post of presenting misleading data by omitting key details that would provide a fuller picture of the models’ comparative performance.

What does it mean: The problem is the way benchmarks scores were reported. OpenAI pointed out that xAI’s graph didn’t include o3-mini-high’s AIME 2025 score under a metric known as “consensus@64,” which allows a model 64 attempts to solve each problem and uses the most frequent answer as the final result, a method that can significantly boost performance. When evaluated at the first attempt, both Grok 3 variants scored lower than o3-mini-high, and Grok 3 Beta (Think) even fell slightly behind OpenAI’s o1 model under medium compute settings.

More details:

In response, xAI’s Igor Babushkin countered on X that OpenAI had similarly skewed its own benchmark presentations in the past.
Despite all of these fights, there is a deeper problem in AI development and that is the lack of standardized, transparent metrics for evaluating model performance.
While benchmark scores dominate headlines, they often hide critical factors like the computational and financial costs behind those results.

ROBOTICS
1X is building a humanoid robot for the home

Source: 1X | Neo Gamma

What’s going on: Norwegian robotics company 1X has introduced its latest product, the Neo Gamma, a humanoid robot designed specifically for home use. Unlike its predecessors, the Neo Gamma is built to perform everyday household tasks such as making coffee, doing laundry, and vacuuming, with the goal of eventually being tested in real home environments.

What does it mean: While still in the early stages, 1X aims to differentiate itself by focusing on domestic applications rather than the industrial settings favored by many competitors such as Figure AI, Agility, Apptronik, and Tesla. The Neo Gamma embodies a softer, safer design compared to typical industrial robots. Featuring a friendly appearance and a knitted nylon suit, the robot is engineered to minimize injury risks during human interactions.

More details:

1X is backed by OpenAI and recently acquired Kind Humanoid, a Bay Area startup. This move has fueled speculation about the origins of Neo Gamma’s enhanced features, particularly its improved speech and body language.
For now, the robot remains a proof-of-concept, with limited in-home testing planned and full commercial deployment still years away. If you want to be updated about Neo Gamma’s latest status, join their waitlist.
Interested in 1X’s journey from and the first generation of its robots? Visit their YouTube Channel.

🖥 OpenAI plans to shift the majority of its data center capacity from Microsoft to SoftBank by 2030.

🤔 Nvidia CEO Jensen Huang stated that the market misinterpreted the impact of DeepSeek-R1 open-source reasoning model, believing it would diminish the need for computing resources, when in fact, it will accelerate AI adoption and create opportunities for more efficient models, ultimately benefiting Nvidia.

⚡ Microsoft has canceled leases for a substantial amount of data center capacity in the US, totaling a couple of hundred megawatts, possibly due to concerns about overbuilding AI computing resources.

💰 Alibaba has committed to investing over 380 billion yuan (approximately $53 billion) in AI infrastructure, including data centers, over the next three years.

🔍 Genspark, an AI search startup, has raised $100 million valuing the company at $530 million, as it aims to challenge Google's dominance in the search engine market by offering AI-generated search results with direct answers and citations. Check out their product.

📽 Veo 2, Google's new AI video model, will cost users 50 cents per second of video generated, which means $30 per minute or $1,800 per hour, while OpenAI's Sora is available to ChatGPT Pro subscribers for $200 a month.

🚀 Microsoft launched Azure AI Foundry Labs, a new Microsoft hub for developers, startups, and enterprises to explore cutting-edge AI innovations, like the Muse World and Human Action Model (WHAM), Aurora for weather forecasting, ExACT for improved search efficiency, and TamGen for drug design.

Krisp

Krisp cancels background noise, records, transcribes, and summarizes meetings and calls.

krisp.ai

Stockimg

Stockimg is an all in one design and content creation tool powered by AI. You can easily generate logo, illustration, wallpaper, poster and more.

stockimg.ai

Suno

Make music with AI.

suno.com/home

AI + Work-life balance

Draft a plan to improve my work-life balance over the next month. Include specific actions I can take to reduce stress at work while ensuring I dedicate time to personal interests and family activities.

Gemini 2.0 Flash’s answer

Meta - AI Research Scientist, Embodied AI (PhD)

VISA - Data Scientist (Visa Predictive Models)

Apple - Applied Machine Learning Engineer - Customer Feedback

TikTok - Machine Learning Engineer - Search Engine - E-Commerce Alliance

Thank you for staying with us like always! If you are not subscribed, subscribe here for free to get more of these emails in your inbox! Cheers!