Coding with AI is getting even better

ML Engineering resources 02-28-25

Feb 28, 2025

∙ Paid

Hi Everyone! And welcome to many new subscribers.

Each week I send out a succinct article highlighting events and sharing resources ML engineers should know about. Please support the authors of the below resources. A huge thanks to all my supporters! If you would like, you can support my writing for just $3/mo.

Get 40% off forever

If you want to learn AI/machine learning, I created a roadmap to do it all entirely for free. Enjoy this week’s resources!

Always be (machine) learning,

Logan

What Happened Last Week

Helix: a vision-language-action model for humanoid control. Robots are really getting somewhere and it’s exciting.
Gemini Code Assist: use Gemini in your favorite code editors as an assistant. Gemini Code Assist includes a free tier for individuals.
Claude 3.7 Sonnet (and thinking): Anthropic’s next AI model is out with advanced reasoning capabilities. 3.5 Sonnet has been the best for code for a long time so this has a lot of people excited. Go try it out.
Claude Code: Anthropic released an active collaborator called Claude Code that can build with you. Basically, it’s a Claude coding agent.
GPT-4.5: a long-awaited release by OpenAI that has been met with some disappointment. Users expected a more performant model and OpenAI says it isn’t. Instead, it has improved character and more natural interactions.
Scribe: ElevenLab’s new speech-to-text model capable of transcribing in 99 languages with high precision.
Gemini 2.0 Flash-Lite: now available via the Gemini API making models even more affordable.
HP acquires Humane: HP acquired Humane and gave Humane execs high-up gigs at HP to work on HP’s AI offerings. The pin isn’t the big deal here (it flopped as many predicted) but HP did acquire a talented engineering team.

The main takeaway from this week has been about coding with AI. Is it useful? Does it cause software engineering skills to atrophy? Which newly released models are the best?

There have been so many demos on social media pushing the newest AI coding capabilities as able to replace software engineers. There have also been many takes on how bad AI coding is for ruining a software engineer’s skill set. Both have merit but the situation is very nuanced.

I’m going to release an article about this on Tuesday so make sure to subscribe and stay tuned. If you want a tl;dr right now: Using AI to code properly has an enormous productivity upside. Using it improperly turns it into an extreme negative.

Last week’s update, in case you missed it

Reasoning is here to stay

Logan Thorneloe

Feb 21

Read full story

The most important resources this week

MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents

MedAgentBench is a new evaluation suite designed to benchmark the capabilities of large language models (LLMs) in medical environments, consisting of 300 tasks and 100 patient profiles. It provides a realistic, interactive setting to assess LLMs' performance on complex clinical tasks beyond traditional question answering. This tool aims to improve the integration of AI agents into healthcare, helping to reduce physician burnout and enhance clinical workflows.

Source

How I use LLMs

By Andrej Karpathy

Andrej Karpathy shares his approach to using large language models (LLMs) for various tasks. He emphasizes the importance of prompt engineering to get effective results. Karpathy also discusses the benefits of LLMs in enhancing productivity and creativity.

Source

Character training and the secret arts of post-training

Nathan Lambert

Most evaluations for post-training in AI focus on internal metrics rather than widely recognized assessments, leaving character training largely unexplored. Character training aims to develop specific personality traits in models, enhancing user interaction, but remains undocumented in academic literature. As AI models evolve, understanding and measuring these changes becomes crucial for both developers and users.

Source

Robotics for software engineers

Gergely Orosz

Robotics is an exciting field growing rapidly with significant investment from companies like Tesla and Boston Dynamics. The process of building robots involves a mix of software and hardware, requiring careful planning and execution to avoid common pitfalls. As robotics technology advances, it holds the potential to transform industries and address labor shortages.

Source

The $50 Billion Fraud Machine: How UnitedHealth and QuantaFlo Are Gaming Medicare [Guest]

Sergei Polevikov

Devansh

UnitedHealth and Semler Scientific exploit the QuantaFlo device to inflate Medicare reimbursements through questionable diagnosis practices, particularly for Peripheral Artery Disease. This system leads to billions in overpayments while prioritizing profits over genuine patient care. Investigations highlight a widespread fraud scheme involving insurers and diagnostic tools that fail to meet valid healthcare standards.

Source

Claude 3.7 thonks and what's next for inference-time scaling

Nathan Lambert

Anthropic's new model, Claude 3.7 Sonnet, enhances performance by using more inference time tokens and introduces Claude Code for coding tasks. While it shows solid improvements over previous versions, it reflects gradual advancements rather than a major leap in the AI industry. The model's design allows for better control over reasoning and response phases, indicating a shift towards more user-friendly AI tools.

Source

A new generation of AIs: Claude 3.7 and Grok 3

Ethan Mollick

Claude 3.7 and Grok 3 represent a new generation of AI models that are significantly more powerful than previous versions, thanks to advanced training methods and increased computing power. These models can handle complex tasks and creative work more effectively, positioning themselves as valuable partners in intellectual endeavors rather than just automation tools. As AI capabilities improve and costs decrease, organizations need to rethink their strategies for integrating AI into their workflows.

Source

Ranking the Top AI Models of 2025

The ranking of AI models in 2025 focuses on three key factors: input size, cost, and additional features. Gemini 2 Flash stands out for its large input capacity and low cost, making it suitable for business applications. While open-source models are cost-effective, premium models like ChatGPT excel in features and usability for various tasks.

Source

Open Deep Research

Open Deep Research focuses on enhancing research capabilities using advanced AI tools. It emphasizes the importance of accessibility and collaboration in academic work. The platform aims to streamline the research process for users.

Source

Training code generation models to debug their own outputs

Training large language models (LLMs) with both supervised fine tuning and reinforcement learning significantly enhances their ability to debug code, improving success rates by up to 39%. The researchers created a synthetic debugging dataset by generating buggy code and prompting LLMs to diagnose and fix errors. Their experiments showed that updated models consistently outperformed those relying solely on prompt engineering.

Source

I’ve got ~30 more resources I found interesting this week below for paid subscribers. Thanks for supporting Society’s Backend 😊. If you want to support Society’s Backend, you can do so for just $3/mo. You’ll get extra resources and articles.

Get 40% off forever

Keep reading with a 7-day free trial

Subscribe to Society's Backend to keep reading this post and get 7 days of free access to the full post archives.