Updates - CodeWithLLM

Code With LLM

CodeWithLLM-Updates

🤖 AI tools for smarter coding: practical examples, step-by-step instructions, and real-world LLM applications. Learn to work efficiently with modern code assistants.

Opus 4.5 in Claude Code
https://www.anthropic.com/news/claude-opus-4-5
Following Sonnet and Haiku, Anthropic's largest model has also been updated to ver 4.5. The problem with version 4 was its very high token price - now it's 3 times (!) cheaper. Therefore, the overall limits have been significantly increased, and Opus 4.5 has been added to Claude Code as an enhanced planning mode. It is capable of "creative problem-solving" — finding non-standard but legitimate solutions.

The main point is that the model uses significantly fewer tokens to achieve results, making it faster and cheaper to use.

Of course, in their programming and tool use tests, Opus 4.5 surpasses both Gemini 3 Pro and GPT-5.1. In one of the tests, the model handled a complex engineering task better than any human candidate. The model manages context and memory more effectively and can coordinate the work of several "sub-agents" to perform complex tasks.

https://www.youtube.com/watch?v=5O39UDNQ8DY

Claude Code is now available in a desktop application (research preview). You can run several local and remote Claude Code sessions in parallel: one agent fixes bugs, another explores GitHub, and a third updates documentation. Git worktrees are used for parallel work with repositories.

We are waiting for the model in most modern AI applications for code generation.

https://news.ycombinator.com/item?id=46037637
Several users emphasized that it's not the price per token that matters as much as the "cost per successful task." A smarter model, like Opus 4.5, makes fewer mistakes and requires fewer tokens to solve a problem, which can ultimately make it cheaper than "cheaper" but less intelligent models.

#newllmmodel #claudecode

Clad Labs
https://www.cladlabs.ai/blog/introducing-clad-labs
If Y Combinator is giving money for this, then we are probably close to the end of the AI bubble (see https://aibubblemonitor.com/). I couldn't download it because I didn't pass the Private Beta Access code verification :)

The website seems to be a speculative design, featuring an announcement of a development environment (IDE) called Chad IDE from Clad Labs. The article is written as a typical tech startup announcement and contains fabricated (?) positive reviews, impressive metrics (e.g., "saved 2.3 hours per day"), and an absurd development plan that includes "personalized casinos" and "algorithms for developer dating."

Chad IDE proposes not to fight distractions, but to integrate them directly into the IDE. The idea is to keep the developer's attention within the work environment, even while waiting. Integration with the online casino Stake.us to place bets during code compilation, as well as built-in mini-games. The ability to watch YouTube Shorts, Instagram Reels, and TikTok directly in the code editor window. Integration with Tinder to swipe profiles while code is being generated.

https://www.youtube.com/watch?v=1uR4QPtXF4Y

To explain Y Combinator's decision, the video author delves into their investment philosophy. He emphasizes that YC bets not on ready-made ideas, but on founders, and consciously makes many risky investments. Many of their most successful companies, such as Twitch or DoorDash, initially seemed absurd or even doomed. YC understands that the best ideas often disguise themselves as strange or incomprehensible, so they are willing to take risks with teams that show potential, even if their current product looks like a joke.

Almost all major AI coding projects have already added the Gemini 3 Pro model. Antigravity itself often throws connection errors and works very slowly.

GPT 5.1 Codex
https://openai.com/index/gpt-5-1-codex-max/
After updating the regular model to GPT 5.1, its Codex variant was also updated. To respond to Google, they also announced GPT-5.1-Codex-Max. Everything, of course, got even better. And it will be able to work even longer.

Compaction technology allows the model to work with millions of tokens without losing important information. In practice, this opens up possibilities for refactoring entire projects and autonomous work for many hours straight.
The model is already available in the Codex tool, and API access will appear soon.

https://news.ycombinator.com/item?id=45982649
In the discussion, people share success stories. For example, one developer told how Codex completely rewrote key parts of his hobby project (a flight simulator) in 45 minutes, performing complex refactoring that would have taken a human much longer.

Users also compare the new product with Google Gemini and, while noting certain advantages of the latter, most agree that for code generation, Codex still provides consistently better results, "hallucinates" less often, and integrates better into existing code.

An interesting conversation about the "folk magic" of prompt engineering: people add special "canary words" to their instructions (e.g., asking the model to always address them as "Mr. Tinkleberry") to check if the AI is still carefully following the initial instructions in long sessions.

#newllmmodel #codex

Antigravity IDE
https://antigravity.google
Google announced its answer to Cursor – a VSCode (or rather, Windsurf which they bought in July 2025 for $2.4 billion) clone called Antigravity. Before this, they already had an IDE – only online, which was released as Firebase Studio. As marketers say, the main goal of the tool is to help developers "achieve takeoff," i.e., significantly accelerate and simplify development.

A YouTube channel was created for this: https://www.youtube.com/@googleantigravity.

Overall, it's a VSCode clone like all the others, with added Chrome interaction. In the chat panel, the list of supported models currently includes: Gemini 3 Pro (in High and Low versions), Claude Sonnet 4.5 (including the Thinking version for complex reasoning), and GPT-OSS 120B (Medium). There are two modes, similar to Kiro - with and without planning.

A key feature being promoted is the Agent Manager – a window for monitoring agent activity and the tasks they perform. You can manage their work – seeing key "artifacts" (created files, code snippets, API request results) and verification outcomes. The agent operates not at the VSC level, but at the system level – this allows it to perform complex, long-term tasks (e.g., monitoring a website, collecting data, and then generating a report).

Gemini 3 Pro
https://blog.google/products/gemini/gemini-3/
The new multimodal agent model series Gemini, specifically the Gemini 3 line, has been officially introduced. Pro is the everyday, full-featured version, immediately available in apps, Search, and developer tools. Deep Think is an "enhanced reasoning" mode with additional power, currently in testing and intended for those who have paid for Google AI Ultra.

You can read which model is good here: https://blog.google/technology/developers/gemini-3-developers/

#newllmmodel #antigravity #gemini

Why LLMs can't really build software
https://news.ycombinator.com/item?id=44900116
More than 500 comments. Central idea: LLMs do not have an abstract "mental model" of how to create something — they only work with text. They do not "understand" code, but only mimic its writing. Many commentators emphasize that the most valuable part of their work is what happens before writing code. 95% of the work is identifying non-obvious dependencies, hidden requirements, or potential problems at the intersection of business logic and technology.

Participants in the discussion agree that LLMs can be useful, but only as a tool in the hands of an experienced specialist, because the main responsibility and control always remain with a human. Unlike traditional tools, LLMs are non-deterministic, which makes them unreliable for complex tasks. Often, fixing errors in such projects takes more time than writing the code manually.

AI Coding Sucks
https://www.youtube.com/watch?v=0ZUkQF6boNg
Already a fairly well-known video where the developer from Coding Garden curses what programming has become. As a result of his frustration, he decided to take a month-long break from all AI tools to rediscover the joy of his work.

The key reason for his dissatisfaction lies in the fundamental difference between programming and working with AI. Programming is a logical, predictable, and knowable system where the same actions always lead to the same result. AI, on the other hand, is unpredictable.

He used to enjoy programming thanks to the sense of achievement after solving a complex problem or fixing a bug, "oh, how capable I am." Now his work has turned into a constant argument with large language models (LLMs), which often generate not what is needed.

The same query to the model can give a different answer each time. This lack of stability makes it impossible to create reliable workflows and contradicts the very nature of programming. It deprives the joy of the process, replacing it with irritation.

He lists numerous advanced methods he tried to apply to make AI more manageable: creating detailed instruction files, step-by-step task planning, using agents, and forcing AI to write tests for self-verification. But the models still ignore rules, bypass problems (e.g., removing failing tests), and do not provide reliable results.

In the end, the author refutes the idea that developers who don't use AI are "falling behind," because these tools can be learned quickly, while fundamental skills are more important and are acquired slowly with experience.

He advises beginners to learn programming without AI.

https://www.youtube.com/watch?v=0ZUkQF6boNg

Comments under the video demonstrate agreement with the author. Many developers felt relief seeing that their frustration is a widespread phenomenon, not a personal problem. Developers compare working with AI to managing an overly confident but incompetent junior specialist. Such an "assistant" assures that he understood everything but actually doesn't listen and does whatever he wants. Corrected errors reappear, the tool ignores given rules, and its code cannot be relied upon.

Many commentators express concern that beginners who rely on AI will never learn to program properly. This is compared to mindlessly copying code from Stack Overflow, but on a worse scale. Beginners do not develop fundamental problem-solving skills, which in the long run makes them weaker specialists.

#bestpractices

DPAI Arena
https://dpaia.dev/ https://github.com/dpaia
JetBrains has introduced Developer Productivity AI Arena (DPAI Arena) — another "first" open platform that evaluates the effectiveness of AI agents in code generation. To ensure neutrality and independence, JetBrains plans to transfer the project under the management of the Linux Foundation.

The company believes that existing testing methods are outdated and only evaluate language models, not full-fledged AI agents (although https://www.swebench.com/ exists). The platform aims to create a unified, trusted ecosystem for the entire industry. Currently, the site only features tests for a few CLIs, with Codex outperforming Claude Code.

A key feature of DPAI Arena is its "multi-track" architecture, which simulates real-world developer tasks. Instead of a single bug-fixing test, the platform includes separate tracks for analyzing pull requests, writing unit tests, updating dependencies, and checking compliance with coding standards.

#junie #benchmarks

Athas Code Editor
https://athas.dev/
Since the end of May 2025, a lightweight, free, and open-source code editor has been under development. This is not a VSC fork, but a new project built "from scratch" using Tauri, targeting all three platforms (Win-Linux-Mac) simultaneously, unlike Zed ;)

Currently at an early stage, but if everything goes according to the plan (roadmap), it could turn out to be very interesting. The idea is "Vim-first, AI-enhanced, Git-integrated". Git integration is already implemented; Vim mode will follow. It aims to be 100% customizable, with support for themes, language servers, and plugins.

Interview with 23-year-old developer from Turkey named Mehmed Ozgul
https://www.youtube.com/watch?v=Aq-VW3Ugtpo

The main goal is to create a unified, minimalist, and fast environment for developers that integrates tools that typically require running several separate applications. Basic Git functionality and functionality for viewing SQLite database content are already implemented.

Athas does not just have its own AI chat; it integrates with existing CLIs, such as claude-code, meaning it "intercepts" the AI assistant call from the built-in terminal and displays the response in a convenient graphical interface. This allows using familiar tools directly within the editor without opening a separate terminal.

https://github.com/athasdev/athas/blob/master/CONTRIBUTING.md
You can join the project via GitHub and influence its future.

#athas

Cerebras GLM 4.6
https://inference-docs.cerebras.ai/support/change-log
Cerebras announced the replacement of the Qwen3 Coder 480B model with the new GLM 4.6, which also applies to the Cerebras Code subscription ($50 or $200/month). The model is suitable for fast UI iterations and refactoring.

GLM 4.6 operates at 1000 tokens/second - this is fast, but still roughly twice as slow as Qwen3 Coder
Code quality approaches Claude Sonnet 4.5, making it competitive, but it easily gets confused on complex tasks
Fewer errors in tool calls compared to Qwen3, but sometimes switches to Chinese or cuts off

https://news.ycombinator.com/item?id=45852751
The discussion concluded that the replacement makes sense for Cerebras (GLM 4.6 is an open model with a clear roadmap), but for users, it's a sidestep rather than a step forward. Qwen3 was a better choice for many tasks.

Claude Code Resources
https://github.com/jmckinley/claude-code-resources
jmckinley has collected various guides in his repository on how to better provide context for Claude Code.

From his perspective, what truly matters:

CLAUDE.md - AI context for your project (most important!)
Context management - Keep conversations focused (<80%)
Planning is paramount - Think before generating code
Git safety - Feature branches + checkpoints

There are examples of agent configurations: tests, security, and code review.

#bestpractices #claudecode

MiniMax M2 and Agent
https://www.minimax.io/news/minimax-m2
MiniMax introduced a new model M2 and a product based on it — MiniMax Agent. The model is specifically designed for coding agents: it can plan steps and use tools (browser, code interpreter, etc.). It has 229 billion parameters, of which 10 billion are active, and a context window of 200 thousand tokens.

The main idea is to find a balance between high performance, low price, and high speed. The model is fully open source.

https://www.youtube.com/watch?v=dHg6VrDjuMQ

In addition to official information, practical tests and reviews confirm that MiniMax M2 is an extremely powerful model, one of the best open-source models for programming to date. The model successfully coped with creating an operating system simulation with working applications, such as Paint and a terminal, and generated creative websites with unique styles and interactive elements.

At the same time, M2 demonstrated the presence of ethical limitations, refusing to create a site on a fraudulent topic, and could not cope with an overly complex task, such as a PC assembly simulator, which indicates its current limits.

https://agent.minimax.io/
MiniMax Agent online has two modes: Lightning Mode: For quick and simple tasks (answering questions, light coding). Pro Mode: For complex and long-term tasks (deep research, software development, report creation). You can only log in via Google. There is integration with Supabase and an MCP catalog. There are applications for iOS and Android.

Pro Mode is temporarily free, and the API is also temporarily free (until November 7). I did not find anything on the website about code privacy control.

https://platform.minimax.io/subscribe/coding-plan
Monthly subscription to the MiniMax M2 model, there are three options with limits every 5 hours - 100/300/1000 requests. The cheapest costs 10 dollars, comparable to Github Copilot. They recommend using Claude Code, but also support Cursor, Trae, Cline, Kilo Code, Roo Code, Grok CLI, Codex CLI, Droid, OpenCode.

#newllmmodel #minimax #autonomousagents

Github Universe 25
https://github.com/events/universe/recap
https://github.blog/news-insights/company-news/welcome-home-agents/
Announced Agent HQ - a future open platform that will allow developers to manage, track, and customize any AI agents (from OpenAI, Google, Anthropic, and others) in one place. Mission Control is a unified interface in GitHub, Mobile, CLI, and VS Code for managing agent operations.

GitHub Copilot received integration updates into workflows. Now it can be assigned tasks from Slack, Microsoft Teams, and other tools - it will use discussion context to perform work.

https://github.blog/changelog/2025-10-28-custom-agents-for-github-copilot/
https://github.blog/changelog/2025-10-28-github-copilot-cli-use-custom-agents-and-delegate-to-copilot-coding-agent/
Custom agents can be defined using a Markdown configuration file in the .github/agents folder of your repository. These allow you to define agent "personas" by specifying instructions, tool selections, and Model Context Protocol (MCP) servers. Configured agents can be invoked from the Copilot CLI using the /agent command.

https://github.blog/changelog/2025-10-28-new-public-preview-features-in-copilot-code-review-ai-reviews-that-see-the-full-picture/
Also introduced is an "agent-powered" code review, where Copilot, in combination with CodeQL, automatically finds and fixes security vulnerabilities. For teams, GitHub Code Quality is a new feature for analyzing code quality, reliability, and maintainability across the entire organization.

For VS Code, a new Plan Mode has been announced, which allows creating a step-by-step plan for task implementation before writing code. Finally, there is support for the AGENTS.md context definition standard.

#githubcopilot

Cursor 2.0
https://cursor.com/changelog/2-0
https://cursor.com/blog/composer
A significant update to one of the main AI coding tools. Cursor decided to respond to Windsurf (by the way, they also updated their SWE model to 1.5) and also created its own model specifically for software development. They named it "Composer" and claim it's 4 times faster than models with similar intelligence, but I think this is just to pay less to external providers.

The main novelty is the ability to run up to eight agents simultaneously (Multi-Agents) and a new interface for managing these agents. Each operates in an isolated copy of the code, preventing conflicts. A voice mode for agent control has appeared.

https://www.youtube.com/watch?v=Q7NXyjIW88E

Browser and isolated terminal (sandbox) features have exited beta. Enterprise clients received extended security control, including isolated terminal settings and an audit log to track administrator actions.

https://news.ycombinator.com/item?id=45748725
Community reaction is mixed but very active, with a clear division between supporters and skeptics. Supporters emphasize that the overall experience ("flow") is unparalleled, as it allows staying focused and in the development flow, and call Cursor the only AI agent that feels like a serious product, not a prototype. The new Composer model is praised for its exceptional speed.

Some complain that requests "hang" or the program crashes, especially on Windows. Several commentators noted that due to reliability issues, they switched to Claude Code, which proved to be "faster and 100% reliable."

There is also skepticism about lack of transparency: the company is criticized for vague graphs without specific model names and for using an internal, closed benchmark (Cursor Bench) to evaluate performance. Many want to know exactly what model underpins Composer (whether it's a fine-tuned open model), but developers evade a direct answer.

#cursor #autonomousagents

ForrestKnight on AI Coding
A guide on how to effectively and professionally use AI for writing code, as experienced developers do.

https://www.youtube.com/watch?v=5fhcklZe-qE

For complex planning, use more powerful models, and for code generation, use faster and cheaper ones. Do not switch models unnecessarily within the same conversation.

AI can quickly analyze other people's code or libraries, explain architecture, and draw component interaction diagrams.

Preparation. At the beginning of the work, use AI to analyze the entire project and build a context description for it. Create files with rules (global for all projects and specific to a particular one). Specify your technology stack there (e.g., TypeScript, PostgreSQL), standards, branch naming conventions, etc.
Specificity. At the start of a new chat, indicate which files need to be changed and which code to pay attention to. Write in detail, for example, "Add a boolean field editable to the users table, expose it via the API, and on the frontend, show the button only if this field is true." Add logs, and error screenshots.
Manage. AI first creates a detailed step-by-step implementation plan. You review, correct, and only then give the command to generate code. You cannot blindly trust its choices.
Edit. Analyze the generated code. It is necessary and possible to manually edit and refine it to a high quality. Ask why AI chose a particular solution and what the risks are.
Team of Agents. You can launch one agent for writing code, a second for writing tests, and a third for reviewing the first agent's code.
You can give Git commands in natural language, such as "create a branch for the release and move bug fixes there."

#prompts #bestpractices #wrap

Kimi CLI
https://github.com/MoonshotAI/kimi-cli
https://www.kimi.com/coding/docs/kimi-cli.html
A new terminal coding agent from Chinese Moonshot AI. Written in Python. Currently in technical preview. Only Kimi or Moonshot API platforms can be used as providers. https://www.kimi.com/coding/docs/ - there are tariff plans with musical names for 49 / 99 / 199 yen per month.

Interestingly: similar to Wrap, you can switch between the agent and a regular terminal. Supports ACL, meaning it can work inside Zed (which, by the way, finally released a Windows version). But Kimi CLI itself does not support Windows, only Mac and Linux for now.

Cline CLI
https://docs.cline.bot/cline-cli/overview
https://cline.ghost.io/cline-cli-return-to-the-primitives/
Cline CLI Preview is presented as a fundamental "primitive" that operates on a single agent loop of Cline Core (which uses a well-known extension). It is independent of model, platform, or runtime environment. This is the basic infrastructure upon which developers can build their own interfaces and automated processes.

Instead of developing complex mechanisms (state management, request routing, logging) from scratch, teams can use Cline as a ready-made foundation. Also currently only macOS and Linux.

#kimi #cline #cli #agentclientprotocol

Claude Code on the Web
https://www.anthropic.com/news/claude-code-on-the-web
A response to the popularity of Google Jules. The online service allows delegating several tasks to Claude Code in parallel from the browser. A new interface is also available as an early version in the mobile app for iOS. Currently in beta testing and available for Pro and Max plans.

Users can connect their GitHub repositories, describe tasks, after which the system will autonomously write code, tracking progress in real-time and automatically creating pull requests. Each task is executed in an isolated environment ("sandbox") to protect code and data.

https://www.youtube.com/watch?v=hmKRlgEdau4

#claudecode

Claude Haiku 4.5
https://www.anthropic.com/news/claude-haiku-4-5
The updated Haiku model, known for being fast and cheap, now matches the code generation performance of the previous-gen Sonnet 4, while being twice as fast (160-220 tokens/sec) and three times less expensive.

Most will use an architectural approach: using a smarter model (e.g., Sonnet 4.5) as an "orchestrator" that breaks down a complex problem into smaller subtasks. These subtasks are then executed in parallel by a "team" of several Haiku 4.5s.

Haiku 4.5 appears to make code changes significantly more accurately compared to GPT-5 models.

Skills for Claude Models
https://www.anthropic.com/news/skills
https://simonwillison.net/2025/Oct/16/claude-skills/
Essentially, "Agent Skills" are a folder containing onboarding, instructions, resources, and executable code. This allows Claude to be trained for specialized tasks, such as working with internal APIs, or adhering to coding standards. Integrated into all Claude products, a new /v1/skills API endpoint has appeared for management. In Claude Code, they can be installed as plugins from the marketplace or manually by adding them to the ~/.claude/skills folder.

Simon Willison believes the new feature is a huge breakthrough, potentially more important than the MCP protocol. Unlike MCP, which is a complex protocol, a Skill is just a folder with a Markdown file containing instructions and optional scripts. This approach doesn't invent new standards but relies on the existing ability of LLM agents to read files and execute code, making it incredibly flexible and intuitive. Since they are simple files, they are easy to create and share.

https://www.youtube.com/watch?v=kHg1TfSNSFI

Compared to MCP, Skills have a key advantage in token efficiency: instead of loading thousands of tokens to describe tools, the model reads only a brief description of the skill, and loads the full instructions only when needed.

https://news.ycombinator.com/item?id=45607117
https://news.ycombinator.com/item?id=45619537
Many commentators note that Skills are essentially just a way to dynamically add instructions to the model's context when needed. Their proponents say that this simplicity is precisely the genius. Skills represent a new paradigm for organizing and dynamically assembling context. Everyone generally agrees that this is a more successful and lightweight alternative to MCP, which saves us from context overload and consuming thousands of tokens.

Users have noticed that Skills are essentially a formalization of an existing AGENTS.md (or CLAUDE.md) pattern, where instructions for an agent are collected in one file, telling it where to look when something is needed. But Skills make this process more standardized, organized, and scalable. The LLM knows the standard and can help in generating a Skill.

#newllmmodel #mcp #agentskills