CodeWithLLM-Updates
-
🤖 AI tools for smarter coding: practical examples, step-by-step instructions, and real-world LLM applications. Learn to work efficiently with modern code assistants.

Superset - a multiterminal for Agents
https://superset.sh/
Currently Mac only, with Windows and Linux versions planned. It is an Electron-based app, a terminal with tabs specifically adapted to manage multiple agents like Claude Code, OpenCode, OpenAI Codex, and others simultaneously.

It automatically creates isolated git worktrees (best practice), sets up environments, isolates tasks to avoid conflicts, adds notification hooks, and includes a built-in diff-viewer for quick review of changes and PR creation. Future plans include cloud workspaces, context sharing between agents, and orchestration.

Logic-wise, it is similar to https://github.com/tmux/tmux, the well-known "terminal of terminals" for Unix-like systems (Linux, macOS, BSD, etc.), which allows creating and managing multiple sessions in a single window using panes and windows.

Mysti as a team of agents
https://github.com/DeepMyst/Mysti
A VS Code extension that allows combining any two different models (shared context) in Brainstorm Mode to receive higher-quality advice. It solves the issue of switching between different paid AI subscriptions to get alternative opinions on complex architectural decisions. Currently supports models from Claude, Codex, Gemini, and GitHub Copilot CLI.

HN Discussion
https://news.ycombinator.com/item?id=46365105
The community shows significant interest in the idea of multi-agent collaboration, actively sharing personal workflows and alternative tools. Many participants experiment with similar approaches manually (e.g., via Tmux panes with several CLI agents) and believe that debates between models help identify weak ideas and improve solutions, especially when one model gets "stuck."

Regarding Mysti, there is criticism of its dependency on VS Code, as many users prefer a pure CLI experience.

New Models

Gemini 3 Flash
https://blog.google/products/gemini/gemini-3-flash/
Google is gradually rolling out its junior multimodal agentic model of the new series—benchmarks suggest it is closer to Gemini 3 Pro than Gemini 2.5 Flash. The model outperforms Gemini 2.5 Pro in many tests while being three times faster and significantly cheaper. In some benchmarks, it even surpasses flagship models from other companies.

Since its release, Gemini 3 Flash has become the default model in the Gemini mobile app (replacing 2.5 Flash) and in Google Search's AI Mode. In my Gemini CLI, neither 3 Flash nor 3 Pro has appeared yet—they can be accessed via Google AI Studio.

GLM-4.7
https://z.ai/blog/glm-4.7
Zhipu AI has updated its GLM model. Version 4.7 shows significant progress over GLM-4.6 in multilingual code generation scenarios. It supports "thinking before acting" in frameworks like Claude Code, Kilo Code, Cline, and Roo Code, ensuring stability in complex tasks. Interface generation quality has also been improved.

Model weights (MoE architecture, up to 200K token context) are publicly available on Hugging Face and ModelScope for local deployment. Access is available via Z.ai API, OpenRouter, the z.ai chat interface, and a special GLM Coding Plan ($3 for the first month, then $6).

MiniMax M2.1
https://www.minimax.io/news/minimax-m21
Release of the improved MiniMax M2 model by the Chinese company MiniMax, focused on practical development and agentic systems. It is reported that the model is significantly enhanced for working with non-Python programming languages (Rust, Java, Golang, C++, Kotlin, Objective-C, TypeScript, JavaScript, etc.), outperforming Claude Sonnet 4.5 and approaching Claude Opus 4.5 in multilingual scenarios.

The model is open-source. API costs are quite low, about 10% of Claude Sonnet. It is compatible with popular agents like Claude Code, Droid (Factory AI), Cline, Kilo Code, Roo Code, and BlackBox, and supports context mechanisms (Skill.md, agent.md, etc.).

They also have a web platform at https://agent.minimax.io/ where you can test how the model builds applications.

https://www.youtube.com/watch?v=kEPLuEjVr_4

SWE-bench Verified comparison: Gemini 3 Flash 78%, MiniMax M2.1 74%, GLM-4.7 73.8%.

MCP as an independent standard
https://aaif.io/
https://openai.com/index/agentic-ai-foundation/
In December 2025, Anthropic transferred the Model Context Protocol (MCP) to the Agentic AI Foundation (AAIF) — a specialized foundation managed by the Linux Foundation. MCP became one of the founding projects of the newly created foundation. Along with MCP, the foundation included projects like goose by Block and AGENTS.md by OpenAI.

Agent Skills as an open standard
https://claude.com/blog/organization-skills-and-directory
https://agentskills.io and https://claude.com/connectors
Agent Skills was announced as an independent open standard on December 18, 2025, with a specification and SDK; it was not transferred to the Linux Foundation or AAIF. Microsoft has already adopted Agent Skills in VS Code and GitHub Copilot; it is also supported by Cursor, Goose, Amp, and OpenCode where Anthropic models are available.

Agent Skills Playground
https://skillsplayground.com/
On this site, by entering your API key, you can experiment with how different models utilize various skills.

Claude code 2.0.74
Added the LSP tool (Language Server Protocol) for code intelligence features such as go-to-definition, find references, and hover tooltips. This significantly improves the development experience, making code navigation faster and more convenient. For now, the agent rarely uses LSP autonomously. The open-source project OpenCode has had LSP support for about 6 months, making the slow progress of proprietary software surprising.

Cursor Visual Editor
https://cursor.com/blog/browser-visual-editor
Cursor introduced a visual editor with a "Point and Prompt" feature: you can simply click on any interface element and describe in text what needs to be changed. It also allows manipulating the site structure using drag-and-drop elements in the DOM tree, changing button order or grid settings.

https://www.youtube.com/watch?v=1S8S89X-xbs

The editor's sidebar provides visual control over component properties (props) and styles: from typography sliders to a color palette. The update aims to blur the line between design and programming, allowing developers to focus on ideas rather than mechanical code work.

Claude Code Plugins
https://code.claude.com/docs/en/plugin-marketplaces
Anthropic launched a plugins marketplace, seemingly in response to a similar one in Gemini CLI. It is not a separate website with an App Store-like interface. It's a system within Claude Code itself, where marketplaces are plugin catalogs (often based on GitHub repositories) that are added and managed via slash commands.

https://www.youtube.com/watch?v=1uWJC2r6Sss

Also, there are now prompt suggestion variants and a hotkey for switching models during a prompt. Subagents can work in parallel. Improved usage statistics and a visual fill-indicator for the context window have been added.

You can now run Claude Code tasks directly from the Claude Android mobile app. This is not a full-fledged terminal on the phone, but an asynchronous integration where Claude runs in the cloud.

Kiro Powers
https://kiro.dev/docs/powers/
Kiro is testing the concept of Powers for a model that solves the problem of context window clutter through dynamic tool activation; the system analyzes the user's query and enables only the necessary "knowledge pack." This is very similar to "Skills" in Anthropic's models.

When many tools (MCP servers) are connected to an agent, it is forced to load hundreds of function descriptions simultaneously. This "eats up" to 40% of the limit before work even begins, leading to irrelevant advice. Instead, each Power is a ready-made set containing instructions (how and when to use tools), server configuration, and automated scenarios.

For example, if you mention "payment," the Power for Stripe is activated, providing specific knowledge about the API and security. As soon as you move to working with a database, Stripe tools are disabled, and instead, the Power for Supabase or Neon is loaded. This allows the agent to remain fast, focus on a specific topic, and produce higher quality code.

The system offers an open ecosystem with one-click installation for popular services (AWS, Figma, Stripe, etc.).

Mistral Devstral 2 and Vibe
https://mistral.ai/news/devstral-2-vibe-cli
The European company Mistral AI is known for its LLMs independent of the US/China. They have updated their programming model and finally released their CLI. These announcements are extremely important for the development of the open-source AI ecosystem in software development.

https://openrouter.ai/mistralai/devstral-2512:free
The new generation of models is called Devstral 2 (123B) and Devstral Small 2 (24B), released under flexible licenses: modified MIT for Devstral 2 and Apache 2.0 for Devstral Small 2. Devstral 2 demonstrates an impressive 72.2% on the SWE-bench benchmark for open models.

The Small version can run locally on NVIDIA hardware, although the larger model (due to its density, not MoE architecture) will require serious hardware like a Mac Studio or several 3090/4090 GPUs.

Currently, Devstral 2 is offered for free via API. The model is already available in Kilo Code and Cline. According to feedback, it is quite mediocre at generating websites, frontend, and animation — it works better with small tasks involving local Python scripts.

https://help.mistral.ai/en/articles/496007-get-started-with-mistral-vibe
Mistral Vibe CLI is like Claude Code, an open-source command-line tool that runs on Windows, macOS, and Linux, based on Devstral models. It can also be run in Zed. It features interface themes, Git integration, MCP support, and agents with custom settings. It supports both interactive and autonomous operation.

https://news.ycombinator.com/item?id=46205437
Commentators noted that "Vibe" sounds like the product is geared towards vibe-coding "played around with an agent and let it churn something out" rather than controlled work by a professional programmer. Some directly call this message "the opposite" of what's needed in real work: augmenting humans, not replacing the process with "chat + tools, good luck."

Mintlify Autopilot
https://www.mintlify.com/blog/autopilot
An AI-powered system that monitors changes in your repository. On every push, it analyzes what needs to be updated in the documentation (both for humans and for AI agents). In the Autopilot dashboard, it shows which changes might require documentation updates. Then the Mintlify agent automatically generates a draft that you can review and refine. It takes into account the code context and the existing tone/style of your documentation.

Code Wiki
https://codewiki.google/
Google launched Code Wiki (currently in public preview) — a platform designed to solve the problem of AI (and humans) reading and understanding existing codebases. The system creates and continuously maintains a structured wiki page for the entire repository.

Key features: full automation, Gemini-powered chat, hyperlinked answers that point directly to code files. The system automatically generates and keeps up-to-date architectural diagrams, class diagrams, sequence diagrams, and detailed descriptions.

There is a waitlist for the upcoming version of Code Wiki that will allow teams to run the exact same system locally and securely on internal, private repositories.

Qoder Repo Wiki
https://docs.qoder.com/user-guide/repo-wiki
A feature inside the Qoder IDE that automatically generates structured documentation for a project (up to 10,000 files per project, in English and Chinese) and continuously tracks changes in both the code and the documentation itself.

It deeply analyzes project structure and implementation details, providing rich context that helps AI agents work more effectively. Wiki generation is fully dynamic.

Full Git synchronization is supported. Generated content is stored in language-specific directories (e.g., repowiki/zh/, repowiki/en/), which can be committed and pushed like regular code. Initial wiki is created with one click (up to ~120 minutes for 4,000 files). After that, the system constantly watches for code changes and can update only the affected sections when modifications are detected.

Originally the feature worked only with Git repositories, but as of December 2, 2025, they added support for generating wikis from local projects without Git.

DeepWiki (by Cognition AI)
https://deepwiki.com/
A free AI tool that turns any GitHub repository (public or private) into a Wikipedia-style knowledge base. It analyzes code, READMEs, and configs, then creates structured pages with architectural and flow diagrams, interactive code hyperlinks, and a natural-language chat interface for asking questions.

Already supports >30,000 open repositories with automatic updates after new commits. An open-source version is available for local/self-hosted deployment.

MCP Standard Year in Review
https://blog.modelcontextprotocol.io/posts/2025-11-25-first-mcp-anniversary/
The blog post describes how in one year, MCP transformed from a small open-source experiment into a de facto standard in the industry. Major companies like Notion, Stripe, GitHub, OpenAI, Microsoft, and Google have created their own servers to automate workflows. For centralized discovery and management of these servers, the MCP Registry was launched, becoming a single catalog for the entire ecosystem.

Coinciding with the anniversary, the team is releasing a new version of the MCP specification (November 2025). Key innovations include support for task-based workflows (for long-running operations), simplified and more secure authorization mechanisms, and an extensions system that allows adding specific functionality without changing the core protocol.

MCP Container Catalog
https://hub.docker.com/mcp
The site hosts a large library of ready-to-use, containerized MCP servers created by the developer community and powered by Docker technology. The servers are grouped by categories. The platform's goal is to simplify the use of MCP tools.

MCP Problems
https://www.youtube.com/watch?v=4h9EQwtKNQ8

The author argues that while MCP is a great idea in theory, in practice it has a serious problem that makes it ineffective. This problem is poor context management.

When an AI agent connects to MCP servers, all descriptions of available tools (tool definitions) are loaded into the language model's "context window." When the agent uses a tool, all results of its work (including intermediate data that may be unnecessary) are also sent to the context.

This consumes a huge number of tokens. The author gives an example where just two connected servers take up 20k tokens. It only gets worse with each iteration. The author calls this problem "context rot".

Agent Skills as an Alternative
https://platform.claude.com/docs/en/agents-and-tools/agent-skills/overview
https://www.youtube.com/watch?v=fOxC44g8vig

The solution once proposed by Cloudflare is that after finding the right tool, the agent generates code (e.g., TypeScript) to call this API. Based on this idea, Anthropic later proposed Agent Skills - a deep dive into the technology is available at https://leehanchung.github.io/blogs/2025/10/26/claude-skills-deep-dive/

Opus 4.5 in Claude Code
https://www.anthropic.com/news/claude-opus-4-5
Following Sonnet and Haiku, Anthropic's largest model has also been updated to ver 4.5. The problem with version 4 was its very high token price - now it's 3 times (!) cheaper. Therefore, the overall limits have been significantly increased, and Opus 4.5 has been added to Claude Code as an enhanced planning mode. It is capable of "creative problem-solving" — finding non-standard but legitimate solutions.

The main point is that the model uses significantly fewer tokens to achieve results, making it faster and cheaper to use.

Of course, in their programming and tool use tests, Opus 4.5 surpasses both Gemini 3 Pro and GPT-5.1. In one of the tests, the model handled a complex engineering task better than any human candidate. The model manages context and memory more effectively and can coordinate the work of several "sub-agents" to perform complex tasks.

https://www.youtube.com/watch?v=5O39UDNQ8DY

Claude Code is now available in a desktop application (research preview). You can run several local and remote Claude Code sessions in parallel: one agent fixes bugs, another explores GitHub, and a third updates documentation. Git worktrees are used for parallel work with repositories.

We are waiting for the model in most modern AI applications for code generation.

https://news.ycombinator.com/item?id=46037637
Several users emphasized that it's not the price per token that matters as much as the "cost per successful task." A smarter model, like Opus 4.5, makes fewer mistakes and requires fewer tokens to solve a problem, which can ultimately make it cheaper than "cheaper" but less intelligent models.

Clad Labs
https://www.cladlabs.ai/blog/introducing-clad-labs
If Y Combinator is giving money for this, then we are probably close to the end of the AI bubble (see https://aibubblemonitor.com/). I couldn't download it because I didn't pass the Private Beta Access code verification :)

The website seems to be a speculative design, featuring an announcement of a development environment (IDE) called Chad IDE from Clad Labs. The article is written as a typical tech startup announcement and contains fabricated (?) positive reviews, impressive metrics (e.g., "saved 2.3 hours per day"), and an absurd development plan that includes "personalized casinos" and "algorithms for developer dating."

Chad IDE proposes not to fight distractions, but to integrate them directly into the IDE. The idea is to keep the developer's attention within the work environment, even while waiting. Integration with the online casino Stake.us to place bets during code compilation, as well as built-in mini-games. The ability to watch YouTube Shorts, Instagram Reels, and TikTok directly in the code editor window. Integration with Tinder to swipe profiles while code is being generated.

https://www.youtube.com/watch?v=1uR4QPtXF4Y

To explain Y Combinator's decision, the video author delves into their investment philosophy. He emphasizes that YC bets not on ready-made ideas, but on founders, and consciously makes many risky investments. Many of their most successful companies, such as Twitch or DoorDash, initially seemed absurd or even doomed. YC understands that the best ideas often disguise themselves as strange or incomprehensible, so they are willing to take risks with teams that show potential, even if their current product looks like a joke.

Almost all major AI coding projects have already added the Gemini 3 Pro model. Antigravity itself often throws connection errors and works very slowly.

GPT 5.1 Codex
https://openai.com/index/gpt-5-1-codex-max/
After updating the regular model to GPT 5.1, its Codex variant was also updated. To respond to Google, they also announced GPT-5.1-Codex-Max. Everything, of course, got even better. And it will be able to work even longer.

Compaction technology allows the model to work with millions of tokens without losing important information. In practice, this opens up possibilities for refactoring entire projects and autonomous work for many hours straight.
The model is already available in the Codex tool, and API access will appear soon.

https://news.ycombinator.com/item?id=45982649
In the discussion, people share success stories. For example, one developer told how Codex completely rewrote key parts of his hobby project (a flight simulator) in 45 minutes, performing complex refactoring that would have taken a human much longer.

Users also compare the new product with Google Gemini and, while noting certain advantages of the latter, most agree that for code generation, Codex still provides consistently better results, "hallucinates" less often, and integrates better into existing code.

An interesting conversation about the "folk magic" of prompt engineering: people add special "canary words" to their instructions (e.g., asking the model to always address them as "Mr. Tinkleberry") to check if the AI is still carefully following the initial instructions in long sessions.

Antigravity IDE
https://antigravity.google
Google announced its answer to Cursor – a VSCode (or rather, Windsurf which they bought in July 2025 for $2.4 billion) clone called Antigravity. Before this, they already had an IDE – only online, which was released as Firebase Studio. As marketers say, the main goal of the tool is to help developers "achieve takeoff," i.e., significantly accelerate and simplify development.

A YouTube channel was created for this: https://www.youtube.com/@googleantigravity.

Overall, it's a VSCode clone like all the others, with added Chrome interaction. In the chat panel, the list of supported models currently includes: Gemini 3 Pro (in High and Low versions), Claude Sonnet 4.5 (including the Thinking version for complex reasoning), and GPT-OSS 120B (Medium). There are two modes, similar to Kiro - with and without planning.

A key feature being promoted is the Agent Manager – a window for monitoring agent activity and the tasks they perform. You can manage their work – seeing key "artifacts" (created files, code snippets, API request results) and verification outcomes. The agent operates not at the VSC level, but at the system level – this allows it to perform complex, long-term tasks (e.g., monitoring a website, collecting data, and then generating a report).

Gemini 3 Pro
https://blog.google/products/gemini/gemini-3/
The new multimodal agent model series Gemini, specifically the Gemini 3 line, has been officially introduced. Pro is the everyday, full-featured version, immediately available in apps, Search, and developer tools. Deep Think is an "enhanced reasoning" mode with additional power, currently in testing and intended for those who have paid for Google AI Ultra.

You can read which model is good here: https://blog.google/technology/developers/gemini-3-developers/

Why LLMs can't really build software
https://news.ycombinator.com/item?id=44900116
More than 500 comments. Central idea: LLMs do not have an abstract "mental model" of how to create something — they only work with text. They do not "understand" code, but only mimic its writing. Many commentators emphasize that the most valuable part of their work is what happens before writing code. 95% of the work is identifying non-obvious dependencies, hidden requirements, or potential problems at the intersection of business logic and technology.

Participants in the discussion agree that LLMs can be useful, but only as a tool in the hands of an experienced specialist, because the main responsibility and control always remain with a human. Unlike traditional tools, LLMs are non-deterministic, which makes them unreliable for complex tasks. Often, fixing errors in such projects takes more time than writing the code manually.


AI Coding Sucks
https://www.youtube.com/watch?v=0ZUkQF6boNg
Already a fairly well-known video where the developer from Coding Garden curses what programming has become. As a result of his frustration, he decided to take a month-long break from all AI tools to rediscover the joy of his work.

The key reason for his dissatisfaction lies in the fundamental difference between programming and working with AI. Programming is a logical, predictable, and knowable system where the same actions always lead to the same result. AI, on the other hand, is unpredictable.

He used to enjoy programming thanks to the sense of achievement after solving a complex problem or fixing a bug, "oh, how capable I am." Now his work has turned into a constant argument with large language models (LLMs), which often generate not what is needed.

The same query to the model can give a different answer each time. This lack of stability makes it impossible to create reliable workflows and contradicts the very nature of programming. It deprives the joy of the process, replacing it with irritation.

He lists numerous advanced methods he tried to apply to make AI more manageable: creating detailed instruction files, step-by-step task planning, using agents, and forcing AI to write tests for self-verification. But the models still ignore rules, bypass problems (e.g., removing failing tests), and do not provide reliable results.

In the end, the author refutes the idea that developers who don't use AI are "falling behind," because these tools can be learned quickly, while fundamental skills are more important and are acquired slowly with experience.

He advises beginners to learn programming without AI.

https://www.youtube.com/watch?v=0ZUkQF6boNg

Comments under the video demonstrate agreement with the author. Many developers felt relief seeing that their frustration is a widespread phenomenon, not a personal problem. Developers compare working with AI to managing an overly confident but incompetent junior specialist. Such an "assistant" assures that he understood everything but actually doesn't listen and does whatever he wants. Corrected errors reappear, the tool ignores given rules, and its code cannot be relied upon.

Many commentators express concern that beginners who rely on AI will never learn to program properly. This is compared to mindlessly copying code from Stack Overflow, but on a worse scale. Beginners do not develop fundamental problem-solving skills, which in the long run makes them weaker specialists.

DPAI Arena
https://dpaia.dev/ https://github.com/dpaia
JetBrains has introduced Developer Productivity AI Arena (DPAI Arena) — another "first" open platform that evaluates the effectiveness of AI agents in code generation. To ensure neutrality and independence, JetBrains plans to transfer the project under the management of the Linux Foundation.

The company believes that existing testing methods are outdated and only evaluate language models, not full-fledged AI agents (although https://www.swebench.com/ exists). The platform aims to create a unified, trusted ecosystem for the entire industry. Currently, the site only features tests for a few CLIs, with Codex outperforming Claude Code.

A key feature of DPAI Arena is its "multi-track" architecture, which simulates real-world developer tasks. Instead of a single bug-fixing test, the platform includes separate tracks for analyzing pull requests, writing unit tests, updating dependencies, and checking compliance with coding standards.

Athas Code Editor
https://athas.dev/
Since the end of May 2025, a lightweight, free, and open-source code editor has been under development. This is not a VSC fork, but a new project built "from scratch" using Tauri, targeting all three platforms (Win-Linux-Mac) simultaneously, unlike Zed ;)

Currently at an early stage, but if everything goes according to the plan (roadmap), it could turn out to be very interesting. The idea is "Vim-first, AI-enhanced, Git-integrated". Git integration is already implemented; Vim mode will follow. It aims to be 100% customizable, with support for themes, language servers, and plugins.

Interview with 23-year-old developer from Turkey named Mehmed Ozgul
https://www.youtube.com/watch?v=Aq-VW3Ugtpo

The main goal is to create a unified, minimalist, and fast environment for developers that integrates tools that typically require running several separate applications. Basic Git functionality and functionality for viewing SQLite database content are already implemented.

Athas does not just have its own AI chat; it integrates with existing CLIs, such as claude-code, meaning it "intercepts" the AI assistant call from the built-in terminal and displays the response in a convenient graphical interface. This allows using familiar tools directly within the editor without opening a separate terminal.

https://github.com/athasdev/athas/blob/master/CONTRIBUTING.md
You can join the project via GitHub and influence its future.

Cerebras GLM 4.6
https://inference-docs.cerebras.ai/support/change-log
Cerebras announced the replacement of the Qwen3 Coder 480B model with the new GLM 4.6, which also applies to the Cerebras Code subscription ($50 or $200/month). The model is suitable for fast UI iterations and refactoring.

  • GLM 4.6 operates at 1000 tokens/second - this is fast, but still roughly twice as slow as Qwen3 Coder
  • Code quality approaches Claude Sonnet 4.5, making it competitive, but it easily gets confused on complex tasks
  • Fewer errors in tool calls compared to Qwen3, but sometimes switches to Chinese or cuts off

https://news.ycombinator.com/item?id=45852751
The discussion concluded that the replacement makes sense for Cerebras (GLM 4.6 is an open model with a clear roadmap), but for users, it's a sidestep rather than a step forward. Qwen3 was a better choice for many tasks.

Claude Code Resources
https://github.com/jmckinley/claude-code-resources
jmckinley has collected various guides in his repository on how to better provide context for Claude Code.

From his perspective, what truly matters:

  • CLAUDE.md - AI context for your project (most important!)
  • Context management - Keep conversations focused (<80%)
  • Planning is paramount - Think before generating code
  • Git safety - Feature branches + checkpoints

There are examples of agent configurations: tests, security, and code review.