CodeWithLLM-Updates
-
🤖 AI tools for smarter coding: practical examples, step-by-step instructions, and real-world LLM applications. Learn to work efficiently with modern code assistants.

DeepSeek updated their R1 thinking model

https://api-docs.deepseek.com/news/news250528
Besides reducing hallucinations and improving speed, they added function calling and JSON output. The model is open-source and according to their measurements, works at the level of top closed models.

In addition, they improved the quality of front-end generation - now the code is even better. Examples are in the video, especially the ball physics:

https://www.youtube.com/watch?v=lWd1UFtbSZ0

I haven't seen testing yet on how well it will work as a background agent, but I think this is still a model for pair programming.

The API price is the same and there are still discounts at night (China time). In the web version https://chat.deepseek.com/ usage is free, but this is the only SOTA model that currently doesn't have a canvas interface in the chat. By the way, they finally added the option in settings ("Improve the model for everyone") to disable data collection for training their models.

It can also be used hosted by third-party providers:
https://openrouter.ai/deepseek/deepseek-r1-0528
https://openrouter.ai/deepseek/deepseek-r1-0528:free from https://chutes.ai/tos

Vercel presented its model
https://vercel.com/docs/v0/api

The v0-1.0-md model is designed to create modern web applications. It supports text and image input, provides fast streaming responses, and is compatible with the OpenAI Chat Completions API format, meaning it can be connected to Cursor in the available models settings.

https://www.youtube.com/watch?v=0KYWJWY62d4

The model also knows how to call functions, is well-versed in modern frontend and full-stack frameworks (like Next.js), and can correct its errors. The context window is 128k for input, 32k for output.

It is currently in beta testing and requires a Premium or Team plan with usage-based billing enabled. There is a limit of 200 messages per day.

Wow...

https://x.com/AnthropicAI/status/1925926102725202163
News "THE WAY OF CODE, a project by @rickrubin in collaboration with Anthropic" -- Rick Rubin together with Anthropic released a book on vibe coding...

Rubin, known for his non-technical approach to music production, sees this method as a way to democratize software creation, allowing people without programming skills to bring their ideas to life.

https://www.thewayofcode.com/
"The Way of Code" is an experimental digital book combining Daoist philosophy with artificial intelligence. The project contains 81 meditative chapters, inspired by the Tao Te Ching, each accompanied by generative art created by Anthropic's Claude AI model.

Readers can see the code and modify these snippets with the help of Claude. The work explores how AI-assisted vibe coding – where users describe ideas in natural language and AI generates the code – aligns with Rubin's emphasis on intuition and simplicity in the creative process.

Next up this week was a presentation from Anthropic. The event was called "Code w/ Claude", meaning they directly focused on programming.

Google, citing Cursor statistics, said that in recent months many people had switched to Gemini 2.5 Pro in it. It's interesting whether the update will bring people back to Sonnet (the model is already available in Cursor settings).

https://www.anthropic.com/news/claude-4
They presented version 4 of Opus and Sonnet. The models have all modern features - reasoning, web search, code execution, tool usage and MCP, local file editing (which Opus can use as memory). The query cache was expanded from seconds to an hour. The default context window (200k) is smaller, but for an additional fee, it reaches Gemini's 1 million tokens.

They say Opus can work for 7 hours as a background autonomous agent.

In light of the release of specialized models for background agent behavior, it's important to note that these new models are essentially that. Only they are not versions of other models, like OpenAI Codex from o3. Anthropic seems to have shifted focus, as they clearly lost the battle with ChatGPT-Gemini-Grok for the everyday consumer chat app market. So they simply released only such agentic models and focused on programming.

They also updated their Claude Code tool. It now supports background tasks via GitHub Actions and integrations with VS Code and JetBrains, reflecting changes directly in the IDE.

This means they also created a background agent that can be assigned tasks from repositories and then the results can be checked: "Mention Claude Code in pull requests to respond to reviewer comments, fix CI errors, or change code. To install, run /install-github-app from Claude Code."

Interestingly, a GitHub representative spoke, and it seems their background agent on the site also runs on an Anthropic model, not OpenAI Codex as I initially thought at their announcement. In the free plan of GitHub Copilot, it's still only Claude 3.5 Sonnet, but the Pro plan added Claude 4.0 Sonnet (Preview). To use Opus, you need to be on a Pro+ subscription.

I remember back in the GPT-4 era, many custom models appeared, specifically "tuned" for programming. There were even separate models for Python. Phind.com was doing cool stuff. Then it all somehow subsided, and most universal models became good at writing code anyway.

https://windsurf.com/blog/windsurf-wave-9-swe-1
Windsurf recently released their SWE-1 models, but I think this is more of a step to reduce external API costs.

The Mistral company still provides API access to the closed Codestral model, last updated January 2025.


And here we have a new turn, now models are being configured for background independent coding of a range of tasks from a git repository. OpenAI has just re-released Codex, now based on o3. Github has updated its agent, adding a background work function.

https://mistral.ai/news/devstral
Mistral's answer is the Devstral model, developed jointly with All-hands (an open-source clone of the AI developer Devin). Unlike Codestral, the license here is Apache 2.0, meaning free use and modification. The model is also available via API under the name devstral-small-2505.

What the model does better:

  • Parses large repositories
  • Finds connections between components
  • Scans code for errors
  • Is trained to solve real problems from GitHub

According to All Hands AI 🙌, Devstral outperforms significantly larger models such as Deepseek-V3-0324 (671B) and Qwen3 232B-A22B. At the same time, Devstral is light enough to run on a single RTX 4090 or a Mac with 32 GB of RAM, making it an ideal choice for background local use.

GitHub Copilot is also not very good with naming; now under the name agent will also be the cloud agent in response to the agent from OpenAI.

https://github.blog/changelog/2025-05-19-github-copilot-coding-agent-in-public-preview/
The cloud Copilot agent can automatically solve tasks in the repository: assign an issue (or several) — it will analyze the code, make changes, test with tests, and send a PR for review. Copilot works in the background, using a secure cloud environment (based on GitHub Actions). Available only for Copilot Pro+ and Enterprise, consumes GitHub Actions minutes.

Apparently, GitHub Copilot for VSCode is not developing as quickly and well as competitors, so MS decided to open-source its code to everyone. Grok 3 model was also added.

https://jules.google/
Google's announcement also includes such a cloud agent Jules, but only a waitlist is on the website. Also, for some reason, the design is made like a pixel game.
UPD: During I/O, they announced beta access for users from the USA (5 total tasks per day)

https://docs.anthropic.com/en/docs/claude-code/sdk
Claude Code SDK. Anthropic announced an SDK for their agent programming system from the console. In fact, it doesn't look like the usual one, where we can connect some product to our code and interact with it. More precisely, it doesn't look like that yet, it says "The SDK currently support command line usage". That is, they rather expanded the possibilities of interacting with it from the console.

OpenAI did it
https://openai.com/index/introducing-codex/

They presented a cloud-based software engineering agent called Codex, powered by Codex-1 (a specialized version of o3), which should not be confused with the 2021 Codex model or the Codex CLI agent programming tool released last month.

Seriously, I recently wrote that it's currently very important to solve the problem of orchestrating AI programming agents' tasks, and it seems from the video presentation that they have done just that. It's not yet available in the standard Plus plan, only in Pro ($200/month), so not everyone will be able to try it.

Codex handles small, well-defined tasks well, but according to users feedback, it struggles with follow-up requests in the chat so far. This means you need to first break down the work into a set of tasks that will not change afterward.

Codex is not intended for "vibe coding" and is best suited for experienced engineers working with stable repositories: adding features or fixing bugs. It has a simple interface, similar to the familiar ChatGPT, with a text field for describing the task and "Ask" and "Code" buttons.

https://www.youtube.com/watch?v=utujQfglbk8

There's a button similar to "play" that sends the task to the agent in the cloud in the background. It queues the task, then shows a detailed execution log. In the video presentation, it looks like a significant achievement for the field of AI programming agents.

By the way, Cursor also added a preview of the background agent feature for a limited number of users in the new version 0.50.

Amp available to everyone since May 15
https://ampcode.com/how-i-use-amp

Sourcegraph decided to take an interesting marketing approach. They already have a VSC plugin AI agent for writing code (Cody) positioned for business - now they have created a new separate website in a strange, informal, and conversational style and are selling the AI agent plugin they named AMP this way.

It has such manual, which already look like a different site https://ampcode.com/manual - there they write about principles, one of which is "No model selection, always the best models. You don't choose the models, we do" and currently use Claude 3.7 Sonnet Extended thinking, which is certainly good, but from the leaderboards, the best is Gemini 2.5 Pro.

Currently, they give 1000 free credits (from my usage, it's about 700k tokens), then packages are $5 for 500.

The system instructions file is here AGENT.md - it's unknown when we will all agree on one name, and for now, there will be 10 copies in repositories for each AI agent.

Based on my observations, by the end of 2024, few people took Codeium Windsurf seriously.

Here's a Hacker News thread from 70 days ago comparing Windsurf and Cursor, which didn't attract much engagement https://news.ycombinator.com/item?id=43288745. Cursor is mentioned as one of the first AI IDEs users tried; it's well-configured and 'just works'. Windsurf's positives include a free autocompletion feature and greater versatility. Github Copilot lags behind Cursor and Windsurf in functionality.

When the topic of vibe coding came up, Windsurf, being a simpler system compared to Cursor, started attracting more users. Subsequently, they rebranded, and the company's focus improved. News about a potential acquisition by OpenAI has been circulating for several weeks, further increasing interest.

In a new comparison poll on Hacker News https://news.ycombinator.com/item?id=43959710, significantly significantly significantly more people participated.

People note that the AI IDE market is changing rapidly. Developers are constantly releasing new features, and tools borrow ideas from each other. This leads to the 'leader' often changing.

Discussion on "Agentic / Vibe Coding":

  • people see the potential in "agentic mode" for automating routine tasks (e.g., adding types, creating boilerplate), but emphasize the need for careful review of generated code.
  • there's a significant range of opinions on the effectiveness and safety of "agentic / vibe coding" where the AI independently makes changes across any files in the repository.
  • some experienced developers believe that AI helps non-experts more, while for experienced users, it's more like 'smarter autocompletion'.

Cursor Pros:

  • excellent autocompletion ("tab-complete"), better than competitors
  • the Cmd-K feature (inline editing) broadly made the IDE known and continues to be liked by users
  • clear pricing ($20 per month) which is quite cheap for access to the best models

Cursor Cons:

  • issues with context limitation in Cursor to save costs - the system tries to use as few tokens as possible
  • the "Agent mode" is quite imperfect and too "jumpy" forward

Windsurf Pros:

  • repository code awareness seems better
  • feels faster in some aspects

Windsurf Cons:

  • problems with large files and similar context limitation where only a small piece of code is sent to the model
  • the interface is more suited for vibe coding, making it harder to work "manually"
  • pricing - some find it more expensive than Cursor in agent mode, because with active use, on top of the up to $15 per month, you need to buy $10/250 credits packages.

Thread participants express positive feedback about Zed as a fast, efficient, and 'uncluttered' editor. But AI autocompletion and 'intelligence' in Zed are not yet at Cursor's level. Additionally, it doesn't support Windows.

They are also compared with Aider, Cline, GitHub Copilot, JetBrains IDEs (IntelliJ, PyCharm, Rider, etc.). Quite a few other AI tools are also mentioned: Claude Code (very expensive), Amazon Q (good for AWS), Machtiani, Brokk (an Aider alternative), Repomix, Void (an open-source Cursor alternative), Nonbios.ai, Amp.

Many participants recommend trying multiple tools, as the situation is changing rapidly, and what works today may change tomorrow.

https://deepmind.google/discover/blog/alphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms/

Google DeepMind AlphaEvolve
Available to academic researchers, the AI agent for algorithm design based on Gemini (a combination of Flash and Pro), which combines the creativity of large language models (LLMs) with automated evaluators using metrics for discovering and optimizing algorithms. It uses an evolutionary approach to improve the best ideas.

Where is it already used?

1. Google Data Center Optimization 🖥️

  • AlphaEvolve found a more efficient algorithm for resource allocation in Borg (Google's data center management system).
  • Result: +0.7% of Google's global computing resources are now used more efficiently.

2. Hardware Design 💻

  • Optimized matrix multiplications in TPUs (Google's specialized chips for AI).
  • Accelerated the operation of arithmetic circuits while maintaining correctness.

3. Accelerating AI Training ⚡

  • Reduced Gemini training time by 1% by optimizing matrix operations.
  • Accelerated FlashAttention (a core algorithm for transformers) by 32.5%.

Improved Strassen's algorithm (1969) for 4x4 matrices, reducing the number of operations. Improved the best solutions for 20% of open problems in mathematical analysis, geometry, and combinatorics.

Interestingly, AlphaEvolve was used to optimize components involved in training the Gemini models themselves. This raises questions about the potential for recursive AI self-improvement and the approach towards a "singularity".

It seems that using Claude code, Cursor, and others has become largely repetitive. The workflow usually involves planning the task (roadmap file), then giving commands to the agent to implement the plan in code.

Thus, task orchestration is the next necessary thing for every agentic AI solution.

I have already mentioned https://www.task-master.dev/, which is currently a popular solution due to MCP.


aider
https://aider.chat/docs/scripting.html
aider natively allows using simple scripting from the terminal to perform repetitive actions. There is also an additional Python API function for scripting, but it is not officially supported or documented.

Roo Code | Boomerang Orchestrator (since ver 3.14.3)
https://docs.roocode.com/features/boomerang-tasks
They added "🪃 Orchestrator" as a built-in mode. It allows breaking down complex projects into smaller, manageable parts. Each sub-task is then executed in its own context, often using a different mode tailored for that specific task.


Code Claude Code
https://github.com/RVCA212/codesys
A project developing a Python SDK for interacting with the Claude CLI tool. The most effective way to use it is by mimicking your actual workflow. Supports resuming specific conversations by ID.

Cloud Code SDK
https://cloudcoding.ai/
A programmable AI Coder SDK in Python - both locally and in a Sandbox cloud. You can think of it as a way to interact with Cursor or Claude code, at a low level with great control. But instead of using these applications, the project uses its own agent that can modify code and use its own built-in tools. Currently supports only OpenAI and Anthropic models. Works with or without Git repositories.

Github has posted a large tutorial on the new Github Copilot.

https://www.youtube.com/watch?v=0Oz-WQi51aU

Three Modes (now similar to Cursor 😉):

  • Ask Mode 💬 – for discussing changes and getting answers.
  • Edit Mode ✏️ – for precise edits and refactoring.
  • Agent Mode 🤖 – automated task execution (e.g., code generation from README).

Example: Creating a hotel booking application using different models (Claude 3.5, Gemini 2.5 Pro, GPT-4).

🔧 Working Techniques

Structured README file 📄: A clear description of the project, tech stack, and file structure helps the agent generate code more accurately.

Copilot Instructions 📌: A file with global guidelines (e.g., code style requirements, security, logs).

Visual Prompting 🖼️: Some models support uploading screenshots for UI analysis.

🛠️ Problem Solving

  • Browser Caching: Copilot can suggest clearing the cache or a fix for templates.
  • Testing: Automated test generation (e.g., for Flask endpoints) using the /test command.
  • Documentation: Updating the README file via Gemini 2.5 Pro with Mermaid diagrams.

🚀 Tips

Claude 3.5 – balances speed and quality.
Gemini 2.5 Pro – powerful documentation generation.
GPT-4 – for complex tasks with context.

Security: Always ask Copilot for a code audit (e.g., How can I make this app more secure?).

Windsurf is in talks to be acquired by OpenAI for about $3 billion.

Apple and Anthropic are teaming up to build a “vibe-coding” software platform that will use generative AI to write, edit, and test code for programmers.

https://developers.googleblog.com/en/gemini-2-5-pro-io-improved-coding-performance/

Google release Gemini 2.5 Pro Preview (I/O edition). This update features even stronger coding capabilities. Expect meaningful improvements for front-end and UI development, alongside improvements in fundamental coding tasks such as transforming and editing code, and creating sophisticated agentic workflows.

https://windsurf.com/
https://lovable.dev/

Windsurf and Lovable have improved the design of their products and pricing strategy.

Windsurf has a new logo and more transparent use of "credits" for the AI chat. Free tier now has new, higher limits, unlimited Fast Tab and Cascade Base.

https://lovable.dev/blog/lovable-2-0

Lovable 2.0 introduces key innovations: switching the agent to chat mode for better understanding and planning, workspaces for collaborative development, and a security scanning function to detect vulnerabilities.

In addition to major functional updates, Lovable 2.0 has updated its brand and interface, added the ability to visually edit styles, and simplified the process of connecting custom domains.

Changes in pricing plans, which now include Pro and Teams, are aimed at better meeting the needs of both individual developers and teams.

https://docs.cursor.com/guides/advanced/large-codebases

Cursor developers shared tips and techniques for effectively working with large and complex codebases.

They highlighted key aspects that help in navigating unfamiliar code faster. Key recommendations include:

  • Using Chat for Code Understanding: Via the chat mode, you can quickly get explanations on how certain parts of the code work. It is also recommended to enable the "Include Project Structure" feature for better understanding of the project structure.
  • Writing Rules: Creating rules allows emphasizing important project information and ensures better understanding for the Cursor agent.
  • Detailed Planning of Changes: For large tasks, it's worth spending time creating an accurate and well-structured plan of action steps.
  • Choosing the Right Tool: Cursor offers various tools (Tab, Cmd K, Chat), each with its advantages for specific tasks – from quick fixes to large-scale changes across multiple files.

They emphasize the importance of breaking down large tasks into smaller parts, including relevant context, and frequently creating new chats to maintain focus.

https://memex.tech/blog/introducing-memex-the-everything-builder-for-your-computer

Memex has officially announced the launch of its platform, which allows you to create any software, from web applications to 3D designs. It is worth noting that they chose a very unfortunate name for themselves, because firstly it is the term of the inventor Vannevar Bush, and secondly there are already many projects with it.

Memex is positioned as "The Everything Builder" for the computer. The platform supports any technology stacks and programming languages. Memex works on Windows/Mac/Linux (this is the Tauri framework) and allows everyone, regardless of their technical experience, to explore, build and deploy software solutions by talking to AI.

The agent uses Claude models - a combination of Sonnet 3.7 + Haiku, and has access to the Internet. Creates checkpoints via built-in shadow git. Plans to support Gemini 2.5 and MCP.