- solving tasks that just require applying knowledge ("here's a paste of my python import structure. I don't write Python often and I'm aware I'm doing something wrong here because I get this error, tell me the proper way organise the package").
- writing self-contained throwaway pieces of code ("here's a paste of my DESCRIBE TABLE output, write an SQL query to show the median [...]").
- as a debugging partner ("I can SSH to this host directly, but Ansible fails to connect with this error, what could be causing this difference").
All these use cases work great, I save a lot of time. But with the core work of writing the code that I work on, I've almost never had any success. I've tried:
- Cursor (can't remember which model, the default)
- Google's Jules
- OpenAI Codex with o4
I found in all cases that the underlying capability is clearly there (the model can understand and write code) but the end-to-end value is not at all. It could write code that _worked_, but trying to get it to generate code that I am willing to maintain and "put my name on" took longer than writing the code would have.
I had to micromanage them infinitely ("be sure to rerun the formatter, make sure all tests pass" and "please follow the coding style of the repository". "You've added irrelevant comments remove those". "You've refactored most of the file but forgot a single function"). It would take many many iterations on trivial issues, and because these iterations are slow that just meant I had to context switch a lot, which is also exhausting.
Basically it was like having an intern who has successfully learned the core skill of programming but is not really capable of good collaboration and needs to be babysat all the time.
I asked friends who are enthusiastic vibe coders and they basically said "your standards are too high".
Is the model for success here that you just say "I don't care about code quality because I don't have to maintain it because I will use LLMs for that too?" Am I just not using the tools correctly?
Those who can’t stop raving about how much of a superpower LLMs are for coding, how it’s made them 100x more productive, and is unlocking things they could’ve never done before.
And those who, like you, find it to be an extremely finicky process that requires extreme amount of coddling to get average results at best.
The only thing I don’t understand is why people from the former group aren’t all utterly dominating the market and obliterating their competitors with their revolutionary products and blazing fast iteration speed.
For example, using an LLM to help you write a Dockerfile when you write Dockerfiles once a project and don't have a dedicated expert like a Deployment Engineer in your company is fantastic.
Or using an LLM to get answers faster than google for syntax errors and other minor issues is nice.
Even using LLM with careful prompting to discuss architecture tradeoffs and get its analysis (but make the final decision yourself) can be helpful.
Generally, you want to be very careful about how you constrain the LLM through prompts to ensure you keep it on a very narrow path so that it doesn't do something stupid (as LLMs are prone to do), you also often have to iterate because LLMs will occasionally do things like hallucinate APIs that don't actually exist. But even with iteration it can often make you faster.
I use a wide variety of tools. For more private or personal tasks, I mostly rely on Claude and OpenAI; sometimes I also use Google or Perplexity—whichever gives the best results. For business purposes, I either use Copilot within VSCode or, via an internal corporate platform, Claude, OpenAI, and Google. I’ve also experimented a bit with Copilot Studio.
I’ve been working like this for about a year and a half now, though I haven’t had access to every tool the entire time.
So far, I can say this:
Yes, LLMs have increased my productivity. I’m experimenting with different programming languages, which is quite fun. I’m gaining a better understanding of various topics, and that definitely makes some things easier.
But—regardless of the model or its version—I also find myself getting really, really frustrated. The more complex the task, the more I step outside of well-trodden paths, and the more it's not just about piecing together simple components… the more they all tend to fail. And if that’s not enough: in some cases, I’d even say it takes more time to fix the mess an LLM makes than it ever saved me in the first place.
Right now, my honest conclusion is this: LLMs are useful for small code completion tasks, troubleshooting and explaining —but that’s about it. They’re not taking our jobs anytime soon.
I reflected once that very little of my time as a senior engineer is actually spent just banging out code. The actual writing of the code is never the hard part or time-consuming part for me - it's figuring out the right architecture, figuring out how to properly refactor someone else's hairball, finding performance issues, debugging rare bugs, etc. Yes, LLMs accelerate the process of writing the boilerplate, but unless you're building brand new products from scratch every 2nd week, how much boilerplate are you really writing? If the answer is "a lot", you might consider how to solve that problem without relying on LLMs!
It is amazing how in our field we repeatedly forget this simple advice from Fred Brooks.
In my experience, LLMs are way more useful for coding and less problem-prone when you use them without exaggerated expectations and understand that it was trained on buggy code, and that of course it is going to generate buggy code. Because almost all code is buggy.
Don't delegate design for it, use functional decomposition, do your homework and then use LLMs to eliminate toil, to deal with the boring stuff, to guide you on unfamiliar territory. But LLMs don't eliminate the need for you to understand the code that goes with your name. And usually, if you think a piece of LLM generated code is perfect, remember that maybe the defects are there, but you need to improve your own knowledge and skills to find it. Be always suspicious, don't trust it blindly.
As you say, it's great for automating away boring things; as a more complicated search & replace, for instance. Or, "Implement methods so that it satisfies this interface", where the methods are pretty obvious. Or even "Fill out stub CRUD operations for this set of resources in the API".
I've recently started asking Claude Opus 4 to review my patches when I'm done, and it's occasionally caught errors, and sometimes has been good at prompting me to do something I know I really should be doing.
But once you get past a certain complexity level -- which isn't really that far - it just stops being useful.
For one thing, the changes which need to be made often span multiple files, each of which is fairly large; so I try to think carefully about which files would need to be touched to make a change; after which point I find I have an idea what needs to be changed anyway.
That said, using the AI like a "rubber duck" programmer isn't necessarily bad. Basically, I ask it to make a change; if it makes it and it's good, great! If it's a bit random, I just take over and do it myself. I've only wasted the time of reviewing the LLM's very first change, as nearly everything else I'd've had to do if I wrote the patch myself from scratch anyway.
Furthermore, I often find it much easier to take a framework that's mostly in the right direction and modify it the way that I want, than to code up everything from scratch. So if I say, "Implement this", and then end up modifying nearly everything, it still seems like less effort than starting from scratch myself.
The key thing is that I don't work hard at trying to make the LLM do something it's clearly having trouble with. Sometimes the specification was unclear and it made a reasonable assumption; but if I tell it to do something and it's still having trouble, I just finish the task myself.
- I find it is pretty good at making fairly self-contained react components or even pages especially if you are using a popular UI library
- It is pretty reliable at making well-defined pure functions and I find it easier to validate that these are correct
- It can be good for boilerplate in popular frameworks
I sometimes feel like I am losing my mind because people report these super powerful end to end experiences and I have yet to see anything close in my day to day usage despite really trying. I find it completely falls over on a complete feature. I tried using aider and people seem to love it but it was just a disaster for me. I wanted to implement a fairly simple templated email feature in a Next.js app. The kind of thing that would take me about a day. This is one of the most typical development scenarios I can imagine. I described the feature in it's entirety and aider completely failed, not even close. So I started describing sub-features one by one and it seemed to work better. But as I added more and more, existing parts began to break, I explained the issues to aider and it just got worse and worse with every prompt. I tried to fix it manually but the code was a mess.
Sure, vibe coders by definition can't have any standards for the code they're generating because by definition they never look at it.
> Is the model for success here that you just say "I don't care about code quality because I don't have to maintain it because I will use LLMs for that too?"
Vibe coding may work for some purposes, but if it were currently a successful strategy in all cases, or even narrowly for improving AI, Google AI or DeepSeek or somebody would be improving their product far faster than mere humans could, by virtue of having more budget for GPUs and TPUs than you do, and more advanced AI models, too. If and when this happens you should not expect to find out by your job getting easier; rather, you'll be watching the news and extremely unexpected things will be happening. You won't find out that they were caused by AI until later, if ever.
Fast food, assembly line, factory may be examples, but there is a HUGE catch: When a machine with a good setup makes your burger, car or wristwatch, you can be sure that at 99.99% it is as specified. You trust the machine.
With LLMs, you have to verify each single step, and if you don't, it simply doesn't work. You cannot trust them to work autonomously 24/7.
That's why you ain't losing your job, yet.
What LLMs are good at and their main value I'd argue, is nudging you along and removing the need to implement things that "just take time".
Like some days back I needed to construct a string with some information for a log entry, and the LLM that we have suggested a solution that was both elegant and provided a nicer formatted string than what I had in mind. Instead of spending 10-15 minutes on it, I spent 30 seconds and got something that was nicer than what I would have done.
It's these little things that add up and create value, in my opinion.
It finally clicked for me when I tried Gemini and ChatGPT side by side. I found that my style of working is more iterative than starting with a fully formed plan. Gemini did well on oneshots, but my lack of experience made the output messy. This made it clear to me that the more chatty ChatGPT was working for me since it seems to incorporate new stuff better. Great for those "Oh, crap I didn't think of that" moments that come up for inexperienced devs like me.
With ChatGPT I use a modular approach. I first plan a high level concept with 03, then we consider best practices for each. After that I get best results with 4o and Canvas since that model doesn't seem to overthink and change direction as much. Granted, my creations are not pushing up against the limits of human knowledge, but I consistently get clean maintainable results this way.
Recently I made a browser extension to show me local times when I hover over text on a website that shows an international time. It uses regex to find the text, and I would never have been able to crank this out myself without spending considerable time learning it.
This weekend I made a Linux app to help rice a spare monitor so it shows scrolling cheat sheets to help me memorize stuff. This turned out so well, that I might put it up on GitHub.
For dilettantes like me this opens up a whole new world of fun and possibilities.
Great for annoying ad-hoc programming where the objective is clear but I lack the time or motivation to do it.
Example: After benchmarking an application on various combinations of OS/arch platforms, I wanted to turn the barely structured notes into nice graphs. Claude Code easily generated Python code that used a cursed regex parser to extract the raw data and turned it into a bunch of grouped bar charts via matplotlib. Took just a couple minutes and it didn't make a single mistake. Fantastic time saver!
This is just an ad-hoc script. No need to extend or maintain it for eternity. It has served its purpose and if the input data will change, I can just throw it away and generate a new script. But if Claude hadn't done it, the graphs simply wouldn't exist.
Update: Sorry, missed "writing self-contained throwaway pieces of code"... well for core development I too haven't really used it.
https://newsroom.ibm.com/2025-05-06-ibm-study-ceos-double-do...
However, only maybe 10% of that is agentic coding. Thus, my recommendation would be - try non-agentic tools.
My primary workflow is something that works with the Zed editor, and which I later ported as a custom plugin to Goland. Basically, you first chat with the AI in a sidebar possibly embedding a couple of files in the discussion (so far nothing new), and then (this is the new part) you use contextual inline edits to rewrite code "surgically".
Importantly, the inline edits have to be contextual, they need to know both the content of the edited file, and of the conversation so far, so they will usually just have a prompt like "implement what we discussed". From all I know, only Zed's AI assistant supports this.
With this I've had a lot of success. I still effectively make all architectural decisions, it just handles the nitty-gritty details, and with enough context in the chat from the current codebase (in my case usually tens of thousands of tokens worth of embedded files) it will also adhere very well to your code-style.
What works for me is collecting it manually and going one implementation chunk at a time. If it fails, I either do it myself or break it down into smaller chunks. As models got better these chunks got larger and larger.
Collecting context manually forces me to really consider what information is necessary to solve the problem, and it's much easier to then jump in to fix issues or break it down compared to sending it off blind. It also makes it a lot faster, since I shortcircuit the context collection step and it's easier to course-correct it.
Collecting manually is about 10 seconds of work as I have an extension that copies all files I have opened to the clipboard.
I’ve had great success using LLMs for things that I haven’t done in a while or never before. They allow me to build without getting too bogged down into the details of syntax
Yes, they require constant attention, they are not fully independent or magical. And if you are building a project for the longer run, LLM-driven coding slows down a lot once the code base grows beyond just a couple of basic files (or when your files start getting to about 500-800+ lines)
I’ve tried several agentic editors and tools, including cursor, they can def be helpful, but I’d rather just manually loop between ChatGPT (o4-high-mini for the most part) and the editor. I get a very quick and tight feedback loop in which I get plenty of control
Git is essential for tracking changes, and tests are gold once you are at a certain size
> Am I just not using the tools correctly?
No, there is no secret sauce and no secret prompting. If LLMs were capable, we'll see lots of new software generated by it given how fast LLMs are at writing code. Theoretically, assuming a conservative 10token/s speed and a 100M token for Chromium code base, you could write a new browser with LLMs in only 115 days.
1. Use a high quality model with big context windows via API (I recommend Openrouter). E.g. Google Gemini 2.5 Pro is one of the best which keeps constant good quality (OpenAI reasoning models can be better in problem solving but it's kinda a mixed bag). Other people swear by the Claude Sonnet models.
2. Upgrade your code tools you combine with this high quality models. Google Jules and OpenAI Codex are so brand new and have a totally different aim than Cursor. Don't use them (yet). Maybe they will get good enough in future. I would focus on established tools like aider (steepest learning curve), roo code (easier) to be paired with Openrouter and if you want to have it really easy claude code (only useful with a 100-200 USD Anthropic subscription IMHO). On average you will get better results with Aider, roo, claude code than with Cursor or Windsurf.
Btw. I think Cursor and Windsurf are great as a starter because you buy a subscription with 15-20 USD and are set. It can be most likely that the more high quality tools burn more tokens and you spent more per month but you also get better quality back in return.
Last but not least and can be applied to every coding assistant: Improve your coding prompts (be more specific in regards to files or sources), do smaller and more iterations until reaching your final result.
I find them to be super useful for things that I don't already know how to do, e.g. a framework or library that I'm not familiar with. It can then give me approximate code that I will probably need to modify a fair bit, but that I can use as the basis for my work. Having an LLM code a preliminary solution is often more efficient than jumping to reading the docs immediately. I do usually need to read the docs, but by the time I look at them, I already know what I need to look up and have a feasible approach in my head.
If I know exactly how I would build something, an LLM isn't as useful, although I will admit that sometimes an LLM will come up with a clever algorithm that I wouldn't have thought up on my own.
I think that, for everyone who has been an engineer for some time, we already have a way that we write code, and LLMs are a departure. I find that I need to force myself to try them for a variety of different tasks. Over time, I understand them better and become better at integrating them into my workflows.
> - Cursor (can't remember which model, the default)
> - Google's Jules
> - OpenAI Codex with o4
Cursor's "default model" rarely works for me. You have to choose one of the models yourself. Sonnet 4, Gemini 2.5 Pro, and for tricky problems, o3.
There is no public release of o4; you used o4-mini, a model with poorer performance than any of the frontier models (Sonnet 4, Gemini Pro 2.5, o3).
Jules and Codex, if they're like Claude Code, do not work well with "Build me a Facebook clone"-type instructions. You have to break everything down and make your own tech stack decisions, even if you use these tools to do so. Yes they are not perfect and make regressions or forget to run linters or check their work with the compiler, but they do work extremely well if you learn to use them, just like any other tool. They are not yet magic that works without you having to put in any effort to learn them.
It's easy to get good productivity out of LLMs in complex apps, here are my tips:
Create a directory in the root of your project called /specs
Chat with a LLM to drill into ideas having it play the role of a Startup Advisor, work through problem definitions, what your approach is, and build a high level plan.
If you are happy with the state of your direction, ask the LLM to build a product-strategy.md file with the sole purpose of describing to an AI Agent what the goal of the project is.
Discuss with an LLM all sorts of issues like:
Components of the site
Mono Repo vs targeted repos
Security issues and your approach to them
High level map of technologies you will use
Strong references to KISS rules and don't over complicate
A key rule is do not build features for future use
Wrap that up in a spec md file
Continue this process until you have a detailed spec, with smaller .md files indexed from your main README.md spec file
Continue promoting the role of AI Developer, AI Consumer, AI Advisor, End User
Break all work into Phase 1 (MVP), Phase 2, and future phases, don't get more granular (only do Phase 2 if needed)
Ask LLM to document best practice development standards and document in your CLAUDE.md or whatever you use. Discuss the standards, err to industry standard if you are lost
Challenge the LLM while building standards, keep looping back and adjusting earlier assumptions
Instruct AI Agents like Claude Code to target on specific spec files and implement only Phase 1. If you get stuck on how to do that, ask an LLM on how to prompt your coding agent to focus, you will learn how they operate.
Always ask the coding agent to review any markdown files used to document your solution and update with current features, progress, and next issues.
Paste all .md files back into other AI's e.g. high models of ChatGPT and ask it to review and identify missing areas / issues
Don't believe everything the agents say, challenge them and refuse to let them make you happy or confirm your actions, that is not their job.
Always provide context around errors that you want to solve, read the error, read the line number, paste in the whole function or focus your Cursor.ai prompt to that file.
Work with all the AI's, each has their strength.
Don't use the free models, pay, it's like running a business with borrowed tools, don't.
Learn like crazy, there are so many tips I'm nowhere near learning.
Be kind to your agent
(Edited: formatting)
Aider, in my humble opinion, has some issues with its loop. It sometimes works much better just to head over to AI studio and copy and paste. Sometimes it feels like aider tries to get things done as cheaply as possible, and the AI ends up making the same mistakes over again instead of asking for more information or more context.
But it is a tool and I view it as my job to get used to the limitations and strengths of the tool. So I see my role as adapting to a useful but quirky coworker so I can focus my energy where I'm most useful.
It may help that I'm a parent of intelligent and curious little kids. So I'm used to working with smart people who aren't very experienced and I'm patient about the long term payoff of working at their level.
Particularly, with data structures it is garbage: it nevers understands the constrains that justify writing a new one instead of relying on the ones from the standard library.
And finally, it is incapable of understanding changes of mind. It will go back to stufd already discarted or replaced.
The worst part of all is that it insists in introducing its own "contributions". For example, recently I have been doing some work on ML and I wanted to see the effect of some ablations. It destroyed my code to add again all the stuff I had removed on purpose.
Overall, it provides small typing/search savings, but it cannot be trusted at all yet.
That’s immensely valuable and pretty game changing
Firstly, there absolutely are people popping up in certain domains with LLM assisted developed products that could not have managed it otherwise, with results you would not suspect were made that way if you were not told.
However, I share the same problem myself. The root of it is "analysis is harder than synthesis". i.e. if you have to be sure of the correctness of the code it's far easier to write it yourself than establish that an LLM got it right. This probably means needing to change how to split things out to LLMs in ways human co-workers would find intolerable.
I tried to use an LLM to write a simple curses app - something where there's a lot of code out there, but most of the code is bad, and of course it doesn't work and there's lots of quirks. I then asked it to see if there are libraries out there that are better than curses, it gave me 'textual' which at first seemed like an HTML library, but is actually a replacement for curses. It did work, and I had some working code at the end, but I had to work around platform inconsistencies and deal with the LLM including outdated info like inline styles that are unsupported in the current version of the library. That said, I don't quite understand the code that it produced, I know it works and it looks nice, but I need to write the code myself if I want a deeper understanding of the library, so that I can support it. You won't get that from asking an LLM to write your code for you, but from you using what you learn. It's like any language learning. You could use google translate to translate what you want, and it may seem correct at first glance, but ultimately won't convey what you want, with all the nuance you want, if you just learned the language yourself.
But just wait for the next doubling of long task capacity (https://metr.org/blog/2025-03-19-measuring-ai-ability-to-com...). Or the doubling after that. AI will get there.
I have collected similar requests over time and I don't have to remind GH copilot/Claude as much anymore.
But then resources became cheap and it stoped matter. Yeah, the tight well designed machine code is still some sort of art expression but for practical purpose it makes sense to write a program in higher level language and waste a few MB...
Write tests for x in the style of this file to cover a, b, c.
Help me find a bug here within this pseudo code that covers three classes and a few functions. Here's the behavior I see, here's what I think could be happening.
I rarely give it access to all the code and usually give it small portions of code and ask for small things. I basically treat it as if I was reaching out to another senior developer in a large company or SO. They don't care to learn about all the details that don't matter, and want a good promoted question that's not wasting their time and that they can help with.
Using it this way I absolutely see the benefits and I'd say an arbitrary 1.25x sounds right (and I'm an experienced engineer in my field).
I'll just quietly keep using it this way and ignore the overwhelming hype on both sides (it's not a speed up camp and it's 100x camp. Imo both are wrong but the it's not a speed up camp make me question how they're using it the most
It’s amazing. Better design in terms of UI / UX than I could have fathomed and so much more.
There’s a lot of duplicated code that I’ll clean up, but the site functions and will be launched for clients to start using soon.
For my day job, it’s also helping me build out the software at a faster pace than before and is an amazing rubber duck.
I've been writing code for 36 years, so I don't take any of the criticism to heart. If you know what you are doing, you can ship production quality code written by an LLM. I'm not going to label it "made by an AI!" because the consumer doesn't care so long as it works and who needs the "never AI!" backlash anyway?
But to the OP: your standards are too high. AI is like working with a bright intern, they are not going to do everything exactly the way that you prefer, but they are enthusiastic and can take direction. Choose your battles and focus on making the code maintainable in the long term, not perfect in the short term.
As the project becomes non-trivial (>1000 lines), they get increasingly likely to get confused. They can still seem helpful, but they may be confidently incorrect. This makes checking their outputs harder. Eventually silly bugs slip through, cost me more time than all of the time LLMs saved previously.
Perhaps one day I'll 'incorporate myself' and start posting my solutions and perhaps make some dough.. but the I benefit far more than the $20 a month I am paying.
The right 'prompt' (with plenty of specs and controls) saves me from the (classic!) swing-on-tree example: https://fersys.cloud/wp-content/uploads/2023/02/4.jpg
> "You've refactored most of the file but forgot a single function"). It would take many many iterations on trivial issues, and because these iterations are slow that just meant I had to context switch a lot, which is also exhausting.
Try prompts like this:
"Given these data structures: (Code of structs and enums), please implement X algorithm in this function signature. (Post exact function signature)."
Or: "This code is repetitive. Please write a macro to simplify the syntax. Here's what calling it should look like (Show macro use syntax)"
Or: "I get X error on this function call. Please correct it."
Or: "I'm copying these tables into native data structures/match arms etc. Here's the full table. Here's the first few lines of the native structures: ...""
I find that so far their quality is horizontal, not vertical.
A project that involves small depth across 5 languages/silos? Extremely useful.
A long project in a single language? Nearly useless.
I feel like its token memory. And I also feel like the solution will be deeper code modularisation.
In Plato's Republic Socrates' compares the ability to produce a piece of furniture with the ability to produce the image of a cabient or so-forth with a small compact mirror; what is the difference if a deceivable crowd doesn't know the difference?
- Using an AI for strange tasks like using a TTS model to turn snippets of IPA text (for a constructed language) into an audio file (via CLI) - much of the task turned out to be setting up stuff. Gemini was not very good when it came to giving me instructions for doing things in the GCP and Google Workspace browser consoles. ChatGPT was much clearer with instructions for setting up AWS CLI locally and navigating the AWS browser console to create dedicated user for the task etc. The final audio results were mixed, but then that's what you get when trying to beat a commercial TTS AI to doing something it really thinks you're mad to try.
- Working with ChatGPT to interrogate a Javascript library to produce a markdown file summarising the library's functionality and usage, to save me the time of repeating the exercise with LLMs during future sessions. Sadly the exercise didn't help solve the truly useless code LLMs generate when using the library ... but it's a start.
- LLMs are surprisingly good at massaging my ego - once I learned how to first instruct them to take on a given persona before performing a task: I still fear LLMs, but now I fear them a little less ...
It started out promising, renaming the symbols according to my instructions. Slower than if I had done it myself, but not horrible slow. It skipped over a few renames so I did them manually. I had to tell it to continue every 2 minutes so I could not really do anything else in the meantime.
I figured it’s quicker if I find the files in question (simple ripgrep search) and feed them to copilot. So I don’t have to wait for it to search all files.
Cool, now it started to rename random other things and ignored the naming scheme I taught it before. It took quite some time to manually fix its mess.
Maybe I should have just asked it to write a quick script to do the rename in an automated way instead :)
Now with webui what's important is to constantly add tests around the code base, also if it gets stuck, go through the logs and understand why.
It's more of a management role of ,,unblocking'' the LLM if it gets stuck and working with it than fitting it to my previous workflow.
Also a lot of marketing and it's cool to hype LLMs and I guess people like to see content about what it can do in YouTube and Instagram.
As a developer buddy - no. LLMs don't actually think and don't actually learn like people do. That part of the overinflated expectations is gonna hit hard some companies one day.
What matters: - the model -> choose the SOTA (currently Claude 4 Opus). I use it mostly in Cursor. - the prompt: give it enough context to go by, reference files (especially the ones where it can start delving deeper from), be very clear in your intentions. Do bullet points.
- for a complex problem: ask it to break down its plan for you first. Then have a look to make sure it’s ok. If you need to change anything in your plan, now’s the time. Only then ask it to build the code.
- be patient: SOTA models currently aren’t very fast
I work at a company with millions of MAU, as well as do side projects - for the company, I do spend a bit more time checking and cleaning code, but lately with the new models less and less.
For my side projects, I just bang through with the flow above.
Good luck!
It has been unimaginably helpful in getting me up to speed in large existing codebases.
First thing to do in a codebase is to tell it to "analyze the entire codebase and generate md docs in a /llmdocs directory". Do this manually in a loop a few times and try a few different models. They'll build on each other's output.
Chunk embed and index those rather than the code files themselves. Use those for context. Get full code files through tool calls when needed
I instantly decided to review the frontend and backend code with AI (used cursor and GitHub copilot)
It reported a dozen more issues which otherwise would have taken a few weeks to find.
We asked AI to generate code that will help the security providing rules informing about technology stack, coding guidelines, project structure and product description.
We got good recommendations, but couldn't implement the suggestions straightforward.
However, we took the advices and hand-coded the suggestions at all code files.
The entire exercise took a week for fairly large project.
As per my tech lead, it would have taken minimum 2 months.
Soniy works.
Most of the time, the output isn't perfect, but it's good enough to keep moving forward. And since I’ve already written most of the code, Jules tends to follow my style. The final result isn’t just 100%, it’s more like 120%. Because of those little refactors and improvements I’d probably be too lazy to do if I were writing everything myself.
1. Improve the code that I already have. Waste of time, it never works. This is not because my code is too good, but because it is SQL with complex context and I get more hallucinations than usable code, still the usable code is good for basic tasks and not for more.
2. Areas I rarely use and I don't maintain an expertise on. This is where it is good value, I get 80% of what I need in 20% of the time, I take it and complete the work. But this does not happen too often, so the overall value is not there yet.
In a way is like RPA: it does something, not great, but it saves some time.
I think it’s great for coders of all levels, but jr programmers will get lost once the llm inevitably hallucinates and the expert will get gains, but not like those who are in the middle.
On the other side, getting a good flow is not trivial. I had to tweak rules, how to describe problem, how to plan the work and how to ask the agent. It takes time to be productive.
Eg. Asking agent to create a script to do string manipulation is better than asking them to do inplace edit. As it's easier to debug and repeat.
I don't suppose there's any solution where you can somehow further train a LLM on your code base to make it become part of the neural net and not part of the prompt?
This could be useful on a large ish code base for helping with onboarding at the least.
Of course you'd have to do both the running and training locally, so there's no incentive for the LLM peddlers to offer that...
Also, the big difference with this tool is that you spend more time planning, don't expect it to 1 shot, you need to think about how you go from epic to task first, THEN you let it execute.
I expect they soon will be able to help me with basic refactoring that needs to be performed across a code base. Luckily my code uses strong types: type safety quickly shows where the LLM was tripping/forgetting.
And given the output I’ve seen when I’ve tried to make it do more, I seriously doubt any of this magic generated software actually works.
My belief is that true utility will make itself apparent and won't have to be forced. The usages of LLMs that provide immense utility have already spread across most the industry.
AI tools can do things faster, but at lower quality. They can't do the _same_ thing faster.
So AI is fine for specifically low quality, simple things. But for anything that requires any level of customization or novelty (which is most software), it's useless for me.
It writes JUnit tests with mocks, chooses to run them, and fixes the test or sometimes my (actually broken) code.
It’s not helpful for 90% of my work, but like having a hammer, it’s good to have when you know that you have a nail.
I'm not building commercial software and don't have a commercial job at the moment so I'm kind of struggling with credits otherwise I would probably blow 40-100$ dollars a day.
They're good at coming up with new code.
Give it function signature with types and it will give pretty good implementation.
Tell it to edit something, and it will lose track.
The write-lint-fix workflow with LLMs doesn't work for me - LLM is monkey brain edits unrelated parts of code.
I am using:
"response_format": { "type": "json_object" }
And with Vertex: "generationConfig": {
"responseMimeType": "application/json"
}
And even: "response_format": {
"type": "json_schema",
"json_schema": { ...
And with Vertex: "generationConfig": {
"responseMimeType": "application/json",
"responseSchema": { ...
Neither of them is reliable.It always gives me json in the format of a markup document with a single json code block:
```json
{}
```
Sure I can strip the code fence, but it's mighty suspicious I asked for json and got markup.I am getting a huge number of json syntax errors, so it's not even getting to the schemas.
When i did get to the schemas, it was occasionally leaving out fields that I'd declared were required (even if i.e. null or an empty array). So I had to mark them as not required, since the strict schema wasn't guiding it to produce correct output, just catching it when it did.
I admit I'm challenging it by asking it to produce json that contains big strings of markup, which might even contain code blocks with nested json.
If that's a problem, I'll refactor how I send it prompts so it doesn't nest different types.
But that's not easy or efficient, because I need it to return both json and markup in one call, so if I want to use "responseMimeType": "application/json" and "responseSchema", then it can ONLY be json, and the markup NEEDS to be embedded in the json, not the other way around, and there's no way to return both while still getting json and schema validation. I'd hate to have to use tool calls as "out parameters".
But I'm still getting a lot of json parsing problems and schema validation problems that aren't related to nested json formatting.
Are other people regularly seeing markup json code blocks around what's supposed to be pure json, and getting a lot of json parsing and schema validation issues?
So the question really comes down to what kind of project you are developing:
Get an MVP fast? LLM is great!
Proof of Concept? LLM rules!
Big/Complex project? LLM junior developer is not up to the task.
You are missing a crucial part of the process - writing rules
Getting the best possible results requires: - an LLM trained to have the "features" (in the ML/DL sense of the word) required to follow instructions to complete your task - an application that manages the context window of the LLM - strategically stuffing the context window with preferences/conventions, design information, documentation, examples, your git repo's repo map, and make sure you actually use rules and conventions files for projects. Do not assume the LLM will be able to retrieve or conjure up all of that for you. Treat it like a junior dev and lead it down the path you want it to take. It is true, there is a bit of micromanagement required but Aider makes that very very simple to do. Aider even makes it possible to scrape a docs page to markdown for use by the LLM. Hooking up an LLM to search is a great way to stuff the context window BTW, makes things much simpler. You can use the Perplexity API with Aider and quickly write project plans and fetch necessary docs quickly this way; just turn that into markdown files you'll load up later after you switch models to a proper code gen model. Assume that you may end up editing some code yourself, Aider makes launching your editor easy though.
This mostly just works. For fun the first thing I did with Aider was to write a TUI chat interface for ollama and I had something I could post to github in about an hour or two.
I really think Aider is the missing ingredient for most people. I have used it to generate documentation for projects I wrote by hand, I have used it to generate code (in one of my choice languages) for projects written in a language I didn't like. It's my new favorite video game.
Join the Aider discord, read the docs, and start using it with Gemini and Sonnet. If you want local, there's more to that than what I'm willing to type in a comment here but long story short you also need to make a series of correct decisions to get good results from local but I do it on my RTX4090 just fine.
I am not a contributor or author of Aider, I'm just a fanatical user and devotee to its way of doing things.
Why don’t you consider that the AI will be the one maintaining it?
I'm not yet a fan of Windsurf or Cursor, but honestly Roo Codes out of the box personas for architect, and orchestration to spin up focused subtasks works well for me.
I am kinda treating it how I would a junior, to guide it there, give it enough information to do the work, and check it afterwards, ensuring it didn't do things like BS test coverage or write useless tests / code.
It works pretty well for me, and I've been treating prompting these bots just as a skill I improve as I go along.
Frankly it saves me a lot of time, I knocked out some work Friday afternoon that I'd estimate was probably 5pts of effort in 3 hours. I'll take the efficiency anyday as I've had less actual coding focus time in coding implementations than I used to in my career due to other responsibilities.
So why are you complaining? I use AI all the time to give me suggestions and ideas. But I write the perfect code myself.
but I think this is solvable when context length goes way higher than current length
I tried to use many LLM tools. They are generally not capable of doing anything useful in a real project.
Maybe solutions like MCP, that allow the LLM to access the git history make the LLM become useful for someone that actually works on a project.
TLDR; it works for a codebase of 1M LoC. AI writes code a lot faster, completing tasks in days instead of sprints. Tasks can be parallelized. People code less, but they need to think more often.
(1) Maintain clear and structured architecture documentation (README, DDD context/module descriptions files, AGENTS-MD).
(2) Create detailed implementation plans first - explicitly mapping dependencies, tests, and potential challenges.
(3) Treat the implementation plan as a single source of truth until execution finishes. Review it manually and with LLM-assistance to detect logical inconsistencies. Plan is easier to change, than a scattered diff.
(4) In complex cases - instruct AI agents about relevant documents and contexts before starting tasks.
(5) Approve implementation plans before allowing AI to write code
(6) Results are better if code agent can launch automated full-stack tests and review their outputs in the process.
The same works for me in smaller projects. Less ceremony is needed there.
Core Development Capabilities:
- File Discovery & Navigation: file_explorer with pattern matching and recursive search
- Intelligent Code Search: search_in_file_fuzzy with similarity thresholds for finding relevant code sections
- Advanced Code Editing: file_diff_writer with fuzzy matching that can handle code changes even after refactoring
- Backups: backup and restores of any files at any state of change.
- System Monitoring: Real-time log analysis and container management
- Hot Deployment: docker_rebuild for instant container updates (Claude can do the rebuild)
The Agentic Workflow:
- Claude searches your codebase to understand current implementation
- Uses fuzzy search to find related code patterns and dependencies
- Makes intelligent edits using fuzzy replacement (handles formatting changes)
- Monitors logs to verify changes work correctly
- Restarts containers as needed for testing
- Iterates based on log feedback
- Error handling requires analyzing logs and adjusting parsing strategies
- Performance tuning benefits from quick deploy-test-analyze cycles
I've not had any issues with Claude being able to handle changes, even doing things like refactoring overly large HTML files with inline CSS and JS. Had it move all that to a more manageable layout and helped out by deleting large blocks when necessary.
The fuzzy matching engine is the heart of the system. It uses several different strategies working in harmony. First, it tries exact matching, which is straightforward. If that fails, it normalizes whitespace by collapsing multiple spaces, removing trailing whitespace, and standardizing line breaks, then attempts to match again. This handles cases where code has been reformatted but remains functionally identical.
When dealing with multi-line code blocks, the system gets particularly clever. It breaks both the search text and the target content into individual lines, then calculates similarity scores for each line pair. If the average similarity across all lines exceeds the threshold, it considers it a match. This allows it to find code blocks even when individual lines have been slightly modified, variable names changed, or indentation adjusted.
TLDR, to use those tools effectively you need to change yourself a bit but in a fairly good direction.
I write compilers. Good luck getting an LLM to be helpful in that domain. It can be helpful to break down the docs for something like LLVM but not for writing passes or codegen etc
i speak in thoughts in my head and it is better to just translate those thoughts to code directly.
putting them into a language for LLMs to make sense and understanding the output is oof... too much overhead. and yeah the micromanagement, correcting mistakes, miscommunications, its shit
i just code like the old days and if i need any assistance, i use chatgpt