Best approach to inheriting an out of control system?

Question

I can't find any good advice for managing what is a reasonably common situation, "how to successfully fire fight"On-prem, cobbled together system, multiple parts that are completely unknown, everyone is afraid of making a change, if it goes down it might not come back up.Two devs were building a little project on top of a system and adding bits to it as they got new ideas (both quite junior quick and dirty types). Much of the codebase they have never touched, it's just there running with the stuff they have on top.Deployed it to an OVH remote managed rack (yeah I know!) and offered it to a client. Client loved it, hockey stick growth, huge demand, totally not production ready but it's throwing off cash (lots and lots and lots).Devs are burnt out and one is taking an offline holiday, they aren't mature about the situation, they are also a self contained business unit away from the rest of the company.Emergency consultant and senior devs from parent company started to take a look over the evenings this week. Emergency plan is to get a copy deployed and running in azure so there are 2 instances and if the OVH one goes down again we can swap traffic (pending P&L sign off)Potential for a great product to die and cause reputational damage. Potential to beat it into shape and turn it into a production product and multi million revenue.Thoughts?

austincheney · Accepted Answer

This is more a problem of failed leadership, possibly at multiple levels.
If you are the guy tapped to fix it then own it. The problems occurred before you got there and you are dealing with a legacy mess, but real leadership means having the humility to own that problem like it is all your own.
Don’t blame the junior developers for their lack of direction. Cover for them and protect them. Earn their trust and guide them in a better direction. This is where you will begin to turn the ship around.
Numerous people will be quick to bias you with their opinions. Always bear in mind that bias is misleading. Be firm in forming original opinions from your own observations.
Set a valid plan with realistic expectations. This problem didn’t happen in a day and it won’t be fixed in a day. Define primitive metrics to identify progress and business health. Use those metrics to determine if you are moving in the proper direction and the speed of success. Be very transparent about your measures to both your management and your team.
Frequently praise your people, work your ass off, and set a positive example.

rawgabbit · Answer

Your leadership will view the situation from a Risk-Reward lens. They know the reward which is the revenue it brings in. They will ask you what is the worst that can happen and then they will decide to continue or shut it down.I personally think that lifting and shifting to Azure is a good thing. Next you should get Microsoft Premier Support/Premier Field Engineering to analyze what you have. They will give you plan on how to improve reliability and probably reduce your costs as well. You take that plan and explain to your client the good news/bad news. The bad news is that Microsoft says we should perform the following to improve reliability; the good news is that your company will not charge the client extra; you will improve reliability for "free". As reliability improves, your company can direct the senior devs to start designing version 2.0 (maybe).

atsaloli · Answer

Max's answer to the InfoQ interview[1] question "What is the most terrible code that you ever encountered, and what was your approach to refactoring it?" may be relevant.
In his book "Code Simplicity", Max has a checklist (summarized in point 4 of https://techbeacon.com/app-dev-testing/rewrites-vs-refactori...) -- and that's the checklist referred to in the InfoQ interview.
1: https://www.infoq.com/news/2018/01/q-a-max-kanat-alexander/

backslash_16 · Answer

It sounds like you are understandably worried about it going down. Standing up another copy is a great idea, and a great exercise in making sure your company understands how to build and deploy it from scratch.
When you do that you also need to figure out the data persistence layer. You probably want to either share it between both instances of the application or setup a backup/copy system so the version in the cloud has up to date data and is more of a hot spare than a second set of infra lying around with an empty DB.
Moving on from there, if your only goal is to keep it alive while the devs are on vacation, you should probably implement a deployment freeze. Yes in an ideal world you would make any and all changes to an infrastructure as code template and re-deploy, or at least change config files in a repo and re-deploy those but it sounds like the application isn't that modular.
Most incidents are caused by change, so minimize that until the whole team is back together, including the devs who wrote the service, to start making improvements.
At the same time you need to know or figure out how you can change config to keep it running. Ex: You need to update a cert thumbprint or change a timeout value. It sounds like it's either running on bare metal or on VMs on a physical server you own? If that's the case maybe sshing into the boxes to manually edit config is the least bad way of updating it (again - only until this emergency situation is over). If you go the route of ssh-ing, at least build a tool or script to update the key value pairs in your config store so you don't missing an angle bracket or quote and set your whole system on fire.
If the developers weren't going on vacation my advice would be a lot different. What I have written above is purely tactics to keep the system alive and the business making money until it's a better time, personnel wise, to improve the system.
Lastly for some dev advice, get started on some end to end or "pinning" tests. Yes they are typically the most fragile and slow type of tests but you can get a lot of safety and peace of mind from just a few of them. I personally feel that in situations like these they are the best value per dev hour spent.
If you're using Python, introduce mypy immediately. Same for JavaScript (use TypeScript). Being able to lean on a type system (or if you're in a static language - compiler) when refactoring or making sweeping changes it incredible helpful and lowers the chance of making a mistake by a large amount.
If you want more, I have tons more I can write on this subject but this is getting quite long :)

nhayfield · Answer

rewrite it from scratch

thedevindevops · Answer

Step 1) Document the system (what currently exists) in more detail than you think you'll need, paying specific focus to the system and module boundaries (if the Interfaces are done well this should be a fairly easy step)
Step 2) Modularise all the things, tuck all those fiddly bits behind the abstractions you pulled out in Step 1
Step 3) Mock and unit test all the things, this is actually crucial because this is where you clarify all the crud that built up due to developer assumptions, you test and check all those Modules and verify the system is doing not only what you think it's doing but what it's supposed to be doing
Step 4) Introduce a good Dependency Injection Framework
Step 5) (This is where you actually fix things) Now you can break up those Modules and refactor their internals - even swapping out entirely freshly written modules thanks to that nice DI framework - with confidence that you won't break the overall system.