HACKER Q&A
📣 thr0waway998877

How do you make sure your servers are up as a single founder?


I'm running a small business on AWS as a solo founder. It's just me. Yesterday I had a service interruption while I was in the London subway. Luckily, I was able to sign in to the AWS console and resolve the issue.

But it does (again) raise the question I'd rather not think about. What if something happens to me and there's another outage that I can't fix?

So - how do you make sure that your servers are up as a one person founder? Can I pay someone to monitor my AWS deploy and make sure it's healthy?


  👤 jmstfv Accepted Answer ✓
I am a solo founder of a website monitoring SaaS [0]. Theoretically, my uptime should be higher than that of my customers'. Here are a few things that I found helpful in the course of running my business:

* Redundancy. If you process background jobs, have multiple workers listening on the same queues (preferably in different regions or availability zones). Run multiple web servers and put them behind a load balancer. If you use AWS RDS or Heroku Postgres, use Multi-AZ deployment. Be mindful of your costs though, because they can skyrocket fast.

* Minimize moving parts (e.g. databases, servers, etc..). If possible, separate your marketing site from your web app. Prefer static sites over dynamic ones.

* Don't deploy at least 2 hours before you go to sleep (or leave your desk). 2 hours is usually enough to spot botched deploys.

* Try to use managed services as much as possible. As a solo founder, you probably have better things to focus on. As I mentioned before, keep an eye on your costs.

* Write unit/integration/system tests. Have a good coverage, but don't beat yourself up for not having 100%.

* Monitor your infrastructure and set up alerts. Whenever my logs match a predefined regex pattern (e.g "fatal" OR "exception" OR "error"), I get notified immediately. To be sure that alerts reach you, route them to multiple channels (e.g. email, SMS, Slack, etc..). Obviously, I'm biased here.

I'm not gonna lie, these things make me anxious, even to this day (it used to be worse). I take my laptop everywhere I go and make sure that my phone is always charged.

[0] https://tryhexadecimal.com


👤 apankrat
> I was able to sign in to the AWS console and resolve the issue

Kids these days.

I had a RAM stick fry in one of the physical machines sitting in a colo 1 hour drive away. Not die, but just start flipping bits here and there, triggering most bizarre alerts you can imagine. On the night of December 24th. Now, that was fun.

--- To add ---

If you are a single founder - expect downtime and expect it to be stressful. Inhale, exhale, fix it, explain, apologize and then make changes to try and prevent it from happening again. Little by little, weak points will get fortified or eliminated and the risk of "incidents" will go down. There's no silver bullet, but with experience things become easier and less scary.


👤 __d
You might also want to consider some additional risks that are often overlooked:

Billing issues. What happens if the credit card you use to pay for everything gets hijacked, and you're trapped with a blocked card trying to clean it up but your bank is taking their sweet time and won't give you another card until it's sorted? ALWAYS have a backup credit card.

DNS Registrar. There's a hard SPOF in the DNS, where your registrar essentially holds your domain name hostage. If your DNS gets hijacked, but your registrar is taking a few days to sort out who actually owns it, you're down hard. There's no mitigation for this one, except paying for a registrar with proper security processes. If you do 3FA anywhere, make it here.

AppStore. If your app gets banned, or a critical update blocked, what do you do? Building in a fallback URL (using a different domain name, with a different registrar, can help work around any backend issues. There's not much you can do for the frontend functionality, except using a webapp.

It can be worthwhile looking at risks and possible mitigations beyond just server and database issues, especially when it's just you.


👤 thrownaway954
I know this is going to be down-voted to nonexistence since everyone now-a-days wants to serverless, AWS and what not. personally i've always used either hosting.com or inmotionhosting.com. yes they are more expensive than AWS and what not, but the thing is, they both have a support staff 24/7/365. I called whenever i need and have someone remote into my server and fix whatever is wrong. furthermore, i can even have the server alert emails routed to not only me, but them as well! so they know about the problem and are on it and i don't have to do a thing.

👤 hellcow
The only way to achieve high availability is to have redundancy of all things.

Random things will go wrong that you can't predict. Boxes will die suddenly and without reason, even after months of working fine without changes, and always at the worst possible moment. Your system needs to be built to withstand that.

I'll take the opposite approach of everyone here and recommend against serverless, kubernetes, and Heroku/PAAS.

You are a solo founder. You should understand your infra from the ground up (note: not understand an API, or a config syntax, but how the underlying systems actually work in great detail). It needs to be simple conceptually for you to do that. If anything goes wrong, you need to be able to identify the cause and fix it quickly.

I've gone through this first-hand and know all the trade-offs. If you'd like, I'm happy to discuss architecture decisions on a call. Email is in my profile.


👤 jasonkester
I build my stuff on top of a stack that hardly ever goes down.

All my SaaS products run on a Windows server, with SQL Server as a database and ASP.NET on IIS running the public sites. You can probably come up with a lot of uncharitable things to say about those technologies, but "flimsy" and "fragile" likely aren't in the list.

As a result, when things go seriously wrong, the application pool will recycle itself and the site will spring back to life a few seconds later. Actual "downtime", of the sort that I learn about before it has fixed itself might happen maybe once ever couple years. At least, I seem to remember it having happened at least once or twice in the last 15 years of running this way.

There's a Staging box in the cage, spun up and ready to go at a moment's notice, in case that ever changes. But thus far it has led a very lonely life.


👤 aspectmin
FWIW - if you just want to make sure your services are up - consider:

1) pagerduty.com or uptimerobot.com for remote monitoring to make sure you site(s) are up (and get alerts when they're not).

2) Datadog or New Relic if you want deeper monitoring (application performance, database performance, diagnostics/debugging.

3) Rollbar.com (site doesn't seem to respond) for site performance/errors.

4) Roll your own with Prometheus (https://prometheus.io/, or Nagios (https://www.nagios.org/)/IcingA. Or... strangely - I still use MRTG for a few perf monitoring things: https://oss.oetiker.ch/mrtg/

5) If you want to monitor the status of deploys/builds - I love integrating CI/CD systems with Slack - very helpful.

Hope that helps - I've spent a lot of my career monitoring things, and have this mantra that I need to know about services down before customers call to tell me same.

(a lot of these have free tiers)


👤 cddotdotslash
Back when I was working on everything myself, I deployed everything through AWS Lambda and API Gateway, with all my static assets on S3 and CloudFront. I had exactly zero infrastructure issues over the course of two years and never dealt with security patches, SSH'ing, etc. If I were doing billions of requests, it may not have been the most cost effective, but it helped me scale without worrying about typical devops issues. Updates, testing, rollbacks, etc were also extremely easy.

👤 idlewords
I've run a one-person business on my own servers for about eight years.

Honestly, the answer is learning how to manage anxiety and stress, particularly doing potentially destructive things under pressure. I think the psychological aspects of this are much more difficult than the technical ones.

If it helps, people are generally very understanding if you explain that you are a solo founder, and take reasonable steps to fix issues in a timely way. Most customers assume every company is a faceless organization; their attitude is much more forgiving when they learn they're dealing with a fellow person.

You cannot be on call 24/7 forever. You will burn out. If you can't hire someone you trust to take over part of this burden, then you have to accept the risk of sometimes not being able to log in for N hours if there is an outage (because you're camping with your spouse, etc.)

For very high-stress situations (database crash, recovery from backup) working from a checklist that you have tested is very valuable.

Good luck to you, and I hope you found useful answers in this thread!


👤 freetonik
Heroku. I just pay for the privilege of not thinking (almost) about such issues.

👤 nlg
I agree in general with the responses encouraging better usage of managed platforms. I've run a SaaS app for a couple of years using a combination of AWS Elasticbeanstalk (Flask and Django) and AWS Lambda. Server resource related downtime has been minimal and recovery is quick/automated. Even hosting on Lambda you can run into issues without layers of redundancy (Lambda may be fine but a Route 53 outage would prevent you from hitting that endpoint if you're using that for DNS).

Before thinking about handing over management of the deployment, I would encourage you to think about what the root cause of the outage is and whether something in the app will create that situation again. I invested in setting up DataDog monitoring for all hosts with alerts on key resource metrics that were causing issues (CPU was biggest issue for me).

The other thing that's worked well for me is just keeping things simple. As a solo founder, time spent with customers is more valuable than time spent on infrastructure (assuming all is running well). It's a little dated, but I still think this is a good path to follow as you're building your customer base. A simple stack will let you spend more time learning how your product can help your customers best.

http://highscalability.com/blog/2016/1/11/a-beginners-guide-...


👤 dkersten
Most of the suggestions here is suggesting ways of restarting services when they go down, which is a good start, but that doesn't actually solve the issue I hit last night...

My system integrates with an external system and what happened is this external system started sending me unexpected data, which my system wasn't able to handle, because I didn't expect it so never thought to test for it -- the issue was that I was trying to insert IDs into a uuid database field, but this new data had non-uuid IDs. Because the original IDs were always generated by me, I was able to guarantee that the data was correct, but this new data was not generated by me. Of course, sufficient defensive programming would have avoided this as this database error shouldn't have prevented other stuff from working, but my point is that mistakes get made (we're humans after all) and things do get overlooked.

The problem is, restarting my service doesn't prevent this external data from getting received again, so it would simply break again as soon as more is sent and the system would be in this endless reboot loop until a human fixes the root cause.

That's a problem that I worry about, no matter how hard I try to make my system auto-healing and resilient (I don't know of any way to fix it other than putting great care into programming defensively), but again, we're human, so something will always slip through eventually...

Some people are suggesting to out-source an on-call person. That seems to me like the only way around this particular case. (The other suggestions can still be used to reduce the amount of times this person gets paged, though)


👤 outime
If you’re using AWS I want to assume you didn’t go for a cheaper solution (e.g. VPS from a reputable company) because you like the managed solutions that they provide, among other reasons.

I assume also you want a simple way to increase reliability while keeping costs within reasonable limits.

Well, AWS can give you all that if you don’t want to go super fancy. Check Beanstalk to get something simple and reliable. Monitor using CloudWatch. Make sure to leverage redundancy options (multi az, multi region if worth it, etc). These are some general tips but with the information that you provide that’s all I can say.

You can also pay a consultant to get a review of your setup and get some recommendations. It won’t be cheap but it depends how much you value your time and your product.


👤 lacker
Some of the comments are suggesting totally different technologies. Don’t do that. You can stay on AWS and achieve the reliability you need. This isn’t the sort of problem that should lead you to rebuild your whole stack.

The question you should be asking is, how can I make my service automatically recover from this problem. It depends why exactly it crashed. If a simple restart fixes the problem, there are different ways you can automate this process, like Kubernetes or just writing scripts.

I’m happy to give more detailed advice if you would like, my email is in my profile.


👤 faeyanpiraat
Your main concern is of course limited time/resources, so you'll have to make compromises.

The question is not whether your system will fail, the question is when.

Have proper monitoring and alerting in place.

But don't over engineer it, sometimes everything seems technically fine, but your support inbox will start getting user complaints.

Resolve the issue, figure out the root cause, make sure this or similar stuff won't happen, apologise to the affected users if necessary, and move on.

You'll learn waaay more failure modes of your application running in the wild, than just thinking about "what could go wrong".

It's a long game of becoming a better developer/devops guy, and not repeating the same mistakes in the future.


👤 charlesju
I would say that as a one person founder, know that you cannot ever get 100% uptime and live with it. In the most simplistic sense, you need to sleep 8 hours a day, you cannot live your life constantly stressed about uptime. Just generally have internet access and sometimes your service will go down.

On the set up, try your best to solve issues and use tried and true hardware, but things go down sometimes, even big sites like Google, Facebook go down, there is no silver bullet, you can only improve on your past mistakes.

Last, try to find some remote help, on a contract basis, it's not that expensive and it can help alleviate a lot of your stress.


👤 jarl-ragnar
I use uptime robot http://uptimerobot.com for monitoring, they have a free plan or paid if you want faster checks.

If it's truly critical to have no down time then you probably need to build that resilience in to your architecture.


👤 CoolGuySteve
I currently run a batch of trading servers solo. The trading system is a C++ process with an asynchronous logger that prints log levels and times. One of the issues with trading is that you're dependent on your datafeed and exchange connections working which is out of your control.

I use a python monitoring script that tails logs watching for ALERT level log lines and constant order activity combined with a cron watchjob to ensure the process is alive during trading hours. The exception handler in the monitoring script sends alerts if the script itself dies.

If there are any issues I use twilio to text me the exception text/log line. I also use AWS SES to email myself but getting gmail to permanently not block SES is a pain in the ass. By design Twilio + AWS SES are the only external dependencies I have for the monitoring system (too bad SES sucks).

On my phone I have Termius SSH setup so I can log in and check/fix things. I have a bunch of short aliases in my .profile on the trading server to do the most common stuff so that I can type them easily from my phone.

I also do all my work through a compressed SSH tmux including editing and compiling code. So if things get hairy I can pair my phone with my laptop, attach to the tmux right where I left off, and fix things over even a 3G connection.

This compressed SSH trick is a huge quality of life improvement compared to previous finance jobs I've worked where they use Windows + Citrix/RDP just to launch a Putty session into a Linux machine. It's almost like finance IT has never actually had to fix anything while away from work.


👤 ElFitz
I basically don't manage any servers. Everything runs on AWS Lambda & co (DynamoDB, S3,...)

It doesn't prevent an app-level outage (corrupted data in the database, bad architecture,...) but at least I don't have to worry about servers going down anymore.

As for the rest, unit & extensive integration tests along with continuous integration and linting. Oh, and a typed language. Moving from Javascript to Typescript was a blessing. But I still miss Swift.


👤 asadlionpk
We are a very small team at https://codeinterview.io. We recently achieved a respectable level of reliability with a tiny team. Some things you should do:

- Atleast have a pool of 2 instances (ideally per service) running under an auto-scaler or a managed K8s (GKE is best) with LB in front. May also want to explore EBS and google cloud run. If you can use them, use them!

- Uptime alerts. pingdom (or newrelic alerts) with pagerduty added.

- Health checks! The trick is to recover the failed container/pod/service before you get that pagerduty call. Ideally, if you have 2 of each service running #2 will handle the requests until the #1 is recreated.

- Sentry + newrelic APM + infra: You should monitor all error stack traces, request throughput, avg response time. For infra, you mainly need to watch memory and CPU usage. Also on each downtime, you should have greater visibility at what caused it. You should set alerts on higher than normal memory usage so you can prevent the crash.

- Logs, your server logs should be stored somewhere (stackdriver on gcloud or cloudwatch on aws).

These might sound overwhelming for a single person but these are one time efforts after which they are mostly automatic.


👤 ransom1538
1. Stay on AWS only.

2. Pay for a Business Support plan. https://aws.amazon.com/premiumsupport/pricing/

3. Call business support about something "how do I restart my server" - so you know how to file a ticket, get a feel for how quick the response is and how it works.

Do not over think this. EG: terraform templates


👤 xwdv
For your case I recommend you use Poor Man’s High Availability method, an auto scaling group of size 1.

👤 kerberos84
Yes there are plenty of start-ups doing this. You can also use AWS's build-in functionality to achieve this. You can write a Lambda function which checks your server status. Even better which calls your end-point for health check if you want more detailed monitoring

👤 conductr
I solo ran a web hosting service way back in 2000-2003, well before cloud when it was mostly LAMP and CPanel. It was super mission critical stuff for 20000 sites and I was totally winging it. As it grew I got totally paranoid about uptime. Long story short, at some point there’s no substitute for getting a human to help back you up. I had this company that I paid $250 a month for that was helped monitor and would jump in to my servers to troubleshoot if I was unavailable. They rarely were needed and when they were it was usually just an Apache restart or similar. Best money I ever spent.

👤 eps
Another option is just to not tackle systems that require 24/7 uptime IF you are just one person. Instead, make an installable product or do a service that's not interactive or real-time.

I've been in the game for a while and every time I run across an idea for a service, there's always a question of whether I'd be OK with sleeping with a pager, remoting to the servers at 4 am on Saturday and generally be slaved to the business. The answer, upon some reflection, is inevitably No. This is the domain of teams.


👤 dougb5
I wrote up a "technical continuity plan" that describes how to keep my web sites and APIs in maintenance mode in the event of my untimely demise. It has a list of bare-minimum things to do in the following week, month, and year, and describes the various third-party relationships and how to go about hiring a replacement administrator. I shared the doc with a few close friends. I hope it's not needed in the future, but just writing the doc was a useful exercise for me in the present.

👤 foxhop
You have identified a single point of failure (yourself), you either need to accept the risk or hire a person on retainer.

I'm in the same boat with my solo founder projects (links in profile).


👤 joshmn
I have a few things in production — two SaaS, one customer-facing subscription site. I run these all myself with no staff or contractors.

The short answer: I'm married to my phone/laptop.

My test coverage is good. I use managed services when possible so I don't need to play sysadmin. I don't deploy before I leave for something (dinner, shower), and I have some pretty good redundancy across all my services. If one node goes down, I'm safe. If four go down (incredibly unlikely), well, fuck, at least my database was backed up and verified an hour ago.

I invested a large amount of time into admin-y stuff. My admin-y stuff is solid and I can tweak/config/ccrud anything on the fly. I credit being able to relax thanks to my admin-y stuffs. Obviously, if shit really hits the fan with hardware or an OS bug, I need to get to my laptop. But over the last six years, I haven't had to do that yet, and hopefully I won't have to.

I've explored adding staff — mainly for day-to-day operations — but I like the idea of interfacing with my customers and I credit growing things to where I have because I'm in the trenches with them. Things haven't always gone smoothly, and my customers always let me know, but any issues are normally swiftly-resolved.

The scale of one of my products is non-trivial and has a ton of moving parts — some of which I'm in no control of and could change at any time and break _everything_. It sounds terrifying, and it is, but I've made a habit to check things before peak hours. If something's amiss, a quick fix is usually all it takes.


👤 bArray
I'm not a solo founder, but I run a number of servers that are heavily used - all with different software with varying amounts of reliability. I also allow other people to deploy code without checking with me first, just to keep things fun.

I have a few pieces of advise:

1. Make sure your service can safely fail and be restarted. What I mean is, if somebody is POST'ing data or making database changes, make sure you handle this safely and attempt some recovery. Something not being fully processed is okay as long as you are able to handle it.

2. Self-monitoring. I run all my systems inside a simple bash loop that just restarts them and pop me an email (i.e. "X restarted at Y" and then "X is failing to start" if it continues).

3. External monitoring via a machine at home that rolls the server back to a previous binary (also on the server). It also pulls various logs from the server, as well as the binaries, so they can be analyzed. Okay, it has some reduced functionality, but it's stable and will keep things going until the problem is fixed.

4. Make sure your service fails inconveniently - i.e. returns a `{"status":"bad"}` string or something, or defaults to a "Under maintenance page, please come back soon". Your service going down is one thing, but becoming completely unresponsive is quite another.

One thing I can't prepare for (which happens more than you think) is the server itself crashing, which as you say, means I'm randomly logging into a VPS console and rebooting. I use a bunch of different VPS providers and every one of them has a slightly different console.


👤 RantyDave
Just to add to the voices that are saying "by not having any". If you can get away with edge, lambda, or heroku ... even if it's in the short term, do.

👤 deif
Other people are suggesting alternative platforms when you could simply have an AWS autoscaling group. If a server goes down it simply relaunches a saved image.

👤 pinacarlos90
It boils down to sending some sort of notification so first responders know about the issue ASAP

You can do it at the OS level: on a windows OS for example: you use EventViewer and assign a task to specific type of log captured by the OS this task can then invoke a small app that sends emails if an error-log occurs or something like that

    Application specific issue:
        you can manually capture exceptions raised within the app and send notifications
            there are many clever ways to do this and not hinder performance, and also not pollute your code-base with exception handling
                you can spawn "fire and forget threads" that send notifications ...
                let me know if need more ideas here

    Integration tests:
        given that you've built a strong suite of integration-tests covering all the functionality on your app
            you can have have your integration tests run every 15min or so and send notifications if tests fail

You can also use monitoring tools. I know Azure offers ways to help with this. Reach out if want more ideas or more specific solutions

👤 peterwwillis
Yes, you can pay managed hosting providers different amounts of money for different levels of support.

Managed support will usually only monitor and fix basic infrastructure and respond to support requests from you. They often won't monitor or fix your applications/services; for that you can set up your own application monitoring and tests. NewRelic is a good all-in-one choice, but there are plenty more out there. To call you during an incident, you'd also adopt PagerDuty.

In order to avoid service outage in general, you want to hook up some kind of monitor to something that re-starts your services and infrastructure. This will only fix crashes; it won't fix issues like disks filling up, network outages, application bugs, too many hits to your service, etc.

You should be able to find small businesses who specialize in selling support contracts for all levels of support. By signing a contract and on-boarding a 24/7 support technician, you can get them to do basically whatever you need to be fixed when it goes down. I don't have suggestions for these, maybe someone else does (it used to be common for SMBs in the 2000's).


👤 samvher
It really depends on the failure mode and the cost of failure. As mentioned by others you can encounter issues in external services which you have no control over and the best you can do in that case is fail gracefully until you're able to deal with the issue. If it's easy to detect failure, and a restart fixes the problem, it can be quite straightforward to set up some monitoring scripts that take care of this for you, and even if it's more complicated than a restart some monitoring can at least notify you by email or SMS. Keeping your tech simple and/or having high test coverage or formal verification can reduce your error rate. Similarly you can introduce fault tolerance into the system with something like Erlang's OTP or monitored containers in an orchestrator (K8s, Docker Swarm, some cloud solution). If failures are expensive you might want to take on staff to deal with them, if the cost is low you might just want to accept occasional downtime (though you'll want to think about how you report that to your users).

👤 Igor_kh
The operations guy is here. I'm probably biased.

If I were you, I would use free monitoring services like uptimerobot. There are some other options available. Typically these services provide some basic functionality for free, it would be enough for a small enterprise.

On AWS it is quite easy to create your own external probes for a reasonable price. However, it would require some basic programming skills.


👤 peterburkimsher
Are you using a Node.JS backend? This is a little script that I set up with cron on a second instance, which logs in and restarts the Node server if it's down. • replaces * because it's used for italics on Hacker News comments.

#!/bin/bash

thisHtml=`curl -s "[your site's web address]"`

if [[ $thisHtml != •"[your site's title]"• ]]; then

#echo "Server is down"

ssh -i "[your pem file]" -t ec2-user@[ip address] 'sudo /bin/bash -c "killall -9 node"'

ssh -i "[your pem file]" -t ec2-user@[ip address] 'sudo /bin/bash -c "export PATH="/root/.nvm/versions/node/v8.11.2/bin:/usr/lib/node:/usr/local/bin/node:/usr/bin/node:$PATH" && forever start /var/www/html/[...]/index.js"'

rebootDate=`TZ=":[your time zone]" date '+%Y-%m-%d %H:%M:%S'`

echo "$rebootDate" >> "/home/ec2-user/serverMonitoring/devRestarts.txt"

fi


👤 drubenstein
> Can I pay someone to monitor my AWS deploy and make sure it's healthy?

Yes. There are consulting shops that will do this, as will many of the monitoring tools listed in the thread (though these tools will not fix the problem for you). Broadly speaking, there is a cost associated with this, as well as the cost associated with your downtime. If the cost of your downtime (reputational risk, SLA credits, etc) outweighs the cost of hiring someone to cut your MTTR to 5 minutes (assuming you can playbook out all of the relevant scenarios) + provides some value in stress reduction, then you should do this. If you've been doing this a while, you can math it out. In what experience I've had though, an outside person is unlikely to be able to fix an "unknown unknown", they just won't know your environment as well as you will.

All that said, one hour of service interruption a year is still better than most.


👤 Beltiras
Redundancy. Failure should always be an option. Specific answers will depend on your stack. Nobody will be able to monitor and react like you will because all IT solutions are their own species of butterfly with their own intricacies. If uptime is really that important you might be at the stage where you need to take on an employee.

👤 gatherhunterer
I highly recommend Kubernetes as infrastructure. It has a reputation for being too complex to use on your own or with simple projects but that reputation is undeserved. Self-healing container orchestration has been eye-opening for me. Many people groan at the prospect of learning something new but it is remarkably easy to use, the only barrier to entry being the high cost of cloud solutions and the unwillingness of many engineers to work with hardware (which would nullify the cost of cloud services). You can easily develop and test on local hardware and deploy to the cloud with the exact same configuration.

The idea that your server does not perform regular health checks or spin itself back up when it fails just seems weird to me now. I like being spoiled.


👤 colinjfw
I've been thinking about this quite a bit lately. I've run DevOps for a few organizations and learned quite a bit through that.

Ultimately you can engineer your systems, even if they are quite complex, to be manageable by a single person. It's not one thing though. It's years of experience and gut feel. It's also totally distinct from technology.

Some things that come to mind:

- use queues for background tasks that may need to be retried. If things go down and you have liberal retry policies, things should recover.

- use boring databases. Just stay away from mongo and use something like rds which is proven and reliable.

- be careful in your code about what an error is. Log only things at the error level you need to look at.

- test driven development. Saves a ton of time.


👤 tachion
You start by identifying the reasons behind why your application/service may fail and then design and implement the infrastructure for it, that can withstand certain failures for a cost you can bear. If a failure of a piece of infrastructure costs you £1 per day, you might be OK with paying £5/day for the infrastructure to handle such failure. But would be be OK with paying £50 for the same thing?

It's all the matter of defining requirements, then solutions and tradeoffs of those solutions and then implementing it with best practices in mind (automation, testing, monitoring, backups, etc.).

Hit me up if you want to discuss it over a pint! :)


👤 waxzce
Very simple: you need to get higher level of service, instead of paying for servers you need to pay for up-time. It's what PaaS, managed service or serverless does : manage server for you, at scale. To have something online you need: - servers - VM/OS management - Scalability system - monitoring (hardware, OS, applicative & functional) - action on monitoring and escalation management - update every weeks - observability

That's what we provide at Clever Cloud BTW https://www.clever-cloud.com/


👤 cpursley
Use heroku as long as it's cost affective. Every time I've moved from heroku to another platform for "cost savings" I always end up spending much more time that I'd planned just maintaining it.

👤 dwild
Does your service actually need to have an incredible uptime? What would be the worst that would happen if the service was down let say 24 hours?

I feel like we over engineer that part. Sure there's plenty of service where you don't want any downtime and it makes sense to over engineer it (like any monitoring service) but for many SaaS, the worst that will happens is a few emails.

Maybe write a simple SLA, something with a 8 hours response over theses kinds of outage. If some client require more, than sell them a better SLA at an higher cost. That should let you invest into better response time for sure.


👤 nh2
I rent dedicated servers at Hetzner.

No cloud machines, no hosted cloud services for production beyond DNS.

* 3 machines in separate data centers (equivalent of AWS AZs) for >= 30 EUR/month each. ECC RAM.

* These machines are /very/ reliable. Uptime of > 300 days are common, reboots happen only for the relevant kernel updates.

* Triple-redundancy Postgres synchronous replication with automatic failover (using Stolon), CephFS as distributed file system. I claim this is the only state you need for most businesses at the beginning. Anything that's not state is easy to make redundant.

* Failure of 1 node can be tolerated, failure of 2 nodes means I go read-only.

* Almost all server code is in Haskell. 0 crash bugs in 4 years.

* DNS based failover using multi-A-response Route53 health checks. If a machine stops serving HTTP, it gets removed from DNS within 10 seconds.

* External monitoring: StatusCake that triggers Slack (vibrates my phone), and after short delay PagerDuty if something is down from the perspective of site visitors.

* Internal monitoring: Consul health checks with consul-alerts that monitor every internal service (each of the 3 Postgres, CephFS, web servers) and ping on Slack if one is down. This is to notice when the system falls into 2-redundancy which is not visible to site visitors.

* I regularly test that both forms of monitoring work and send alerts.

* Everything is configured declaratively with NixOS and deployed with NixOps. Config changes and rollbacks deploy within 5 seconds.

* In case of total disaster at Hetzner, the entire production infrastructure can be deployed to AWS within 15 minutes, using the same NixOps setup but with a different backend. All state is backed up regularly into 2 other countries.

* DB, CephFS and web servers are plain processes supervised by systemd. No Docker or other containers, which allows for easier debugging using strace etc. All systemd services are overridden to restart without systemd's default restart limit, to come back reliably after network failures or out-of-memory situations.

* No proprietary software or hosted services that I cannot debug.

* I set up PagerDuty on Android to override any phone silencing. If it triggers at night, I had to wake up. This motivated me to bring the system to zero alerts very quickly. In the beginning it was tough but I think it paid off given that now I get alerts only every couple months at worst.

* I investigate any downtime or surprising behaviour until a reason is found. "Tire kicking" restarts that magically fix things are not accepted. In the beginning that takes time but after a while you end up with very reliable systems without surprises.

Result: Zero observable downtimes in the last years that were not caused by me deploying wrong configurations.

The total cost of this can be around 100 EUR/month, or 400 EUR/month if you want really beefy servers that have all of fast SDDs, large HDDs, and GPUs.

There are a few ways I'd like to improve this setup in the future, but it's enough for the current needs.

I still take my laptop everywhere to be safe, but didn't have to make use of that for a while.


👤 ezekg
I use Heroku for https://keygen.sh. Sometimes it pisses me off how big the bill is (~$1.5k/mo atm), but the net time savings are still worth it to me. I usually spend a total of 0 hours a month on managing servers/infra, and less than an hour a day on support. I'm thinking I'll move to AWS eventually to maximize margins, but right now this really works for me.

👤 more_corn
Sure, the person you pay is AWS.

You enable them to do it for you by creating HA infrastructure. Start by creating an autoscaling group that enforces a certain number of working application endpoints. You probably need an alb too. An app endpoint that fails healthcheck causes the asg to spin up another instance and auto-register with the alb. (You can snapshot your configured and working app endpoint as the base image).


👤 izendejas
I'd love to recommend pingdom, or a service like it. I'm in no way affiliated with them, just a very happy customer and one of those products where I'm jelly I didn't come up with the idea. It integrates very nicely with pagerduty and slack/sms, etc.

It's just extra redundancy in case something like cloudwatch (which you should use -- with ELBs) also goes down.


👤 SkyPuncher
I used AWS Cloudwatch and some simple server side checks (ianheggie/health_check for Rails is great) for a very long time.

It's not perfect, but it's (1) cheap (2) easy (3) quick (the mythical trifecta). It misses some of issues due to high loads (but still technically available) but works perfectly when things actually crash (like queue workers deciding to turn off).


👤 brentis
Ive struggled with this for years. AWS is not foolproof and with environments for web, Android, amd ios availability gremlins have killed much of my spirit despite users proclaiming how they've been looking like a service like mine for yrs.

Docker, elastic beanstalk, SNS, and the hidden world of AWS instance performance are all a PITA. Oh yea, certs...

Welcome help as well.


👤 owaislone
I've used runscope.com and I love it. I don't know about their pricing so can't tell if it's suitable for someone in your situation but I'm sure there are tons of similar services. You could also build your own with Lambda and hope AWS is reliable enough to keep Lambda running. (Who monitors the monitoring tools? :) )

👤 atmosx
Services like heroku try to solve this problem.

👤 nathan_f77
I'm working on FormAPI [1] as a solo founder. I started on Heroku, but Heroku was a bit unreliable and I had some random outages that I couldn't predict or control. (This was even while using dynos in their professional tier.)

I also had a lot of free AWS credits, so I migrated to AWS. I didn't want to write all my terraform templates from scratch, so I spent a lot of time looking for something that already existed, and I found Convox [2].

Convox provides an open source PaaS [3] that you can install into your own AWS account, and it works amazing well. They use a lot of AWS services instead of re-inventing the wheel (CloudFormation, ECS, Fargate, EC2, S3.) It also helps you provision any resources (S3 buckets, RDS, ElastiCache), and everything is set up with production-ready defaults.

I've been able to achieve 100% uptime for over 12 months, and I barely need to think about my infrastructure. There's even been a few failed deployments where I needed to manually go into CloudFormation and roll something back (which were totally my fault), but ECS keeps the old version running without any downtime. Convox is also rolling out support for EKS, so I'm planning to switch from ECS to Kubernetes in the near future (and Convox should make that completely painless, since they handle everything behind the scenes.)

[1] https://formapi.io

[2] https://convox.com

[3] https://github.com/convox/rack


👤 bullen
I made my own (every minute) monitor: http://monitor.rupy.se

I also warns me if the CPU load goes up over 80%.

For the first two years of going live I had this hardwired to my Pebble via real-time mail, but now I know my platform is robust; so I can choose worry about other things.


👤 gumby
Sounds like there is a business opportunity here. A kind of DevOps-AAS, though to make it scale you'd probably need the customer to probably architect their system in a certain way.

(though this is essentially a single-line comment, it's earnest, not intended to be sarcastic)


👤 peterk5
I build my projects on Google App Engine and it has been stable and reliable without much administration. The platform is not without its challenges, especially with the Gen 2 rollout, but no issues related to administration/interruption. PaaS could be a good place to explore...

👤 vandershraaf
My applications which are built using Laravel are deployed through Laravel Forge. There is definitely extra charge for it, but having Forge to simplify deployment really save my time especially in case of any issue.

For monitoring, I am using Stackdriver which has easy-to-use health check.


👤 davecap1
Have you thought about hiring someone remote in the same or different timezone to be on-call for outages? I'm sure there are many people around that would be able to help with this. You could hire someone on a retainer who can be on-call via PagerDuty or something.

👤 brokenkebab
There are a lot of whatever-as-a-service offers which can relieve you of updating, patching, and restarting. But if your troubles originate as bugs in software, poorly formatted data, or something along this lines then human supervision is probably the only solution.

👤 telesilla
I've been using Linode's managed service, about $100 a month per server. If something goes wrong they have access and can triage, or let me know if they can't fix it. It's been very helpful, especially since they have (excellent) phone support.

👤 exabrial
TICK stack. Literally everything handling or supporting production traffic should be monitored

👤 elamje
I think another consideration that might not be an obvious risk is your use of two factor auth.

It’s important for critical services, yet if you lose your 2FA device, like a phone, you will be locked out for a while. Like many things, it will happen at a bad time.


👤 parliament32
AWS is not "your servers", it's "your services". How you monitor, manage, and set up redundancy/recovery is going to be very very different between running real servers or just paying AWS for semi-managed services.

👤 r0rshrk
For a single server setup, write bash scripts that check whether the server is down, and bring it back up of it's down.

Also, send errors through chat platforms like Telegram to be notified of any errors/monitor the servers


👤 aarreedd
Use uptimerobot to monitor your site. Have a scheduled job that ping healthchecks.io every 5 minutes. Configuration both to email you if anything goes down.

These are both reactionary but at least you'll know if things break.


👤 dubcanada
I guess it really depends on what AWS services you use. There are companies that can manage your AWS 24/7 for a fee.

Other options includes using another service that offers 24/7 uptime. Obviously you pay more for that.


👤 throw03172019
I have been using Convox for the last 3 years and it has been super reliable. Convox is essentially bring your own infra Heroku built on top of AWS ECS (Docker). I believe one of the founders is from Heroku.

👤 masternda
Yes, you can, by outsourcing it to someone or scaling up your team of one as others have mentioned. I am currently in the process of rolling out a service that does this. Hit me up if you are still keen.

👤 techscruggs
1) Setup a status page for what you want to monitor (/health/queuelength) 2) Point statuscake at that url 3) Connect statuscake to pagerduty.

This approach is easy to implement and scale


👤 riffic
Your bus factor is 1. Use managed services.

https://en.wikipedia.org/wiki/Bus_factor


👤 lazyeye
Recommend doing the AWS certification training. AWS has redundancy built in with health monitoring and auto scaling groups etc. The training covers all this.

👤 lepah
by not managing servers, use a PAAS such as heroku, it significantly reduces devops allowing you more time to focus on what matters, ie product market fit.

👤 m00dy
You can try to use AWS Elasticbeanstalk. It can recover from failures automatically by spawning new nodes right behind the load balancer.

👤 dhimes
I outsource to Tummy.com They've been terrific, and I don't have to worry about anything I don't want to.

👤 nwilkens
I run a company that helps with this single founder scenario. We monitor your infrastructure, and resolve issues 24x7, along with other proactive items.

https://www.mnxsolutions.com/services/linux-server-managemen...

I’d be happy to chat with anyone, even if to provide some feedback or a quick audit to help you avoid the next outage.

- nick at mnxsolutions com


👤 motakuk
1) Have a monitoring in place. 2) Never miss alerts, use smth with multi-channel escalation like amixr.io

👤 soulchild37
Pieter Levels (Founder of remoteok.io) hired a guy to monitor his server for $2k / a month

👤 xchaotic
The direct question was, can I pay someone to monitor my AWS. And the answer is yes. You want redundancy at every layer including the human one. For 24x7 coverage, long term you need a team but for now two people will do.

Funny enough I was just talking to someone who passed all his AWS certifications and was looking for some AWS work.


👤 wlycdgr
Don't sell things that require high availability as a solo founder.

👤 slipwalker
i used to have a telegram bot sending me events from supervisord: http://supervisord.org/events.html

👤 FearNotDaniel
After reading another recent post... Tinder for Founders, anyone?

👤 jillesvangurp
Short answer: promising 5 nines of uptime is not a thing for startups. Downtime is going to happen and you are going to be asleep, drunk, or otherwise not fit for doing any emergency ops. It's not the end of the world. Happens to the best of us.

So given that, just do the right things to prevent things going down and get to a reasonable level of comfort.

I recently shut down the infrastructure for my (failed) startup. Some parts of that had been up and running for close to four years. We had some incidents over the years of course but nothing that impacted our business.

Simple things you can do: - CI & CD + deployment automation. This is an investment but having a reliable CI & CD pipeline means your deployments are automated and predictable. Easier if you do it from day 1. - Have good tests. Sounds obvious but you can't do CD without good tests. Writing good tests is a good skill to have. Many startups just wing it here and if you don't get the funding to rewrite your software it may kill your startup. - Have redundancy. I.e. two app servers instead of 1. Use availability zones. Have a sane DB that can survive a master outage. - Have backups (verified ones) and a well tested procedure & plan for restoring those. - Pick your favorite cloud provider and go for hosted solutions for infrastructure that you need rather than saving a few pennies hosting shit yourself on some cheap rack server. I.e. use Amazon RDS or equivalent and don't reinvent the wheels of configuring, deploying, monitoring, operating, and backing that up. Your time (even if you had some, which you don't) is worth more than the cost of several years of using that even if you only spend a few days on this. There's more to this stuff than apt-get install whatever and walking away. - make conservative/boring choices for infrastructure. I.e. use postgresql instead of some relatively obscure nosql thingy. They both might work. Postgresql is a lot less likely to not work and when that happens it's probably because of something you did. If you take risks with some parts, make a point of not taking risks with other parts. I.e. balance the risks. - When stuff goes wrong, learn from it and don't let it happen again. - Manage expectations for your users and customers. Don't promise them anything you can't deliver. Like 5 nines. When shit goes wrong be honest and open about it. - Have a battle plan for when the worst happens. What do you do if some hacker gets into your system or your data-center gets taken out by a comet or some other freak accident? Who do you call? What do you do? How would you find out? Hope for the best but definitely plan for the worst. When your servers are down, improvising is likely to cause more problems.


👤 taf2
monit, keepalived, statuscake, with more money datadog, newrelic help as well with PagerDuty

👤 listenallyall
hetrixtools.com is a great option for monitoring and real-time notifications

👤 armatav
Heroku

👤 janee
Imo you will have to get outsourced on-call if your downtime tolerance is very very low.

Otherwise I'd suggest religiously documenting your outage root causes and contemplating hard what could've avoided that outcome.

Then lastly for monitoring on the cheap:

Sentry.io - alerts.

Opsgenie - on-call management.

Heroku+new relic - heartbeat & performance.

Tldr; Keep your stack small and nimble and try to learn from past outages


👤 mister_hn
yes you can do. Or try to automatize as much as possible:

- add health check mechanisms

- if health check is broken => restart service

- if restart service doesn't help after X retry => redeploy previous state (if any available)

Try to use Kubernetes or Docker Swarm if possible, combined with Terraform


👤 F117-DK
One word.. Serverless. It's a bit more pricy but the ease of mind is worth it.

👤 verdverm
GKE

👤 _tkzm
I know the situation. I haven't got to the production stage yet but I totally get it. Beside using Kubernetes, Nomad or some other scheduler, you will always have to invest your own time to resolve issues manually. You could have triggers that would invoke ansible playbooks if you don't want to handle any of the aforementioned but in the end the type of business simply requires maintenance - there is no way around that. A real human being has to be keeping an eye on the entire architecture and make sure it is running as it is supposed to.