Latest news about Bitcoin and all cryptocurrencies. Your daily crypto news habit.
Jon Christensen and Chris Hickman of Kelsus discuss health checks for services, containers, and daemons. They use them to keep Kelsusâs distributed systems and services functioning.
Some of the highlights of the show include:
- Health Checks: A first line of defense when running any software in production from an operational standpoint to detect errors and identify when a service needs to be recreated
- Health checks involve something hitting an endpoint to execute application code and determine if itâs responding on that port and back to it
- Two main types of health checks: Shallow: Use service code to create new endpoint; goes through frontend, routes to your code, executes code, and returns a response that signifies success. Deep: Less common and has several pros and cons; doesnât test any dependencies, only identifies if service is running and responding to requests
- Do a deep health check, if your service canât function without dependencies
- Design your system to be able to gracefully degrade; if deep health check hits your database, youâre making sure itâs up and running and can connect to it
- Considerations for deep health checks:
- Expensive
- Startup latency issue
- Domino effect
- Rotation: Load balancer is put in front of things to start routing requests
- Parameters to remember when configuring a health check include: (1) Set an interval because health checks run periodically, not in real time; incorporate a 30- to 60-second delay to give the system time to start up. (2) Determine how often you want to send health checks (i.e, every second, half a second, 20 seconds, five minutes); depends on type of service. (3) Identify how long you want to give your health check to succeed or fail; does it failed after 100 milliseconds or 30 seconds to meet latency requirements
- Implementation Styles: Where are you using the health checks?
- ELB level
- Background jobs or daemons
- Synthetics: Health checks from callerâs standpoint that verify your service is up and running and what the end user is expecting is correct
Links and Resources
Amazon Elastic Container Service (ECS)Amazon Elastic Compute Cloud (EC2)Amazon Elastic Load Balancer (ELB)KelsusSecret Stache Media
Transcript
Rich: In episode 53 of Mobycast, Jon and Chris discuss health checks for services, containers, and daemons. Welcome to Mobycast, a weekly conversation about cloud-native development, AWS, and building distributed systems. Letâs jump right in.
Jon: Welcome Chris, itâs another episode of Mobycast. This time, youâre sitting right next to me.
Chris: This is a first.
Jon: Yes.
Chris: Good to be back Jon.
Jon: I think the only thing that will change with us sitting right next to each other here in beautiful Florianopolis, Brazil is maybe thereâll be a little less interrupting each other. But I expect otherwise, the episode will be just like any others. This week, weâre going to talk about health checks, how we use those just on our team, and how we may want to change how we use them for keeping our distributed systems and services up and running. Before we get started, since this is sort of a special week, what have you been up to this week, Chris?
Chris: I have been basically living on a plane, it feels like. I traveled for about 32 hours from snowy cold Seattle, halfway across the world to beautiful, sunny Brazil. Weâre here at our company retreat with the entire team and itâs amazing to see just how big this team has grown. Thereâs a lot of new faces since the last company retreat, pretty interesting.
Jon: Yeah, itâs really fun. Itâs big enough that I feel like Iâm not getting quality time with every single person on the team this time around, everybody started to form their own groups and social circles, and itâs really interesting to see that grow. Itâs so fun and Chris, this is your second time here in Brazil. Iâve been here six times and I just love this place. Itâs a place that I was just thinking earlier today that itâs like a place that doesnât change. You come back and back, and you can kind of expect things to be the way they were. Thereâs something about that after living in fast-paced America that I can really get behind.
Chris: Absolutely.
Jon: Under health checks, maybe we should start like we do with many episodes, and you can give us a definition.
Chris: Sure. Health checks are kind of a cornerstone of running any kind of software in production from an operational standpoint. Itâs the basic check of just making sure that your code, your service is up and running, and itâs able to service request. This is one of the first line of defenses to detect those errors when things have gone wrong. Typically, these are used to identify when a service needs to be recreated. In the past, many have rebooted, and the cloud really just basically you shoot it in the head and spin a new one up. Health checks gives us that core capability of identifying when things go wrong and basically just restart.
Jon: Thinking back in my career, I think the first time that I got exposed to doing health checks was when we were configuring load balancers for clusters of Java web application servers. I think at the time, the health checks that we did werenât really able to tell much about what was going on inside the application because the load balancer itself wasnât really able to tell. It wasnât able to load balance at that application level. It was only able to the balance what is the computer and network responding, not as the application happy. But it was still sending pings or network requests to each of the machines in the cluster all the time.
Whenever I saw one wasnât available, I would just stop running traffic that direction. I think things have come a long way since then. The health checks are way more sophisticated, but at the end of the day, itâs still the same basic premise. If you see a system or a service thatâs not available, stop sending traffic to it.
Chris: Right. I think what youâre describing there is basically the port-level, network-level, health checks. Your service is running on port 80 or 443, so ping that port, do you get a response back? Thatâs a basic-level check. Now, normally a health check involves something hitting an endpoint per se. Youâre actually executing application code to determine not only is it responding on that port, but is the code actually running to respond back to it as well.
Jon: I think thatâs a nice way to enter into the idea of the types of health checks that you might run. If youâre going to hit an endpoint and youâre expecting a response from that endpoint, what might you do in order to see if the system that youâre looking at is healthy?
Chris: Yeah. Maybe this is a good time just explain broadly two main types of health checks here, so shallow versus deep. Shallow is definitely the most common. This is typically what you do. We have our micro service, or some API service, whatever it may be exposing an endpoint and usually itâs very simple. Weâll get into this a little bit more about how shallow versus deep health checks are different and why you want to take into account this consideration. Again, the basic thing is, with these shallow health checks, keep it tight. Itâs something thatâs very quick. Itâs exercising your actual service code. Itâs an endpoint in there that is responding back, so itâs going through all the frontend, the routing to your code, executing that code, and returning back a response that signifies success.
Jon: A shallow health check with that, letâs just try to think in terms of examples. Maybe we have a blog service. You can do crud on posts. You can create a post, update a post, get posts. With a shallow health check, will that actually get a particular post or would you try to find something more shallow than that?
Chris: Yes. For a shallow checking, you definitely want something more shallow than that. Typically, your shallow health checks are something that are going to be executed quite frequently.
Jon: You want to be ahead on getting a post, or just like tell me that this endpoint exists and I can send stuff to it, like HEAD instead of GET? Do you see what Iâm saying? What if all you can do is create, update, read, and delete posts? Thatâs a very small, tiny, micro service and you want to do a shallow health check on it, and you donât want to have the databases. Thatâs what Iâm hearing with the shallow health check. Can you do just some lighter weight other HTTP request like HEAD or something like that to keep it shallow?
Chris: Yeah. I just recommend you would create a new endpoint for your shallow health check. Call it /status. You create a new endpoint /status and basically all it does is just echoes back something like, âJust return a 200.â
Jon: And that tells you that your service is alive, because if it wasnât answering, then the whole service is dead.
Chris: Exactly. Itâs very lightweight. Itâs very quick. Youâre not testing anything else. Youâre not testing upstream dependencies. Youâre not taxing your service with any load. Itâs just verifying basically my processes up and running, and kind of that at a top level, everything is working. The requests are coming in, theyâre getting routed, codes being executed, and the responses coming back.
Jon: Yeah and you can learn quite a bit from that, because if your process is thrashing, it doesnât have enough memory, itâs got problems, then even that shallow health check will have a problem probably, or in many cases it will.
Chris: Absolutely. Shallow health checks despite the name, this is not a bad thing. This is what you want to do. Theyâre very useful and they will give you that indication like something is wrong here, weâre not able to satisfy a request.
Jon: Yeah. Because if your shallow health check doesnât pass, then you definitely want to stop routing traffic immediately to that particular instance of your service.
Chris: Yeah. In the words of our friend, Chrome, âItâs dead, Jim. Oh, snap.â
Jon: Right. Then thereâs the other kind of check with the obvious, itâs deep. Letâs talk about those a little bit, when you might use them, what they are.
Chris: Deep health checks are kind of interesting. Definitely much less common, lots of pros and cons, and lots of just considerations around them. We talked about the shallow health check. Itâs very quick. Itâs not testing any dependencies, very lightweight. It just basically says, âHey, this service is up and running and itâs responding to your request.â
Deep health checks. So I have this micro service architecture. My service is a consumer of other services. Maybe my service is completely unusable if some of its dependencies are not up and running. A deep health check would be something thatâs a bit more advanced and inexhaustive in its checking. Youâre not just checking that your service is running, youâre checking that your dependencies are running as well. Your dependencies again could be other microservices that you make calls on that you really depend upon. It could be your database. Your example before of like, âHey, do we hit a database with this call?â that would definitely be something to consider with a deep health check.
Jon: Okay, great. It makes sense what they are. We talked about how itâs important and good to use shallow health checks. If our main ideas that we just want to stop routing traffic to a process thatâs just thrashing or dead, then shallow make sense. When does it make sense to use it? You said maybe other services are running that you depend on, but can you just give an example? Have we ever used a deep health check? When does it really make sense? When would you do it?
Chris: I think really where this makes sense is where if your service, if it just canât function without dependencies. In that case, you have to have a deep health check. Letâs just say you have a database that your microservice talks to and thereâs just no way your service is going to run without being able to talk to that database. You may very well change your health check to test that. Again, if this is the requirements that you have that your service just canât operate without that dependency being up, then you want to look at the deep health check.
Of course, itâs also really good thing to design your system so that they can gracefully degrade. Something like a database, youâre probably not going to be able to degrade too gracefully from perhaps, but you may very well decide like, âRather than failing over this, Iâm going to display an error message or Iâm going to switch over to a system likeâŠâ
Jon: AOL?
Chris: Yeah, exactly. AOL. That may be your strategy there while you have the alerting going on for the dependent service that, âHey, this thing needs to be fixed.â
Jon: Earlier you had said, to keep your shallow health check really lightweight and to keep it out of the way of other processing thatâs more important, you would create your own endpoint just for that. Would you do something similar? Say your deep health check needed to just make sure the database was there. Would you maybe make a status table that just has one row in it, and then you just go get that one row, and then thatâs an easy way of making sure your database is alive, and itâs not doing anything to any other tables, and itâs super easy on the database?
Chris: Thatâs a great point. If you are doing a deep health check and itâs going to hit your database, youâre making sure that the database is up and running and that you can connect to it. If you have a table with millions of records, donât go select on that table as part of your health check. Thatâs not what youâre testing with this. Youâre just testing basic connectivity. Go hit a table with a single record in it. Even though this is a deep health check, you still want to keep it light.
Jon: Right on. Where do we go from here? We know what shallow health checks are, deep health checks are. Maybe we can get into some implementation styles?
Chris: Yeah. Maybe something to talk about more with the deep health checks is some of those considerations. Again, theyâre expensive so you need to weigh that in into account. You also have this issue of startup latency. Typically, the way things are when youâre spinning up a new instance of the service, one of the things that is going on is a health check being performed. Once that is successfully passed, then the system knows that this new service is spun up correctly and it can now be put into rotation effectively.
Jon: Do you mind if I say that in just a little bit of a different way? I mean that was absolutely correct, but it just felt a little complex to me when I heard it. I just want to say, youâve got different containers if youâre using ECS or different machines if youâre using EC2, who knows what if youâre using another service. Theyâre all available. Youâve got a cluster. Youâve got a lot of things running. You want each of them to be able to service your request. When you said put it into the rotation, thatâs what you mean. Your load balancer thatâs in front of all these things is going to be able to start routing requests to that thing. Now itâs in the rotation. I just had to clarify that one term.
Chris: Yeah. Absolutely. You have that, you have startup latency to consider a lot of systems like the ELBs or whatever cluster mechanism youâre using. Theyâre going to have a certain amount of time before they fail. Theyâre going to hit a health check and if it doesnât respond within five seconds, then it failed the health check. Itâs not going to sit there forever. Thereâs going to be some time associated with it. If you have a deep health check thatâs very expensive that canât respond in that time, then youâre going to have a big problem, because youâre never going to pass your health check.
Even though everything is perhaps okay with your system, itâs never going to pass the test. You need to keep that into account. It would be a good idea to kind of think of like having an initial health check thatâs pretty deep and expensive. Then after that, you switch over to shallow. A lot of times itâs like when you start up, you want to make sure that everything is up and running. Once thatâs done, then you can switch over to a shallow. Thatâs definitely a bit more complicated and advanced to do, but definitely something to take into account.
The other problem or consideration to take into account with the deep health checks is just this concept of a domino effect. Your deep health check is hitting multiple services. Imagine the health check on your niche on your main service takes some amount of time and then it hits a dependent service for a health check. Maybe itâs actually another microservice. Itâs not something like a relational database. Well, what if that microservice, its health check goes and hits another one? You start chaining together these requests and you have to take into account all of that time. Then also what happens if itâs one of those in the middle thatâs failing or itâs the tail end. It gets much more complicated with the deep health check. This is typically why almost all the time youâre going to stay with the shallow health checks and youâre going to rely on your monitoring and your alarms independently with your upstream dependencies.
Jon: Yeah, that makes sense. I was just trying to think about how that domino effect would work, because when your micro service that you call calls the micro service that it depends on, if it just happens to hit the one thatâs down out of 100 that are available, then it thinks the whole thing is down and itâs going to say, âIâve got to be down too. Iâm not working either,â when really it was just unlucky. That potential for unluckiness keeps getting multiplied as you go deeper and deeper in your list of dependencies of microservices. Yeah, it seems like micro service, deep health checks, really think hard before you put those in place.
Chris: Yeah and in that particular case, hopefully everything else is working so your upstream services, they should have their own health check and theyâre in a cluster. Ideally, you shouldnât even be hitting. If there is one out of however many notes in your cluster thatâs bad, it should have failed its health check with its cluster and then pulled. Hopefully, that has happened before it even tries to go and connect to it. Itâs not even in the routing for that, but it could be that itâs failing. Health checks run periodically. Theyâre not running in real time. You have to you some interval for these health checks around. You are going to have failures and they will be up. Use it around that.
Jon: You just talked about the interval because you had talked about a few parameters that you need to think about when youâre setting up a health check. I think we might as well just make that concrete in terms of AWS. Iâm just going to go out on a limb. I canât remember for sure, but I think that thereâs three parameters basically that you have to keep in mind. One is the delay before you start doing health checks. I want to wait 30 seconds, 45 seconds, a minute before I even send my first health check to this system to give it time to start up. Another one is how often do I want to send health checks. I want to send health checks every second, every half a second, every 20 seconds, every five minutes, and that really depends on the type of service. Thereâs no best practice there, it just absolutely depends on the type of service and how flaky or finicky it is. I think the third one is how long you want to give your health check to succeed or fail. Is it failed after 100 milliseconds, because you have a super high and low latency requirement, or is it failed after 1 second, 5 seconds or 30 seconds. I think those are the three parameters, but please correct me if Iâm wrong.
Chris: In general I think these are the categories. AWS in particular, for ELB health checks, some of the parameters are kind of imagine, how many times does it have to pass a health check before itâs determine healthy. Is it parameter? You can set it to one. I think the default is to two. Itâs got to successfully pass two health checks before it will get put into rotation. You will also have how many health checks do you need to fail before youâre considered unhealthy. You have your health check interval. How often are you running these health checks.
Jon: That one IÂ got.
Chris: Yes. Once something has been marked unhealthy, how many successful health checks does it need to pass before it gets put back into rotation. The delay for starting that initial health check, thatâs not on ELB, thatâs usually on some other services. ECS has this. ECS will spin up a task and you can specify. The ECS will have a delay before it registers it as part of the target group.
Jon: Thatâs a little complicated, wouldnât it be nice if you just can figure this all in one place? But ECS is the one telling the load balancer, âNow Iâm running for you,â as opposed to the load balancer saying, âI see you there. Iâm going to give you time to be ready,â and I can configure all my health check needs stuff in one place, but it makes sense. ECS is where itâs aware of the fact that it takes time to start up the load balancer. It has no idea. The load bouncer is like, âJust let me route traffic somewhere.â
Chris: And the load balancer is very generic. Thereâs tons of things that you could put an end to this.
Jon: Shall we move on into implementation styles? I jumped the gun on that one before.
Chris: Yeah. That actually gets into just implementations like where are you using these health checks. The most common without a doubt as far as services go and containers go is youâre doing health checks at the ELB level. This is just built-in to the load balancer. Itâs just part of it. Load balancers are just keeping a membership set of all the different host computer, targets, whatever it is that itâs managing as a set. This is the set of my cluster of nodes that are in here that can answer requests. That has a membership set. Those health checks are built right into load balancers.
The load balancer knows like, âHey, for every single one of the nodes in here thatâs in the membership set, go on periodically and hit its health check and if it fails, take it out of the membership set, mark it as unhealthy. Continue hitting it, and if I get it back on so itâs now healthy again, then Iâll put it back into the membership set.â All thatâs done at the ELB. You get it for free. You donât have to do anything other than configure that health check, have an appropriate health check implemented by your service, and away you go. This is just so simple as the common routine, if you will. Any micro services that had inbound traffic, that are better fronted by an ELB, this is a great pattern for.
It gets more complicated when you have services that arenât friends with ELB, because now you have to ask yourself whatâs it going to do the health check. Common examples of this is if you have background jobs or daemons. Basically, you can think of it as push versus pull. The ELB fronted services, their requests are being pushed to them and theyâre coming into the front versus these daemon services that are typically pulling. Theyâre the ones that are going out and pulling, looking for work to do periodically type of thing. They donât have the inbound request coming into them. Instead, theyâre doing work, and theyâre basically a client, and theyâre probably hitting something else thatâs fronted by an ELB. For those, youâre going to need a custom implementation to do these health checks. It gets a bit more complicated, but also itâs very important to do. Thereâs tons of different ways that you can do this.
Jon: I donât think Iâve ever worked for a software company that had some daemon processes running where it didnât go down and nobody knew about it. That always happens when you have a startup company, youâre billing systems, youâre going fast, you build the demon, and youâre like, âWait, how come we havenât seen any PDFs generated in awhile? Ah, the daemon process isnât running.â
Chris: Typically in those cases, youâll notice minutes, hours, days later and youâre like, âUh-oh, this thing was down the whole weekend,â and you never knew. Thatâs a big bad. Health checks for these kind of things are super important. But again, itâs more complicated. You got to figure out how you implement it.
Jon: Yeah, you definitely want to find out about it on Sunday mornings or Saturday afternoons, not like Monday morning at nine. You show up and, âI can fix it, thatâs my job,â never that convenient.
Chris: Absolutely. Again, various different plans of attack there. You can have something just really simple and maybe even a little bit silly, but go ahead and put an ELB in front of those daemon jobs and just have them have one inbound route. Itâs your status check. Have a private facing ELB that basically has one job and one job only and that is to check to see whether these things are up and running.
Jon: Thatâs great. Is that really easy that to just configure some sort of an alert or cloud watch alert so that you can be notified when the things are not available?
Chris: In this particular case, you do not have to do anything because the ELBâŠ
Jon: Oh, it will start it over.
Chris: Yeah. You have your simple health check. Itâs kind of weird because youâre only putting an ELB in front of it for the health check. Thinking about it itâs like, âI donât know, why not?â Itâs minimal amount of code and you can leverage what youâre used.
Jon: Just to make sure I understand this because the ELB will signal to ECS, youâre supposed to keep the service running, at least one instance of the service running, ELB is what tells ECS, âThis thing is down. Restart it.â Is that why you would put an ELB in front of it?
Chris: The reason I put an ELB in front of it is because you need something to do to health check. This is actually how just a normal service running on ECS like a microservice. Itâs not ECS detects that itâs down, itâs the load balancer doing the health check that figures that itâs down. It marks it as unhealthy. ECS subscribes that event, it sees that, and then it kills the task and spins up a new one. Then that comes back up and inserts it into the membership set. The ELB then takes over and performs its health checks. If it passes, it goes back in the membership set. Itâs this dance back and forth between them.
You can do the same thing with these background, these daemon jobs. Really, the only extra work you have to do is you have to update that daemon to accept some HTTP traffic. Itâs like you have to have a little micro HTTP service just listening in and can satisfy request that are coming into your HTTP or HTTPS, whatever it may be.
Jon: I love that hack. Itâs small and it just takes advantage of all this heavy lifting that AWS already knows how to do.
Chris: Yeah. It feels like cheating to me and it doesnât feel right. But again, itâs simple and fast. You can do other things that are perhaps a bit more sophisticated. Thereâs a bunch of different techniques that you can use, but basically, you just have to have something that is running at regular intervals that can go, reach out, talk to these things, and figure out whether or not theyâre running.
Whether itâs a cloud watch alarm thatâs triggering a Lambda to go figure out if somethingâs running or not, then you can deal with removing these things, or failing them and mark them, and letting ECS then kill it and restart it. Thatâs a more sophisticated approach and thereâs some pros to that but again, a lot more heavy lifting.
Jon: Right. Great. I think weâre on our last bullet point here of the day. Synthetics, whatâs this?
Chris: Yeah. The last thing I want to talk about is just this concept of synthetics. Health checks are verifying that your service is up and running. It can respond to your request, but it doesnât necessarily mean that things are going swimmingly well and that what users are seeing is actually correct. A really great example of this would be like, you have a website and maybe thereâs a login page, and it has to go and talk to a database or maybe some other dependent micro service, and something goes wrong, and thereâs a bug in the code where itâs just not rendering the login box, the username and password at the login screen. Itâs a broken web page.
Your health checks arenât going to catch this. It does it when it hit the port 80 or port 443. It passed the health check but really your site doesnât work. Synthetics are basically health checks from the caller standpoint, from the end userâs standpoint. In that particular example, you would have a synthetic thatâs basically going out and itâs not just testing whether it gets back at 200. Itâs actually examining the response and verifying that response is correct. You can really think of this as like a test case.
Jon: I guess a production integration test.
Chris: Yeah. We had a great talk today with the team about doing UI testing for mobile apps and how do you actually create those remnants. Thatâs what you can use a synthetic for which would be a useful thing. It would go and fetch the HTML. Then it will go and just load that into to the DOM and check the elements that they are expecting are actually there and they have the right text labels on them or what not so. Again, itâs another one of the things that is really useful to have. Youâre verifying not only is your service up, but also what the end user is expecting is indeed correct.
Jon: The thing that makes me think of is, as we were talking about health checks, and shallow and deep, I was thinking at some level, we also want our codes to just react when things arenât going well. We want our code to catch errors and start to behave in a way that is appropriate for the types of errors that are seen. If things arenât going the way they should from a database or another service that we depend on, the code should start to be able to shut things down a little bit, but thatâs really difficult.
A lot of times when you write systems, you donât have the time to make your code go that deep. This is a way of saying the code may return some bad stuff, and the whole system might get into some bad states, but if we can write these synthetics to go in and test the world as looking right from a user point of view, then we can at least have a fighting chance of finding out if things in the world are not the way theyâre supposed to be before users do or very early on in that process. Even if not all of our error-handling and thinking about every single micro service and what itâs supposed to do in a case of an error is totally vetted, baked, and perfect, because it never is.
Chris: Yeah. Absolutely. You think of synthetics are your way of a combination of having shallow health checks plus synthetics kind of gives you the advantage of those deep health checks. Thatâs really what theyâre doing. If your end user is not responding the way that they expect your caller thatâs an API, if itâs not getting the kind of response it is expected, thereâs bugs in the code or it could be that one of the upstream dependencies is down, youâre synthetic will catch that. It gives you the best of both worlds there.
Donât use the deep health checks for your standard health check. Thatâs hitting them at a more frequent pace but instead, use the synthetics and you can judge what you want to use for an interval there as well. Itâs something that can be very useful to have in concert with the shallow health checks.
Jon: Very cool. Well, this has been fascinating for me. I havenât done a lot of work in this area recently, so I learned a lot. Thank you.
Chris: Awesome. Go check your status.
Jon: Right on.
Chris: Alright.
Jon: Talk to you next week.
Chris: All right. Thanks, Jon. See you.
Rich: Well dear listener, you made it to the end. We appreciate your time and invite you to continue the conversation with us online. This episode, along with the show notes and other valuable resources is available at mobycast.fm/53. If you have any questions or additional insights, we encourage you to leave us a comment there. Thank you and weâll see you again next week.
Health Checks for Services, Containers and Daemons was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.
Disclaimer
The views and opinions expressed in this article are solely those of the authors and do not reflect the views of Bitcoin Insider. Every investment and trading move involves risk - this is especially true for cryptocurrencies given their volatility. We strongly advise our readers to conduct their own research when making a decision.