What’s Behind Your Load Balancer Matters
Web applications require some level of monitoring to ensure their success. If you run a web app of any size, you more than likely use some form of a load balancer (LB). There are plenty of different LB options available. For example, you could use an ELB in AWS, F5 in a data center, DNS-based distribution, or even something as simple as Nginx to run a reverse proxy. The best load balancer solutions all have one essential feature in common—a health check that can ensure the nodes behind the LB are healthy. Typically, this involves a single test configured at the LB. For example, you might define the test criteria in the following way: “If you make a certain request of the node and receive a certain result, then consider the node healthy and send traffic to it.” This test is usually sufficient for routing traffic, but not always sufficient for giving you visibility into the performance and stability of all the applications you have running behind the LB.
Full visibility into application stability is important. Our legacy application often reports a status of healthy to the load balancer on the basic heartbeat health check page. However, after deeper inspection, we’ve often found that some of the nodes were not performing as they should despite this status. Thus, we needed something a little more advanced to see the behavior of the nodes to maintain a consistent experience for our customers. In addition to the insights our monitoring system (New Relic) gave us, we wanted to be able to peer in to see other HTTP requests and receive quick and accurate notifications if something was off.
Our application has more than one cluster that uses the same configuration; to streamline the process and allow for more efficiency, we wanted to be able to apply the same check against all of them. At Teaching Strategies, we are fairly polyglot on our team (Python, Node, and Go), but I chose Node for this particular AWS Lambda function to handle the many asynchronous calls and provide a low barrier for entry for others who might want to leverage the project using a ”fast enough” monitoring approach. If you aren’t using AWS and/or Lambda, no worries—you can easily modify the code and enable it to be called from a Node.JS server or anything that can do CRON-like jobs. To see our full project, you can visit our GitHub: https://github.com/teachingstrategies/lb-node-watcher.
Examining the Code
I broke down the overall project into six parts:
config.js – Configures the app.
checkNode() function – Validates a node in the load balancer.
checkCluster() function – Checks the nodes in the cluster.
checkClusters() function – Checks all of the clusters.
The Lambda handler – Uses mostly boilerplate code. We use an AWS Lambda handler.
resultsHandler() function – Processes the response from the check functions.
The checkNode function is probably the most interesting. I wanted the function to be configurable so that we could test response metadata, the returned content, or any errors in the call. To ensure that we could configure it externally, I decided to allow for functions to be passed in to test the response.
Let’s look at the code for checkNode():
const checkNode = (url, testFunc) => { return new Promise((resolve, reject) => { fetchUrl( url, { agentHttps: httpsAgent, timeout: config.requestTimeout * 1000 }, function (error, meta, body) { // package the response into an object to pass to the test func const response = { error: error, meta: meta, body: body }; // if there is an error in the response return false if (typeof response.error == undefined) { resolve(false); } // test the node response using the function passed in try { resolve(testFunc(response)); } catch (err) { resolve(false); } } ); }); };
This takes the URL of the node to test the response as parameters. The code resolves the promise with either true or false, choosing not to reject in any case.
For examples of some of the tests you can do with testFunc, look in the config.js. There are four functions, two for computing the URLs and two for doing the actual test. Let’s look at a test function.
const inLbTest = (response) => { if (response.body.toString().substr(0, 6) == "--OK--") return true; return false; };
This particular test is just looking to see if the first six (6) characters in the HTTP response are –OK–.
Here’s another example of a test looking at the response code in the health check:
function healthCheckTest(response) { if (response.meta.status == 200) return true; return false; }
This test is simply looking for a “200” response. One thing to note is that healthCheckTest is a bit of a misnomer since it is the actual health check, not the health check for the load balancer.
checkCluster() loops over the nodes as such:
const checkCluster = async (clusterName, nodes) => { let results = []; for (let index = 0; index < nodes.length; index++) { let node = nodes[index]; if (node.cluster == clusterName) { node.isInLb = await checkNode(config.inLbUrl(node), config.inLbTest); node.apiStatus = await checkNode( config.healthCheckUrl(node), config.healthCheckTest ); results.push(node); } } return results; };
In the main loop, we pass the function inLbTest() to the checkNodes() function to test whether the node is active in the load balancer. We do the same in a different function on the next lines.
Similar to checkCluster(), which loops over the nodes, checkClusters() loops over the clusters. You can have as many clusters as you’d like. The clusters are configured in the config.js file, like the one below, as an array of objects, and the array can contain as many clusters as you’d like.
const nodes = [ { "hostName":"host1", "host":"192.168.1.1", "cluster":"cluster1" }, { "hostName":"host2", "host":"192.168.1.2", "cluster":"cluster2" } ];
Configuring the Lambda
Finally, we have the Lambda handler:
exports.handler = async function (event, context) { let clusterHealth = []; let resultsHandlerArgs = {displayAll : true, slackStatus : false}; // if slack status environment variable is defined and is true, Slack the status if (process.env.SLACK_STATUS == 'true') { resultsHandlerArgs = { displayAll: false, slackStatus: true }; clusterHealth = await checkClusters(config.clusters, config.nodes, "all"); } else if (typeof event.queryStringParameters === "undefined" || event.queryStringParameters === {}) { clusterHealth = await checkClusters(config.clusters, config.nodes, "all"); } // if a cluster name is passed into the query params, only check that cluster else if (typeof event.queryStringParameters.cluster !== "undefined") { clusterHealth = await checkClusters( config.clusters, config.nodes, event.queryStringParameters.cluster ); // if "badnodes" is passed in as a query param, only return the bad nodes } else if (typeof event.queryStringParameters.badnodes !== "undefined") { clusterHealth = await checkClusters(config.clusters, config.nodes, "all"); resultsHandlerArgs.displayAll=false; } return config.resultsHandler(clusterHealth, resultsHandlerArgs); };
I configured this lambda to use Lambda URL invocation, so I added the queryStringParameters check in order to review a single cluster. If this parameter is defined, the test looks for a cluster with that name; otherwise, it will check them all. One thing to note is that you will want to adjust your Lambda timeout to cover the entire length of time it will take to complete the node checks. Each node is given a certain amount of time to respond based on a config param in config.js.
const requestTimeout = 10; // this is in seconds
Setting Up Slack Messages
After we let this run for a while, and once we consumed the results, we added a resultsHandler. As configured in the Github repo, it can send a Slack message when there is a failure; however, you can configure it to do almost anything you want here. The args object allows you to set the function to return only the bad nodes and set criteria for when you want to receive a Slack status message.
function resultsHandler(results, args) { if (args.displayAll == false) { results = resultHelpers.removeHealthyNodes(results); } if (args.slackStatus == true && results.length > 0) { let msg = ""; // compose message with status of each bad node results.forEach((node) => { msg += `Server name: ${node.hostName}, LB test: ${node.isInLb}, API test: ${node.apiStatus}. \r`; }); resultHelpers.sendSlackMessage(hookUrl, channel, msg, sender); } return results; }
The final bit of code to note:
const httpsAgent = new https.Agent({
rejectUnauthorized: false,
});
Since our SSL sits on our LB, we wanted to allow the lb-watcher to connect to the nodes without a valid SSL cert. To do so, we needed to configure the httpsAgent to let it pass.
The Bottom Line: Check Behind Your LB
In closing, no matter what load balancer you’re using, it is essential to find ways to test the health of the nodes behind your LB. The bottom line is if you are unable, for whatever reason, to have the health test for your standard load balancer thoroughly test your API or the applications running on nodes behind the LB, make sure you are still monitoring what’s behind the LB somehow. The small Lambda function outlined above can do the trick. Now, all that’s left to do is to leverage this API to complete other actions on your nodes—beyond simply sending Slack status messages. That part is entirely up to you!
About the Author
Bill Davidson spent over a decade as a tech executive for a media company, and then went on to various startups and federal agencies. He is now Director of Engineering Enablement, focusing on supporting Teaching Strategies Technical Operations team.