As part of an ongoing effort to migrate a significant part of our stack into the cloud, we recently moved our job scheduling system, Rundeck, from an on-premise bare metal instance to AWS. With the exception of a few minor annoyances–from an operational standpoint–pretty much everything worked. It didn’t take long, however, to notice jobs weren’t running on-time. Minutely jobs were running between 30-50 seconds off, skipping minutes, etc. As we had a near-identical configuration, we weren’t really sure what the issue was. Given the lack of documentation for Rundeck architecture and having seen similar unanswered questions across various message boards, after three of us had spent a couple days on the issue, we wanted to post the steps we took to troubleshoot and resolve the issue.
Job Concurrency Limitations
As the most common cause of items not running on-time in a scheduling system is related to the permitted level of concurrency, and we saw no CPU-related run queue issues, we first expected the delay to be limited by the number of available worker threads. So, per the Rundeck Administration tuning guide, we increased the Quartz scheduler’s thread pool size (org.quartz.threadPool.threadCount), related JVM heap settings, and restarted the service. We noticed zero effect on jobs and verified the thread pool was operating as expected.
AWS Instance Sizing/Performance
At this point, we knew the Quartz scheduler was not the issue. As everything else in the configuration was identical, while we hadn’t seen any CPU-related issues, we figured it could just be related virtualization slowness. So, just to rule it out, we rolled the next largest box. As expected, given we had seen nothing to indicate CPU issues, we saw no difference in performance on a larger instance; jobs were still running 30-50 seconds behind.
Java Version Differences
After comparing packages between our on-premise box and AWS box, we found our on-prem box to be running Java 1.8.0, while our AWS instance was running 1.7.0. We upgraded the AWS instance to 1.8.0 and noticed a five-or-so-second improvement. The UI was much faster. But no huge improvement in runtime.
Waiting for Execution
During one of the times we launched a job manually, we noticed it became queued immediately, but sat there in the UI for over 30s saying, “Waiting for execution to start.” We thought this was odd, as the box wasn’t exhibiting any CPU-related issues and there were hundreds of free threads in the scheduler to execute the job. We searched and came across this thread, which seems very similar and was unanswered.
Local vs SSH Job
In an effort to make sure SSH wasn’t the issue, we created a local job that simply echoed the current timestamp into a file in /tmp. This job too ran between 30-50 seconds behind the time it was supposed to start. Obviously, this isn’t an SSH issue.
As we’re using MySQL as the backend of Rundeck, we figured there may be something up with database performance. After taking a look, we noticed several queries taking several seconds to complete. Accordingly, we realized the tables weren’t indexed properly for those queries. Rather than creating them ourselves, we searched and found a ticketed performance issue specifically related to missing MySQL indices. We created these indices and, while we no longer saw any slow queries, we saw no improvement in job execution.
Reducing the Number of Concurrent Jobs
To see whether this issue occurred with the same frequency with fewer jobs, we disabled all but one minutely-executed job. While it was still off by ~6s, it was far better than 30-50s. As we enabled other jobs one-by-one, we saw increasing start delays. As it was still 6s off, we figured something else must still be going on.
Watching the Filesystem
During the time we had been watching the single job logs, we noticed a new job log was being created at EXACTLY the right time the job was supposed to start. That is, at the first second of each minute. Unfortunately, when we looked at all other logs, none indicated this job had even started. So, how did this file get created at the right time, but no scheduler log showed this job had actually yet started? Let’s eliminate all the variables.
Eliminating Outside Variables
Our Rundeck configuration includes multiple servers, which have nodes resourced from a local XML file (resources.source.N.config.file) as well as multiple servers via a URL in the project resource configuration (resources.source.N.config.url). We disabled all URL-based ones and noticed our job ran perfectly on-time over and over again. So, something was up with the URL-based configs. We noticed, for some reason, caching was disabled (resources.source.N.config.cache=false), so we set that to true, reenabled our URL-based resources, and restarted. Same old performance issues. Looking at the code, that’s only because this adds additional cache headers to the HTTP request. As everyone who knows HTTP knows, dynamic endpoints still have to do all the work, even if they don’t return any results. So, we tried to fetch that resource directly via cURL. It took ~2s to fetch the resource file from each server. Yikers! We then researched the resources.source.N.config.timeout option, only to find it wouldn’t really help in this case. Lastly, we decided to try saving those resource files locally and using them directly. After that, all jobs ran perfectly on time.
So, when a job gets kicked off by Rundeck, it appears it’s not officially “started” until after the resource configuration is fetched. As multiple jobs run per minute, each one fetching N number of URL-based resources, this caused a stampeding herd to the other Rundeck servers and, accordingly, a significant lag in execution. This also explains why–in the more-jobs-per-minute case–we saw greater lag with each added job.
Update: Response from Rundeck
The URL source is a blocking call so unfortunately it can lead to this kind of problem. We have discussed the idea to enable an asynchronous URL request to help avoid this kind of lag problem.
With the exception of making calls to multiple URL sources concurrently, thereby reducing startup-time from the sum of all response times to the maximum response time, I’m not sure how this could be improved in any asynchronous manner. As all nodes are required at the start of a job–to me–node retrieval must be remain synchronous, but it could be improved with smarter caching on the HTTP server. ¯\_(ツ)_/¯