Troubleshooting Rundeck Remote Resource Issues

rundeck-logoAs part of an ongoing effort to migrate a significant part of our stack into the cloud, we recently moved our job scheduling system, Rundeck, from an on-premise bare metal instance to AWS. With the exception of a few minor annoyances–from an operational standpoint–pretty much everything worked. It didn’t take long, however, to notice jobs weren’t running on-time. Minutely jobs were running between 30-50 seconds off, skipping minutes, etc. As we had a near-identical configuration, we weren’t really sure what the issue was. Given the lack of documentation for Rundeck architecture and having seen similar unanswered questions across various message boards, after three of us had spent a couple days on the issue, we wanted to post the steps we took to troubleshoot and resolve the issue.

Job Concurrency Limitations
As the most common cause of items not running on-time in a scheduling system is related to the permitted level of concurrency, and we saw no CPU-related run queue issues, we first expected the delay to be limited by the number of available worker threads. So, per the Rundeck Administration tuning guide, we increased the Quartz scheduler’s thread pool size (org.quartz.threadPool.threadCount), related JVM heap settings, and restarted the service. We noticed zero effect on jobs and verified the thread pool was operating as expected.

AWS Instance Sizing/Performance
At this point, we knew the Quartz scheduler was not the issue. As everything else in the configuration was identical, while we hadn’t seen any CPU-related issues, we figured it could just be related virtualization slowness. So, just to rule it out, we rolled the next largest box. As expected, given we had seen nothing to indicate CPU issues, we saw no difference in performance on a larger instance; jobs were still running 30-50 seconds behind.

Java Version Differences
After comparing packages between our on-premise box and AWS box, we found our on-prem box to be running Java 1.8.0, while our AWS instance was running 1.7.0. We upgraded the AWS instance to 1.8.0 and noticed a five-or-so-second improvement. The UI was much faster. But no huge improvement in runtime.

Waiting for Execution
During one of the times we launched a job manually, we noticed it became queued immediately, but sat there in the UI for over 30s saying, “Waiting for execution to start.” We thought this was odd, as the box wasn’t exhibiting any CPU-related issues and there were hundreds of free threads in the scheduler to execute the job. We searched and came across this thread, which seems very similar and was unanswered.

Local vs SSH Job
In an effort to make sure SSH wasn’t the issue, we created a local job that simply echoed the current timestamp into a file in /tmp. This job too ran between 30-50 seconds behind the time it was supposed to start. Obviously, this isn’t an SSH issue.

MySQL Optimization
As we’re using MySQL as the backend of Rundeck, we figured there may be something up with database performance. After taking a look, we noticed several queries taking several seconds to complete. Accordingly, we realized the tables weren’t indexed properly for those queries. Rather than creating them ourselves, we searched and found a ticketed performance issue specifically related to missing MySQL indices. We created these indices and, while we no longer saw any slow queries, we saw no improvement in job execution.

Reducing the Number of Concurrent Jobs
To see whether this issue occurred with the same frequency with fewer jobs, we disabled all but one minutely-executed job. While it was still off by ~6s, it was far better than 30-50s. As we enabled other jobs one-by-one, we saw increasing start delays. As it was still 6s off, we figured something else must still be going on.

Watching the Filesystem
During the time we had been watching the single job logs, we noticed a new job log was being created at EXACTLY the right time the job was supposed to start. That is, at the first second of each minute. Unfortunately, when we looked at all other logs, none indicated this job had even started. So, how did this file get created at the right time, but no scheduler log showed this job had actually yet started? Let’s eliminate all the variables.

Eliminating Outside Variables
Our Rundeck configuration includes multiple servers, which have nodes resourced from a local XML file (resources.source.N.config.file) as well as multiple servers via a URL in the project resource configuration (resources.source.N.config.url). We disabled all URL-based ones and noticed our job ran perfectly on-time over and over again. So, something was up with the URL-based configs. We noticed, for some reason, caching was disabled (resources.source.N.config.cache=false), so we set that to true, reenabled our URL-based resources, and restarted. Same old performance issues. Looking at the code, that’s only because this adds additional cache headers to the HTTP request. As everyone who knows HTTP knows, dynamic endpoints still have to do all the work, even if they don’t return any results. So, we tried to fetch that resource directly via cURL. It took ~2s to fetch the resource file from each server. Yikers! We then researched the resources.source.N.config.timeout option, only to find it wouldn’t really help in this case. Lastly, we decided to try saving those resource files locally and using them directly. After that, all jobs ran perfectly on time.

So, when a job gets kicked off by Rundeck, it appears it’s not officially “started” until after the resource configuration is fetched. As multiple jobs run per minute, each one fetching N number of URL-based resources, this caused a stampeding herd to the other Rundeck servers and, accordingly, a significant lag in execution. This also explains why–in the more-jobs-per-minute case–we saw greater lag with each added job.

Update: Response from Rundeck

The URL source is a blocking call so unfortunately it can lead to this kind of problem. We have discussed the idea to enable an asynchronous URL request to help avoid this kind of lag problem.

With the exception of making calls to multiple URL sources concurrently, thereby reducing startup-time from the sum of all response times to the maximum response time, I’m not sure how this could be improved in any asynchronous manner. As all nodes are required at the start of a job–to me–node retrieval must be remain synchronous, but it could be improved with smarter caching on the HTTP server. ¯\_(ツ)_/¯

Reflecting on Node 2015

mohonk-sceneryAfter an information-fueled three days, our team has returned from Mohonk Mountain House, where we held our sixth annual internal engineering conference, Node.

While Node started out as a development-only retreat, it has since become an engineering-wide conference consisting of our development, data science, product, quality assurance, and operations teams. As everyone in the technology industry has experienced at one time or another, it’s hard to find time to get to know peers, learn new things, and stay on top of organizational objectives when one is heads-down on a project; this is why we created Node.

Node is designed to provide our organization with a dedicated event, away from the office, to share knowledge, learn something new, and get to know each other in a collaborative environment. The first, knowledge-sharing, is accomplished via conference-like sessions and workshops presented by our own internal colleagues.

This year, we had a number of internal, team-oriented, tracks. Within each track was a presentation or workshop related to that technology or team. As an example, Corky Brown and Bill Sykes gave a kick-ass workshop on development with our new client-side rendered web code, Phoenix. Likewise, our new Director of Data Science, Anton Slutsky, Ph.D., presented, “Data Science @ MeetMe,” a session designed to give everyone a basic understanding of Data Science, Big Data, its importance, and its role in our organization. In addition to learning more about the work other teams are doing or how to use some of our new tools, we also bring in external speakers, who provide us with experience and insight outside of our day-to-day work.

seb-imgAs Node is composed of individuals with varying degrees of technical depth, we try our best to attract external speakers who provide experience and content beneficial to everyone. Oftentimes, these speakers are aligned with either a task or a technology we’re working on, or one we plan to put into effect in the coming year. Given the significant effort our engineering team has put into Cucumber this year, we felt it imperative to spend a considerable amount of time focused on quality. For that reason, we invited Seb Rose, co-author of The Cucumber for Java Book, to spend two days with our team.

Seb covered agile development concepts, BDD/TDD, and detailed aspects of feature/scenario development with Cucumber. Unlike other presenters, who lecture for hours on end, Seb’s group-oriented hands-on approach was well-received and left many with a better understanding of not only the concepts surrounding Cucumber, but also the best ways to use the technology itself.

Alternatively, for those who wanted something a little more technical, we asked William Kennedy, author of Go in Action, to give us an overview. Bill crammed two days worth of Go programming language training into only eight hours. To accomplish this, we skipped the hands-on examples.

node-2016-dinner2In addition to the content, we spent the remaining hours sleeping or in spontaneous team-building activities. Whether it was sitting together watching the Philadelphia Eagles vs. Dallas Cowboys game, hanging out at lunch, laughing at dinner, sightseeing, group photography, relaxing in the lounge, playing Heads Up!, or participating in a highly-competitive game of Texas Hold’em, we spent a good amount of time getting to know and collaborate with each other. This was especially useful for our new team members.

Lastly, every Node is a learning experience and improving it is an iterative process. What did we learn this year? For one, programming language trainings should be less intense. Some thought the Go training pace was fine, while others felt it was a bit too rushed. Likewise, everyone wants a little time after lunch to explore the venue we’re at. We’ll be sure to take that into account next year.

All in all, it was another great experience for the team. I’d like to thank everyone who attended, presented, and helped make it happen. Specifically, on the engineering side, with input and feedback from multiple teams, Node was planned primarily by our own DevOps Lead Architect, Jason Lotito, who did an outstanding job. With everyone reinvigorated, on the same page, and understanding what our 2016 goals are, I’m excited to see what we will accomplish!

Upcoming Meetup: How Not To Do Postgres Connection Pooling

In May 2014, MeetMe engineers will be presenting a talk, Lessons Learned: How Not To Do Postgres Connection Pooling, for the New York City PostgreSQL User Group. In this talk, we’ll discuss real-world connection management issues encountered while running Postgres as the primary database for a top-ten social network. This includes drilling into mistakes commonly made in regard to Postgres connection management, the disasters that follow, and how to do things properly. An overview of topics involved are as follows:

  • Postgres connection management
  • Performance
  • Queueing
  • Limitations
  • PGBouncer usage and configuration

The talk, which will be held at 6:30 PM on Wednesday, May 14, 2014 and hosted at the Yodle offices, will be presented by long-time MeetMe engineers and Postgres contributors Jonah H. Harris and Michael Glaesemann, respectively the VP of Architecture and Sr. Data Architect. If you’d like to attend, please join the MeetUp! You can also get the ICS file here: New York PostgreSQL User Group Meetup.