Thoughts on…

Java Middleware & Systems Management

Profiling and Code Optimizations

leave a comment »

Optimizing code is an art form. What may be seen as beautiful and elegant to some may seem downright wasteful, improvident, or prodigal to others. Aside from the fact that it takes time to optimize code – time that could be spent fixing bugs or working on enhancements to the product – the speedup may not justify the time it took to get there.

So how does one make the decision whether to and, more importantly, what to optimize? Profiling an application is always a good start. If you attach a profiler to your application, you can see what parts of the code run the slowest and mark those as potential parts to improve. However, what if that code is only called once every hour, once every day? But you say that you can get 1000% improvement? Wow, that sounds impressive…until I hear that the slow code only takes 10 seconds to run. Granted, 1 second is a lot faster than 10 seconds, but if that code runs once an hour that’s only 0.25% improvement, if it runs once a day that’s only ~0.01% gain.

OK, so just run the profiler for a long period of time, so that you can collect metrics over a wide range of times and get better averages. This way, you’ll not only know which parts of the code are slow, but you’ll also know how frequently they are called in a regular run of the product.

But what if your application isn’t an application at all? What if it’s a framework – a generic tool or service that is a means to an end, but is not a usable product by itself? Depending on how the platform is leveraged or extended, the application’s runtime profile can look vastly different. In terms of program performance, what may be a perfectly reasonable feature for one may be so slow it becomes unusable for another.

Granted, the people writing the framework *should* have an idea of how it will be used, and understand a moderate set of the most common scenarios and use cases, but it becomes impractical to respond to every conceivable usage of a product ahead of time. And this is where customers come in.

Recently, a fairly large financial firm approached us with a performance problem. It seemed that their RHQ Server was taking a large amount of time to startup, and due to that lag other subsystems started throwing back errors. After some investigation, the problem seemed to be the initial load of data into the alerts engine cache. If the cache didn’t load in a timely fashion, the reports coming up to the server that needed to be checked against the cache data would eventually give up (time out) and complain.

In previous versions of the product (1.0.1 or earlier), the entire cache would attempt to be loaded at startup time. There had never been a problem with this approach before. Up until that point, we hadn’t had a single customer run into performance problems with cache loading, but this customer had a fairly large environment (~100K resources). In this case, it wasn’t just the large environment that hurt them, it was the fact that they used large numbers of alert templates across many resource types, which created thousands of alert conditions.

Upon initial inspection, the loading code didn’t look all that expensive. There weren’t any unnecessary operations being done. It was a straight load out of the database: find the alert definition, load all conditions for each definition, and load any necessary relational data for each condition. The problem here, as some astute readers may already be guessing, is database round trips.

The problem actually arose from the relative ease and quickness that the data access layer could be written with. Since RHQ uses EJB3/Hibernate for a large portion of its data access needs, which makes development rather rapid, it’s easy to introduce inefficiencies if you aren’t careful. And that’s precisely what happened here. The load of the alert definitions was inexpensive, but for each definition the set of alert conditions under it was loaded in a separate query, and for each condition any necessary relational data was loaded in yet more separate queries. This model explodes quickly, and turns the loading of a few thousand definitions into tens of thousands of queries.

The solution here was to refactor the code to load all necessary data in one fell swoop (i.e., use more complex queries). It only took about 2 hours to write the new queries, and a few more hours to fully test against regressions, but the results were dramatic. I tested the scenario using an RHQ Server instance that was remote across a 100MB line to a Postgres 8.3.x database. The new code took only 1/7 of the time the old code used to, used *significantly* less database resources, and decreased the number of roundtrips from tens of thousands to a few dozen.

We further improved things by loading NO cache data at startup (which on moderately powerful machines now results in sub-minute startup times). Instead, data is loaded on an agent-by-agent basis and *only* when an agent connects to the server. Also, we added new logic into the communications layer so that an agent will not send any reports up to the server until its corresponding cache data is fully loaded. This way, the reports wouldn’t time out sitting there waiting for cache data to load – it would already be loaded by the time the report was received by the server.

All in all, we learned a lot about how the product would be used in a larger enterprise, which played perfectly into the high availability features made available in the 1.1.0 release. And we did it without losing any precious development time profiling the platform without having an understanding of how customers with big environments might use it. The upside is that a single customer that wanted to push the 1.0.1 version of the framework to its limits, helped us to make it better for all that will use RHQ in larger environments from here on out.


Written by josephmarques

October 13, 2008 at 10:31 pm

Posted in rhq

Tagged with

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: