Thoughts on…

Java Middleware & Systems Management

Archive for October 2008

Cluster Management

with 2 comments

For a majority of the past year the RHQ team has worked diligently to deliver a stable, extensible, and fault tolerant infrastructure for managing and monitoring your enterprise. Most of the focus has been on providing services at the individual resource level – a singular Apache install, a sole IIS instance, a solitary JBoss Application Server. Nowadays, however, the name of the game is clustering. Redundant servers, clustered services, data replication, cloud computing – all have a slightly different role in the game.

There are plenty of companies out there trying to get in on the action, who have high hopes of becoming formidable players. In a similar vein, the next release of RHQ promises to make its platform technology a force to reckon with when it comes to managing and monitoring clustered resources. There are several key facets to this (in no particular order):

• aggregate and average views of metric / measurement data
• operational scheduling and execution
• fine-grained configuration control

The first two of these are actually partially implemented. Today, if you create a compatible group (a resource group that only contains a single type of resource, e.g. only JBossAS instances) you can schedule and execute operations against the members in the group, in rolling fashion or simultaneously against all of them, as well as view aggregate and average metrics.

The shortcoming here, however, is that you need to explicitly add resources to a compatible group before you can perform these aggregate business services across them. Manually adding resources to the group would be absurd for anything but the smallest inventories. This is largely why DynaGroups exist. They were created with the sole purpose of being able to create vast numbers of groups according to flexible rules for how to partition your resources.

So why isn’t this good enough? It’s because DynaGroups can not create hierarchies of grouped resources, such as a group of HTTP Connectors under a group of Embedded Tomcat Servers under a group of JBoss Application Servers. And this group-wise, hierarchical navigation is where a lot of the power will be derived from.

If you could aggregate your JBossAS instances into a cluster group, and then have logically clustered servers and services automatically grouped beneath that, you could navigate the resource hierarchy very quickly and efficiently. Instead of having to view data from multiple contexts, a “cluster view” would aggregate the data into a single, navigable tree structure – a single context. You wouldn’t be bouncing back and forth between different pages; everything would be at your fingertips from a single, landing page.

A secondary benefit of building cluster views – instead of explicitly using compatible groups – is that you won’t clog up your resource browser with thousands of groups you might use only rarely, if ever.

Think about this: let’s say we have 60 physical hosts, 5 JBossAS instances on each of them, 3 instances to a logical cluster, which creates 100 (60*5/3) cluster groups. Let’s further say that on each instance we have 250 unique business services (this includes enterprise applications, session & entity beans, connectors, virtual hosts, datasources, queues / topics, etc). If we were to create explicit compatible groups for all nested servers and services under each of these 100 cluster groups, it would be another 25K (250*100) groups.

The cost of group creation is not the worry here; it’s the sheer volume of groups that would be shown in the application’s user interface. If you had a handful of compatible groups that you were previously using to manage your inventory, say a dozen or two, it would be much more difficult and frustrating to sift through one- to two-thousand times as much data. Granted, a very nice search interface could mitigate the situation, but it wouldn’t eliminate the underlying problem of unnecessary group creation.

So the solution migrates back to our cluster view concept. The team has come together several times over the past 2 weeks to flesh out the details, and the current functional and user interface requirements have been kept updated on the RHQ Project site here. If you have ideas above and beyond what you see there, we encourage you to speak up and let us know how we can make the new features and product improvements around resource clustering even better. If you want to be part of the action and you’re interested in becoming a contributor to the RHQ Project, look for my handle – joseph42 – in #rhq on freenode.

As always, post backs are greatly appreciated.

Written by josephmarques

October 24, 2008 at 7:59 am

Posted in rhq

Tagged with

One Step Closer…

leave a comment »

I played a little word association game with my team today. I asked them the following,

With respect to RHQ, when you hear the word ‘configuration’, what is the first word that comes to mind?

Though the answers were varied – resource, edit, product, settings, and properties – they were all perfectly in line with my suspicions. All were either synonyms of configuration, or actions you would take against something that is configurable.

So what’s wrong with that? Well, RHQ is a platform that, in a nutshell, performs systems management AND monitoring. However, these results show that people initially, predominantly, and perhaps only think of ‘management’ when they hear ‘configuration’.

Maybe the question wasn’t fair. I DID ask them to only give me one-word responses. Maybe they would’ve also mentioned monitoring had they had the ability to answer less pointedly.

Or…maybe not. More than half of the responses I got were NOT one-word answers. They were short phrases, or even multiple full sentences explaining why certain words came to mind. Granted, there is a chance the results could be skewed, because the mailing list I asked this question on has only a few dozen people subscribed to it; but it’s enough evidence to show me that the common association is management, not monitoring.

The next version of RHQ will close this gap and bring to the fore monitoring capabilities around configuration. This solution is actually twofold:

1) Detecting agent-side configuration changes

The RHQ agent, since it already knows how to discover the configuration for some managed product on demand, simply has to keep a record of what the last known configuration was. After that, it would need a mechanism to periodically scan for the current configuration, and test whether or not the last known was different than the current. If it was, the RHQ agent would send the results up to the server, which would persist it as the new configuration for that resource and at the same time add an element to the configuration audit trail, so that administrators can see what changed over time.

This development work was committed last week, and the QA for it (RHQ-988) can be tracked in JIRA here.

2) Alerting against changed configurations

This logic is completely server-side, and deals with the ability to set up alert definitions against resources that support configuration. As a consequence of doing this, alert templates will be able to create “monitors” across a large segment of the inventory quickly, thus making it easy to receive notifications when any managed resource in your enterprise has its configuration changed external to the RHQ infrastructure.

This feature has been on the docket for nearly 6 months, but depended on configuration change detection (see above) to be written first. So, once I saw the bits for RHQ-988 in SVN, I wasted no time implementing RHQ-342. That development work was completed this past weekend, and the QA can be tracked here.

So what does this all mean? Well, it means that any plugins that have written configuration support for the resources they define can now be managed AND monitored.

Below is a short list of the various different configurations provided by the base plugins found in the RHQ project:

* APT repository locations
* GRUB kernel entries
* hosts file mapping of IP to canonical names
* SSHD settings, advanced configuration, and X11 properties
* PostgreSQL configuration files as well as runtime properties; database user settings, passwords, and privileges; and table schemas
* RHQ agent configurations

At the time of writing, this author knows of at least one other project that builds extensions to the RHQ platform, and it is called Jopr. Its primary focus is to provide plugins for JBoss Application Server and related services. Simply by dropping the Jopr plugins into your RHQ distribution, you would extend the configuration monitoring capabilities to the following items:

* datasource configuration and advanced settings
* connection factory properties
* JMS queue & topic information

Configuration monitoring just scratches the surface of some of the feature enhancements targeted at the 1.2.0 release of RHQ, but it does bring the platform one step closer to being a complete, end-to-end management and monitoring solution for your enterprise.

If you’re interested in helping to improve the base platform, have ideas for new plugins or extensions to existing plugins, or just want to be closer to the action, please visit the development team in #rhq on irc.freenode.net – my handle is joseph42.

Written by josephmarques

October 21, 2008 at 5:21 am

Posted in rhq

Tagged with

Resource Group Versatility

leave a comment »

When you first download and install RHQ, you’ll log in to the web console and notice that there are two different types of grouping constructs for resources – mixed and compatible. In short, compatible groups must contain the same types of resources, whereas mixed groups do not. Under the covers, these are implemented by the exact same construct, but how meaning has been applied to them, and what you can do with each of them, is why this blog got the title it did.

Mixed groups are predominantly used for security, in particular, authorization. With them you can put all sorts of resources together – Windows and Linux platforms, IIS and Apache servers, etc. Then, you can attach that mixed group to a role, and any users in that role will be able to see those resources.

If you want to be able to give someone access to an entire box, then create a mixed group with the “recursive” option enabled. By turning that option on, any resource you add to the group automatically adds all descendant resources to the group as well. For instance, if you add a platform, it will indirectly add all servers under that platform, as well as all services under all of those servers, and so on.

While mixed groups have one thing they’re good at, compatible groups have an array of functionality they excel at providing. First and foremost is their “compatibility” with all of the other subsystems RHQ provides: monitoring, configuration, operations, etc.

For monitoring, RHQ shows aggregate and average metrics across the group members. For configuration, RHQ enables you to change the configured connection properties across everybody in the group at the same time. For operations, RHQ allows you to execute the same operation against all resources in the group – at the same time, or serially (one after the other, in rolling fashion).

Very recently, a customer pointed out to me how groups – mixed and compatible – can be used in a novel way. Their question was simple: what’s the easiest method to see all of the resources in their environment that are down?

In order to do this today, you have to use the Browse Resources page, go to the each tab in turn – platforms, servers, and services – and sort on the availability column. Granted, this is fairly easy to do and doesn’t take all that long, but wouldn’t it be nice to be able to automatically create a group that contained any and all resources that were down in the system.

OK, maybe you’re initial thought is “why not just use the Problem Resources portlet?” Well, a ‘problem’ resource isn’t necessarily one that is down. If you have ANY alerts, or if you have metrics that are more than 5% outside of their baseline range (a running average calculated over time automatically by RHQ), the resource will also show up in this portlet. This customer JUST wanted the unavailable resources.

Alright, and maybe your second thought was “well, why not use alerts?” Today, we can fire alerts when a resource goes down, and you CAN use the notification mechanism so that you get an email when this happens. However, there are at least two problems with this strategy:

Problem 1

Alerts are only good at telling you what JUST happened in the system. Alerts will be created as the result of some agent sending data up to the server, such as an availability report or the results of an operation. So, if you already have resources that are down before you set up your alert definitions, you will not be notified because those resources were already down.

Problem 2

Setting up availability alerts across ALL resources in the system will take a while. A lot of time could be saved by using the alert templates feature (Administration > Monitoring Defaults), which would make sure that all existing resources (and any resources that are imported in the future) automatically have alert definitions created for them. However, you’d still have to set up one template across every single resource type in the system, and so depending on how many plugins you have installed could be several dozen templates to create. Also, for each of those alert templates, you’d have to setup identical notification rules too, which takes more time still.

Interestingly enough, before I could even reply to the customer, they suggested a solution – a feature enhancement, to be precise – which would do the trick. They wanted to extend DynaGroups to be able to aggregate resources by availability.

I was floored by the simplicity of this suggestion. In fact, I sort of recall rubbing my eyes looking to wake up from a dream, because I thought it was so incredible that the development team hadn’t thought of this before. And I wasted no time creating the issue in JIRA to track this request.

Anyone that knows me probably already guessed I had the fix locally within an hour, but because the request came in during the final seconds just before the 1.1 release I held off on committing it. Though, as soon as SVN was unlocked for 1.2 development it was one of the first commits.

If you’re building off of trunk (or running anything rev1730 or greater), it’s easy to create a Group Definition that will always keep a DynaGroup populated with the resources that are unavailable.

resource.availability = DOWN

But let’s say you are monitoring a very large inventory, and want to break things down further to keep the groups more granular. For example, let’s say you wanted to create different DynaGroups for each type of resource that’s down. This way you can look at your IIS servers that have failed, independent from your Apache vhosts that aren’t up, separate from your File Systems that aren’t at their expected mount points. That expression set would be as follows:

resource.availability = DOWN
groupby resource.type.plugin
groupby resource.type.name

But maybe that creates too many groups, or gives you results for resource types you aren’t interested in. Let’s say you want to focus your search because you only care about one specific type of resource failing, maybe just your Apache servers. Instead of grouping by the plugin and resource type, specify those pieces of information exactly:

resource.availability = DOWN
resource.type.plugin = Apache
resource.type.name = Apache HTTP Server

Thus, in a roundabout way, resources groups can actually be used as indirect tools for monitoring the health of your platforms, servers, and services.

This, however, just scratches the surface in terms of how groups can be used to monitor your enterprise. One major focus for the 1.2 release of RHQ is going to be on cluster management. Remember, compatible groups serve as a natural way of exposing RHQ subsystems at the group-level. So expect to see lots of new group-level services and UI functionality.

At the time of this writing, the requirements for cluster support were in their infancy, but we encourage you to read the latest requirements and post your ideas back to the resource clustering thread in the forums.

Written by josephmarques

October 17, 2008 at 7:28 pm

Posted in rhq

Tagged with

The Software Dinner Party

leave a comment »

Running a successful project and putting out a successful software product that sells is much like organizing a dinner party. Your guests are going to have a wide variety of personalities and experiences, which will undoubtedly lead to a range of different tastes in music and cuisine.

If you focus solely on your favorite food and entertainment, you’re very likely going to find at least one or two people that don’t enjoy themselves. Instead, you should concentrate mostly on your guests. Learn their likes and dislikes, their preferences, and what their expectations are – then try to satisfy as many of them as possible, while delivering just a bit more than they expected.

Software development can benefit from following a similarly balanced plan: learn the likes and dislikes of your user community, what they want to see in terms of bugs fixed and new features added, and what their prioritizations are as far as the most important things to see get done in the next release cycle – then try to satisfy as many of them as possible, while delivering just a bit more than they expected.

One of the more important parts of becoming a seasoned and well-rounded software developer is not your ability to write code, it’s the ability to recognize that your personal interests and stake in a project / product are not always the best for the community at large. This author makes no attempt whatsoever to hide that his interests lie mostly in and around the platform, as opposed to plugin development / refinement, or writing an abundance of documentation, or working on audio-visual demos.

That said, and even though I realize I’m just a small slice of a much larger pie, I try to advocate for what I feel is right for the product on the off-chance that some customer simply forgot to mention it or, more likely, didn’t realize it was something they actually wanted in the first place. At the same time, I still appreciate and respect the dinner party analogy, and the fact that the software I write is primarily to serve others – not myself. A perfect example of this happened just recently.

The 1.1.0 release of RHQ brought with it the long-awaited arrival of the high availability and failover feature set. Despite this having taken months and months of coordinated, distributed man effort mucking around with some of the lowest-level APIs of the platform it, in my eyes, only scratched the surface of what still needed to be done: higher scale, better isolation of services, visualization of the agent-server runtime topology, and greater visibility into data flow patterns across the infrastructure.

However, perhaps more than any of the aforementioned, I really wanted to see us simplify the configuration of the communication layer. Today, regardless of which RHQ release you use, you need to have two endpoints exposed: the agent needs to contact the server on some address/port combo, and the server needs to contact the agent on some address/port combo. There’s no technical reason why there needs to be a full, bidirectional link between these two endpoints. The communication, in theory, could be rewritten to open a unidirectional link, and then have responses piggybacked on the open connection.

The telephony industry solved this problem a long time ago. You pick up a phone, dial someone, then you can both talk back and forth across the line even though only one of you initiated the call. It seems rather silly and overly complicated, in retrospect, to even think of doing it any other way. Just imagine how awkward that would have been: I call you and talk on that line, but you have to call on a second line if you want to talk back to me, and we each need to use two hands to hold both our phones up.

With telephones, any person can initiate the call because both people have telephone numbers. Likewise, the unidirectional link between the server and agent could be established at either endpoint, but it makes the most practical sense to have the agent initiate the connection. From a security standpoint, under the assumption that your network is locked down, you’d only have to punch holes for incoming communication on the handful of RHQ servers you have in your infrastructure, as opposed to the hundreds (or perhaps, in the future, thousands) of agents you would have installed if the connection was initiated in the reverse order.

Unfortunately, adding this feature complicates the required semantics in order to properly deliver some business services. As it stands today, RHQ has many services which were written under the assumption that every single agent is visible to the server performing the business service workflow. However, since it’s possible that any server can initiate that workflow, it really implies that every agent must be reachable from every server. But this runs counterpoint to the unidirectional channel idea, which states that servers can only piggyback messages down to agents that initiated connections to them.

Thus, these business services need to be refactored. Though, rewriting each service in an isolated fashion would only create havoc within the code and make things rather difficult to understand and maintain over time. Instead, there needs to be something that can, in a generic fashion, distribute a single business workflow across a range of servers, as dictated by the need to communicate with specific agents.

The solution I’m hinting at is what I’ve termed a fully partitioned services framework – a mechanism by which servers can indirectly communicate with one another when they need to send a request to or get data from an agent that isn’t connected to them. By writing this logic as its own abstract mechanism, the framework can expose itself to programmers via a simple API, and any business service that needs to be partitioned would thus be written in a consistent way. The programmer wouldn’t even have to care how the request is being carried out, just that it delivers on its promise.

With these two devices in place – unidirectional communication and full partitioning of business services – the platform architecture simplifies into an easy to understand, easy to visualize, and incredibly easy to paraphrase topology: agents talk to servers, servers talk to the database. It simplifies configuration, and thus makes installation simpler, reduces the maintenance burden (when you want to add more agents), and even makes upgrading easier too.

So why, with all of these benefits, have we not already done this? Well, it goes back to the dinner party. As with most things in life, there needs to be a balance of priorities. If we spent all of our time focusing on the architecture, the platform wouldn’t have lots of different business services. If we spent all of our time focusing on adding new business services, there would be no plugins that took advantages of the new services. If we spent all of our time writing new plugins (or improving existing ones), they might not have a solid base into which to be installed.

Being a platform guy at heart, it sometimes pains me to see that I can’t spend every waking moment improving JUST the platform. However, my role as technical lead constantly reminds me of this balance that must be maintained for the greater overall success of the product. And, when I take all of that together, my personal preferences are a concession easily made…because I want to make my guests happy. Nonetheless, I’m confident there will come a time when it’s right to work on these – it’s just not today.

Written by josephmarques

October 16, 2008 at 12:51 am

Posted in rhq

Tagged with

86 the Syntax!

leave a comment »

For developers that know multiple programming languages, that can understand Java generics, who’ve play around with ANTLR, that have read about closures, and who generally automate a lot of their day-to-day work through various scripting languages, picking up a new syntax is a piece of cake. But this is *certainly* not the case for everyone.

If a new feature in a software product, for example, is driven through familiar controls such as drop-down menus, checkboxes, and input text fields, then chances are pretty high that they’ll figure out how to use it naturally (or eventually by way of trial and error). However, if you were to provide a text-only interface that required knowing some not-so-obvious syntax, the user has no choice than to turn to the manual, and continually reference it for details, pointers, guidelines, and grammar.

As in most things, “know your audience” comes into play here too. It is absolutely crucial to take into consideration who your user population is when deciding how to provide interfaces for new software functionality. The wrong interface, no matter how great the feature, could render it absolutely useless to some. And, as experience dictates, first impressions tend to stay with those users even in the face of improved controls in newer versions of the product.

Despite having knowledge of all of this, yours truly (due to time constraints in the development cycle) had to introduce a less then ideal interface to interact with what is otherwise a rock-solid and incredibly powerful feature. Prior to the 1.1.0 release of RHQ, the group definition / dynagroups feature could only be manipulated via a raw syntax.

Not only did this make using the feature cumbersome (even for skilled users), but it made for a rather high learning curve. I had an inclination of this as development was commencing. I sort of knew this as QA was underway. And I definitively knew this when it came time to write the end user documentation. Despite having this knowledge, there simply wasn’t enough time in the cycle to provide a better interface.

Well, as expected, we took flak for it through our customer support channels. More than once did people have trouble understanding what a group definition was, and balked at using the feature. Some of these were customers that had read all of the provided documentation, but still didn’t quite understand how to use it properly. It just goes to show you that no matter how great you think a feature is, the interface to that feature is so crucial to it penetrating your user community and becoming a tool that lessens their burden, as opposed to another “system” they have to learn.

Luckily, a solid support team coupled with a rather understanding and forgiving customer base helped to eventually smooth things over. The more and more examples we threw their way, the better they understood what group definitions were trying to accomplish. In fact, two customers eventually liked it so much that they started suggesting improvements for it, so as to make dynagroups even more useful in the future.

So, perhaps we got lucky. Maybe the utility of the feature, in this case, outweighed the annoyance of having to learn its syntax. The feature *was* introduced to automate a large portion of what was once dozens and dozens of hours of manual work. But providing a poor interface just because we can get away with it doesn’t make it right, and it doesn’t give us an excuse to sit on our hands either.

To remedy this, we made providing a newer and better interface for group definitions / dynagroups a priority in the 1.1.0 release of RHQ. As a result, it was shipped with a wizard formally titled the DynaGroup Expression Builder. Now when you create a new group definition (or edit an existing one) you’ll see a little icon at the upper right-hand portion of the input box. Click it. You’ll find it provides all of the familiar controls we should have started with in the first place: drop-down menus, checkboxes, and input text fields. Several non-developers used the interface before it was released, and commented on how much easier it was to understand.

So does this mean, assuming you’ve already gotten used to the raw syntax from pre-1.1.0 versions, that you’ll now need to learn yet another new interface? Nope. We’ve left the old interface there for two reasons: 1) backwards compatibility (for those already familiar with the syntax, as just discussed) and 2) to provide an expert interface. See, we feel that once you use the expression builder enough times, you’re going to learn the syntax naturally by example and, once you know it sufficiently well, you might want a more direct interface instead of being forced to use the wizard every time.

With the newer and better interface firmly in place, there’s no reason NOT to use dynagroups. So try ‘em out, and let us know what you think, either through the customer support portal, a user thread in the forums, a comment on a JIRA, or even (as this author would prefer) a post back to this blog.

* For more information about the Dynagroups feature, please visit the docs.
* RHQ is an extensible, open source management and monitoring infrastructure. For more information see the project site.

Written by josephmarques

October 14, 2008 at 4:39 am

Posted in rhq

Tagged with

Profiling and Code Optimizations

leave a comment »

Optimizing code is an art form. What may be seen as beautiful and elegant to some may seem downright wasteful, improvident, or prodigal to others. Aside from the fact that it takes time to optimize code – time that could be spent fixing bugs or working on enhancements to the product – the speedup may not justify the time it took to get there.

So how does one make the decision whether to and, more importantly, what to optimize? Profiling an application is always a good start. If you attach a profiler to your application, you can see what parts of the code run the slowest and mark those as potential parts to improve. However, what if that code is only called once every hour, once every day? But you say that you can get 1000% improvement? Wow, that sounds impressive…until I hear that the slow code only takes 10 seconds to run. Granted, 1 second is a lot faster than 10 seconds, but if that code runs once an hour that’s only 0.25% improvement, if it runs once a day that’s only ~0.01% gain.

OK, so just run the profiler for a long period of time, so that you can collect metrics over a wide range of times and get better averages. This way, you’ll not only know which parts of the code are slow, but you’ll also know how frequently they are called in a regular run of the product.

But what if your application isn’t an application at all? What if it’s a framework – a generic tool or service that is a means to an end, but is not a usable product by itself? Depending on how the platform is leveraged or extended, the application’s runtime profile can look vastly different. In terms of program performance, what may be a perfectly reasonable feature for one may be so slow it becomes unusable for another.

Granted, the people writing the framework *should* have an idea of how it will be used, and understand a moderate set of the most common scenarios and use cases, but it becomes impractical to respond to every conceivable usage of a product ahead of time. And this is where customers come in.

Recently, a fairly large financial firm approached us with a performance problem. It seemed that their RHQ Server was taking a large amount of time to startup, and due to that lag other subsystems started throwing back errors. After some investigation, the problem seemed to be the initial load of data into the alerts engine cache. If the cache didn’t load in a timely fashion, the reports coming up to the server that needed to be checked against the cache data would eventually give up (time out) and complain.

In previous versions of the product (1.0.1 or earlier), the entire cache would attempt to be loaded at startup time. There had never been a problem with this approach before. Up until that point, we hadn’t had a single customer run into performance problems with cache loading, but this customer had a fairly large environment (~100K resources). In this case, it wasn’t just the large environment that hurt them, it was the fact that they used large numbers of alert templates across many resource types, which created thousands of alert conditions.

Upon initial inspection, the loading code didn’t look all that expensive. There weren’t any unnecessary operations being done. It was a straight load out of the database: find the alert definition, load all conditions for each definition, and load any necessary relational data for each condition. The problem here, as some astute readers may already be guessing, is database round trips.

The problem actually arose from the relative ease and quickness that the data access layer could be written with. Since RHQ uses EJB3/Hibernate for a large portion of its data access needs, which makes development rather rapid, it’s easy to introduce inefficiencies if you aren’t careful. And that’s precisely what happened here. The load of the alert definitions was inexpensive, but for each definition the set of alert conditions under it was loaded in a separate query, and for each condition any necessary relational data was loaded in yet more separate queries. This model explodes quickly, and turns the loading of a few thousand definitions into tens of thousands of queries.

The solution here was to refactor the code to load all necessary data in one fell swoop (i.e., use more complex queries). It only took about 2 hours to write the new queries, and a few more hours to fully test against regressions, but the results were dramatic. I tested the scenario using an RHQ Server instance that was remote across a 100MB line to a Postgres 8.3.x database. The new code took only 1/7 of the time the old code used to, used *significantly* less database resources, and decreased the number of roundtrips from tens of thousands to a few dozen.

We further improved things by loading NO cache data at startup (which on moderately powerful machines now results in sub-minute startup times). Instead, data is loaded on an agent-by-agent basis and *only* when an agent connects to the server. Also, we added new logic into the communications layer so that an agent will not send any reports up to the server until its corresponding cache data is fully loaded. This way, the reports wouldn’t time out sitting there waiting for cache data to load – it would already be loaded by the time the report was received by the server.

All in all, we learned a lot about how the product would be used in a larger enterprise, which played perfectly into the high availability features made available in the 1.1.0 release. And we did it without losing any precious development time profiling the platform without having an understanding of how customers with big environments might use it. The upside is that a single customer that wanted to push the 1.0.1 version of the framework to its limits, helped us to make it better for all that will use RHQ in larger environments from here on out.

Written by josephmarques

October 13, 2008 at 10:31 pm

Posted in rhq

Tagged with