Archive for October 2008
For a majority of the past year the RHQ team has worked diligently to deliver a stable, extensible, and fault tolerant infrastructure for managing and monitoring your enterprise. Most of the focus has been on providing services at the individual resource level – a singular Apache install, a sole IIS instance, a solitary JBoss Application Server. Nowadays, however, the name of the game is clustering. Redundant servers, clustered services, data replication, cloud computing – all have a slightly different role in the game.
There are plenty of companies out there trying to get in on the action, who have high hopes of becoming formidable players. In a similar vein, the next release of RHQ promises to make its platform technology a force to reckon with when it comes to managing and monitoring clustered resources. There are several key facets to this (in no particular order):
• aggregate and average views of metric / measurement data
• operational scheduling and execution
• fine-grained configuration control
The first two of these are actually partially implemented. Today, if you create a compatible group (a resource group that only contains a single type of resource, e.g. only JBossAS instances) you can schedule and execute operations against the members in the group, in rolling fashion or simultaneously against all of them, as well as view aggregate and average metrics.
The shortcoming here, however, is that you need to explicitly add resources to a compatible group before you can perform these aggregate business services across them. Manually adding resources to the group would be absurd for anything but the smallest inventories. This is largely why DynaGroups exist. They were created with the sole purpose of being able to create vast numbers of groups according to flexible rules for how to partition your resources.
So why isn’t this good enough? It’s because DynaGroups can not create hierarchies of grouped resources, such as a group of HTTP Connectors under a group of Embedded Tomcat Servers under a group of JBoss Application Servers. And this group-wise, hierarchical navigation is where a lot of the power will be derived from.
If you could aggregate your JBossAS instances into a cluster group, and then have logically clustered servers and services automatically grouped beneath that, you could navigate the resource hierarchy very quickly and efficiently. Instead of having to view data from multiple contexts, a “cluster view” would aggregate the data into a single, navigable tree structure – a single context. You wouldn’t be bouncing back and forth between different pages; everything would be at your fingertips from a single, landing page.
A secondary benefit of building cluster views – instead of explicitly using compatible groups – is that you won’t clog up your resource browser with thousands of groups you might use only rarely, if ever.
Think about this: let’s say we have 60 physical hosts, 5 JBossAS instances on each of them, 3 instances to a logical cluster, which creates 100 (60*5/3) cluster groups. Let’s further say that on each instance we have 250 unique business services (this includes enterprise applications, session & entity beans, connectors, virtual hosts, datasources, queues / topics, etc). If we were to create explicit compatible groups for all nested servers and services under each of these 100 cluster groups, it would be another 25K (250*100) groups.
The cost of group creation is not the worry here; it’s the sheer volume of groups that would be shown in the application’s user interface. If you had a handful of compatible groups that you were previously using to manage your inventory, say a dozen or two, it would be much more difficult and frustrating to sift through one- to two-thousand times as much data. Granted, a very nice search interface could mitigate the situation, but it wouldn’t eliminate the underlying problem of unnecessary group creation.
So the solution migrates back to our cluster view concept. The team has come together several times over the past 2 weeks to flesh out the details, and the current functional and user interface requirements have been kept updated on the RHQ Project site here. If you have ideas above and beyond what you see there, we encourage you to speak up and let us know how we can make the new features and product improvements around resource clustering even better. If you want to be part of the action and you’re interested in becoming a contributor to the RHQ Project, look for my handle – joseph42 – in #rhq on freenode.
As always, post backs are greatly appreciated.
I played a little word association game with my team today. I asked them the following,
With respect to RHQ, when you hear the word ‘configuration’, what is the first word that comes to mind?
Though the answers were varied – resource, edit, product, settings, and properties – they were all perfectly in line with my suspicions. All were either synonyms of configuration, or actions you would take against something that is configurable.
So what’s wrong with that? Well, RHQ is a platform that, in a nutshell, performs systems management AND monitoring. However, these results show that people initially, predominantly, and perhaps only think of ‘management’ when they hear ‘configuration’.
Maybe the question wasn’t fair. I DID ask them to only give me one-word responses. Maybe they would’ve also mentioned monitoring had they had the ability to answer less pointedly.
Or…maybe not. More than half of the responses I got were NOT one-word answers. They were short phrases, or even multiple full sentences explaining why certain words came to mind. Granted, there is a chance the results could be skewed, because the mailing list I asked this question on has only a few dozen people subscribed to it; but it’s enough evidence to show me that the common association is management, not monitoring.
The next version of RHQ will close this gap and bring to the fore monitoring capabilities around configuration. This solution is actually twofold:
1) Detecting agent-side configuration changes
The RHQ agent, since it already knows how to discover the configuration for some managed product on demand, simply has to keep a record of what the last known configuration was. After that, it would need a mechanism to periodically scan for the current configuration, and test whether or not the last known was different than the current. If it was, the RHQ agent would send the results up to the server, which would persist it as the new configuration for that resource and at the same time add an element to the configuration audit trail, so that administrators can see what changed over time.
This development work was committed last week, and the QA for it (RHQ-988) can be tracked in JIRA here.
2) Alerting against changed configurations
This logic is completely server-side, and deals with the ability to set up alert definitions against resources that support configuration. As a consequence of doing this, alert templates will be able to create “monitors” across a large segment of the inventory quickly, thus making it easy to receive notifications when any managed resource in your enterprise has its configuration changed external to the RHQ infrastructure.
This feature has been on the docket for nearly 6 months, but depended on configuration change detection (see above) to be written first. So, once I saw the bits for RHQ-988 in SVN, I wasted no time implementing RHQ-342. That development work was completed this past weekend, and the QA can be tracked here.
So what does this all mean? Well, it means that any plugins that have written configuration support for the resources they define can now be managed AND monitored.
Below is a short list of the various different configurations provided by the base plugins found in the RHQ project:
* APT repository locations
* GRUB kernel entries
* hosts file mapping of IP to canonical names
* SSHD settings, advanced configuration, and X11 properties
* PostgreSQL configuration files as well as runtime properties; database user settings, passwords, and privileges; and table schemas
* RHQ agent configurations
At the time of writing, this author knows of at least one other project that builds extensions to the RHQ platform, and it is called Jopr. Its primary focus is to provide plugins for JBoss Application Server and related services. Simply by dropping the Jopr plugins into your RHQ distribution, you would extend the configuration monitoring capabilities to the following items:
* datasource configuration and advanced settings
* connection factory properties
* JMS queue & topic information
Configuration monitoring just scratches the surface of some of the feature enhancements targeted at the 1.2.0 release of RHQ, but it does bring the platform one step closer to being a complete, end-to-end management and monitoring solution for your enterprise.
If you’re interested in helping to improve the base platform, have ideas for new plugins or extensions to existing plugins, or just want to be closer to the action, please visit the development team in #rhq on irc.freenode.net – my handle is joseph42.
When you first download and install RHQ, you’ll log in to the web console and notice that there are two different types of grouping constructs for resources – mixed and compatible. In short, compatible groups must contain the same types of resources, whereas mixed groups do not. Under the covers, these are implemented by the exact same construct, but how meaning has been applied to them, and what you can do with each of them, is why this blog got the title it did.
Mixed groups are predominantly used for security, in particular, authorization. With them you can put all sorts of resources together – Windows and Linux platforms, IIS and Apache servers, etc. Then, you can attach that mixed group to a role, and any users in that role will be able to see those resources.
If you want to be able to give someone access to an entire box, then create a mixed group with the “recursive” option enabled. By turning that option on, any resource you add to the group automatically adds all descendant resources to the group as well. For instance, if you add a platform, it will indirectly add all servers under that platform, as well as all services under all of those servers, and so on.
While mixed groups have one thing they’re good at, compatible groups have an array of functionality they excel at providing. First and foremost is their “compatibility” with all of the other subsystems RHQ provides: monitoring, configuration, operations, etc.
For monitoring, RHQ shows aggregate and average metrics across the group members. For configuration, RHQ enables you to change the configured connection properties across everybody in the group at the same time. For operations, RHQ allows you to execute the same operation against all resources in the group – at the same time, or serially (one after the other, in rolling fashion).
Very recently, a customer pointed out to me how groups – mixed and compatible – can be used in a novel way. Their question was simple: what’s the easiest method to see all of the resources in their environment that are down?
In order to do this today, you have to use the Browse Resources page, go to the each tab in turn – platforms, servers, and services – and sort on the availability column. Granted, this is fairly easy to do and doesn’t take all that long, but wouldn’t it be nice to be able to automatically create a group that contained any and all resources that were down in the system.
OK, maybe you’re initial thought is “why not just use the Problem Resources portlet?” Well, a ‘problem’ resource isn’t necessarily one that is down. If you have ANY alerts, or if you have metrics that are more than 5% outside of their baseline range (a running average calculated over time automatically by RHQ), the resource will also show up in this portlet. This customer JUST wanted the unavailable resources.
Alright, and maybe your second thought was “well, why not use alerts?” Today, we can fire alerts when a resource goes down, and you CAN use the notification mechanism so that you get an email when this happens. However, there are at least two problems with this strategy:
Alerts are only good at telling you what JUST happened in the system. Alerts will be created as the result of some agent sending data up to the server, such as an availability report or the results of an operation. So, if you already have resources that are down before you set up your alert definitions, you will not be notified because those resources were already down.
Setting up availability alerts across ALL resources in the system will take a while. A lot of time could be saved by using the alert templates feature (Administration > Monitoring Defaults), which would make sure that all existing resources (and any resources that are imported in the future) automatically have alert definitions created for them. However, you’d still have to set up one template across every single resource type in the system, and so depending on how many plugins you have installed could be several dozen templates to create. Also, for each of those alert templates, you’d have to setup identical notification rules too, which takes more time still.
Interestingly enough, before I could even reply to the customer, they suggested a solution – a feature enhancement, to be precise – which would do the trick. They wanted to extend DynaGroups to be able to aggregate resources by availability.
I was floored by the simplicity of this suggestion. In fact, I sort of recall rubbing my eyes looking to wake up from a dream, because I thought it was so incredible that the development team hadn’t thought of this before. And I wasted no time creating the issue in JIRA to track this request.
Anyone that knows me probably already guessed I had the fix locally within an hour, but because the request came in during the final seconds just before the 1.1 release I held off on committing it. Though, as soon as SVN was unlocked for 1.2 development it was one of the first commits.
If you’re building off of trunk (or running anything rev1730 or greater), it’s easy to create a Group Definition that will always keep a DynaGroup populated with the resources that are unavailable.
resource.availability = DOWN
But let’s say you are monitoring a very large inventory, and want to break things down further to keep the groups more granular. For example, let’s say you wanted to create different DynaGroups for each type of resource that’s down. This way you can look at your IIS servers that have failed, independent from your Apache vhosts that aren’t up, separate from your File Systems that aren’t at their expected mount points. That expression set would be as follows:
resource.availability = DOWN
But maybe that creates too many groups, or gives you results for resource types you aren’t interested in. Let’s say you want to focus your search because you only care about one specific type of resource failing, maybe just your Apache servers. Instead of grouping by the plugin and resource type, specify those pieces of information exactly:
resource.availability = DOWN
resource.type.plugin = Apache
resource.type.name = Apache HTTP Server
Thus, in a roundabout way, resources groups can actually be used as indirect tools for monitoring the health of your platforms, servers, and services.
This, however, just scratches the surface in terms of how groups can be used to monitor your enterprise. One major focus for the 1.2 release of RHQ is going to be on cluster management. Remember, compatible groups serve as a natural way of exposing RHQ subsystems at the group-level. So expect to see lots of new group-level services and UI functionality.
At the time of this writing, the requirements for cluster support were in their infancy, but we encourage you to read the latest requirements and post your ideas back to the resource clustering thread in the forums.