Thoughts on…

Java Middleware & Systems Management

Posts Tagged ‘gwt

Debugging GWT on Linux

with 2 comments

After waiting with bated breath for what seemed like forever, the GWT Developer Plugin for Chrome on Linux has finally arrived!

Now I can get rid of the custom prebuilts I keep pulling from svn trunk. ; )

Written by josephmarques

December 8, 2010 at 4:44 pm

Posted in gwt

Tagged with

GWT Performance Monitoring

with 12 comments

Profiling a web application for performance bottlenecks requires a balance between focusing on the minutia and taking a step back and looking at the bigger picture. Since nearly all software projects deal with time constraints, a team must wisely choose where to spend its efforts tweaking an application to yield the best experience for its end users.

However, before diving into the code looking for optimizations, there are a few key things that one should understand about the performance profile of an application: Critical Path, Outliers, Variances, and Environment.

Critical Path

In a typical user session, which are the most frequented pages and/or business services? Determining these can help generate a prioritized list of services requiring optimization. For instance, decreasing the running time of a service that is called 50 times during a typical browsing experience is likely to yield significantly more perceived benefit than it would if the service were only called once.

Outliers

It may be the case that one service takes significantly more time to load than any other service on the page. If any of those other services were written in such a way where their execution is contingent on the data returned by the slow service, then the effects of the slow service become magnified. Paying attention to outliers will help one understand whether the perceived performance of one or more pages can be significantly improved by the optimization of just a single service.

Variances

Typically, a service will not execute in the exact same number of milliseconds each each time it is invoked, but it will execute within some tolerance, e.g. 1100-1300ms 98% of the time. Tracking these variances will help determine whether some service will cause random lags in your application, that may be otherwise difficult to track down because they occur sporadically. A service that will take 5 times as long to execute in the worst case may indicate the need to rework that implementation.

Environment

Often times programmers are tempted to introduce optimizations because of something they saw on their local box. However, these development environments often differ significantly from how an application will perform once it’s deployed to production or shipped to a customer. Application load is another important factor to consider; an application with one logged-in user will not necessarily have the same bottleneck(s) as it would if there were dozens, hundreds, or thousands of logged-in users. Thus, one must consider where an application might be deployed, and use a similar test environment to profile an application for performance issues.


With these principles in mind, I wrote an extension to the GWT client-side RPC mechanism to capture data that will assist in making better informed decisions about which services need optimization. The first thing we have to do is inject a custom ProxyGenerator into the GWT compilation process.

<generate-with class="...TrackingServiceInterfaceProxyGenerator">
    <when-type-assignable class="com.google.gwt.user.client.rpc.RemoteService"/>
</generate-with>    

The implementation for that custom generator is as follows:

package org.rhq.enterprise.gui.coregui.user.rebind.rpc;

import com.google.gwt.core.ext.typeinfo.JClassType;
import com.google.gwt.user.rebind.rpc.ProxyCreator;
import com.google.gwt.user.rebind.rpc.ServiceInterfaceProxyGenerator;

public class TrackingServiceInterfaceProxyGenerator extends ServiceInterfaceProxyGenerator {
    @Override
    protected ProxyCreator createProxyCreator(JClassType remoteService) {
        return new TrackingProxyCreator(remoteService);
    }
}

I sub-classed the GWT default ServiceInterfaceProxyGenerator implementation so that the only thing I had to do is override the one method that returns the ProxyCreator. In this case, it returns a custom implementation that will be responsible for reassigning the super class of all generated proxies:

package org.rhq.enterprise.gui.coregui.user.rebind.rpc;

import com.google.gwt.core.ext.typeinfo.JClassType;
import com.google.gwt.user.client.rpc.RemoteService;
import com.google.gwt.user.client.rpc.impl.RemoteServiceProxy;
import org.rhq.enterprise.gui.coregui.client.util.rpc.TrackingRemoteServiceProxy;

public class TrackingProxyCreator extends ProxyCreator {
    public TrackingProxyCreator(JClassType type) {
        super(type);
    }

    @Override
    protected Class getProxySupertype() {
        return TrackingRemoteServiceProxy.class;
    }
}

Note: make sure to keep your custom ProxyCreator and ProxyGenerator classes outside of the client source directory because these classes can not be cross-compiled to javascript. If you forget to do this and attempt to compile anyway, module validation will fail with a message like: “[ERROR] Line XX: No source code is available for type com.google.gwt.user.rebind.rpc.ProxyCreator; did you forget to inherit a required module?”

The TrackingRemoteServiceProxy is responsible for the actual RPC instrumentation. This class will wrap the original asynchronous callback method in one that performs statistics tracking.

package org.rhq.enterprise.gui.coregui.client.util.rpc;

import com.google.gwt.http.client.RequestCallback;
import com.google.gwt.user.client.rpc.AsyncCallback;
import com.google.gwt.user.client.rpc.impl.RemoteServiceProxy;
import com.google.gwt.user.client.rpc.impl.Serializer;
import com.google.gwt.user.client.rpc.impl.RequestCallbackAdapter.ResponseReader;

public class TrackingRemoteServiceProxy extends RemoteServiceProxy {
    ...
    @Override
    protected  RequestCallback doCreateRequestCallback(ResponseReader reader, 
        String methodName, int invocationCount, AsyncCallback callback) {
        RequestCallback original = 
            super.doCreateRequestCallback(reader, methodName, invocationCount, callback);
        TrackingRequestCallback trackingCallback = 
            new TrackingRequestCallback(methodName, original);
        RPCTracker.getInstance().register(trackingCallback);
        return trackingCallback;
    }

}

The implementation of the TrackingRequestCallback is fairly straightforward – all it has to do is notify the class responsible for statistics tracking each time a response is received for any service, whether that invocation was successful or returned an error:

package org.rhq.enterprise.gui.coregui.client.util.rpc;

import com.google.gwt.http.client.Request;
import com.google.gwt.http.client.RequestCallback;
import com.google.gwt.http.client.Response;

public class TrackingRequestCallback implements RequestCallback {

    private String name;
    private long start = System.currentTimeMillis();
    private RequestCallback callback;

    public TrackingRequestCallback(String name, RequestCallback callback) {
        this.name = name;
        this.callback = callback;
    }

    public void onError(Request request, Throwable exception) {
        RemoteServiceStatistics.record(getName(), getAge());
        callback.onError(request, exception);

    }

    public void onResponseReceived(Request request, Response response) {
        RemoteServiceStatistics.record(getName(), getAge());
        callback.onResponseReceived(request, response);
    }

    public String getName() {
        return name;
    }

    public long getAge() {
        return System.currentTimeMillis() - start;
    }
}

Finally we have the RemoteServiceStatistics singleton, which is responsible for keeping track of all invocations for all services across the application. When the statistics are queried, it can use this full audit trail of invocations to generate things like min/max/avg running times and standard deviations:

package org.rhq.enterprise.gui.coregui.client.util.rpc;

import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;

import org.rhq.enterprise.gui.coregui.client.util.rpc.RemoteServiceStatistics.Record.Summary;

public class RemoteServiceStatistics {

    public static class Record {
        private String serviceName;
        private String methodName;
        private List data = new ArrayList();

        private Record(String remoteService, long millis) {
            // remoteService format "{ServiceName}_Proxy.{MethodName}"
            this.serviceName = remoteService.substring(0, remoteService.indexOf("_Proxy"));
            this.methodName = remoteService.substring(remoteService.indexOf('.') + 1);
            this.data.add(millis);
        }

        private void record(long millis) {
            this.data.add(millis);
        }

        public Summary getSummary() {
            return new Summary(serviceName, methodName, data);
        }

        public static class Summary {
            public final String serviceName;
            public final String methodName;
            public final int count;
            public final long slowest;
            public final long average;
            public final long fastest;
            public final long stddev;

            private Summary(String serviceName, String methodName, List data) {
                this.serviceName = serviceName;
                this.methodName = methodName;
                this.count = data.size();

                if (!data.isEmpty()) {
                    int i = 0;
                    long total = 0, slow = 0, fast = 0;
                    for (long next : data) {
                        if (i++ == 0) {
                            total = slow = fast = next;
                        } else {
                            total += next;
                            if (next > slow) {
                                slow = next;
                            } else if (next < fast) {
                                fast = next;
                            }
                        }
                    }
                    slowest = slow;
                    fastest = fast;
                    double avg = (total) / (double) count;

                    double sumOfSquares = 0;
                    for (long next : data) {
                        sumOfSquares += Math.pow(next - avg, 2);
                    }
                    stddev = (long) Math.pow(sumOfSquares / count, 0.5);
                    average = (long) avg;
                } else {
                    slowest = average = fastest = stddev = 0;
                }
            }
        }

    }

    private static Map statistics = new HashMap();
    private RemoteServiceStatistics() {
        // static access only
    }

    public static void record(String remoteService, long millis) {
        Record record = statistics.get(remoteService);
        if (record == null) {
            record = new Record(remoteService, millis);
            statistics.put(remoteService, record);
        } else {
            record.record(millis);
        }
    }

    public static Summary get(String remoteService) {
        Record stat = statistics.get(remoteService);
        if (stat == null) {
            stat = new Record(remoteService, 0);
        }
        return stat.getSummary();
    }

    public static List getAll() {
        List stats = new ArrayList();

        for (String remoteService : statistics.keySet()) {
            stats.add(get(remoteService));
        }

        return stats;
    }
}

I wrote a quick test page in the user interface to expose the statistics data. Below is a cropped summary of what you’ll see:

By putting the statistics into a table with sortable columns, it makes analyzing the data much simpler. Below is a quick list of the indicators that can be derived from the data set above:

  • ResourceBossGWTService.getInventorySummaryForLoggedInUser – high average and low variance implies often/always slow, look for possible improvements such as query rewrites, database indexes, or pre-computes.
  • AlertGWTService.findAlertCountByCriteria/findAlertsByCriteria – high variance, look for either possible database table contention issues that would lead to periodic query slowness or possibly eagerly/conditionally loaded data not necessary to render the view
  • TagGWTService.findTagReportComositesByCriteria – since this is on the critical path (invoked every time a user is looking at a resource / group / bundle / etc), must ensure page rendering isn’t contingent on this data; in other words, it should be OK to lazy load this data onto the view

After you have the prioritized list of services that might require optimization, I would then open up Speed Tracer to see just how these network figures relate to pure client-side figures such as parsing, layout, style calculations, timer firings, events, etc.

It may be the case that most of the perceived end-user time is spent taking the RPC results and rendering the DOM, not in the network request itself. If that’s the case, then Speed Tracer will very likely help you to pinpoint the exact client-side issue(s) you’re dealing with. On the other hand, if you determine that the bottleneck is on the server, you can follow the steps in my other blog entry on Hibernate Performance Monitoring to help get to the bottom of those issues.

Keep in mind, this solution is only meant to be a starting off point. It’s doesn’t tell you precisely why the services are slow, but does give you an indication of which are the worst offenders and how frequently they are invoked. One thing I’ll probably be implementing in the near future is payload response tracking. Tracking this data and correlating it with the rest of the remote service statistics would help distinguish between network slowness caused by slow queries on the server versus too much data being returned in the response to the client. I encourage you to experiment with this solution, and expose other statistics that you feel are useful in getting a quick overview of where your GWT-based application performance bottlenecks may lie.

Written by josephmarques

November 29, 2010 at 5:18 pm

Posted in gwt

Tagged with

GWT Compilation Performance

with 2 comments

A few weeks ago I noticed that the coregui module of RHQ, which is the next generation web interface written in GWT (SmartGWT to be precise), started to have its compilation slow down…noticeably. Not only did the overall time to compile the module increase, but during the build my box seemed to more or less locked up. Even mouse gestures were choppy. So…I decided to investigate.

I started by writing a script that would compile the coregui module for me. It was important to gauge how the different arguments to the maven gwt:compile goal (http://mojo.codehaus.org/gwt-maven-plugin/compile-mojo.html) would affect the system during the build. After reading through the documentation, I decided that ‘localWorkers’ and ‘extraJvmArgs’ were the two most important parameters to test different permutations for. Here is the script I came up with:


#!/bin/sh

# INTERVAL - the minimum amount of memory to use for gwt:compile
# STEPS    - the number of increments of interval
#            e.g. 4 steps of 256 interval = tests of 256, 512, 768, 1024
#            e.g. 5 steps of 200 interval = tests of 200, 400, 600, 800, 1000
# RUNS     - number of iterations at each (INTERVAL,STEP)

INTERVAL=256
STEPS=4
RUNS=10

declare -a AVERAGES
declare -a VARIANCES

function build
{
    workers=$1
    memory=$2

    dirName="workers-${workers}-memory-${memory}"
    vmArgs="-Xms${memory}M -Xmx${memory}M"

    mvnArgs="-Pdev -DskipTests -o"
    gwtWorkers="-Dgwt-plugin.localWorkers=\"$workers\"" 
    gwtJvmArgs="-Dgwt-plugin.extraJvmArgs=\"$vmArgs\""
    gwtArgs="$gwtWorkers $gwtJvmArgs"

    cmd="mvn $mvnArgs $gwtArgs gwt:clean install"
    echo "Executing $cmd"

    total_seconds=0
    rm -rf "target/runs/$dirName"
    mkdir -p "target/runs/$dirName"
    declare -a raws
    for run in `seq $RUNS`
    do
        outputFile="target/runs/$dirName/output.${run}.log"

        before=$(date +%s)
        eval "$cmd" > $outputFile
        after=$(date +%s)

        elapsed_seconds=$(($after - $before))
        raw[$run]=$elapsed_seconds
        total_seconds=$(($total_seconds + $elapsed_seconds))

        echo "Run $run Took $elapsed_seconds seconds"
        echo "Execution details written to $outputFile"
    done 

    average_seconds=$(expr $total_seconds / $RUNS)

    let "index = (($workers - 1) * 4) + ($memory / $INTERVAL) - 1"
    AVERAGES[$index]=$average_seconds

    sum_of_square_deltas=0
    for run in `seq $RUNS`
    do
       let "sum_of_square_deltas += (${raw[$run]} - $average_seconds)**2"
    done
    let "variance = $sum_of_square_deltas / ($RUNS - 1)"
    VARIANCES[$index]=$variance

    echo "Run Total: $total_seconds seconds"
    echo "Run Average: $average_seconds seconds"
    echo "Run Variance: $variance"
}

function run
{
    for workers in `seq 4`
    do
        for offset in `seq $STEPS`
        do
            memory=$(($INTERVAL * $offset))
            build $workers $memory
        done
    done
}

function print
{
    echo "Results"
    printf "          "
    for headerIndex in `seq 4`
    do 
        printf "% 12d" $headerIndex
    done
    echo ""

    for workers in `seq 4`
    do
        let "memory = $workers * $INTERVAL"
        printf "% 10d" $memory
        for offset in `seq $STEPS`
        do
            let "index = ($workers - 1) * 4 + ($offset - 1)"
            printf "% 6d" ${AVERAGES[index]}
            printf "% s" "("
            printf "% 4d" ${VARIANCES[index]}
            printf "% s" ")"
        done
        echo ""
    done
}

if [[ $# -ne 1 ]]
then
    echo "Usage: $0 "
    exit 127
fi

rhq_repository_path=$1
cd $rhq_repository_path/modules/enterprise/gui/coregui

echo "Performing timings..."
run
print
echo "Performance capture complete"


In short, this will build the coregui module over and over again, passing different parameters to it for memory (JVM arguments Xms/Xmx) and threads (GWT compiler argument ‘localWorkers’). And here are the results:

  1 2 3 4
256 229(6008) 124(4) 124(27) 124(12)
512 141(193) 111(76) 113(82) 114(25)
768 201(5154) 115(98) 123(57) 195(2317)
1024 200(2352) 125(83) 199(499) 270(298)

The columns refer to the number of ‘localWorkers’, effectively the number of concurrent compilation jobs (one for each browser/locale). The rows refer to the amount of Xms and Xmx given to each job. Each row represents statistics for the build being executed ten times using the corresponding number of localWorkers and memory parameters. The primary number represents the average time in seconds it took to compile the entire maven module. The parenthetical element represents the variance (the square of the standard deviation).

So what does this grid tell us? Actually, a lot:

  • End-to-end compilation time suffers when using a single localWorker (i.e., only one permutation of browser/locale is compiled at a time). Not only does it suffer, but it has a high variance, meaning that sometimes the build is fast, and sometimes it isn’t. This makes sense because the compilation processes are relatively independent and generally don’t contend for shared resources. This implies the compilation is a natural candidate for a concurrent/threaded solution, and thus forcing serialized semantics can only slow it down.
  • The variance for 2 or more localWorkers is generally low, except for the lower right-hand portion of the grid which starts to increase again. This also makes sense because, for the case of 4 threads and 1GB memory each, these concurrent jobs are exhausting the available system memory. This box only had 4GB ram, and so all of the physical memory was used, which starts to spill over to swap, which in turn makes the disk thrash (this was verified using ‘free’ and ‘iostat’). Regardless of what other activities were occurring on the box at the time as well as how the 4 threads are interleaved, it has an effect on the build because approximately ½ GB was being swapped to my 7200rpm disk during the compilation. (Granted, I could mitigate this variance by replacing my spindles with an SSD drive (which i may end up doing) but it is nonetheless important to be mindful of the amount of memory you’re giving to the entire compilation process (all workers threads) relative to the amount of available physical RAM.)
  • The 512MB row has values which are relative minimums with respect to the other timings for the same number of localWorkers given different memory parameters. This indicates that 512 is the sweet spot in terms of how much memory is needed to compile the modules. I would surmise that the cost of allocating more memory (256MB or 512MB for each of 2, 3, or 4 worker threads) and/or the cost of GC largely accounts for the slightly decreased performance with other memory settings. And then, as mentioned above, with lots of threads and very high memory, swapping to disk starts to dominate the performance bottleneck.
  • Aside from end-to-end running time, it’s also important on a developer system to be able to continue working while things are building in the background. It should be noted that each worker thread was pegging the core it ran on. For that reason, I’m going to avoid using 3 or 4 workers on my quad-core box because my entire system crawls due to all cores being pegged simultaneously…which is exacerbated in the case of high total memory with 3 or 4 workers causing swapping and disk thrashing alongside all my cores being pegged.

Conclusions:

On my quad-core box with 4 gigs of RAM, the ideal combination is 2 localWorkers with 512MB. The end-to-end build will consistently (recall the low variance) complete just as quickly as any other combination (recall the low average running time), and it won’t peg my box because only half of my cores will be used for the GWT compilation, leaving the other half for other processes i have running…eclipse, thunderbird, chrome, etc.

So what happens now if I take and move this script over to a larger box? This new box is also a quad core, but has a faster processor, and more than twice the amount of RAM (9 gigs). From everything deduced thus far, can you make an intelligent guess as to what might happen?

….

I anticipated that with more than twice the amount of RAM, that the worse-case permutation (4 localWorkers with 1GB of Xms/Xmx each) would cause little to no swapping. And with a more powerful processor, that the average running times would come down. And that is exactly what happened:

  1 2 3 4
256 153(24) 98(0) 96(1) 95(1)
512 150(23) 96(1) 96(2) 95(1)
768 149(32) 97(1) 96(1) 95(0)
1024 149(24) 96(0) 96(0) 95(1)

With nearly non-existent variance, it’s easy to see the net effect of having plenty of physical resources. When the processes never have to go to swap, they can execute everything within physical RAM, and they have much more consistent performance profiles.

As you can see, adding more cores does not necessarily yield large decreases for the end-to-end compilation time. This can be explained because (at the time of this writing) RHQ is only compiling 6 different browser/locale permutations. As more localizations are added in the future, the number of permutations will naturally increase, which will make the effect of using more cores that much more dramatic (in terms of decreasing end-to-end compilation times).

Unfortunately, the faster processor on the larger box still got pegged when compiling coregui. So since 3 or 4 localWorkers doesn’t result in dramatically improved end-to-end compilation times, it’s still best to use 2 localWorkers on this larger box. I could, however, get away with dropping the memory requirements to 256MB for each worker, since the faster hardware seems to even out the the performance profile of workers given different memory parameters holding the localWorkers constant.

Lessons learned:

  • In any moderately sized GWT-based web application, the compilation may take a considerable portion of each core it executes on, perhaps even pegging them at 100% for a portion of the compile. Thus, if you want your development box to remain interactive and responsive during the compilation, make sure to set the localWorkers parameter to something less than the number of cores in your system.
  • Don’t throw oodles of memory at the GWT compilation process and expect to get a speedup. Too much Xms/Xmx will cause the worker threads to spill out of main memory and into swap, which can have a detrimental affect on the compilation process itself as well as other tasks on the box since I/O requests explode during that time. Modify the script presented here to work with your build, and obtain reasonable timings for your GWT-based web application at different memory levels. Then, only use as much RAM as is required to make the end-to-end time reasonable while avoiding swap.
  • Provide reasonably conservative default values for localWorkers and Xms/Xmx so as not to give a negative impression to your project’s contributors. Err on the side of a slow build rather than a build that crushes the machine it runs on.
  • Parameterize your builds so that each developer can easily override (in a local configuration file) his or her preferred values for localWorkers and Xms/Xmx. Provide a modified version of the script outlined here, to allow developers to determine the ideal parameter values on each of their development boxes.

Other tips / tricks to keep in mind:

  • Only compile your GWT application for the browser/locale permutation you’re currently developing and testing on. There’s no need to compile to other languages/locales if you’re only going to be testing one of them at a time. And there’s no need to compile the applications for all browser flavors if the majority of early testing is only going to be performed against one of them.

Recall at the start of this article I was investigating why my box seemed to be non-responsive during the compilation of coregui. Well it turns out that one developer inadvertently committed very high memory settings, which as you’ve now seen can cause large amounts of swapping when localWorkers is also high. The fix was to lower the default localWorkers to 2, and reduce the memory defaults to 512MB. As suggested above, both values are parameterized in the current RHQ build, and so developers can easily override either or both to their liking.

Written by josephmarques

July 30, 2010 at 1:42 am

Posted in gwt, java

Tagged with ,

Follow

Get every new post delivered to your Inbox.