Apex cold starts and class caching misses

In the last couple of posts I have looked at class deploy times and class recompile times to round this up it’s time to look at class caching on Salesforce.

You have almost certainly seen the impact of class caching at some point, or rather cache misses, when some page takes an unusually long time to load the first time you try and use it. At work we generally talk of these events as being ‘cold starts’ which is a play on a cold cache. However, they can also be caused by invalid class recompilation so to avoid any confusion in this post I am only going to look at cold starts caused by a cold caches. None of the tests I did in this post involve invalid class recompilation.

What do they look like?

It turns out cold starts are actually quite hard to capture. I had to run quite a lot of tests to get a few samples, 9 days worths of tests actually:

This is showing the execution time for an anonymous apex call measured every hour for those 9 days. The spikes here are what we are looking for, calls that takes significantly longer than the average. There is no real pattern as to when these occurred, some were in the middle of the night, other during work hours. Some clustered close together, others in isolation.

I am using an anonymous Apex just because it was easy to run from my test harness. Customers normally see this type of behaviour when trying to access pages but the real cause is somewhere in the class cache system as this org was not being used by anyone while this test was running so no classes could have been invalidated. A number of these cold starts may be being caused by maintenance work on the instance, there is really no way to tell the root cause but the impact is obvious.

If you have watched the new compiler talk you might recall that there are two caches used for Apex classes, a level 1 cache on each application server and a level 2 cache shared between application servers. It’s not clear from this data which cache is causing these spikes. I would like to understand that but pragmatically to our customers it does not really matter, if they see this kind of behaviour often enough they are going to start questioning our ability to write good software.

Parallel Tests

The test result above are from a binary tree of classes with a call to a method on the root node being timed. This method calls the same method on its child class nodes which in turn call the same method on their child classes. This of course requires all the classes in the tree to be loaded for the call to complete. For this test I used a tree with 2048 classes and added 1kB of unrelated Apex code to each.

As I knew this was going to take some time to run I ran the same tests on four other trees of different sizes at the same time so we can compare the impact. Each of these trees has the same total amount of code spread over the classes just so any costs due to code size could be ignored. Looking at one day we get this:

Here we have a couple of cold start issues early in the morning. This looked very much like the primary cost was related to the number of classes so I used the data to calculate a cost/class when a cold start happens to get this:

What I think this is telling us is that the cache miss cost is mostly a factor of the number of classes but bigger classes do have some impact. There is some part which is proportional to the amount of code you put in the class which is consistent with the description of ‘inflation’ from the new compiler talk.

Should I worry about this?

This is a hard question to answer. The pages & triggers of your products need to require quite a lot of classes for the cold start to be significant so this is not really a problem on smaller products but it also depends on how tolerant your customer are to response times.

What you should be wary of if you already have or suspect you will have response time concerns is how you architect and design your product. Using lots of small classes will help your deploy times but could also increase the impact of cold starts. What I can’t really shed much light on is if the same sort of patterns are common across instances or if the incidence varies much over the medium/long run of months, this is just a snapshot of one week on one instance.

Next Steps

My takeaway from the results in this post and the last two is that there are things I am doing when coding that I have not thought through the impact they have on my own developer experience and customer response times. This feels poor, but as yet I don’t know how to correct this by giving myself a new mental model of what I should be aiming for. If I find a happy place I will let you know…

Advertisement

Invalid Apex Class Recompile times

In my last post some testing showed how Apex class deploy times are proportional to the size of code in the class being deployed and all its dependant classes. If you read this alone you might think that best practice would be to always write very small classes. What we also have to consider though is what impact that will have on our customers experience. In this post I am going to focus on one aspect of this, what happens to response times when there are invalid classes in an org.

Lets start with the headline result:

This graph is showing how long it takes to recompile classes of various sizes once they have become invalid. By re-compile here I really mean the process that converts an invalid class to a valid one on-demand as some execution context attempts to use it. I think it’s fair to assume this a recompile of the class source code into the byte code format used by Salesforce given we have a ‘Compile All Classes’ link that appears to do exactly this operation.

As you can see there is fairly high initial cost of ~20ms/class + a variable component depending on class size. The time here is measured as the additional time needed to make an anonymous Apex request when that request uses classes which are invalid. So if the request required ten 8 kB sized classes to be recompiled we would expect the request to take 0.5seconds longer than if those classes were not invalid. You could also likely time this on the platform as well as calling Type.forName on a class with causes it to become valid but I am using the Tooling API for tests so timing some Anonymous Apex that will invoke a method on the class is more convenient.

To calculate this I used the same binary tree of classes approach from this post but made a few adjustments. When the tree is initially constructed all the classes are in a valid state, to invalidate part of the tree the test updates all the leaf node classes which causes all non-leaf node classes to become invalid. The test then times an anonymous Apex call to the root node class which will in turn call methods on all classes in the tree requiring any invalid classes to be recompiled. To get the additional cost of the re-compile I ran the test with and without the code that invalidates part of the tree and subtracted the results before dividing by the known number of invalid classes in the tree. The number reported is geometric mean of 10 runs.

This was the second test I ran, the earlier test showed that the cost to recompile invalid classes appears to be linear to the number of invalid classes. I tested that up to ~512 invalid classes so there is a possibility that the linear relationship will break down a bit after this. I wasn’t really expecting this to show anything interesting but you never know.

Ideal size & compile on deploy

This result does leave us with a bit of a problem. We would really like to have class size driven by ‘clean code’ considerations, not the implementation details of Salesforce orgs, but that may well hurt either our productivity via excessive class deploy times or our user experience via poor responsiveness depending on what turns out to be a typical class size for the type of product we are developing.

Salesforce have recently introduced ‘Compile on deploy’ support to try and address the responsiveness side of this problem. My experience with this so far has been pretty disastrous, if enabled developer deploys for single classes can go from 30 seconds worst case to many minutes. Worst still, on some org types you don’t have the ability to disable via Apex Settings, it’s locked on.

The positive side of compile-on-deploy is it should mean that on customer orgs the cost of invalidation recompiles is not important anymore and so using the smallest possible classes looks desirable to improve developer experience. There is however another fly here, the class caching that is used to improve the runtime access to recently used classes. Studying caching behaviour is always difficult but maybe I can find a way to look at what is happening to get a more complete picture.

Inner classes to the rescue?

Thinking about how these result might change my coding style it feels obvious that my own best interest is served by making classes as small as possible in an attempt to reduce class update times to the minimum possible to improve my own productivity. Doing that though may well cause issues with system response times depending on the compile-on-deploy setting and class caching behaviours.

Perhaps a way around this is write outer classes as though they were inner classes (i.e. no inner classes or statics). That would allow for a very easy route to automatically convert a set of outer classes into inner classes which can be combined together to reduce the class count needed. In a DX project where we can at last use directories to organise code you could place all the related classes together and auto-merge them for packaging. Just a thought…

Understanding Class Deploy Times for Developers

Update 5th Feb 2019: Test for this can be found at https://github.com/nawforce/ClassDeploymentTests

When working with multiple managed packaged I have often wondered why single file deploy times vary so much. Classes in some packages only take a few seconds to deploy while in other packages classes with similar purpose can consistently take much longer. The difference can be very large, it often feels like up to 10 times worse.

I have performed some experiments to try and understand what is happening but before we get there I should mention this is more than casual interest for me. If you do a bit of research you will find the link between poor system response times and developer productivity has been studied a few times. While the evidence is not entirely conclusive (for me) it’s pretty strong. How much time developers lose due to long deploy times is always going to be hard to gauge, but if I ask my colleagues at work about this link no one argues it does not exist.

Show me the data!

Ok, fair enough. Let’s start with this:

This is a log of the time taken to deploy individual classes taken from the same managed package via the tooling API. Each dot represents one class with the y-axis showing an average milli-seconds taken to deploy over 10 attempts. The data is presented in the order it was collected along the x-axis.

The first thing you might notice is that around about the 100th class tested there was some kind of brownout. The deploys continued to work but were much slower for a few minutes. I didn’t investigate the cause of that but it’s not what interested me, I was really interested in the time distribution of the other deploys.

There is a clear indication here of two clusters, classes that deploy in ~3.5 seconds and those that typically take 7 seconds. Repeating this test on a different package shows that 3.5seconds is virtually always best case but the higher cluster location location varies, so in the other package I tested the higher cluster was closer to 20 seconds. Another observation is that percentage of classes in each cluster can vary. In the graph above the two clusters are roughly equal in size but in another case approx ~90% of classes tested were in the slower to deploy group.

This is clearly quite unusual behaviour, I was expecting to find some kind of distribution to deploy times but not quite this. Understanding why we see these clusters requires delving into code patterns which I will likely do in a future post but for now we can learn more by studying the deploy time behaviour of some generated classes.

Class Invalidation

My first thought on seeing the graph above was that the deploy times differences were being caused by the need to invalidate other classes. If you have spent much time developing with Apex you will know classes can becomes ‘invalid’. To see this go to the Setup->Apex Classes and add the IsValid flag to the view. The flag is used to indicate that although the class was valid when deployed (all are) some metadata has changed that might mean it is no longer syntactically valid. The flag is indicating to the runtime that it will need recompiling/rechecking before next use. You can get rid of all the invalid classes by hitting the ‘Compile all classes’ action on that page.

To test this I generated long chains of classes that called a method on the next class in the chain so they became dependant. If you then update the class at the end of the chain you can then force a large number of classes to become invalid during the update. This test did show part of the deploy time was coming from the need to perform invalidation but even with very long chains (up to 1000 classes) the impact to the deploy times was not large enough to explain what we see with package code.

Use the trees

Having failed to identify a cause looking at invalidation the next step was to look at class dependants as logically this is the inverse, so if Class A calls a method on Class B we might say updating B invalidates A or equivalently A depends on B. To test dependencies I created various sizes of binary tree using classes to get this result.

The simplest 1-layer tree consisted of a root node class that calls methods in two other classes. In a 2-layer tree the root node class calls methods in two other classes but each of those calls methods in two more classes. If you run this test with no other code in each class then you can see an increase in deploy times but its not that clear. I found I could create the result above by adding additional code to each class that only depends on platform types.

What this graph is showing is a near linear increase in the time to update the root class of the tree as the tree size grows. There could be a few reasons for this so time to dig a bit deeper with another experiment.

For this test I added ‘weight’ to the classes by using a number of identical small blocks of code . This means that when the number of classes doubles I could half the number of code blocks to keep the overall amount of Apex code in the tree about constant.

What the data is showing is that for small->medium class numbers the time to update the root class of the tree is proportional to amount of code in the class and all dependent classes. 

With high numbers of classes we can see that that deploy times start getting worse but it’s not the linear relationship we saw earlier, so it’s a smaller factor to consider than just the amount of code in all dependent classes.

Back to invalidation

With the Salesforce Apex runtime being essentially a black box understanding why something happens as apposed to what happens can be very difficult if it’s not documented. In this case I am going to speculate a bit and try and explain why we might be seeing this behaviour.

The invalidation result earlier got me thinking about what kind of data structure that could be being used to invalidate classes very quickly. In this context I think it might be important to understand that class references in an org support some very dynamic behaviours. For example, I can replace a platform class with my own class of the same name (don’t do this it’s a horrible anti-pattern). If I did do this how are existing class references handled?

My guess is (and that’s all this is) is that class references are always being resolved during class loading. This would mean the only way to perform quick invalidation would be to store a list of classes names that should be invalidated with each class so that when it is updated there is no need to analyse class dependencies. If that is the case then when you update a class you may need to recompute an invalidation list for all the dependent classes by analysing the code in the dependency tree.