I’ve been meaning to write a blog entry about Varnish for years now. The closest I’ve come is to write a blog about how to make Varnish cache your debian repos, make you a WikiLeaks cache and I’ve released Varnish Secure Firewall, but that without a word on this blog. So? SO? Well, after years it turns out there is a thing or two to say about Varnish. Read on to find out what annoys me and people I meet the most.
Although you could definitely call me a “Varnish expert” and even a sometimes contributor, and I do develop programs, I cannot call myself a Varnish developer because I’ve shamefully never participated in a Monday evening bug wash. My role in the Varnish world is more… operative. I am often tasked with helping ops people use Varnish correctly, justify its use and cost to their bosses, defend it from expensive and inferior competitors, sit up long nites with load tests just before launch days. I’m the guy that explains the low risk and high reward of putting Varnish in front of your critical site, and the guy that makes it actually be low risk, with long nites on load tests and I’ll be the first guy on the scene when the code has just taken a huge dump on the CEO’s new pet Jaguar. I am also sometimes the guy who tells these stories to the Varnish developers, although of course they also have other sources. The consequences of this .. lifestyle choice .. is that what code I do write is either short and to the point or .. incomplete.
I know we all love Varnish, which is why after nearly 7 years of working with this software I’d like to share with you my pet peeves about the project. There aren’t many problems with this lovely and lean piece of software but those which are there are sharp edges that pretty much everyone snubs a toe or snags their head on. Some of them are specific to a certain version, while others are “features” present in nearly all versions.
And for you Varnish devs who will surely read this, I love you all. I write this critique of the software you contribute to, knowing full well that I haven’t filed bug reports on any of these issues and therefore I too am guilty in contributing to the problem and not the solution. I aim to change that starting now :-) Also, I know that some of these issues are better lived with than fixed, the medicine being more hazardous than the disease, so take this as all good cooking; with a grain of salt.
Silent error messages in init scripts
Some genious keeps inserting
1>/dev/null 2>&1 into the startup scripts on most Linux distros. This might be in line with some wacko distro policy but makes conf errors and in particular VCL errors way harder to debug for the common man. Even worse, the
`service varnish reload` script called
`varnish-vcl-reload -q`, that’s q for please-silence-my-fatal-conf-mistakes, and the best way to fix this is to *edit the init script and remove the offender*. Mind your p’s and q’s eh, it makes me sad every time, but where do I file this particular bug report?
debug.health still not adequately documented
People go YEARS using Varnish without discovering
watch varnishadm debug.health. Not to mention that it’s anyone’s guess this has to do with probes, and that there are no other
debug.* parameters, except for the totally unrelated
debug parameter. Perhaps this was decided to be dev-internal at some point, but the probe status is actually really useful in precisely this form.
debug.health is still absent from the
param.show list and the man pages, while in 4.0 some probe status and backend info has been put into varnishstat, which I am sure to be not the only one being verry thankful for indeed.
Designing a language is tricky.
Explaining why purge is now ban and what is now purge is something else is mindboggling. This issue will be fixed in 10 years when people are no longer running varnish 2.1 anywhere. Explaining all the three-letter acronyms that start with V is just a gas.
ban("req.url = "+ req.url) for the first time is bound to make them go “oh” like a racoon just caught sneaking through your garbage.
Grace and Saint mode… that’s biblical, man. Understanding what it does and how to demonstrate the functionality is still for Advanced Users, explaining this to noobs is downright futile, and I am still unsure whether we wouldn’t all be better off for just enabling it by default and forgetting about it.
I suppose if you’re going to be awesome at architecting and writing software, it’s going to get in the way of coming up with really awesome names for things, and I’m actually happy that’s still the way they prioritize what gets done first.
Only for people who grok regex
Sometimes you’ll meet Varnish users who do code but just don’t grok regex. It’s weak, I know, but this language isn’t for them.
Uncertain current working directory
This is a problem on some rigs which have VCL code in stacked layers, or really anywhere where it’s more appropriate to call the VCL a Varnish program, as in “a program written for the Varnish runtime”, rather than simply a configuration for Varnish.
You’ll typically want to organize your VCL in such a way that each VCL is standalone with if-wrappend rules and they’re all included from one main vcl file, stacking all the vcl_recv’s and vcl_fetches .
Because distros don’t agree on where to put varnishd’s current working directory, which happens to be where it’s been launched from, instead of always
chdir $(basename $CURRENT_VCL_FILE), you can’t reliably specify
include statements with relative paths. This forces us to use hardcoded absolute paths in includes, which is neither pretty nor portable.
Missing default director in 4.0
When translating VCL to 4.0 there is no longer any language for director definitions, which means they are done in
vcl_init(), which means your default backend is no longer the director you specified at the top, which means you’ll have to rewrite some logic lest it bite you in the ass.
director.backend() is without string representation, instead of backend_hint,
so cannot do old style name comparisons, ie backends are first-class objects but directors are another class of objects.
VCL doesn’t allow unused backends or probes
Adding and removing backends is a routine ordeal in Varnish.
Quite often you’ll find it useful to keep backup backends around that aren’t enabled, either as manual failover backups, because you’re testing something or just because you’re doing something funky. Unfortunately, the VCC is a strict and harsh mistress on this matter: you are forced to comment out or delete unused backends :-(
Workarounds include using the backends inside some dead code or constructs like
set req.backend_hint = unused;
set req.backend_hint = default;
It’s impossible to determine how many bugs this error message has avoided by letting you know that backend you just added, er yes that one isn’t in use sir, but you can definitely count the number of Varnish users inconvenienced by having to “comment out that backend they just temporarily removed from the request flow”.
I am sure it is wise to warn about this, but couldn’t it have been just that, a warning? Well, I guess maybe not, considering distro packaging is silencing error messages in init and reload scripts..
To be fair, this is now configurable in Varnish by setting
false, but couldn’t this be the default?
saintmode_threshold default considered harmful
If many different URLs keep returning bad data or error codes, you might concievably want the whole backend to be declared sick instead of growing some huge list of sick urls for this backend. What if I told you your developers just deployed an application which generates 50x error codes triggering your saintmode for an infinite amount of URLs? Well, then you have just DoSed yourself because you hit this threshold. I usually enable saintmode only after giving my clients a big fat warning about this one, because quite frankly this easily comes straight out of left field every time. Either saintmode is off, or the treshold is Really Large™ or even ∞, and in only some special cases do you actually want this set to an actual number.
Then again, maybe it is just my clients and the wacky applications they put behind Varnish.
What is graceful about the saint in V4?
While we are on the subject, grace mode being the most often misunderstood feature of Varnish, the thing has changed so radically in Varnish 4 that it is no longer recognizable by users, and they often make completely reasonable but devestating mistakes trying to predict its behavior.
To be clear on what has happened: saint mode is deprecated as a core feature in V4.0, while the new architecture now allows a type of “stale-while-revalidate” logic. A saintmode vmod is slated for Varnish 4.1.
But as of 4.0, say you have a bunch of requests hitting a slow backend. They’ll all queue up while we fetch a new one, right? Well yes, and then they all error out when that request times out, or if the backend fetch errors out. That sucks. So lets turn on grace mode, and get “stale-while-revalidate” and even “stale-if-error” logic, right? And send If-Modified-Since headers too, sweet as.
Now that’s gonna work when the request times out, but you might be surprised that it does not when the request errors out with 50x errors. Since
beresp.saint_mode isn’t a thing anymore in V4, those error codes are actually going to knock the old object outta cache and each request is going to break your precious stale-while-error until the backend probe declares the backend sick and your requests become grace candidates.
Ouch, you didn’t mean for it to do that, did you?
And if, gods forbid, your apphost returns 404′s when some backend app is not resolving, bam you are in a cascading hell fan fantasy.
What did you want it to do, behave sanely? A backend response always replaces another backend response for the same URL – not counting vary-headers. To get a poor mans saint mode back in Varnish 4.0, you’ll have to
return (abandon) those erroneous backend responses.
Evil grace on unloved objects
For frequently accessed URLs grace is fantastic, and will save you loads of grief, and those objects could have large grace times. However, rarely accessed URLs suffer a big penalty under grace, especially when they are dynamic and ment to be updated from backend. If that URL is meant to be refreshed from backend every hour, and Varnish sees many hours between each access, it’s going to serve up that many-hour-old stale object while it revalidates its cache.
This diagram might help you understand what happens in the “200 OK” and “50x error” cases of graceful request flow through Varnish 4.0.
Language breaks on major versions
This is a funny one because the first major language break I remember was the one that I caused myself. We were making security.vcl and I was translating rules from mod_security and having trouble with it because Varnish used POSIX regexes at the time, and I was writing this really godaweful script to translate PCRE into POSIX when Kristian who conceived of security.vcl went to Tollef, who were both working in the same department at the time, and asked in his classical broker-no-argument kind of way "why don’t we just support Perl regexes?".
Needless to say, (?i) spent a full 12 months afterwards cursing myself while rewriting tons of nasty client VCL code from POSIX to PCRE and fixing occasional site-devestating bugs related to case-sensitivity.
Of course, Varnish is all the better for the change, and would get no where fast if the devs were to hang on to legacy, but there is a lesson in here somewhere.
So what's a couple of
sed 's/req.method/req.request/'s every now and again?
This is actually the main reason I created the VCL.BNF. For one, it got the devs thinking about the grammar itself as an actual thing (which may or may not have resulted in the cleanups that make VCL a very regular and clean language today), but my intent was to write a parser that could parse any version of VCL and spit out any other version of VCL, optionally pruning and pretty-printing of course. That is still really high on my todo list. Funny how my clients will book all my time to convert their code for days but will not spend a dime on me writing code that would basically make the conversion free and painless for everyone forever.
Indeed, most of these issues are really hard to predict consequences of implementation decisions, and I am unsure whether it would be possible to predict these consequences without actually getting snagged by the issues in the first place. So again: varnish devs, I love you, what are your pet peeves? Varnish users, what are your pet peeves?
vcc_err_unref has existed since Varnish 3.