Tuesday, January 10, 2017

Using varnish in front of a webhosting canada or other shared hosting site? Pay attention to your probe!

I've had a varnish installation protecting http://www.fairvote.ca -- a wordpress site hosted at webhosting canada (whc.ca) -- for about a year. Last Friday in the evening, it started spitting out 503 errors, for no obvious reason. I spent a few hours yesterday in vain trying to get the hosts to tell me what, if anything might have changed at their end.

What I eventually figured out is that they changed how their servers respond to http requests without a domain - previously it was a 200, and it became something else (a redirect, I believe). That's not unreasonable (though relatively pointless to change at this stage, I would imagine), except that, by default, varnish probes its backends with requests that don't have a domain. What that meant is that varnish started thinking that the backend was broken, so stopped talking to it (even though it was perfectly capable of doing so).

The fix was just to add a customized probe that included a valid domain to the request.

Monday, November 14, 2016

Varnish saves the day, in unexpectedly awesome ways.

Four and half years ago, I wrote a blog post about Varnish, a 'front-end proxy' for webservers. My best description of it then was as a protective bubble, analogous to how it's namesake is used to protect furniture. I've been using it happily ever since.

How Americans Voted poster
But last week, I got to really put Varnish through a test when the picture here, posted by Fair Vote Canada (one of my clients), went viral on Facebook. And Varnish saved the server and the client in ways I didn't even expect.

1. Throughput

Varnish prides itself on efficiently delivering http requests. As the picture went viral, the number of requests was up to about 1000 per minute, which Varnish had no trouble delivering - the load was still below 1, and I saw only a small increase in memory and disk usage. Of course, delivering a single file is exactly what Varnish does best.

2. Emergency!

Unfortunately, Varnish was not able to solve a more fundamental limitation, which was the 100Mb/s network connection. Because the poster was big (760Kb), the network usage, which is usually somewhere in the 2-5 Mb/s range, went up to 100Mb/s, and even a bit beyond. That meant the site (and others sharing that network connection) started suffering slow connections, and I got a few inquiries about whether the server had 'crashed'.

At that stage, I had no idea what was actually going on, just that requests for this one file was about to cause the site as a whole to stop responding. I could see that the referrer was almost exclusively facebook, I also noticed that the poster on it's own wasn't really helping their cause, and the client also had no idea that it was happening - they had uploaded the poster to facebook, so it shouldn't be requesting it from their site.

Fortunately, because the limitation was in the outgoing network, there was a simple solution - stop sending the poster out. With a few lines in my varnish VCL, the server was now responding with a simple 'permission denied', and within a few seconds, everything settled down.

In fact, the requests kept coming in, at ever higher numbers, for the rest of the day, but Varnish was able to deflect them without any serious blip in the performance of the server.

3. And Better

The next day, after some more diagnostics, we discovered that the viral effect had actually come from someone else's facebook post who shared the poster as it had gone out in an email. Although the poster on it's own wasn't going to help the cause of PR directly, we didn't really want to stem whatever people were getting out of it, so I uploaded the poster to an Amazon S3 bucket, (an industrial file service) and modified my varnish vcl to now give a redirect to the amazon copy instead of a permission denied.

Now the poster could go safely viral.

4. And Best

After some more discussion, Fair Vote noted it would be better if people ended up on the facebook campaign url here  rather than just the poster. So I updated the varnish vcl so that if the poster request comes from a facebook referrer, then it redirects them instead to the above url.

Four days later now, it seems like it's worked - the poster is still pretty viral, and even the requests for the original url is still going strong (3.4 million requests in the 48 hours ending at 3 am this morning).

Without Varnish, my server would have crashed and been unable to get back up, even now. Instead, the poster is still being shared, the rest of the site is still working, and the facebook share is even more effective than it would have been.

Tuesday, July 05, 2016

Distracted parenting

Do one thing each day that scares you? How about, watch your toddler ...

Tuesday, December 09, 2014

Lily's New Hat

My cousin Vikki has been helping us for the past month and is leaving to head back home in a couple of days. We had her over for dinner last night and she brought Lily a new hat!

Thursday, November 06, 2014

Varnish, Drual and ical (.ics): expiry issue fixed

My normal configuration of a public site on my servers involves using varnish for the page cache and setting expire page to 1 day. This mostly works quite well (the varnish module in Drupal takes care of clearing the varnish cache when you're creating/editing content).

We recently launched a new Drupal version of the Calgary French & International School (okay, I was just along for the tail end to help with the launch, Karin and Rob get the credit for all the work), which includes an ical feed for parents (generated from views of course).

That's an excellent thing - parents can subscribe to the feed and have all the upcoming events on their mobile device (or google calendar, or both). But we discovered that although it works great on the Mac desktop, it wasn't working well for iOS (i.e. the iPhone). It would poll frequently enough, but only actually update once a day.

It turned out that these two devices are interpreting the http header 'cache-control' differently - on the iphone, it appeared to interpret it to say don't both looking for fresh data more than once a day. The header is not very well defined unfortunately, but it is used by Drupal/Varnish to control the maximum expiry date, so we didn't want to crank it too low (or risk a badly performing site, since most access is anonymous).

The solution was actually simple: a little help in the varnish vcl file, in my vcl_deliver function, below. The piece I added was the second if, and it's just modifying the cache-control header on output if it's delivering a file with extension 'ics'.

sub vcl_deliver {
  if (resp.http.magicmarker) {
    unset resp.http.magicmarker;
    set resp.http.age = "0";
  if (req.url ~ "\.ics$") {
    set resp.http.cache-control = "public, max-age=60";

Monday, September 08, 2014

Lily, is it time to make some garlic bread?

Lily is my daughter, born a couple of days after my previous post, now almost 5 months old.

Tuesday, April 15, 2014

CiviCRM and Accounting in 4.3

The 4.3 version of CiviCRM that first came out in April 2013 addresses a key problem with CiviCRM for large organizations: namely, accounting integration.

So what exactly does that mean, and how does it work? Since I'm working on a big migration to CiviCRM, and the client has "accounting integration" needs, I've been diving in and trying to understand the nitty gritty. Since I started, 4.4 is now out, and 4.5 is almost out, and I understand they've made some improvements, so this is now a bit dated, but still might be helpful.

First off, "accounting integration" doesn't mean that CiviCRM will now replace your accounting package, but the goal is to make it play nicer. The key issue for my client is that reports coming out of their current system are used as input for their accounting system, so it needs to speak the same language - i.e. things like GL account codes, and double entry accounting for things like contributions where the donor is promising to send a cheque. I like to describe it as: making CiviCRM a trusted entity within the accounting system, instead of it's current status where reports generally involve some wishful thinking, caveats, and a checklist of adjustments.

Initially I've just been going along assuming everything will take care of itself under the hood as I use Eileen's brilliant civimigrate module in combination with the Drupal migration module. But after a few failed attempts at getting what I want, I've been forced to try and understand what's really happening, and so that's what I'm sharing here.

The most useful documentation I've found so far is here:


The aha moment came after looking at the new civicrm_financial_* tables and reading that page.

My key insight was:

In the past, a row in the contribution table was playing too many roles: it was the transaction record as well as the accounting record. In this new, more sophisticated world of accounting, you still get one row in the contribution table per "contribution", but that really serves as a simplification of what are two collections of things: the collection of financial transactions that go into paying for it, and the collection of items that describe how that contribution is accounted for - where it went. In the simplest case, you get one of each (e.g. a check for a simple donation). But at it's most complicated, you might have a series of partial payments for an event ticket, some of which is receiptable as a tax donation and some of which goes towards the cost of the event. Yes, that's why they call it 'double entry'.

Implementing a system like this over top of an existing system is a bit tricky, and the developers seem to have been a little bit worried about the complexity implication for users that didn't want to know about this. So you have to dig a bit to see what's going on, and I'm probably missing some details. Patches welcome ...

One way of describing what we're doing is 'splitting' the contribution. The contribution becomes an container for a collection of inputs (financial transactions) and collection of outputs (attribution of the contribution to accounting codes). The original contribution table still contains a single entry per contribution, but the possibly multiple transactions that pay for it, and the possibly multiple attributions of that income, need to live in related tables.

One trick the developers used was to create something they call a 'financial type'. The point of this entity is to allow administrators to delegate the accounting to a set of rules for each 'financial type' - meaning, the way a contribution gets allocated to the account codes is determined by the nature of the transaction (i.e. income, expense, cost of sales, etc.), which is then looked up for the 'financial type', and that determines the accounting code. Fortunately, this is just a mechanism for calculating the actual accounting - the data gets stored fairly directly.

Now let's check out the new civicrm_financial_* tables.

civicrm_financial_item - this looks like the accounting for each entry in the contributions table. It includes entity_table and entity_id fields that link it to more information, e.g. an entry in civicrm_line_item. It doesn't provide you the accounting directly, but it gives you the financial_account_id, and you can look up the accounting code directly from the civicrm_financial_account table.

civicrm_financial_trxn - these are the financial transactions that make up the contributions. You'll see it has things like the from and to account fields (to allow for both external and internal transactions, like when a check is received and the transaction is transferred from accounts payable to your bank account), as well as transaction id's, currency, amounts, dates, payment instruments, etc., i.e. everything you need to follow the money.

civicrm_entity_financial_trxn - this is the table that joins the above two tables to the contributions table. A simple typical contribution will have two entries, one pointing to the financial item, and the other to the financial transaction.

Okay, now let's dig a little deeper:

In the financial_item table, which holds the accounting, it also has a reference to a 'more information' record, with entity_table and entity_id fields. In my install, it's pointing at the civicrm_line_item table most of the time, except for old imported from 4.2 entries that point at the civicrm_financial_trxn table.

civicrm_line_item - I'm not sure why you'd need a reference to this, but I guess it does help track back to how the financial_item got created. Specifically, it has a 'financial type id' field, which in combination with the transaction, could be used to calculate the financial account id that ends up in the financial item.

civicrm_financial_trnx - I'm guessing that the only time a financial_item references this table is when there's a direct correspondence between the transaction and the accounting. For an install that was migrated from 4.2, for example, that's the case of all the old transactions that assumed this and for which there is no intervening line item to split it up. Maybe "backend" administrative entries and adjustments end up here as well?

And now back to the other financial tables:

civicrm_financial_type - a list of these 'financial type' abstractions. There's no accounting codes in here, you have to find the connection to the  account id using something like:

select * from civicrm_entity_financial_account where entity_table = 'civicrm_financial_type';

civicrm_financial_account - the list of the accounting codes for each 'account', i.e. what you want to get from your bookkeeper when you set things up.

Conclusion: it's pretty complicated, and obviously you don't want to be manually mucking with these tables. In theory, the structure allows for some nice splitting up of income into different accounting categories, but at this stage, the interface is trying to hide most of the details to keep things simple for users.

Thursday, March 13, 2014

Transport layer security on the Internet

Yesterday I posted this:


and sent the link off to some friends and family. They had some good things to say, and some of that helped me clean it up a bit. But the feedback and discussions I had also helped me to step back a bit from the specifics of that proposal and think more generally about the problem.

The problem I'm talking about is a mash-up of technical detail, privacy concerns, security concerns and good old fashioned apocalypse with a dash of conspiracy anti-government kind of stuff. So there's definitely more than one way to look at it. I like to think of it as "collapse of trust on the Internet as we know it".

Here's the scenario: at some point in the next 5 years, a method is discovered that allows people with enough computer power to decrypt 'secure' https connections. Once this is generally known to the public (e.g. via a leak like that of Mr. Snowden), no one will 'trust' that any communications on the Internet is safe. Banks and credit cards companies will stop accepting any transactions from the Internet, and e-commerce will collapse. How that will impact the world, I'll leave to your imagination, but I don't think it will be pretty.

The anti-establishment rogue in me gets some satisfaction from that scenario, but I also know that in a crisis, it's the people at the bottom of the ladder that get crushed, and mass human suffering isn't something I'd like to encourage.

So here are some follow-up notes to my post:

What problem are we trying to solve?

Avoiding a disaster is a nice big picture goal, but not one that lends itself to a specific solution. One way of framing the problem is narrowly, which is what I suggested in my post - i.e. focus on the mathematics behind the encryption problem.

On the other hand, perhaps that's not the right problem to solve? It's not a new problem, and it's been around for about 20 years and there hasn't been a whole lot of progress or change.

The mathematical piece of the problem as it is currently framed is about how to provide a "Public Key Infrastructure" (PKI) using mathematics. A PKI is a way of solving the more abstract problem of 'how do you establish trust between two parties on the Internet', where the only communication between them is this stream of bytes that appear to be coming from a source that is reliably identifiable only as number? What if that doesn't have a reliable solution?

The short version of what suddenly got quite complicated is this: this part of internet security was designed for e-commerce, in a bit of a hurry, back in the early days of the Internet when machines were less powerful and e-commerce was a dream. Then the dream actually came true (after the Internet bubble and collapse) but those emperor's clothes are pretty skimpy.

So "who do you trust and why" is the bigger, more abstract problem, and treads on some scary ground. Is there a different solvable technical problem somewhere in here, bigger than the mathematical problem of a PKI but smaller than the completely abstract one?

What problems are already solved?

My smarter older brother pointed me to these:

a. http://en.wikipedia.org/wiki/Advanced_Encryption_Standard

A smaller more tractable problem is 'symmetric encryption' (which isn't a mathematical solution to a PKI on it's own), and this solution has been adopted as a new standard. In other words, if you have a prior relationship with someone and way of sharing secrets outside of the Internet, then a secure private channel is not all that difficult.

b. http://en.wikipedia.org/wiki/Quantum_key_distribution

This appears to be a solution to negotiating a shared random secret key, which solves part of the PKI problem (it helps provide a secure channel with your correspondent, it doesn't help prove who they are).

c. Human nature

Yeah, just kidding. Just to be clear though - none of this solves the general problems of fraud and how humans have built a glorious, terrible thing built on machines and social interaction, and how fragile it is. Perhaps that part of the problem (who do you trust) is not going to have a technical solution.

Tuesday, December 17, 2013

The outsourcing question

I run a web development business, and am always engaged in a question about how many of my supporting services should be contracted out or done myself. And for what I don't do myself, who I can trust to deliver that service reliably to my clients. And what to do when that service fails.

This is not an academic debate this week for me.

On Sunday, my server-hardware supplier failed me miserably. On Friday, I notified them of errors showing up in my log related to one of my disks (the one that held the database data and backup files). They diagnosed it as a controller issue and scheduled a replacement for Sunday early morning. So far so good. It took longer than they had expected, but it came back and seemed to check out on first report, so I thought we were done. It was Sunday morning and I wasn't going to dig too deep into what I thought was a responsible service providers' area of responsibility.

On Sunday evening, Karin (my business associate at Blackfly) called me at home (which she normally never does) to alert me that the server was failing. By that point, the disk was unreadable, so we scheduled a disk replacement and I resigned myself to using my offsite backups, which were now a day older than normal because the hardware replacement had run over the hour when the backup schedule runs normally (why didn't I manually run it after the hardware "upgrade"? yes).

That server has been much too successful of late, and loading all the data from my offsite server was much slower than I'd anticipated (i.e. 2 hours), and then running all the database restores took a while. To make it worse, I decided that it was a good opportunity to update my Mariadb (mysql) version from 5.2 to 5.5. That added unexpected extra stress and complications to it (beware the character configuration changes!!!!), which I can mostly only blame myself for, but at least I suffered for it correspondingly with lack of sleep.

But then on Monday, after sweeping up a bit, I discovered that not only had the hardware swap that was done on Sunday morning not addressed the problem and made it much harder by postponing what could have been a simple backup to the other disk, they actually swapped good hardware for older hardware of lesser capacity - in other words, the response to the problem had been to make it considerably worse. I had a few words with them, I'll give them an opportunity to come up with something before I shame them publicly.

Now it's Tuesday morning and and the one other major piece of infrastructure that I outsource (DNS/Registration, to hover.com) is down, has been for the last hour.

In cases like this, my instinct is to circle the wagons and start hosting out of my basement (just kidding!) and run my own dns service (also kidding, though less so). On the other hand, the advantage of not being responsible is that it gives me time to write on my blog when they're messed up.

Conclusion: there are no easy answers to the outsourcing question. By nature, I take my responsibilities a little bit too close to heart, and have a corresponding outlook on what healthy 'growth' looks like. Finding a reliable partner is tough. It's what I try to be.

Update: here's an exchange with my server host after they asked when they could schedule time to put the right cpu back in, and asking me whether they wanted to keep the same ticket or a different one:


Thanks for this. I don't care if it's this ticket or another one. Having a senior technician to help sounds good, and I wonder if you could also tell me what you plan to do - are you going to put back my original chassis + cpu or try to swap in my old cpus into this chassis? Or are you just going to see what's available at the time?

The cavalier swapping of mislabelled parts after a misdiagnosis of the original problem points to more than a one-off glitch, particularly in light of previous errors I've had with this server - it sounds to me like a you've got a bigger problem and having a few extra hands around doesn't convince me that you've addressed it. 

What I have experienced is that you are claiming and charging for a premium service and delivering it like a bargain basement shop.


We will check available options prior to starting work during the maintenance window. 
We are currently thinking we would like to avoid the old chassis in case there are any SCSI issues and move the disks to another, tested chassis. As an option, we could add a CPU to the current server. 
If you have any preference on these options, we will make it the priority. 
I apologize again for the mistakes made, and the resulting downtime you have experienced. 

Is it just me, or did they just confirm what I was afraid of?

Saturday, October 19, 2013

Me and varnish win against a DDOS attack.

This past month one of my servers experienced her first DDOS - a distributed denial of service attack. A denial of service attack (or DOS) just means an attempt to shut down an internet-based service by overwhelming it with requests. A simple DOS attack is usually relatively easy to deal with using the standard linux firewall called iptables.  The way iptables works is by filtering the traffic based on the incoming request source (i.e., the IP of the attacking machine). The attacking machine's IP can be added into your custom ip tables 'blacklist' to block all traffic from it, and it's quite scalable so the only thing that can be overwhelmed is your actual internet connection, which is hard to do.

The reason a distributed DOS is harder is because the attack is distributed from multiple machines. I first noticed an increase in my traffic about a day after it had started - it wasn't slowing down my machine, but it did show up as a spike in traffic. I quickly saw that a big chunk of traffic was all of the same form - a POST to a domain that wasn't actually in use except as a redirect. There were several requests per second, and each attacking machine would do the same request about 8 times. So it was coming from a lot of different machines, making it not feasible to manually keep adding in these ip's into my blacklist.

It certainly could have been a lot worse. Because it was attacking a domain that was being redirected, it was using up an apache process, but no php, so it was getting handled very easily without making a noticeable dent in regular services. But it was worrisome, in case the traffic picked up. It was also a curious attack - why make an attack on an old domain that wasn't even in use? My best guess is that it was either a mistake, or a way of keeping someone's botnet busy. I have heard that there are a number of these networks of "zombie" machines, presumably a kind of mercenary force for hire, and maybe if there are no contracts, they get sent out on scurrilous missions to keep them busy.

In any case, I also thought a bit about why Varnish wasn't being useful here. Varnish is my reverse-proxy protective bubble for my servers (yes, kind of like how a layer of varnish protects your furniture). The requests weren't getting cached by Varnish because in general, it's not possible to responsibly cache POST requests (which is presumably why a DDOS would favour this kind of traffic). To see why, just imagine a login request , which is a POST - each request will have a unique user/pass and the results of the request will need to get handled by the underlying CMS (Drupal in my case).

But, in this case, I wasn't getting any valid POST requests to that domain anyway, so that made it relatively easy to add the following stanza to my varnish configuration:

 if (req.http.host ~ "example.com" && req.request == "POST") {
   return (lookup);

And indeed, now all the traffic is bouncing off my varnish and I'm not worrying. In case it was a domain that was actively in use, I could have added an extra path condition (since no one should be POST'ing to the front page of most of my domains anyway), but it would have started getting trickier. Which is why you won't find Varnish too helpful for DDOS POST attacks in general. As usual, the details matter, and in this case, since I was being attacked by a collection of mindless machines, the good guys won.