Thursday, November 06, 2008

Eating my dog food

I was carrying home a bag of dog food recently for my dogs when the neighbour made jokes about eating dog food and the coming recession. I think recessions are like winter - you know it'll come eventually, but it's hard to imagine in the depths of summer.

But my point is really about dog food, and eating it. The woman who sells me Nutromax claims the salespeople eat it to prove it's good. As a computer-geeky guy, I'm familiar with the expression "eating your own dog food" to mean, using your own software. I just looked it up on wikipedia and discover that the original idea did indeed come from an advertisement about dog food, and that it's now used mainly about software. Here's what wikipedia says about the idea:

Using one's own products has four primary benefits:

1. The product's developers are familiar with using the products they develop.
2. The company's members have direct knowledge and experience with its products.
3. Users see that the company has confidence in its own products.
4. Technically savvy users in the company, with perhaps a very wide set of business requirements and deployments, are able to discover and report bugs in the products before they are released to the general public.

A disadvantage is that if taken to an extreme, a company's desire to eat its own dog food can turn into Not Invented Here syndrome, in which the company refuses to use any product which was not developed in-house.

So, that's my introduction to say that I've finally created myself a Drupal site for my business. It had previously been hosted at googlepages, because it was free and easy and I thought Web 2.0 was cool (just kidding about that last one). Also because I didn't have a server or domain name, because I thought I'd just be a consultant.

After three years, I'm still working as an independent consultant. What I've changed is:

  1. I've got my hands full with Drupal and CiviCRM for Canadian non-profits. I may do some projects outside that scope, but I've now got a more specific niche.
  2. I'm not just a "consultant", but a full service shop - i.e. websites from beginning to end, even mail. I use the "keep it as simple as possible, but no simpler" rule, and working on other people's servers turned out to be more complicated than running my own server (no, not in my basement, I use a commercial Canadian service for the hardware and network).
  3. I'm committed to remaining "agressively small" [credits to Mark Surman and Phillip Smith]. There's an assumption in the technical world that you have to "grow" your business to be competitive (yes, not just the technical world). I think that ideology is wrong in a general way from economic and environmental points of view, but specifically wrong for most Drupal websites. Big shops with layers of management do not make better websites, and certainly not cheaper - the big shops are not driven by real 'economies of scale' but by delusions of money and/or fame by the owners. You know who you are ...

That's my story so far, now go visit my new site.

Friday, July 04, 2008

Infrastructure projects

I've been running my own server for a year and a half now, and have been surprised at how trouble free it's been. I attribute this to:

  1. luck
  2. good planning
  3. a decent upstream provider
  4. the maturity of linux distribution maintenance tools (e.g. yum)

In this case, good planning means:

  1. keeping it as simple as possible
  2. doing things one at a time
  3. i'm the only one mucking about on it
And so this month, inspired by some Drupal camp sessions, I decided to take some time to make a good thing better. My goals were:
  1. Optimizing my web servicing for more traffic.
  2. Simplifying my Drupal maintenance.
  3. Automating my backups.

And here's the results ...

Web Servicing Optimizations

This was relatively easy - I just finished off the work from here:

Specifically, i discovered that I hadn't actually setup a mysql query cache, so I did that. And then I discovered that it was pretty easy and not dangerous to remove a bunch of the default apache modules. All I had to do was comment out the lines of the httpd.conf file. I took out some other gunk in there that isn't useful for Drupal sites (multilingual icons, auto indexing).

I like to think that between those two, the response time is even better, though the difference is relatively marginal without much load. The real reason to do this is to increase the number of available servers in apache without the risk of going into swap death. So I can now add more sites with out fear.

Simplifying Drupal Maintenance

I was converted to SVN (a version control program) 3 years ago and still love it. I've been using it to methodically track all the code, with individual repositories for each of my major projects, using the full trunk and vendor branch setup and the magic of svn_load_dirs.

But after a project starts using a lot of contributed modules, or when there are several code security updates each year and you have several projects, this starts getting time consuming.

So I've started NOT putting drupal core or contributed modules into my svn projects, and I'm using one multi-site install for most of my sites. Along with the fabulous update_status module for Drupal 5 (which is in core for Drupal 6), keeping up-to-date is now much more manageable. It's also a change of mind set - I'm now more committed (pun intented) to the Drupal community. I means I can no longer hack core (at least not without a lot of work).

And so -- I also tested this whole scheme out by moving all my simple projects to a new document root that's controlled entirely via cvs to the server, with symlinks out to my individual site roots (which still go in svn, so i can keep track of themes, files and custom modules), and it worked well. There's actually a performance issue here as well - by keeping all my sites on the same document root, the php cache doesn't fill up so fast, because there's less code running. And it's more easily kept secured.

And as a final hurrah, I converted up to Drupal 6. In the process, I've given up on the 'links' module which I thought had some promise, and am now just using the 'link' module that defines link fields for cck. I also started learning about the famed Drupal 6 theming, and tweaked the theme for fun.


I backup to an offsite-server using rsync, which seems to be a common and highly efficient way to do things for a server like this. Rsync is clever to only send file diffs, so load and bandwidth are kept to a minimum. My backups are not for users, they're only for emergencies, so I don't need to do hourly snapshots, only daily rsyncs.

Well, this works well for code, but not so much for mysql. I'd been doing full mysqldumps, and then copying them to my backup server, but this was not very efficient. So finally this week, I've set it up with help from some simple scripts to use the --tab parameter to mysqldump - which dumps the tables in each database to separate files. This means that now when I run rsync on them, it's clever enough to only worry about the tables that have changed, which are relatively few each day. So now I've got daily mysql backups as well, without huge load/bandwidth!

And that also means, I can now use my backup as a place to pull copies of code and database when I want to setup a development environment.


Which takes me almost to a new topic, but it's also about infrastructure, so here it is. I've been running little development servers for several years. My main one I actually found being thrown out (it was a Pentium II). They have served me well, but I was rethinking my strategy mainly on power issues: I'm not happy that I have to use so much electricity for them (and as older servers, the power supplies aren't very efficient), and since one of them is actually in my office, it's fine in the winter when my office is cold, but really not good in the summer when I'm trying to stay cool.

And so the promise of virtualization lured me into believing I could run a little virtual server off my desktop. I tried XEN, but it broke my wireless card (because I have to run it using ndiswrapper), so I finally gave up and installed VMWare (because it was in an ubuntu-compatible repository), even though it's not really open source.

Does it work? Well, so far so good.

Wednesday, May 14, 2008

Toronto Drupal Camp 2008

I thought I'd have some time for some house renovations before Drupal Camp this year, but planning Drupal projects is always harder than you'd think. In any case, I'm also helping plan Drupal Camp, and I've even got a couple of session proposals that have to do with planning Drupal websites. So come find out what all the fuss is about.

Friday, April 18, 2008

CiviCRM Case Study:

These are my notes from a CiviCRM data import for Fair Vote Canada I did on April 16/17, 2008.
Fair Vote Canada is a small NGO, has been around for about 7 years, and is a public interest lobby group for proportional representation-type voting systems in Canada. If you care about democracy, then they're worth supporting. One thing I find particularly interesting and important is that they're cross-party. Obviously, depending on whether they're in power or not, parties have a very biased opinion about proportional representation, and regardless of their statements of principles, that's not going to change with any changes of government, since parties exist to win power, or they don't last long. So Fair Vote Canada decided early on to be strictly non-partisan, and they have some energetic and high-profile supporters from across the political spectrum.
On the technical side of things, they've had a Drupal site for a while, but were still using Excel spreadsheets to manage their relationships with their members (about 3000 of them), which was getting unwieldy and time-consuming.
They had tried to setup CiviCRM and import the data earlier this year, but the import had been done as if CiviCRM was a custom relational database (like the thousands of FoxPro/Filemaker desktop installs out there) - so it wasn't very useful. For example, donations and householding stuff were imported as custom fields. The installation did have some customization (fields, profiles) that needed to be kept, but the data was all considered suspect.
1. Sample Imports
Before I did anything on the live server, I created a vanilla CiviCRM site on my development server and imported a sample Excel sheet provided by Fair Vote, testing my ideas about how to do this. The data was saved as one household per row, with multiple columns detailing date/amount of donations, as well as one or two individuals associated with the household.
The key idea was to generate 'external ids', in order to maintain the relationships between the contact information and the donation and membership data. Then I could import the same sheet several times - both as contact information (possibly multiple times for households) and then as donation information, retaining the relationship through the use of this external id key which is well supported by CiviCRM.
2. Server survey and backup
I looked at all the existing server code and backed up the relevant databases.
3. CiviCRM Install
I started out by creating in the /sites directory of the current site and cloning to it. I then edited the two public CiviCRM related pages to say 'coming soon' and turned off the CiviCRM module on the live site. Then I edited the settings in the stage site so that it used that old CiviCRM database – so I had full access to the old CiviCRM data while I rebuilt the new one on a clean install. I installed v. 202 in /sites/all/modules where it's happiest and ran the usual new install routines, and then copied over the global configuration stuff from the old install (locale, etc.).
4. Global spreadsheet cleanup
My sample import exercise had provided me with a few global spreadsheet cleanups that I knew I had to do. These were:
a. convert dates to ISO 8601 (yyyy-mm-dd) - using the cell formatting feature in OpenOffice, with some manual and automated cleanup when dates had been entered erratically.
b. remove dollar signs from currency (simple format)
c. generate "household names" for spreadsheet rows with more than one contact ID per address. I used a macro that combined the last names.
d. fix various misspelled country/provinces (e.g. USA -> US, NF -> NL, etc.)
e. modify gender from "m" and "f" to "Male" and "Female" (using a spreadsheet macro). I did this with some other columns as well (e.g. French).
f. add a dummy column that has just the word "Donation" in it for when I import the donation columns of a sheet.
g. after all that, I had to split the membership spreadsheet because it included rows with 1 membership and 1 name, 1 membership and 2 names, and 2 memberships with 2 names. It also had some other special membership entries that I wanted to mark separately. So I ended up with 4 spreadsheets from this one (more details below about this).
5. CiviCRM Customization
I created a few custom fields after looking through the old installation and the data I was importing. Not all of the old customizations were useful - some looked like accumulated cruft and I had no corresponding data in my spreadsheets. With my external id trick, I could also rely on being able to re-import any data that I didn't import the first time (at least, for the custom fields - re-importing relationships wasn't going to be as easy).
I also had the two public CiviCRM-related pages: the newsletter signup and the petition - they needed their own custom fields and profiles and groups.
6. Data Import
The bulk of the work now should have been relatively straightforward, but ended up being fiddly.
a. members spreadsheet
This was the hardest and most important, so i started with it. It had 2747 entries. As per the above note, I split it into:
mv - 'vip', steering committee memberships (15)
m22 - rows with 2 names and 2 membership (156)
m12 - rows with 2 names and 1 membership (87)
m11 - rows with 1 name and 1 membership. (2489)
for the householding (m11 & m12) I created an 3 extra columns in which i generated ("external") ids for the household and individuals, looking like:
m22-h-140 m22-i1-140 m22-i2-140
i.e.: -<(household or individual 1 or 2)>-
Fortunately, the other sheets later could all be simpler with just one external id per row, since there was no householding involved.
Each of these sheets was then exported to CSV format, and now I did the imports.
First each sheet got imported at least once for the contact information, and 3 times in the case of m12 and m22 (once for the household and twice for the two individuals in the household). When importing the individuals with households (i.e. m12 and m22), I chose not to import the mailing address address of their household to avoid duplicated mailings, but did import the phone number to all three. Instead, I used my 'external id' trick to relate the individuals to the household, which does contain their mailing address info. In these imports, I also imported the recurring donation information into a custom field of the first individual per record.
For each of these imports, I generate a new 'group' for the import, using the codes above. This somewhat redundant, because you can regenerate these groups based on the external id, but I've left them in temporarily so you can check over the data more easily. Since they're ugly and distracting, they should be deleted eventually.
Then I imported all the donation information by importing it up to 8 times - once for each Donation amount/date. I imported the date as the 'recieved date' and the amount as the 'total amount' and set the donation type as 'donation' - i.e. only three fields, plus I used the external id to relate the donation to the first individual of each row.
Then I imported the membership data - which was just the 'date entered' as 'membership since' and the max of date entered and date renewed as 'membership start'. I imported the m22 sheet twice - once for each individual. There is some automated stuff about renewing membership automatically when getting a donation, but this didn't do anything during the import. Subsequent donations (manually input) should automatically update the membership status.
The rest of the sheets were similar, but much simpler, notes following.
b. Non-member donors sheet
744 records. Here I used the external id format d-. I imported the donations as a special 'MMP-Donation' since they were marked specially on the sheet and didn't seem to bestow membership like a normal donation. I didn't generate a group for them.
c. non-member volunteers
Originally 749 records, only 721 imported after cleaning out ones with bad addresses - no external id, I just tagged all imports with the 'volunteer' tag that already exists.
d. newsletter list - non-members
Originally 862 records only 859 valid, added to group 'FVC Newsletter' and used external id n-.
e. organizations
originally 61, imported 60, no external id.
f. petition - online signers - non members.
4501 records - put into FVC Petition Group - no external id - also imported 'Email newsletter?' custom field, petition sign date and party fields. Didn't put them into the newsletter list!
Final tally: 9,831 contacts imported.
7. Conclusion
CiviCRM and it's import facility was impressive for fully capturing all the variety of data available on these spreadsheets. It's now all there, with excellent functionality that wasn't in the original sheets.
I encountered a number of little bugs as I went along, but the biggest one to note was a few times when the import would claim success but not do anything. That caused me hours of grief as I tried various ways of tricking it into thinking it was a new import (believing the problem to be a caching issue), but eventually I looked into files/civicrm/upload and discovered a log file that had a fatal PHP error that wasn't reported on the screen (related to an invalid value for a custom field).
Like all projects like this, it took longer than I'd hoped for, but the result is actually better than I'd feared - there was very little lost in translation. The total time was about 3 days.
Here's hoping that the tool helps the cause.

Tuesday, February 26, 2008

Drupal + CentOS + optimization

I've been working through various optimization issues today and thought i'd share them with my future self and anyone else who reads this.


I'd heard that getting apache to gzip your non-compressed data was a good idea and thought I was probably already doing that with my default apache2 setup on CentOS 4.4. What I learned was that:
  1. For apache2, the relevant module is mod_deflate (it used to be mod_gzip)
  2. My CentOS included the apache module by default, but didn't enable it. I did that according to the excellent documentation on the apache web site.
  3. I found a test site, which says that the html is now about 25% of what it was, saving me bandwidth and increasing the apparent response of my sites.

wim leers

I found a great article about drupal optimization here: improving drupal's page loading performance. He refers to a firefox plug-in developed by Yahoo that looks like a great tool, as well as a list of key issues to analyse for any site, and how those can be addressed in various ways by Drupal. My key understanding here is that php code + mysql optimization is only a small part of the user experience of a fast site.

APC + mysql query cache

Yes, I use APC as my php cache and love it. And i've tuned mysql somewhat to have a reasonable query cache. For handling sudden bursts of traffic (e.g. during the election), this combination is awesome - it means most traffic, even for complex pages, is handled by a bare minimum of cached php and cache mysql calls. Great for the server scaling traffic anyway.

css + javascript

Wim's article above refers to this issue, but it's worth thinking about on it's own. I'd like to use the javascript compressor for all my custom and contrib module javascript and stick it at the end of the page html. And using Drupal's built in css combine/compress mechanism seems more important now - i'd been ignoring it.


The default apache2 setup for CentOS isn't optimized for Drupal - it comes with a lot of extra modules. I haven't done a rigourous paring yet, but would like to report on what I can remove when i do eventually manage that. Also - i'd like to consider having separate apache instances for https and civicrm so that it's more streamlined. I'm also using fastcgi for sympa, which would be nice to split out.