Skip to main content

CiviCRM Import: Advanced Techniques

I'm almost finished a big new CiviCRM import/installation that I've been working on for longer that I'd planned. That's pretty normal, and of course there were a number of things that should have been warnings:
  1. it's a huge mass of data from an established non-profit: more than 5000 constituents, more than 50,000 contributions, from more than 10 years.
  2. it's being imported from Raiser's Edge, the cadillac of desktop fundraising software.
  3. the clients are very attached to the detailed minutae they have accumulated about their constituents, all of which were faithfully entered into Raiser's Edge.
After a round of a sample import and looking at what the data might look like in CiviCRM, we were able to whittle it down a little bit, but the last month I've been working on the big final import and have developed some techniques that I expect would be generally useful with big CiviCRM imports.

Two Scripting Techniques

For my first test import, I pulled the CSV files into an OpenOffice spreadsheet and manually cleaned things up [e.g. dates, bad characters, etc.]. That works for small imports, but not for large ones, because:
  1. it's not reliably reproduceable [and/or it takes a long time]
  2. there are some things it doesn't do well [like splitting fields off to separate imports].
For my scripts in this project, I used two simple techniques:

a. add a column type row

before processing the CSV files, I added a row underneath the header, and inserted one of several 'column types' that my script could use to parse the data. For example, "Date" columns need to be converted into ISO-8601 format. "Select" columns are custom fields with option groups - so the labels in this column get cleaned up into values and those pairs get added into the database as valid options. "Note" fields get written to a different file and then imported as activities. I wrote separate scripts for each of the client files to process [i.e. constituents, notes, gifts, and extra attributes]. Some of the scripts also refer to the column header.

b. chunking

i still use the standard civicrm import technique [because I'm too lazy to script to write directly to the database, and also I'd like to not have to reinvent my wheels each time the database structures change], but the files I got from the client are way to huge - it would time out. I probably should have checked out the option of importing from a database via sql, but instead, my script just breaks the files up into no more than 1000 0r 2000 records at a time [reinsterting the header each time]. That means manually uploading a lot of files [e.g. 17 contribution files], so I have a little piece of paper handy on which i'm writing the upload file number and the time beside it so I can keep track while I'm editing this blog between uploads.
I should also say that the first time I did the sample import, I was really stretching my little staging machine. I'm now doing this import on a virtual server running off my desktop which I upgraded in the interim so it's a little less painful, though I still get a fair amount of time to edit my blog. You certainly don't want to do big imports on a live server!

Documentation

Importing into CiviCRM usually involves some degree of judgement calls. Some of the fields migrate naturally into CiviCRM fields [last name, province, ...] but all imports I've done include fields that have no built-in CiviCRM equivalent. For that, the standard advice is to turn it into a custom field, which can then be manipulated further after the import [e.g. turn it into a group, create equivalent functionality via a smart group, etc.].
This is where it usually pays to hire a professional - you have to consider not only how to get the data into CiviCRM, but you have to understand what kind of functionality you need from that data and import it somewhere where you can use it properly.
But more than that - what happens in a couple of months? In this case, there are so many fields and functions being imported that the mass of detail is almost overwhelming. So documenting what export field goes into what field/function in CiviCRM is essential not only for me [since I have to redo things with gaps of a month or two between imports], but even more for the staff and future staff of the client who are going to say "where did such and such go"?.
So my second technique is a simple one for documentation. I set up a new node type [I'm using Drupal w/ CCK and views of course], with these fields:

$content['type']  = array (
 'name' => 'CiviCRM Field Migration',
 'type' => 'civimigrate',
 'description' => 'Description of mapping of raiser\'s edge fields to civicrm fields.',
 'title_label' => 'Raiser\'s Edge Field',
 'body_label' => 'Description',
 'min_word_count' => '0',
 'help' => '',
 'node_options' =>
 array (
   'status' => true,
   'promote' => false,
   'sticky' => false,
   'revision' => false,
 ),
 'language_content_type' => 0,
 'upload' => '0',
 'scheduler' => 0,
 'scheduler_touch' => 0,
 'i18n_node' => '0',
 'old_type' => 'civimigrate',
 'orig_type' => '',
 'module' => 'node',
 'custom' => '1',
 'modified' => '1',
 'locked' => '0',
 'image_attach' => '0',
 'image_attach_size_teaser' => 'thumbnail',
 'image_attach_size_body' => 'thumbnail',
);
$content['fields']  = array (
 0 =>
 array (
   'label' => 'civicrm_field',
   'field_name' => 'field_civicrm_field',
   'type' => 'text',
   'widget_type' => 'text_textfield',
   'change' => 'Change basic information',
   'weight' => '1',
   'rows' => '1',
   'size' => 60,
   'description' => '',
   'default_value' =>
   array (
     0 =>
     array (
       'value' => '',
     ),
   ),
   'default_value_php' => '',
   'default_value_widget' => NULL,
   'group' => false,
   'required' => 0,
   'multiple' => '0',
   'text_processing' => '0',
   'max_length' => '',
   'allowed_values' => '',
   'allowed_values_php' => '',
   'op' => 'Save field settings',
   'module' => 'text',
   'widget_module' => 'text',
   'columns' =>
   array (
     'value' =>
     array (
       'type' => 'text',
       'size' => 'big',
       'not null' => false,
       'sortable' => true,
       'views' => true,
     ),
   ),
   'display_settings' =>
   array (
     'label' =>
     array (
       'format' => 'inline',
     ),
     'teaser' =>
     array (
       'format' => 'hidden',
       'exclude' => 0,
     ),
     'full' =>
     array (
       'format' => 'default',
       'exclude' => 0,
     ),
   ),
 ),
 1 =>
 array (
   'label' => 'civicrm_note',
   'field_name' => 'field_civicrm_note',
   'type' => 'text',
   'widget_type' => 'text_textfield',
   'change' => 'Change basic information',
   'weight' => '2',
   'rows' => '1',
   'size' => 60,
   'description' => '',
   'default_value' =>
   array (
     0 =>
     array (
       'value' => '',
     ),
   ),
   'default_value_php' => '',
   'default_value_widget' =>
   array (
     'field_civicrm_note' =>
     array (
       0 =>
       array (
         'value' => '',
         '_error_element' => 'default_value_widget][field_civicrm_note][0][value',
       ),
     ),
   ),
   'group' => false,
   'required' => 0,
   'multiple' => '0',
   'text_processing' => '0',
   'max_length' => '',
   'allowed_values' => '',
   'allowed_values_php' => '',
   'op' => 'Save field settings',
   'module' => 'text',
   'widget_module' => 'text',
   'columns' =>
   array (
     'value' =>
     array (
       'type' => 'text',
       'size' => 'big',
       'not null' => false,
       'sortable' => true,
       'views' => true,
     ),
   ),
   'display_settings' =>
   array (
     'label' =>
     array (
       'format' => 'inline',
     ),
     'teaser' =>
     array (
       'format' => 'hidden',
       'exclude' => 0,
     ),
     'full' =>
     array (
       'format' => 'default',
       'exclude' => 0,
     ),
   ),
 ),
 2 =>
 array (
   'label' => 'civicrm_transformation',
   'field_name' => 'field_civicrm_transformation',
   'type' => 'text',
   'widget_type' => 'text_textfield',
   'change' => 'Change basic information',
   'weight' => '3',
   'rows' => '1',
   'size' => 60,
   'description' => '',
   'default_value' =>
   array (
     0 =>
     array (
       'value' => '',
     ),
   ),
   'default_value_php' => '',
   'default_value_widget' =>
   array (
     'field_civicrm_transformation' =>
     array (
       0 =>
       array (
         'value' => '',
         '_error_element' => 'default_value_widget][field_civicrm_transformation][0][value',
       ),
     ),
   ),
   'group' => false,
   'required' => 0,
   'multiple' => '0',
   'text_processing' => '0',
   'max_length' => '',
   'allowed_values' => '',
   'allowed_values_php' => '',
   'op' => 'Save field settings',
   'module' => 'text',
   'widget_module' => 'text',
   'columns' =>
   array (
     'value' =>
     array (
       'type' => 'text',
       'size' => 'big',
       'not null' => false,
       'sortable' => true,
       'views' => true,
     ),
   ),
   'display_settings' =>
   array (
     'label' =>
     array (
       'format' => 'inline',
     ),
     'teaser' =>
     array (
       'format' => 'hidden',
       'exclude' => 0,
     ),
     'full' =>
     array (
       'format' => 'default',
       'exclude' => 0,
     ),
   ),
 ),
 3 =>
 array (
   'label' => 'civicrm_todo',
   'field_name' => 'field_civicrm_todo',
   'type' => 'text',
   'widget_type' => 'optionwidgets_onoff',
   'change' => 'Change basic information',
   'weight' => '4',
   'description' => '',
   'default_value' =>
   array (
     0 =>
     array (
       'value' => 0,
     ),
   ),
   'default_value_php' => '',
   'default_value_widget' =>
   array (
     'field_civicrm_todo' =>
     array (
       'value' => false,
     ),
   ),
   'group' => false,
   'required' => 0,
   'multiple' => '0',
   'text_processing' => '0',
   'max_length' => '',
   'allowed_values' => '0|
1|*',
   'allowed_values_php' => '',
   'op' => 'Save field settings',
   'module' => 'text',
   'widget_module' => 'optionwidgets',
   'columns' =>
   array (
     'value' =>
     array (
       'type' => 'text',
       'size' => 'big',
       'not null' => false,
       'sortable' => true,
       'views' => true,
     ),
   ),
   'display_settings' =>
   array (
     'label' =>
     array (
       'format' => 'hidden',
     ),
     'teaser' =>
     array (
       'format' => 'hidden',
       'exclude' => 0,
     ),
     'full' =>
     array (
       'format' => 'default',
       'exclude' => 0,
     ),
   ),
 ),
);
And then I imported the headers of my spreadsheets, grouping them with a taxonomy term corresponding to the spreadsheet name. Then a simple view allows me to list all the fields in each spreadsheet, and document the following information for each input field:

a. Input field name [the node title]
b. Note - description of the use of the field
c. CiviCRM field - name of the field in CiviCRM it's being migrated to
d. CiviCRM note - additional information about the civicrm field [e.g. custom, etc.]
e. CiviCRM transformation - processing of the original field before putting it into CiviCRM
f. Todo [bonus field - just use it as you need to for keeping track of loose ends].
At the end of this, I've got a nice table for each input spreadsheet documenting where that value is ending up in CiviCRM and what happens to it along the way.

Comments

Anonymous said…
Thanks, that was very helpful.

Popular posts from this blog

Varnish saves the day, in unexpectedly awesome ways.

Four and half years ago, I wrote a blog post about Varnish, a 'front-end proxy' for webservers. My best description of it then was as a protective bubble, analogous to how it's namesake is used to protect furniture. I've been using it happily ever since.

But last week, I got to really put Varnish through a test when the picture here, posted by Fair Vote Canada (one of my clients), went viral on Facebook. And Varnish saved the server and the client in ways I didn't even expect.

1. Throughput

Varnish prides itself on efficiently delivering http requests. As the picture went viral, the number of requests was up to about 1000 per minute, which Varnish had no trouble delivering - the load was still below 1, and I saw only a small increase in memory and disk usage. Of course, delivering a single file is exactly what Varnish does best.

2. Emergency!

Unfortunately, Varnish was not able to solve a more fundamental limitation, which was the 100Mb/s network connection. Becaus…

What to do in the age of Trump?

Well, that's the question of the day. If you're part of an organization that does advocacy work, rather than waiting to see what happens first, might as well get yourself ready, even if the details are sketchy still. Here's one opportunity that's ready for you now, message courtesy of Steve Anderson of OpenMedia.

OpenMedia, David Suzuki Foundation, SumOfUs and a range of other organizations are supporting a new shared set of civic engagement tools.

Vancity Community Foundation is providing some support to subsidize some of the cost of the tools to select values-aligned organizations that sign up before February 28th.

Interested? You can learn more or book a demo from here: http://tools.newmode.net/

Here's some live examples of the tools you can take a look at:

1. Click to Call: http://www.davidsuzuki.org/blogs/healthy-oceans-blog/2016/11/to-help-protect-canadas-oceans-weve-made-it-easy-to-call-your-mp/#newmode-embed-4-266

Check out this video of David Suzuki's d…

Me and varnish win against a DDOS attack.

This past month one of my servers experienced her first DDOS - a distributed denial of service attack. A denial of service attack (or DOS) just means an attempt to shut down an internet-based service by overwhelming it with requests. A simple DOS attack is usually relatively easy to deal with using the standard linux firewall called iptables.  The way iptables works is by filtering the traffic based on the incoming request source (i.e., the IP of the attacking machine). The attacking machine's IP can be added into your custom ip tables 'blacklist' to block all traffic from it, and it's quite scalable so the only thing that can be overwhelmed is your actual internet connection, which is hard to do.

The reason a distributed DOS is harder is because the attack is distributed from multiple machines. I first noticed an increase in my traffic about a day after it had started - it wasn't slowing down my machine, but it did show up as a spike in traffic. I quickly saw that…