DAC CRS code lists: multiple versions?

I noticed that an update of the IATI Sector code list is upcoming, as the source list - the DAC CRS Purpose Codes - has been updated.

This is where I run into a slight difficulty. I can access six versions (in English) of this list, via the OECD website, at different URLs:

Via a page called DAC and CRS code lists:

And then, at “Purpose Codes: sector classification”:

I think these might be the same list, available in different formats (PDF, xls, XML), which could be acceptable. But, the XML version looks to be very out of date (last updated 2015-06-23T16:42:28), whilst the budget identifier voluntary codes do not appear on the 2015 or 2016 purpose codes PDF…

The fact that a) all these documents are at different URLs and b) there’s a lack of consistent or accessible changelog information are two reasons that slow me down when trying to answer the following:

  • which list is definitive?
  • when was it last updated?
  • what was added in the last update?

It seems vital that where IATI relies on external code lists, people can ask these questions, and get some answers. For lists published via IATI, that seems to be the case. Is it reasonable to ask the same of DAC and CRS code lists?

NB: there’s a chance that I’m looking in the wrong place, or have missed a clarification document (which is a further issue), so please correct me if so

@matmaxgeds @YohannaLoucheur bearing in mind our discussion at Sectors & sector mappings , I’d value any insight you have here…

Hi Steven,

TL;DR I don’t think you’ve missed anything here, and I would also like this to be improved!

This is a very timely post. I’ve been wrestling with this for a while, and it is quite difficult to know how best to rationalise the Pupose codes, as well as others.

From what I can tell, the most consistently up-to-date record of the codes is the .xls file you’ve linked above. The only approach which has worked for me is to get the excel sheet and spend quite a long time wrangling it into a useable format.

This becomes very difficult given the many quirks of this spreadsheet, including:

Side-by-side tables:

Colour coded data:

Repeated heading rows:

There is also the occasional merged cell, trailing space, and other similar things.

None of this makes processing the codes impossible, but it does make the task time consuming and not one that can easily be solved once and for all with software.

It seems to me that the most simple solution by far would be to make the XML version the authoritative one and have the excel sheet generated from it.

Similarly, I’d be very interested to see if I’m missing something obvious here!

Super useful thread. @rory_scott, I think it is quite unlikely that the DAC would release the codelists in XML format by default.

I did some work on converting the CRS codelists from the source Excel to the IATI codelists XML format, for some other work on a simple IATI-compatible projects database I built:

My main aim was to generate bilingual EN/FR CSV files. You run the scripts in the following order:

./get.sh
python crs_convert.py

It follows the pattern outlined in this pull request from @bjwebb (which I really think should be merged)

The crs_lib.py file provides patterns to interpret the Excel spreadsheets.

NB, it does not handle historical codes (i.e codes that previously existed, but no longer do) – these are just removed at the moment, but should instead be marked as withdrawn.

Hi there

I won’t wade into the technical discussion here, but am wondering whether anything could be done on the governance side to make things easier.

For instance, Mark you say it’s unlikely that the DAC would release codelists in XML. Does anyone know if this was ever brought up with the DAC? Would it be technically difficult for them to release XML codelists? Are there other solutions that could be explored with the DAC? I wonder if this may be something to bring through the WP-STAT.

Hi @YohannaLoucheur ,

Thank you for this suggestion. I actually think the the governance side is the only viable avenue for substantial progress on this topic. Some progress was made on releasing XML codelists, as evidenced by the existence of the (now very outdated) files mentioned by @stevieflow above (see the penultimate link here), but it seems that this has gone cold in 2016.

For me, the hierarchy of preference for solutions to this is as follows (high preference to low):

  1. There’s a proper API for CRS++, meaning that codelists are accessible through queries, and have proper metadata (lots of investment, but would reap huge benefits for CRS over time);
    Or,
  2. Codes are stored in structured data primarily, and then presented in graphical formats as needed. In this case, XML files can be used as the authoritative source, and .xls files are generated from those XML files (shouldn’t be that difficult, and could bring lots of short term benefit for introducing changelogs / notifications etc.);
    Or,
  3. There is at least a commitment to update the structured data version of the codelists on a predictable and frequent basis (unless the original file was compiled entirely manually, this should be very easy - just pressing ‘go’ on whatever script made the last one).

In all of these cases, three things are completely necessary:

  • A changelog;
  • Notifications;
  • A real commitment to never reuse codes that have been deprecated;

I have tried to contact people whom I believe to be relevant, but with little success.

@markbrough thank you for directing me to your code. I’ve written some similar scripts in R and Pandas, but the issue is that no matter how sophisticatedly a script has been written, there’s no guarantee that the next spreadsheet will have the same structure as the one it was written for. For instance, having just cloned your script and updated the URLs (which have been changed), I can see that there’s been an arbitrary change which will stop the script from working at the first hurdle:


Rorys-DI-MBP:IATI-Codelists-NonEmbedded roryscott$ python crs_convert.py
Getting mapping Sector
Traceback (most recent call last):
  File "crs_convert.py", line 51, in <module>
  #### verbose debugging removed #### 
xlrd.biffh.XLRDError: No sheet named <'Purpose code'>

That sheet is now called ‘Purpose codes’. Clearly this isn’t an insurmountable problem, but what if there is a more subtle difference which just means that the script runs incorrectly but doesn’t halt? Things start to become more complicated when we put our trust in non-deterministic procedures.

Now consider their XML version:

<Codelist name="Sector" xml:lang="en" category-codelist="SectorCategory" complete="1">
    <metadata>
        <url>http://www.oecd.org/dac/stats/dacandcrscodelists.htm</url>
        <name>"DAC 5 Digit Sector"</name>
    </metadata>
    <codelist-items>
        <codelist-item>
            <code>32164</code>
            <category>321</category>
            <name>Chemicals</name>
            <name xml:lang="fr">Produits chimiques</name>
            <description>Industrial and non-industrial production facilities; includes pesticides production.</description>
            <description xml:lang="fr">Production industrielle et non industrielle ; y compris fabrication des pesticides.</description>
        </codelist-item>
        <!-- more codelist items -->
    <codelist-items>
</Codelist>

This would make makes the lives myself and other IATI/CRS users significantly easier, but it would also allow a much more responsive and rapid effort in joining up CRS codes with others, helping to make data much more interoperable. Minimally, I could just use a diff-checker to make sure none of the element or attribute names have changed, and we could even introduce a schema to start to standardise.

I recognise that there may be serious counter-arguments to the points I’m making but I would be very interested to start a dialogue about them and I’d be interested to hear what could potentially be achieved from the governance angle.

@rory_scott you make some excellent points, I agree with everything you say. However I’m still not all that optimistic at the likelihood of success of getting the codes released in XML consistently and by default. Would it not be possible to take a twin track approach, something like:

  1. Short-term technical fix: Merge Ben’s pull request and look at running something similar to your/my code to update the CRS codelists periodically. The codelists are not updated that frequently, maybe max every few months, so some manual checking (diffing, or whatever) would probably not be too labour intensive. This would be an improvement on the current situation where codelists have to be manually edited, where IATI and CRS codelists are out of sync for lengthy periods, and where French translations of CRS codelists are not available.
  2. Medium-term governance fix: see whether it would be possible for DCD/DAC to release the codelists in a more structured format and work with them to implement that. Is this something the IATI Secretariat could take up (or already has taken up?) with DCD?
1 Like

@YohannaLoucheur @rory_scott I agree that a technical solution driven through a political commitment in WP-STAT is reasonable and possible.

Development Initiatives has for the past 6 months been collecting all data from the OECD DAC via its API. We appear to be the first rigorous users of this service as the system has had numerous bugs. The technical team at the DAC have leant over backwards to fix the glitches and improve the service. There is no reason why they wouldn’t respond in a similar spirit if this was recognised to be a priority.

IATI’s commitment to encourage non-DAC reporters to use DAC codes is an important contribution to making a wider pool of data on resource flows compatible and I’m sure we can persuade the WP-STAT that providing a robust service is in everyone’s interest.

@markbrough I think at this point in time we are better off limping on in our present way while engaging with WP-STAT rather than investing in sticking plasters.