Planning for machine readable, version controlled OECD-DAC codelists

Some of us (OK, @andylolz!) noticed that the DAC CRS codelists were updated by the OECD on May 9th.

It’s great that we have something from the community (also @andylolz!) in place to try and track this.

However, it would be great to know when, what and how these codelists will be moved by the OECD-DAC to a platform that ensures scalability, version control and machine legibility. I think there’s an upcoming WP-STAT meeting - can anyone inform us (maybe @OJ_ or @Herman ?)?

A while back, I posted something about characteristics we might want to see from an external codelist. Perhaps we could start to list and detail what we’d expect from CRS DAC codelists in the IATI community, given that we are so reliant on their effective management?

2 Likes

Dear Steven

Yes, that’s the spirit. I have previously shared a few items to such a list. Basic point being the need for a historical dimension.

This will include from-to dates, but we must remember that this has nothing to do with the date of release of updates – all codes are valid for the statistical reporting of particular calendar years, and changes made e.g. at the June meeting this year will have effect for the final reporting on 2016-flows. Thus, codes currently on the list may end up with 31-12-2015 as ‘end date’, and some codes, that are not on the list yet, will be published in June under the understanding that the start-date was actually 01-01-2016.

The current, specific channel-codes should be considered ‘serial-numbers’, since they no longer carry any information about the category or parent-relationship. But if we intend to establish historical data for more than just a couple of years, we reach back to a time where the specific channelcodes did carry that meaning – i.e. when organisations got new id’s if they were moved from one category to another. Thus, a long-reaching historical dataset would require a ‘previous code’ option.

The Secretariat has responded very positively, and favourably regarding an idea of providing the rich meta-data as well (e.g. documentation of when which changes to the list was decided).

So I do believe this could end up as a successful cooperation.

Yours OJ

This is a timely topic, particularly given that new updates to DAC codelists keep on coming! Additionally, representatives from the IATI Technical Team met with the OECD DAC last week to discuss how we can better support each others work.

Our use of replicated codelists was a key topic as part of this, and we gained a greater understanding into how DAC codelists are reproduced and shared our experiences of working with them.

There were a number of issues which will be of interest to the IATI and development data communities:

Machine readability of codelists

As we know the DAC codelists are made available in XLS (spreadsheet) and XML (fully machine-readable) formats. However, the DAC recognise that in the past there have been occasions where the content of XLS and XML versions have been inconsistant. In part, this appears to be related to resources available to manage the codelists. Currently, codes are stored at source on an internal (SQL) database, with the output XLS and XML versions curated manually.

However, by Autumn 2017, the DAC are planning to implement an automated systems to generate both XLS and XML versions in an automated way, which should see an end to inconsistencies between versions and make it possible to work with DAC codelists in a fully machine-readable way.

Changelog

The DAC have been responsive to requests from their user community and have introduced a summary sheet to the main sheet detailing changes since the last version. This is of great help for highlighting modifications, as it prevents the need to manually identify these differences.

Example from the latest spreadsheet of codes:

With full machine readability of codes, it will be possible to generate changelogs more easily anyway. I would suggest that a daily script is created to identify changes to XML version and store a new version in a git repository.

Reuse of codes

From our discussion, it was clear that the DAC share our frustration when codes are re-used. It seems that they feel they are forced into reusing codes when member organisations supply data. Nonetheless this is something that they are striving to avoid. Additionally, we are grateful to DAC staff who have been responsive and helpful in responding to queries to help understand the scope of changes to code names (for example relating to DAC sector code 15114).

Making old codes available

The DAC are responsive to this ask and are planning to modify their source database to include metadata fields for code introduction (and presumably withdrawal) dates. Our understanding if that they will seek to make this public in output versions too. The above idea on storing codelists as a git repository will provide a way to generate this metadata even if not feed through to output versions.

Summary

All in all, this was an encouraging meeting and a good insight into working practices. The publication of these lists in fully machine-readable XML format will be a game-changer for the management of the codes.

We would continue to encourage the DAC user community to make visible the positive impact that the production of DAC codelists makes to the wider development data community, as well as the importance of codelist management policies and our excitement about the potential of full machine-readability to improve efficiency and effectiveness.

1 Like

It’s worth explicitly noting that this is not limited to the past, but in fact also applies to the present! The date-last-modified for the XML version says Jan 2016, and a quick eyeball suggests this is probably accurate. I believe @stevieflow has previously called for the link to the XML version to be removed from the OECD site, at least until it points to something up-to-date.


That said… It seems to me there’s a bit of complexity here, but the machine readable bit (i.e. XLS->XML) is the least of the problems. We know from experience that there’s XML… And then there’s “XML”. If IATI were to work with the XLS codelists that the DAC publishes right now, and help iron out issues that occur with those, it could help better inform the DAC about the requirements users (e.g. IATI) have.

Of course it would be useful if the DAC produced and maintained machine-readable codelists… But IATI would still need to (at least for a period) maintain some infrastructure to e.g. track changes (as @dalepotter mentions). Most of that can be set up right now (in fact, it’s exactly what https://andylolz.github.io/dac-crs-codes/ does). I don’t believe that would undermine efforts to get the DAC producing machine-readable codelists. IATI non-embedded codelists being months out-of-date is (in my opinion) largely independent of the DAC not publishing machine-readable codelists.

Thanks for these comments @andylolz

Following our face-to-face discussions with the OECD DAC in May, we have an open channel for dialogue and are next due to speak with our contact there this coming Friday. We will seek to find out about the status of the current XML version of their codelists that you mentioned. We will also be finding out more about how they are progressing with plans to automate the release of fully machine-readable XML, and will report back here. Aside from the issues that I mentioned in the above post and raised by @rory_scott here , do let us know if there are other usability issues that you would like us to highlight in our ongoing dialogue?

We accept that some recent changes to non-embedded codelists have taken some time to push through (example), often where there have been queries about recycled codes that may impact the meaning of published data. We have been discussing our protocols for this amongst the IATI Technical Team and have been looking to find better ways to handle these.

Pending full agreement within the team (and corresponding clarification on the IATI Codelist Management page), we are moving towards a situation where replicated non-embedded codelists (i.e. those from the OECD DAC, ISO, IANA and the US National Geospatial-Intelligence Agency) are agreed to be replicated ‘as is’, after at least 7 days notice has been given on the Non-embedded codelists category. We will also be seeking to include withdrawn codes as part of the source codelists, using the status, activation-date and withdrawal-date attributes.

1 Like

Hello

Following our face-to-face discussions with the OECD DAC in May, we have an open channel for dialogue and are next due to speak with our contact there this coming Friday. We will seek to find out about the status of the current XML version of their codelists that you mentioned. We will also be finding out more about how they are progressing with plans to automate the release of fully machine-readable XML, and will report back here.

Forgive me, but this strikes as creating further complexity. It’s no doubt appreciated that the IATI tech team are talking to the OECD DAC team, but surely the OECD DAC representatives can signal their intentions and plans in an open and public channel, for us all to engage around? It’s 2017 - we have the internet.

Apologies - I do not wish to sound flippant, but I do think this issue could be dealt with if the party controlling and managing these codes could interact with the community that (in)directly use them.

We were indeed planning to encourage the OECD DAC to do this in our meeting tomorrow, with the person responsible for managing their codelists. Unless we plan to find an alternative source for the codelists currently provided by the OECD DAC, and given that we are a separate entity, ultimately all we can do share our experiences and offer encouragement.

As an update this the posts above, we had a follow-up call with representatives from the OECD DAC last Friday (16th June).

Updates to OECD DAC codelists in XML format
It was confirmed that the current XML version of the OECD codelists has not been updated for some time. We understand that generating this XML file is – at present – a manual process and the colleague actioned with this is hoping to undertake the work within a couple weeks. They said they would update us when the file was updated, and we will post here when we receive this notification.

Progress on fully machine-readable (XML) codelists
There are no updates on work to automatically generate this file at this time. However the OECD DAC are hoping to work on this over the summer, with automatic and fully machine-readable XML codelists available in the Autumn.

We have advised that all OECD DAC codelists are made available in XML format (not just those used by IATI), for the benefit of the wider open and development data communities. Additionally, we have highlighted the importance of keeping the XML codelist file at the same URL, which they are happy to do.

Interest and involvement from the wider IATI community
We have highlighted the interest in the above-mentioned topics, such as in this thread. Given this, we encouraged the DAC to consider setting-up a (perhaps quarterly) user group meeting/consultation call to speak directly with codelist users and other interested parties in the open and development data communities. However, given resourcing constraints, there is no capacity to do this and it is felt that an informal relationship is most sustainable, given they see their primary remit being to service DAC members. They have said that interested parties and data users are however welcome to get in touch using the contact information on their website.

Overall
We will continue to engage on the above matters and would also encourage others with an interest in OECD data to get in touch direct. This way, data producers and data users can work together to ensure that relevant data is produced in a way that best meets user needs and supports economic development and decision-making.

Maintained, machine readable versions of these codelists are now provided as a frictionless data core dataset, in CSV and JSON.

I’ve blogged about it here (please like and share etc etc).

It’s trivial to automatically replicate these as IATI codelists. It’s slightly less trivial if you want to keep track of removed (‘withdrawn’) codes, but still ultimately doable. I’ve added some comments about how I’d do it.

2 Likes