Version 1 Files in DataStore

andylolz · May 11, 2019, 10:09pm

Really useful bit of analysis – thanks @IATI-techteam.

The above post doesn’t say whether v1.0x datasets will be included in the v2 (“new”) IATI Datastore. Can I take this opportunity to again flag IATI transformer (described in this discuss post) which I think could be of use here. One option would be for the v2 (“new”) IATI Datastore to run any v1.0x IATI data through IATI Transformer as a pre-processing step. This is really trivial to do – you can even do it on the fly. So for instance, instead of using:

http://aidstream.org/files/xml/icauk-tg.xml

You’d use:

https://iati-transformer.herokuapp.com/transform.xml?url=http%3A%2F%2Faidstream.org%2Ffiles%2Fxml%2Ficauk-tg.xml

Does that sound like a viable option? In that way, every effort would be made to ensure v1.0x data isn’t being dropped completely.

IATI-techteam · May 13, 2019, 3:05pm

Hi Andy,

Apologies if it was not clear from the above post; we can confirm that v1.0x datasets will not be consumed by the new DataStore.

It is important to bear in mind that the new Datastore is not a database; rather it is a curation and display of files that are hosted on the IATI Registry. As such, unless publishers themselves host the transformed file after using your transformer tool and link back to the Registry, the dataset will not be picked up by the DataStore.

It is for this reason that we cannot ourselves run all v1 data through the transformer tool. We do however appreciate its availability as a public tool that can be used by publishers who are undergoing the v2 transition and we have certainly raised it as a potential transition option for the publishers that we are in touch with.

andylolz · May 13, 2019, 5:04pm

Apologies, you’re absolutely right. It was clear – I’m afraid I didn’t read it carefully enough.

Okay, great! This is a good decision in my opinion. I don’t see how the one-off upload would have worked (in terms of data ownership) so I’m relieved to hear this has changed.

I’m not sure I understand what this means. The datastore performs an ETL (Extract, Transform, Load) cycle. I think the “L” bit means it is a database. The “T” bit means it already performs a transformation/normalisation step. I’m just suggesting adding to that transformation/normalisation step.

matmaxgeds · May 13, 2019, 9:37pm

Is there an easy (or even a known) way to check how many of those v1 files contain active activities, i.e. activities that have not yet reached their predicted end date, as those are the things that would be most key for me to keep in the new datastore - if you could share a list of the urls for the from the great analysis above then I might be able to figure this out myself?

amys · May 14, 2019, 2:56pm

Hi @matmaxgeds, if I was to do the above (someone more technical than me probs has a better approach) I’d use the current IATI datastore.

download the activities for all publishers we know are using v1, and that have activities with an end date greater than today.
filter out all v2 activities (ones with numeric codes).

That should provide you with a list of v1 activities that are still active. Or, use the activity-status attribute as a filter as there’s some debate whether it’s updated more regularly than the activity dates.

matmaxgeds · May 14, 2019, 3:54pm

Thanks @amys - If I get time to have a go I will be sure to share

andylolz · May 15, 2019, 12:41pm

@matmaxgeds you can use iatikit to get the answers you’re after.

Here’s a jupyter notebook demonstrating how.

In summary:

Total v1.0x organisation datasets: 116
Total v1.0x activity datasets: 594
Total v1.0x activities: 68,698
Total v1.0x activities with activity-status of “pipeline/identification” or “implementation”: 24,125
Total v1.0x activities with no end date, or an end date in the future: 28,057

Hope that helps.

matmaxgeds · May 15, 2019, 12:27pm

Hah, @andylolz, I ‘hope it helps’ is underselling it somewhat

Based on this, I feel that not having v1 data in the new datastore would be a substantial loss until we can get this figure down substantially.

I am aware that DI are working hard with many of the key publishers still using v1

Suggested ways forward that I am aware of in the interim (until the publishers update their systems):

Not implement the v1 aspects of validation yet
Keep the existing datastore online
Have the new DS use the iati-transform tool to convert them to v2 data
Republish them all using iati-transform and a secondary publisher (although they still wouldn’t be available in d-portal and several other tools)
Explore how many of these activities are actually closed but where the data has not been updated

In light of my recent commitments to being a good discuss citizen - how about we have a skype/similar meeting for anyone who is interested - we could discuss:

What are the strategic/political decisions that are affecting this?
How this fits in/alters current Secretariat/DI workplans - and what scope there is for tweaking them?
What is a reasonable number to aim for in terms of the abandoned data - or whether we should just drop this data?
What the community can contribute to making this work - and how best to do that in a coordinated way?
Whether the different technical options really work?

So - if keen, please let me know on this thread, and I will setup a doodle to find a time that works for everyone - or maybe this process is better managed by DI/UNDP - please say if so.

IATI-techteam · May 17, 2019, 1:08pm

Hi Matt,

Thanks for your thoughts; our response is below. We’re happy to set up a call if that would be helpful.

A few specific responses to your points:

The way the system has been built is that the Validator checks if a file is 2.0x compliant; if not then it doesn’t parse it onto the DS. This means there is no way to “turn it off”.

We would have to turn off all validation to allow v1 data into the DS or we’d have to delay the launch of the validator and the DS in order to build a v1 validation service, which would add considerable cost.

There is an agreement to keep the existing DS in place until the end of 2019.

We discussed with our board focal points the problems of having data that doesn’t have an owner; this gets into the core of how IATI is designed.

The way the whole IATI data system functions is that a publisher owns and hosts their own data. If we (as the tech team) start to transform and host people’s data this is a fundamental change to how IATI publishing works - a big change of mandate for the IATI secretariat and something that was agreed with IATI Board focal points wasn’t the right direction for us to take.

One example of where it would get complicated is when a publisher chooses to update their data. We would then be left with two versions in the system: the data transformed and held under a fake account as well as the data owned by the publisher that they have later handled.

The IATI-transform tool will work in some cases but not in all. Not all mandatory elements and attributes in version 2 of the IATI Standard are required in version 1. The IATI-transform tool can turn existing version 1 data into version 2, but it cannot create data.

For example, it cannot provide an activity status, or an iso-date for an activity where none was originally given. That information needs to be added by the publisher.

Hopefully the more detailed explanation above makes it a bit clearer as to why the IATI secretariat can’t re-publish data that we don’t own and how there isn’t something we can “turn off”.

In terms of what we are doing behind the scenes; we have been in touch with all the publishers that this pertains to offering our assistance for them to upgrade.

From our analysis only 61 publishers in v1 are still active. We have reached out to all of these organisations and have been pleased with the level of engagement so far. The vast majority of these publishers only have their Organisation file in v1 and/or use AidStream to publish their data, meaning that the transition process is not particularly strenuous.

Getting the organisations to engage is the biggest barrier! We would really welcome your thoughts and ideas on how we can get better engagement.

Also, it’s important to note that version 1 data will also still be available from the Registry.

As above, we’re happy to have a call with you and any other community members if you think it would be helpful.

andylolz · May 17, 2019, 2:04pm

Agreed that neither of the approaches described here are desirable. One alternative approach would be: if the dataset is v1.0x, attempt to transform it (i.e. pass the dataset through an XSLT.) Then continue as before i.e. pass the transformed version through the validator.

markbrough · May 17, 2019, 5:46pm

Thanks for this great response, @IATI-techteam!

Personally I am pretty happy with this change. The analysis indicates that all major (“targeted”) organisations will hopefully be migrating to v2 so we will not miss their data. There are two exceptions: Scottish Government and JICA. I think it is OK that JICA’s data won’t be migrated as my understanding is that it is CRS data anyway – so it would be preferable to move them to a better publication process instead. That leaves only the Scottish Govt.

On @matmaxgeds’ point, I think the number of activities overstates a bit the issue here – thanks to @andylolz’ IATI Kit, I did a quick analysis based on the great analysis spreadsheet @IATI-techteam shared at the top of this thread – see here for details (here for code), but in summary – looking only at activities:

84% of activities are from targeted publishers
of the 62,056 activities from targeted publishers, all but 2,886 (for Japan and Scotland) should become available in v2.
there are only 12 untargeted V1 publishers with over 100 activities
I guess there could be other effects in terms of enabling traceability if the data from some of these smaller organisations is no longer so easily accessible, but I think that is OK if the data is normally quite stale.

Maybe I have got something wrong in my analysis, but I think this indicates that the vast majority of activities, and almost all from targeted organisations, should be covered by organisations that will convert soon.

So maybe need to keep an eye on the remaining targeted publishers, but in the meantime, I think this looks good @IATI-techteam!

(As an aside: I think it might be good to have a conversation about what we do in the future in similar cases, where there are a large number of organisations with stale data.)

stevieflow · May 17, 2019, 5:58pm

Thanks @IATI-techteam for the thorough analysis and explanations. Thanks @matmaxgeds @andylolz & @markbrough for your support analysis, commentary and pivot tables

We should remind ourselves of how productive it is to drill into the issue we first started to discuss in January, whilst also minded of the need for this key infrastructure to be delivered and available to us as planned.

Thanks to this latest discussion, I think we now have further evidence to advise the @IATI-techteam that this decision is productive. As @markbrough also highlights, there are further issues to discuss and deliberate on - but let’s make sure we consider those in away from this decision.

SJohns · June 4, 2019, 11:38am

@amys I’ve created some guidance for AidStream users who need to update their activity or organisation file here: https://docs.google.com/document/d/1xs02GhqTuX3RhDNOV2P4ihGQ7z_PxnRECxRrUyvP3Hg/edit?usp=sharing

Most of these will be organisations that are not your targets, but if you are able to share the link, that’d be great.

JoshStanley · June 5, 2019, 10:52am

Hi all,

Please see the following news-post about v1 deprecation:

Josh

matmaxgeds · June 6, 2019, 10:30am

Thanks @JoshStanley great to get the update - are you able to update progress against the targetted organisations on your useful google doc so we can see who we are still at risk of losing?

JoshStanley · June 6, 2019, 12:42pm

Hi Matt,

Of our initial list of target publishers, the following have already fully transitioned:

Asian Development Bank
Catholic Agency for Overseas Development (CAFOD)
Climate Investment Funds
European Commission - Humanitarian Aid & Civil Protection
The Global Fund to Fight AIDS, Tuberculosis and Malaria
European Commission - Service for Foreign Policy Instruments

Several more are in the process of transitioning and some have advised that they will upgrade alongside their next publication.

We will provide another update closer to the deprecation date (before the end of June); if you are interested in the progress of any specific publishers then do feel free to get in touch with the publisher directly (I’d be happy to link you with the contacts that we have).

Thanks,
Josh

matmaxgeds · June 6, 2019, 1:52pm

super useful update - thanks @JoshStanley - especially great that with a push and a shove people are transitioning

JoshStanley · July 2, 2019, 10:00am

Hi all,

The following is an update on publishers’ transition to version 2 of the Standard, now that the v1 deprecation date has passed:

‘Targeted publishers’

The following is a list of ‘targeted’ publishers which have successfully transitioned all of their data to v2 of the Standard:

Asian Development Bank
Catholic Agency for Overseas Development (CAFOD)
CDC Group plc
Climate Investment Funds
European Commission – Humanitarian Aid & Civil Protection
European Commission – Service for Foreign Policy Instruments
The Global Fund to Fight AIDS, Tuberculosis and Malaria
UK – Home Office
UNITAID

We are also still actively engaging with most of the other ‘targeted’ publishers as they continue to go through the process. Some of these require just a few more files to be updated before they re-publish at v2, while others have completed the transition process and plan to publish at v2 in their next publication.

AidData are a slight anomaly in that there were limitations to their data (as geo-location data was their primary focus when they first published) which meant that some activities were missing mandatory fields that were required to update to v2. In this instance, we worked with them to ensure that the activities that do have the mandatory information were transformed, while the rest were at version 1.

‘Non-targeted publishers’

In the lead up to the deprecation date (30 June 2019), we sent a number of bulk emails to both the ‘active’ non-targeted publishers and the ‘non-active’ non-targeted publishers. The ‘active’ non-priority publishers received emails that were tailored to certain features of their data (e.g. only Org file at v1 vs. Org and Activity file at v1; AidStream users vs. internally output XML etc.) while the ‘non-active’ non-targeted publishers were invited to contact the technical team service desk for advice and help with transitioning.

We liaised with a number of non-targeted publishers following this (and will continue to do so) and will provide a more detailed update about the number of v1 datasets/activities as a whole in the coming weeks.

Thanks,
Josh

matmaxgeds · July 2, 2019, 11:20am

Thanks loads for the continued updates @JoshStanley - are the ones who have not changed going to be removed from the dashboard? And if so, are we under 1,000 again?

amys · July 9, 2019, 3:03pm

Since last time Josh did a data pull 418 files have been moved from v1, equating to 30,213 activities. You can see the details on the July 2019 - Data tab

@matmaxgeds the plan in the future is to remove v1 data from the Dashboard. However, this won’t change the count of IATI publishers.