IATI Datastore - what data should go in?

Hi @SJohns - thanks, it’s a very valid question :slight_smile:

In terms of specific file having a mix of 1.0x and 2.0x activities within it, then I don’t think this is actually possible. The version attribute is only applicable at the <iati-activities> element, not the <iati-activity>, so it can only be declared once per file. It used to be different (in version 1.0x) - but was changed in the move to 2.01 (see changelog). @bill_anderson @IATI-techteam do you agree?

However, the point still remains that it could be possible to publish a file with a mix of valid and invalid activities (in the same version). I think @andylolz did some stats on this too…

@SJohns: pragmatically, I’d suggest any publisher that can’t go back and update old v1.0x data should ensure all new activities are created in a brand new v2.03 activity file. This means all future data will be “datastore compliant”. And perhaps at some point, the old v1.0x data could be one-off converted.

That’s true – in the stats above, schema validation was performed at activity level (i.e. rather than validate each dataset, I validated each activity.) So in practice this means the “activity count” is a count of invalid activities, rather than a count of all activities inside invalid datasets.

Slightly off-topic but does IATI give a ‘lifespan’ estimate when new version of the standard are created i.e. with an operating system, updates are guaranteed for X number of years? It seems like it might be helpful/standard practice to say with new version that they will not be depreciated (dropped from the registry/core tools) for X years, or until X date?

4 Likes

@stevieflow @andylolz thanks for clarifying. I was really thinking of this scenario - older activities that are poorer quality within the same datafile as activities of good quality, but I didn’t express it well!! So just thinking about Andy’s suggestion - how would it work for AidStream users.

Aidstream users (who are using the full version) should click on the button to upgrade to version 2.03 and then continue to add in their data for the current activities. If they have older, closed activities on AidStream that are poor quality (data missing/incomplete), then they can convert them to draft activities in AidStream by editing them. This means that when they publish the datafile, only the current activities will show up in a datafile that is tagged as iati-activities version=“2.03”.

This 2.03 datafile should (if no other issues) get pulled through to the new database without the older activities, which will no longer be publicly available. This should not therefore impact their funding (because the current activities are published) but will shorten their track record.

Then if an organisation has extra resources, they can go back and fix the older files if they want to show a longer track record.

For organisations with a smaller amount of activities, this will be feasible to do.For organisations that use AidStream to publish many activities,for multiple donors, it’s going to be a headache, so the more time and warning you can give, the better.

Unfortunately, as soon as funders link an open, public good like IATI to withdrawing funding which an organisation receives to run their programmes (which vulnerable people depend on) it gets a lot more complicated than just excluding data and teliing organisations to update it as and when.

@SJohns: I’ve replied on twitter with a suggested approach that doesn’t involve removing existing IATI data.

Thanks again @SJohns

I think we are into some of the implementation details , based on the agreement of the principles above.

@andylolz would it be possible to share your twitter feedback in a new thread, where we can discuss this in a dedicated space? @SJohns by no means am I saying we should ignore this - but I want to keep this thread to our shared three principles. Just in the same way we have a new discussion on follow-ups for licencing, we should detail the support needed for Aistream publishers in a concentrated channel.

Hi everyone

I’m just flagging that our technical advice to the @IATI-techteam & partners via @siemvaessen looks to be a clear line on the datastore initially ingesting data that is:

  • Valid to the relevant schema
  • Openly licenced
  • Version 2.0x (but actively checking valid/open 1.0x data alongside this)

As we can see, there are follow ups and actions elsewhere, but I wanted to thank everyone for their input here, and pass onto @KateHughes in terms of implementation of the datastore. Thanks!

Could you expand on this, @stevieflow? It’s unclear what this would mean for v1.0x publishers.

Thanks

1 Like

@stevieflow i agree with @andylolz: it needs to be very clear what is going to happen with valid 1.0x data. I would expect the DS to process this data.

Agree?

2 Likes

See ^^. It has already been agreed that …

The DS spec does not require for deprecated versions of the standard to be processed.

It was decided, pragmatically, that although DS will come online before V1.0 is deprecated, we are talking about a couple of months and it was not worth the effort to complicate the load.

Then there hopefully will be no active 1.x data publishers anymore after the depreciation date in June this year.

1 Like

Thanks all

It’s useful for us to reaffirm our role as a community here. We’re giving technical advice (the TA of TAG!) to the datastore project. We’re not in a position to project manage the datastore, for example. For this reason, it’ll be great to hear a progress update from the @IATI-techteam

In terms of the discussion we’ve had so far on 1.0x, then apologies if I left that vague. My understanding is that we’d leave the door open for valid 1.0x data, but that other factors instigated by the @IATI-techteam may mean becomes less of an issue:

  • Existing active 1.0x publishers shift to 2.0x before June
  • There’s a process in place to support the Aidstream users, who may have a mix of versions

Jumping on this thread a little late – I think it would be great to ensure that the needs of the ultimate users of the data are factored in here. There are currently some big donors publishing v1.x data to IATI (see below). It would be really unfortunate if the data from these organisations, which is currently available through the Datastore, became no longer available.

I don’t really understand the suggestion of loading all v1.x data into the Datastore once, and then never again – I would assume the development time required would be more or less the same, and presenting out of date data to users from a Datastore that is supposed to update nightly would arguably be misleading. Perhaps a better approach would be to gracefully degrade for older versions – i.e., trying to load v1.x data on a “best efforts” basis, but not focusing development or maintenance time on this.

Here are few suggestions about how to avoid losing access to all these organisations’ data:

  1. IATI tech team works intensively with priority organisations to support/encourage them to begin publishing v2.x data. I would argue that prioritisation should be based primarily on size of organisation.
  2. If there are still a large number of organisations (especially large ones) publishing v1.x data, then have a policy of gracefully degrading for no longer supported versions.
  3. The Datastore importer could perhaps use something like @andylolzv1.x to v2.03 transformer where possible to simplify the import process.

IATI Dashboard - Versions

v1.03

  • AFD (France)
  • AsDB
  • Finland
  • France (Ministry of Foreign Affairs)
  • Switzerland (SDC)
  • UNOPS

v1.04

  • European Commission (FPI)

v1.05

  • Germany (Environment Ministry)
  • Climate Investment Funds
  • New Zealand
  • The Global Fund to Fight AIDS, Tuberculosis and Malaria
2 Likes

The TAG consensus to deprecate v1 in June was based on the realistic expectation (based on the ongoing work of the Tech Team) that all big publishers will upgrade. Your Suggestion 1 has been going on for some time.

@bill_anderson that’s great to hear, then perhaps we can just revisit this question around June, once we know how much progress these publishers have made.

Hi all,

Please note that a new topic has been created which outlines the technical team’s plans for version 1 files in the context of the new Datastore: Version 1 Files in DataStore

I would like to reopen this discussion.

I am pulling together data on resource flows going to Kenya in response to COVID-19. I know that the World Bank has given Kenya a $50m loan.

This can’t be right. In the interests of providing users with the maximum amount of usable data we surely need to change the validation and datastore ingestion guidelines to operate at the activity level, not the file level.

4 Likes

I agree, but not only is this as per specification for the Datastore and from a data (standards) perspective expected behaviour.

How would you propose we solve this, while keeping schema-valid only data in the Datastore while at the same time providing information all data available in raw (XML) files? I believe we spoke of the option where no data gets left behind (= the Datastore accepts all) with a default view from the Datastore only showing schema-valid data. In my opinion this should be adressed on perhaps Registry level: user will get notified (bombarded) if data is not schema valid and it will not become available in the Datastore and will perhaps become invisible on Registry after some time? This is a data quality issue after all and one of the reason the new Validator should help out right?

But do agree this needs some sort of better solution than what is currently offered, but the real issue is data quality, probably best solved at the root cause and not to be solved by some other tool down the data pipeline. So it’s back to the organisation that is responsible for its information dissemination I’d argue.

1 Like

Just bumping this @bill_anderson as I could not resist to start rereading the full thread…

I agree with @bill_anderson on this – I also noticed recently that some files were not validating and therefore not entering the datastore just because some elements were ordered incorrectly.

I would go a couple of steps further than Bill and suggest:

  1. relaxing the strong requirement for every file to pass schema validation, in favour of a weaker “best efforts” attempt to import every activity (even if that activity fails validation), and alert if particular activities could not be imported. For example, having elements in the wrong order shouldn’t present a major issue to importing data.
  2. making more visible (and actively contacting publishers) when datasets/activities fail validation, or cannot be downloaded (e.g. files were accidentally moved around on the server and become inaccessible through the registered URLs). Perhaps some combination of IATI Canary and the IATI Dashboard could be used for this.