IATI Datastore - what data should go in?

Then there hopefully will be no active 1.x data publishers anymore after the depreciation date in June this year.

1 Like

Thanks all

It’s useful for us to reaffirm our role as a community here. We’re giving technical advice (the TA of TAG!) to the datastore project. We’re not in a position to project manage the datastore, for example. For this reason, it’ll be great to hear a progress update from the @IATI-techteam

In terms of the discussion we’ve had so far on 1.0x, then apologies if I left that vague. My understanding is that we’d leave the door open for valid 1.0x data, but that other factors instigated by the @IATI-techteam may mean becomes less of an issue:

  • Existing active 1.0x publishers shift to 2.0x before June
  • There’s a process in place to support the Aidstream users, who may have a mix of versions

Jumping on this thread a little late – I think it would be great to ensure that the needs of the ultimate users of the data are factored in here. There are currently some big donors publishing v1.x data to IATI (see below). It would be really unfortunate if the data from these organisations, which is currently available through the Datastore, became no longer available.

I don’t really understand the suggestion of loading all v1.x data into the Datastore once, and then never again – I would assume the development time required would be more or less the same, and presenting out of date data to users from a Datastore that is supposed to update nightly would arguably be misleading. Perhaps a better approach would be to gracefully degrade for older versions – i.e., trying to load v1.x data on a “best efforts” basis, but not focusing development or maintenance time on this.

Here are few suggestions about how to avoid losing access to all these organisations’ data:

  1. IATI tech team works intensively with priority organisations to support/encourage them to begin publishing v2.x data. I would argue that prioritisation should be based primarily on size of organisation.
  2. If there are still a large number of organisations (especially large ones) publishing v1.x data, then have a policy of gracefully degrading for no longer supported versions.
  3. The Datastore importer could perhaps use something like @andylolzv1.x to v2.03 transformer where possible to simplify the import process.

IATI Dashboard - Versions

v1.03

  • AFD (France)
  • AsDB
  • Finland
  • France (Ministry of Foreign Affairs)
  • Switzerland (SDC)
  • UNOPS

v1.04

  • European Commission (FPI)

v1.05

  • Germany (Environment Ministry)
  • Climate Investment Funds
  • New Zealand
  • The Global Fund to Fight AIDS, Tuberculosis and Malaria
2 Likes

The TAG consensus to deprecate v1 in June was based on the realistic expectation (based on the ongoing work of the Tech Team) that all big publishers will upgrade. Your Suggestion 1 has been going on for some time.

@bill_anderson that’s great to hear, then perhaps we can just revisit this question around June, once we know how much progress these publishers have made.

Hi all,

Please note that a new topic has been created which outlines the technical team’s plans for version 1 files in the context of the new Datastore: Version 1 Files in DataStore

I would like to reopen this discussion.

I am pulling together data on resource flows going to Kenya in response to COVID-19. I know that the World Bank has given Kenya a $50m loan.

This can’t be right. In the interests of providing users with the maximum amount of usable data we surely need to change the validation and datastore ingestion guidelines to operate at the activity level, not the file level.

4 Likes

I agree, but not only is this as per specification for the Datastore and from a data (standards) perspective expected behaviour.

How would you propose we solve this, while keeping schema-valid only data in the Datastore while at the same time providing information all data available in raw (XML) files? I believe we spoke of the option where no data gets left behind (= the Datastore accepts all) with a default view from the Datastore only showing schema-valid data. In my opinion this should be adressed on perhaps Registry level: user will get notified (bombarded) if data is not schema valid and it will not become available in the Datastore and will perhaps become invisible on Registry after some time? This is a data quality issue after all and one of the reason the new Validator should help out right?

But do agree this needs some sort of better solution than what is currently offered, but the real issue is data quality, probably best solved at the root cause and not to be solved by some other tool down the data pipeline. So it’s back to the organisation that is responsible for its information dissemination I’d argue.

1 Like

Just bumping this @bill_anderson as I could not resist to start rereading the full thread…

I agree with @bill_anderson on this – I also noticed recently that some files were not validating and therefore not entering the datastore just because some elements were ordered incorrectly.

I would go a couple of steps further than Bill and suggest:

  1. relaxing the strong requirement for every file to pass schema validation, in favour of a weaker “best efforts” attempt to import every activity (even if that activity fails validation), and alert if particular activities could not be imported. For example, having elements in the wrong order shouldn’t present a major issue to importing data.
  2. making more visible (and actively contacting publishers) when datasets/activities fail validation, or cannot be downloaded (e.g. files were accidentally moved around on the server and become inaccessible through the registered URLs). Perhaps some combination of IATI Canary and the IATI Dashboard could be used for this.

where, which, when?

Lets not relax these requirements, they have been created for a valid reason. Relaxing them goes against consensus reached (in this thread notabene) and avoids the real issue solving data quality first.

We should first and foremost address the latter, not start relaxing parts further down the data pipeline. Feels like going backwards tbh and invalidates efforts and consensus in this thread and perhaps more importantly goes against what MA has decided upon in the IATI Strategic Plan 2020-2025

Exactly. @JoshStanley @petyakangalova & others do this very often. Perhaps they can chip in on some stats here or provide some aggregate figures from the Validator on schema-invalid files?

1 Like

Thanks Siem - I would suggest relaxing this requirement as I don’t think it really leads to an improvement in data quality. It just means that more activities are not available through the Datastore, so you have to use an alternative if you want to access a more complete set of data (see Bill’s comparison between D-Portal and the Datastore, above).

Practically speaking: over 10% of publishers currently have invalid files, including a number of large organisations. When the Datastore is launched, perhaps increased use will create stronger incentives for these publishers to begin to fix their data, but in the meantime, we are missing a lot of data.

I know. This is why I’ve come back to this thread because it was community consensus that led to that being in the spec.

Personally I have always held the opinion that the datastore’s only clients are users - developers and savvy analysts - who should be provided with as much usable data as possible. The discussion that this thread cut short at the outset was how could we define ‘usable data’ without compromising the overall integrity of the standard. Feedback loops for improving data quality and other supply side issues are important but should have nothing to do with the datastore.

By spending a bit of time across the community agreeing on a set of rules defining “usable data”.

Seems to me like this is another unsolved discussion of (huge simplification here to make a point):

  1. IATI is primarily a data interface standard…therefore defend the standard working as it should…do not allow validator rejected files in the DSv2
    vs
  2. IATI is a transparency initiative…so show more data, even if it makes it harder to operate the standard…

Of course the two are linked, and do not exclude each-other, but I think it would help us to take a lot of decisions if there was a clearer answer to the first question…theory of change…what have we learnt from 10 years of the original Theory of Change etc

Ok, but perhaps we should first look back at the rationale for following the schema, as @stevieflow pointed out last year:

My argument here is simple: if we start to support data that is not valid against the schema, why have a schema? Or even - what support are we giving to data users, if we supply invalid data?

We first need to come up with a solid answer here, prior to opening up the Datastore to data that does not follow its data schema. Relaxing this requirement also introduces schema fatigue: “well, Datastore accepts everything anyway, just leave our non schema valid IATI data invalid”. Pretty slippery slope.

I do get your point @bill_anderson on invalidating on dataset level though, like the WB example you pointed out, so perhaps we could look at activity level schema validation rather than dataset validation? I do agree that rejecting a complete dataset because of a single transaction level schema invalid reporting should not be considered best practice.

2 Likes

Funny, we actually fixed the sequence of elements within an activity as part of the upgrade to 2.01…

A standard is for both producers and consumers, to make the exchange of information easier. We try to make it easier for producers by offering a schema (and hopefully a ruleset) that you can use to check your data before publishing. All with the intent to make it easier for more data consumers to use what is published.

The idea that the datastore just tries its best to process “anything” as a solution is shifting the problem from the producer to the consumer. And it basically says: don’t try to develop your own IATI-consuming application and feel free to publish about anything.

We need to fix this by making data quality part of the publisher’s process. And so it needs to be adequately resourced and prioritised. Bombarding a technical department with emails won’t change a thing until management sees that this is a problem. It helps if they see that their data is missing.

This is what’s happening with grantees of the Dutch government: programme staff get called by grant managers that they are missing from the dashboard, and need to fix their data.

If an organisation like the Worldbank is able to regularly update ~500MB across nearly150 files, they should be able to to a simple schema validation step as part of their QA when publishing.

If it’s a matter of ordering the elements in the right way, I’d be happy to work with them on a simple XSLT script to do just that.

But I assume their technical staff is already well aware of this.

My guess is: it’s not a priority, you can be #2 in the Aid Transparency Index even though you publish schema-invalid files. And the IATI Secretariat is happy to push data consumers to accept your data, you don’t even have to do that yourself.

To echo Matt:

  • Is IATI a still data standard to make it easier to exchange information between all kinds of parties?
  • Or is it a database offered by DI to please some users, and we don’t care that the EU, USAID, governments, multilateral NGO networks, project management software platforms, etc, also need or want to exchange information between systems?

Making sure you have schema-valid XML has been solved over 20 years ago. We need to push publishers to make that part of their production system. So that we can move on to including business rules compliance as well. And discuss actual business rules as part of the standard, instead of still being stuck on this basic level.

4 Likes

Agree. But you go on to focus solely on the supply side.

As a consumer I’m not interested in what producers should or shouldn’t be capable of doing. I just want the data. I’m not bothered whether my banana is malformed so long as it is edible.

I’m quite happy to hold my hand up and admit that for the best part of ten years I was part of a machinery (and community) that paid insufficient attenion to users outside of our immediate supply chain. Now that I’m on the other side of the fence things look different …

This kind of (much-used) argument is fundamentally flawed. Improving data quality and maximising the use of what currently exists are two very separate ideas that actually reinforce each other.

Thanks for this discussion!

I think part of the issue is that 2.01 made it much easier to fail schema validation by requiring elements to be in a particular order (it did this in order to make certain fields “mandatory”, which I think was the wrong way of enforcing compliance). I think that was a mistake. That didn’t matter that much before now, because everyone could continue to use data even though it failed validation, but obviously it would begin to make much more of a difference if we stick to this approach.

I don’t think making it impossible to access schema-invalid data through the IATI Datastore shifts any problem from a consumer to a producer. At the moment, it just makes it much more difficult for the consumer to access the data (even if it’s just a question of one element in one activity in one file being in the wrong order). If publishers quickly resolved data validation issues, that would be fine. However, the evidence suggests that around 10% of publishers have invalid files, and the number has remained fairly stable for the last three years – see these charts.

As various people have mentioned, one way of squaring this circle might be for publishers to be automatically notified (or politely bombarded) when their data fails validation.

If you’re a publisher reading this thread – you can sign up for alerts from IATI Canary!

2 Likes

I agree. I’m not an xml expert, but isn’t there another way of checking mandatory fileds without ordinality?

1 Like

Flagging this from earlier in this thread:

I checked again today, and the number of schema-valid activities in schema-invalid datasets is now 74,752. It’s possible to validate at activity level and still provide access to raw XML, by excluding the invalid activities.

3 Likes