IATI Datastore - what data should go in?

I know. This is why I’ve come back to this thread because it was community consensus that led to that being in the spec.

Personally I have always held the opinion that the datastore’s only clients are users - developers and savvy analysts - who should be provided with as much usable data as possible. The discussion that this thread cut short at the outset was how could we define ‘usable data’ without compromising the overall integrity of the standard. Feedback loops for improving data quality and other supply side issues are important but should have nothing to do with the datastore.

By spending a bit of time across the community agreeing on a set of rules defining “usable data”.

Seems to me like this is another unsolved discussion of (huge simplification here to make a point):

  1. IATI is primarily a data interface standard…therefore defend the standard working as it should…do not allow validator rejected files in the DSv2
    vs
  2. IATI is a transparency initiative…so show more data, even if it makes it harder to operate the standard…

Of course the two are linked, and do not exclude each-other, but I think it would help us to take a lot of decisions if there was a clearer answer to the first question…theory of change…what have we learnt from 10 years of the original Theory of Change etc

Ok, but perhaps we should first look back at the rationale for following the schema, as @stevieflow pointed out last year:

My argument here is simple: if we start to support data that is not valid against the schema, why have a schema? Or even - what support are we giving to data users, if we supply invalid data?

We first need to come up with a solid answer here, prior to opening up the Datastore to data that does not follow its data schema. Relaxing this requirement also introduces schema fatigue: “well, Datastore accepts everything anyway, just leave our non schema valid IATI data invalid”. Pretty slippery slope.

I do get your point @bill_anderson on invalidating on dataset level though, like the WB example you pointed out, so perhaps we could look at activity level schema validation rather than dataset validation? I do agree that rejecting a complete dataset because of a single transaction level schema invalid reporting should not be considered best practice.

2 Likes

Funny, we actually fixed the sequence of elements within an activity as part of the upgrade to 2.01…

A standard is for both producers and consumers, to make the exchange of information easier. We try to make it easier for producers by offering a schema (and hopefully a ruleset) that you can use to check your data before publishing. All with the intent to make it easier for more data consumers to use what is published.

The idea that the datastore just tries its best to process “anything” as a solution is shifting the problem from the producer to the consumer. And it basically says: don’t try to develop your own IATI-consuming application and feel free to publish about anything.

We need to fix this by making data quality part of the publisher’s process. And so it needs to be adequately resourced and prioritised. Bombarding a technical department with emails won’t change a thing until management sees that this is a problem. It helps if they see that their data is missing.

This is what’s happening with grantees of the Dutch government: programme staff get called by grant managers that they are missing from the dashboard, and need to fix their data.

If an organisation like the Worldbank is able to regularly update ~500MB across nearly150 files, they should be able to to a simple schema validation step as part of their QA when publishing.

If it’s a matter of ordering the elements in the right way, I’d be happy to work with them on a simple XSLT script to do just that.

But I assume their technical staff is already well aware of this.

My guess is: it’s not a priority, you can be #2 in the Aid Transparency Index even though you publish schema-invalid files. And the IATI Secretariat is happy to push data consumers to accept your data, you don’t even have to do that yourself.

To echo Matt:

  • Is IATI a still data standard to make it easier to exchange information between all kinds of parties?
  • Or is it a database offered by DI to please some users, and we don’t care that the EU, USAID, governments, multilateral NGO networks, project management software platforms, etc, also need or want to exchange information between systems?

Making sure you have schema-valid XML has been solved over 20 years ago. We need to push publishers to make that part of their production system. So that we can move on to including business rules compliance as well. And discuss actual business rules as part of the standard, instead of still being stuck on this basic level.

4 Likes

Agree. But you go on to focus solely on the supply side.

As a consumer I’m not interested in what producers should or shouldn’t be capable of doing. I just want the data. I’m not bothered whether my banana is malformed so long as it is edible.

I’m quite happy to hold my hand up and admit that for the best part of ten years I was part of a machinery (and community) that paid insufficient attenion to users outside of our immediate supply chain. Now that I’m on the other side of the fence things look different …

This kind of (much-used) argument is fundamentally flawed. Improving data quality and maximising the use of what currently exists are two very separate ideas that actually reinforce each other.

Thanks for this discussion!

I think part of the issue is that 2.01 made it much easier to fail schema validation by requiring elements to be in a particular order (it did this in order to make certain fields “mandatory”, which I think was the wrong way of enforcing compliance). I think that was a mistake. That didn’t matter that much before now, because everyone could continue to use data even though it failed validation, but obviously it would begin to make much more of a difference if we stick to this approach.

I don’t think making it impossible to access schema-invalid data through the IATI Datastore shifts any problem from a consumer to a producer. At the moment, it just makes it much more difficult for the consumer to access the data (even if it’s just a question of one element in one activity in one file being in the wrong order). If publishers quickly resolved data validation issues, that would be fine. However, the evidence suggests that around 10% of publishers have invalid files, and the number has remained fairly stable for the last three years – see these charts.

As various people have mentioned, one way of squaring this circle might be for publishers to be automatically notified (or politely bombarded) when their data fails validation.

If you’re a publisher reading this thread – you can sign up for alerts from IATI Canary!

2 Likes

I agree. I’m not an xml expert, but isn’t there another way of checking mandatory fileds without ordinality?

1 Like

Flagging this from earlier in this thread:

I checked again today, and the number of schema-valid activities in schema-invalid datasets is now 74,752. It’s possible to validate at activity level and still provide access to raw XML, by excluding the invalid activities.

3 Likes

From a user perspective, it would be great that if there can be activity level validation and some activities are dropped due to failing validation, if there could be some unmissable notification that the activities shown are not all of what the publisher intended to publish - because quite often users see the existence of data for a publisher as implying a comprehensive dataset from that publisher.

3 Likes

this is a huge amount of activities that should reside in the Datastore. Let me check what the efforts would be to validate on activity level. Perhaps @AlexLydiate could chime in?

Bumping this as it’s been dormant for a month and needs urgent attention imo.

I would like to engage with the initial IATI community consensus on the original RfP specification for the Datastore where community consensus was reached (not sure when/where anymore) on the matter of blocking non schema valid data-sets from entering the new datastore.

With the knowledge we all have today with both the 2020 validator and datastore moving towards production, I propose the IATI community to act on @bill_anderson suggestion to make an amendment to that will move away from dataset level schema validation towards activity level schema validation as the initial proposed requirement omits huge amounts of data from the new datastore and is not in line with IATI strategy of data use and data quality. This is a more granular approach and will ensure we have all IATI schema valid activities stored in the Datastore which is currently not the case and will never be the case if current consensus remains in place.

If we make this amendment IATI will massively increase the availability of IATI data.

The current method for blocking any activity from a schema valid data-set -even if 999/1000 of activities are actually schema valid(!!)- seems unsustainable and is a policy we should revisit urgently.

cc @bill_anderson @IATI-techteam @stevieflow @Mhirji @matmaxgeds @markbrough @andylolz @David_Megginson @Herman @rolfkleef

Could I suggest that anyone objecting to this proposal speak up. Otherwise silence should be accepted as community consent for this amendment, thus allowing the secretariat to take the necessary steps to make this happen.

1 Like

Agree with you Mark: data order should be irrelevant. Not so sure about other examples such as missing transaction type codes, currency codes, etc. Those activities should i.m.o. not be in the data-store (these are the really inedible rotten banana’s)

Accepting files with schema errors and doing ‘activity level’ validation only, would make file level schema validation unnecessary.

But then the question is how you are going to do ‘activity level’ validation. When this is bound to only validating and cleaning IATI data in the data-store ingestion process, it would mean that every existing IATI XML consuming application would be forced to make use of the data-store if it wanted to use validated IATI data.

The data-store will become the only source of validated IATI data since validating the raw IATI data of the publisher against the XSD will lose its meaning. This is i.m.o. only acceptable if the data-store would also provide fully IATI standard compliant XML output (which validates against the XSD), with just the erroneous activities being removed.

Correct, it can still stay in place, but for datastore entry the policy would be revised towards activity level.

Validator and Datastore will update their (technical exchange) arrangements to fit this amendment.

Hold on. Validator nor Datastore will clean IATI data. This thread is about increasing IATI Data availability by moving towards activity-level schema validation.

This is the case today seeing only schema validated data-sets arrive in the datastore.

Correct, datastore will serve fully IATI standard compliant XML output (next to CSV and JSON formats).

But a bit back on topic: do you have fundamental objections to the proposal of moving to schema validation on activity level by the proposed amendment?

I do not object against doing activity level schema validation as long as the publisher is responsible for doing this. Above proposal shifts this responsibility from the publisher to the IATI data-store, thereby removing any incentive to act on bad data. The problem with that is that this will get the publisher of the hook. Providing usable data will become a more technical responsibility.

Ultimately this proposal is about who is ultimately responsible for the data quality of the data in the IATI ecosystem. This proposal partly shifts this responsibility. I am not sure that in the long term the benefits will outweigh the costs. Reading the thread I see different viewpoints on this. In addition to that I would say that this should not be solely decided by the members of the community who are not on vacation right now or the tech-team. It warrants a broader audience (maybe an IATI WG?).

Nonetheless this should not be a showstopper for the further development of the data-store now. We can always decide to add this functionality in the next version of the data-store.

Correct me if I am wrong but the raw XML on the publishers URL validated against the XSD is i.m.o. the primary source of validated (or rejected) IATI XML.

The reason for this amendment is that in some cases (as pointed out by @bill_anderson earlier) one part of a transaction has an error which resulted in all other activities (schema valid) to be rejected due to current policy in place.

We all agree publishers are responsible for good data, that’s not really the point for this amendment.

This is about getting all schema valid activities available out to users to use. That’s the point of having a datastore in the first place. Current policy omits clearly valid data and therefore this policy must be revised.

This amendment does not in any way shift responsibility from the publisher to the IATI data-store, but merely feeds schema valid activities into the datastore for use that are actively being rejected by current policy.

Not sure what the exact work on changing to activity level schema would entail, but from datastore perspective it looks minimal. I’d argue good IATI data being rejected is far worse.

Sure, we’re in the middle of summer, so no immediate action required. But actively blocking schema-valid activities from the datastore should is not a healthy data policy and should no longer be considered.

It implicitly does i.m.o. because you will change (by omitting bad activities) the content of the publishers IATI XML, without the publisher being aware.

What I meant here was not the costs of the technical implementation, but the costs of increasingly sloppy publishers data because of implicitly omitting bad activities in an publication.

It would i.m.o. be better to provide feedback to the publisher that they have data quality problems instead of processing the XML anyway. My experience is that data quality improves when providing feedback.

Could you provide (automated) feedback to publishers when activities are rejected and inform the publisher that their IATI data will not be available in the data-store until they correct the errors?

(B.t.w. did anyone in above examples try providing feedback to publishers? If so, what was the response?).

May I am not familiar enough with the data-store, but I was not able to retrieve the original publishers XML with all data-elements. Is there an endpoint for that?

I’m late to the party, @siemvaessen, but I’m in full agreement. The packaging of IATI activities is arbitrary (a data provider could choose anything from one XML file per activity to one XML file for all their activities, ever), so an error in one activity in the package shouldn’t block the others from the data store.

4 Likes

@Herman I disagree with much of your approach. As an increasingly heavy user** (and champion) of IATI data I want access to as much usable data as possible. That’s what I expect from the datastore. Being told that I can’t have access because the datastore is on a mission to produce ‘pure’ data won’t wash.

In my particular use case my biggest problem is ensuring that both geographic and sector percentage splits add up to 100 so that I can to reliable transaction arithmetic. There’s a whole load of other QA issues that I’m not the slightest bit interested in. I would rather deal with my particular problem myself if I know that the Datastore is doing as little as possible to get in the way.

This has got nothing to do with letting the publisher off the hook. That’s got nothing to do with me (or the datastore). If we can get useful information based on usable data in front of a lot of people (not just those responsible for supply chain accountability) the incentives for publishers to improve their data will far outweigh our moral sticks.

(** For the record I have nothing to do with the supply side or governance any more)

Thanks to everyone for engaging with this conversation.

The Tech Team is currently focussing on ensuring that the original TOR are delivered as we prepare for launch. Any consideration of requirements outside of the current TOR will be led by the Tech Team after launch; we will engage with the community to make sure this complex issue in particular is fully explored. In the meantime, the Tech Team is contacting publishers with schema invalid files this week to urge them to address their issues.

Thanks again for all your good input above; we look forward to discussing this further after we launch.

2 Likes