IATI Datastore - what data should go in?

andylolz · February 19, 2019, 10:19am

Could you expand on this, @stevieflow? It’s unclear what this would mean for v1.0x publishers.

Thanks

Herman · February 22, 2019, 1:32pm

@stevieflow i agree with @andylolz: it needs to be very clear what is going to happen with valid 1.0x data. I would expect the DS to process this data.

Agree?

bill_anderson · February 22, 2019, 3:10pm

See ^^. It has already been agreed that …

The DS spec does not require for deprecated versions of the standard to be processed.

It was decided, pragmatically, that although DS will come online before V1.0 is deprecated, we are talking about a couple of months and it was not worth the effort to complicate the load.

Herman · February 22, 2019, 3:45pm

Then there hopefully will be no active 1.x data publishers anymore after the depreciation date in June this year.

stevieflow · February 25, 2019, 11:23am

Thanks all

It’s useful for us to reaffirm our role as a community here. We’re giving technical advice (the TA of TAG!) to the datastore project. We’re not in a position to project manage the datastore, for example. For this reason, it’ll be great to hear a progress update from the @IATI-techteam

In terms of the discussion we’ve had so far on 1.0x, then apologies if I left that vague. My understanding is that we’d leave the door open for valid 1.0x data, but that other factors instigated by the @IATI-techteam may mean becomes less of an issue:

Existing active 1.0x publishers shift to 2.0x before June
There’s a process in place to support the Aidstream users, who may have a mix of versions

markbrough · February 25, 2019, 5:30pm

Jumping on this thread a little late – I think it would be great to ensure that the needs of the ultimate users of the data are factored in here. There are currently some big donors publishing v1.x data to IATI (see below). It would be really unfortunate if the data from these organisations, which is currently available through the Datastore, became no longer available.

I don’t really understand the suggestion of loading all v1.x data into the Datastore once, and then never again – I would assume the development time required would be more or less the same, and presenting out of date data to users from a Datastore that is supposed to update nightly would arguably be misleading. Perhaps a better approach would be to gracefully degrade for older versions – i.e., trying to load v1.x data on a “best efforts” basis, but not focusing development or maintenance time on this.

Here are few suggestions about how to avoid losing access to all these organisations’ data:

IATI tech team works intensively with priority organisations to support/encourage them to begin publishing v2.x data. I would argue that prioritisation should be based primarily on size of organisation.
If there are still a large number of organisations (especially large ones) publishing v1.x data, then have a policy of gracefully degrading for no longer supported versions.
The Datastore importer could perhaps use something like @andylolz’ v1.x to v2.03 transformer where possible to simplify the import process.

IATI Dashboard - Versions

v1.03

AFD (France)
AsDB
Finland
France (Ministry of Foreign Affairs)
Switzerland (SDC)
UNOPS

v1.04

European Commission (FPI)

v1.05

Germany (Environment Ministry)
Climate Investment Funds
New Zealand
The Global Fund to Fight AIDS, Tuberculosis and Malaria

bill_anderson · February 26, 2019, 10:28am

The TAG consensus to deprecate v1 in June was based on the realistic expectation (based on the ongoing work of the Tech Team) that all big publishers will upgrade. Your Suggestion 1 has been going on for some time.

markbrough · February 26, 2019, 4:26pm

@bill_anderson that’s great to hear, then perhaps we can just revisit this question around June, once we know how much progress these publishers have made.

IATI-techteam · May 7, 2019, 10:09am

Hi all,

Please note that a new topic has been created which outlines the technical team’s plans for version 1 files in the context of the new Datastore: Version 1 Files in DataStore

bill_anderson · June 22, 2020, 3:54pm

I would like to reopen this discussion.

I am pulling together data on resource flows going to Kenya in response to COVID-19. I know that the World Bank has given Kenya a $50m loan.

I can see this activity in d-portal.
But it isn’t in the datastore
The activity is published by the World Bank in this file.
The file contains 44 activities containing 2,227 transactions.
One of these transactions is missing a transaction-type
The validator logs a critical error and the datastore rejects the whole file.

This can’t be right. In the interests of providing users with the maximum amount of usable data we surely need to change the validation and datastore ingestion guidelines to operate at the activity level, not the file level.

siemvaessen · June 30, 2020, 11:24am

I agree, but not only is this as per specification for the Datastore and from a data (standards) perspective expected behaviour.

How would you propose we solve this, while keeping schema-valid only data in the Datastore while at the same time providing information all data available in raw (XML) files? I believe we spoke of the option where no data gets left behind (= the Datastore accepts all) with a default view from the Datastore only showing schema-valid data. In my opinion this should be adressed on perhaps Registry level: user will get notified (bombarded) if data is not schema valid and it will not become available in the Datastore and will perhaps become invisible on Registry after some time? This is a data quality issue after all and one of the reason the new Validator should help out right?

But do agree this needs some sort of better solution than what is currently offered, but the real issue is data quality, probably best solved at the root cause and not to be solved by some other tool down the data pipeline. So it’s back to the organisation that is responsible for its information dissemination I’d argue.

siemvaessen · June 30, 2020, 11:30am

Just bumping this @bill_anderson as I could not resist to start rereading the full thread…

markbrough · June 30, 2020, 12:11pm

I agree with @bill_anderson on this – I also noticed recently that some files were not validating and therefore not entering the datastore just because some elements were ordered incorrectly.

I would go a couple of steps further than Bill and suggest:

relaxing the strong requirement for every file to pass schema validation, in favour of a weaker “best efforts” attempt to import every activity (even if that activity fails validation), and alert if particular activities could not be imported. For example, having elements in the wrong order shouldn’t present a major issue to importing data.
making more visible (and actively contacting publishers) when datasets/activities fail validation, or cannot be downloaded (e.g. files were accidentally moved around on the server and become inaccessible through the registered URLs). Perhaps some combination of IATI Canary and the IATI Dashboard could be used for this.

siemvaessen · June 30, 2020, 12:37pm

where, which, when?

Lets not relax these requirements, they have been created for a valid reason. Relaxing them goes against consensus reached (in this thread notabene) and avoids the real issue solving data quality first.

We should first and foremost address the latter, not start relaxing parts further down the data pipeline. Feels like going backwards tbh and invalidates efforts and consensus in this thread and perhaps more importantly goes against what MA has decided upon in the IATI Strategic Plan 2020-2025

Exactly. @JoshStanley @petyakangalova & others do this very often. Perhaps they can chip in on some stats here or provide some aggregate figures from the Validator on schema-invalid files?

markbrough · June 30, 2020, 1:08pm

Thanks Siem - I would suggest relaxing this requirement as I don’t think it really leads to an improvement in data quality. It just means that more activities are not available through the Datastore, so you have to use an alternative if you want to access a more complete set of data (see Bill’s comparison between D-Portal and the Datastore, above).

Practically speaking: over 10% of publishers currently have invalid files, including a number of large organisations. When the Datastore is launched, perhaps increased use will create stronger incentives for these publishers to begin to fix their data, but in the meantime, we are missing a lot of data.

bill_anderson · June 30, 2020, 1:52pm

I know. This is why I’ve come back to this thread because it was community consensus that led to that being in the spec.

Personally I have always held the opinion that the datastore’s only clients are users - developers and savvy analysts - who should be provided with as much usable data as possible. The discussion that this thread cut short at the outset was how could we define ‘usable data’ without compromising the overall integrity of the standard. Feedback loops for improving data quality and other supply side issues are important but should have nothing to do with the datastore.

By spending a bit of time across the community agreeing on a set of rules defining “usable data”.

matmaxgeds · June 30, 2020, 3:08pm

Seems to me like this is another unsolved discussion of (huge simplification here to make a point):

IATI is primarily a data interface standard…therefore defend the standard working as it should…do not allow validator rejected files in the DSv2
vs
IATI is a transparency initiative…so show more data, even if it makes it harder to operate the standard…

Of course the two are linked, and do not exclude each-other, but I think it would help us to take a lot of decisions if there was a clearer answer to the first question…theory of change…what have we learnt from 10 years of the original Theory of Change etc

siemvaessen · July 1, 2020, 8:05am

Ok, but perhaps we should first look back at the rationale for following the schema, as @stevieflow pointed out last year:

My argument here is simple: if we start to support data that is not valid against the schema, why have a schema? Or even - what support are we giving to data users, if we supply invalid data?

We first need to come up with a solid answer here, prior to opening up the Datastore to data that does not follow its data schema. Relaxing this requirement also introduces schema fatigue: “well, Datastore accepts everything anyway, just leave our non schema valid IATI data invalid”. Pretty slippery slope.

I do get your point @bill_anderson on invalidating on dataset level though, like the WB example you pointed out, so perhaps we could look at activity level schema validation rather than dataset validation? I do agree that rejecting a complete dataset because of a single transaction level schema invalid reporting should not be considered best practice.

rolfkleef · July 1, 2020, 10:52am

Funny, we actually fixed the sequence of elements within an activity as part of the upgrade to 2.01…

A standard is for both producers and consumers, to make the exchange of information easier. We try to make it easier for producers by offering a schema (and hopefully a ruleset) that you can use to check your data before publishing. All with the intent to make it easier for more data consumers to use what is published.

The idea that the datastore just tries its best to process “anything” as a solution is shifting the problem from the producer to the consumer. And it basically says: don’t try to develop your own IATI-consuming application and feel free to publish about anything.

We need to fix this by making data quality part of the publisher’s process. And so it needs to be adequately resourced and prioritised. Bombarding a technical department with emails won’t change a thing until management sees that this is a problem. It helps if they see that their data is missing.

This is what’s happening with grantees of the Dutch government: programme staff get called by grant managers that they are missing from the dashboard, and need to fix their data.

If an organisation like the Worldbank is able to regularly update ~500MB across nearly150 files, they should be able to to a simple schema validation step as part of their QA when publishing.

If it’s a matter of ordering the elements in the right way, I’d be happy to work with them on a simple XSLT script to do just that.

But I assume their technical staff is already well aware of this.

My guess is: it’s not a priority, you can be #2 in the Aid Transparency Index even though you publish schema-invalid files. And the IATI Secretariat is happy to push data consumers to accept your data, you don’t even have to do that yourself.

To echo Matt:

Is IATI a still data standard to make it easier to exchange information between all kinds of parties?
Or is it a database offered by DI to please some users, and we don’t care that the EU, USAID, governments, multilateral NGO networks, project management software platforms, etc, also need or want to exchange information between systems?

Making sure you have schema-valid XML has been solved over 20 years ago. We need to push publishers to make that part of their production system. So that we can move on to including business rules compliance as well. And discuss actual business rules as part of the standard, instead of still being stuck on this basic level.

bill_anderson · July 1, 2020, 12:55pm

Agree. But you go on to focus solely on the supply side.

As a consumer I’m not interested in what producers should or shouldn’t be capable of doing. I just want the data. I’m not bothered whether my banana is malformed so long as it is edible.

I’m quite happy to hold my hand up and admit that for the best part of ten years I was part of a machinery (and community) that paid insufficient attenion to users outside of our immediate supply chain. Now that I’m on the other side of the fence things look different …

This kind of (much-used) argument is fundamentally flawed. Improving data quality and maximising the use of what currently exists are two very separate ideas that actually reinforce each other.