IATI Datastore - what data should go in?

I understand @IATI-techteam in the above, so this is in no way to try to contradict that timeline. I’m putting it here for posterity when this conversation re-opens after the initial launch.

I’m in agreement with @bill_anderson, @David_Megginson and others on this. I am routinely asked why IATI doesn’t line up with official statistics, or why expected IATI data is missing, or why we can’t we can’t trust IATI data. In my opinion it isn’t satisfactory to say that a publisher fluffed one number in a dataset of thousands of activities and therefore all of that data is inaccessible, and that this is by design.

The removal of arbitrary, valid data will undermine trust in IATI, and frustrate both users and publishers, and exacerbate existing narratives about the viability of the entire corpus of data we work to produce.

Regarding @Herman’s concern about the onus moving away from the publisher: I understand this principle, but at the moment we need pragmatism. I would be happy with a number of measures to re-establish that onus that don’t involve removing access to valuable data.

For example:

  1. We could institutionalise IATI Canary, making use of it a basic participation prerequisite for engagement.
  2. We could take this further by publishing the response times to data validation issues, and possibly push for this to be a relevant metric in future transparency evaluations such as the Aid Transparency Index or the IATI Dashboard
  3. We could include a flag within the XML to denote validity, and put garish, unsightly banners on relevant D-Portal pages or other presentation sites to make it clear that there are validation issues.
  4. We could celebrate the rapid engagement with and resolution of data validation issues in newsletters and official communications (if the publisher consents).
  5. We could have a public ‘caution’ list of publishers with invalid data.

I’m not seriously suggesting all of these, and some of them might seem extreme, but for me they are all sensible* compared to removing an unknown quantity of valid data from the one official data store.


*To add some numbers to this sentiment (see workings here):

  • There are currently ~982k activities.
  • If we take the publisher stats and add an activity to file ratio value, we can see that the top 25 publishers by number of activities published account for ~814k activities, about 82.89% of the total.
  • These activities are split amongst 2,234 files (meaning a total activity-to-file ratio of 364 among them).

The median activity-to-file ratio among them is 530, and the arithmetic mean is 1,657. This is because of our top five activity-to-file publishers:

  • GlobalGiving.org
  • UN Pooled Funds
  • Norad - Norwegian Agency for Development Cooperation
  • Food and Agriculture Organization of the United Nations (FAO)
  • The Global Alliance for Improved Nutrition

Together these five account for 38,000 activities spread between 5 files.

Going back to our top 25 publishers by activity count, it’s fairly clear that one validation error in any of these publishers will mean a serious loss of valid IATI data.

If GlobalGiving have one sector percent that doesn’t add up to 100 one missing sector field or other schema error, we could lose nearly 2% of all IATI data pertaining to nearly 10,000 organisations.

EDIT: changing the sector example as per @JoshStanley’s correction.

5 Likes

Just to be clear, data quality issues such as sector percentages not adding up to 100 will not prevent the dataset from being ingested by the Datastore, as this is a Standard rule (a must), rather than something that is dictated by the Schema.

2 Likes

Thanks for clarifying Josh - I’ve changed the example to reflect that :slight_smile:

2 Likes

@rory_scott and @bill_anderson I understand the need to have as much data available as possible. As mentioned before I am not against having an activity level schema validation for ingesting XML data in the data-store as such, provided there is an active feedback mechanism to the publisher (active meaning that no action is required from the publisher to get informed about the data quality issues).

@rory_scott proposes a number of interesting feedback mechanism, to which I would like to add one: sending an e-mail the the e-mail address provided on the activity level (iati-activities/iati-activity/contact-info/email) , or if there is not such an e-mail address, sending it to the contact e-mail address as stored in the registry.

I object to any solution which would silently skip activities being processed without any notification to the user or the publisher of the data. Users will be kept in the dark about the completeness of the data and publishers will be kept in the dark about the quality problems in their data.

One last thought: if a large publisher has just a few tiny errors in the many activities published, why not simply contact that publisher and ask to correct the problem. I.m.o. it is this lack of active engagement of data users and publishers that causes a great deal of these problems.

4 Likes

@Herman - I don’t think we’re far off: we both agree that the datastore shouldn’t try to ingest an iati-activity if it’s malformed. We also both agree that there should eventually be a feedback mechanism to let data providers know when the datastore does not ingest one of their activities because it’s malformed (and explain why), though I also acknowledge that this last part might be a new work item that needs to be triaged and scheduled.

The difference is over whether one malformed activity in an iati-activities package should cause the datastore to reject (e.g.) 999 other, well-formed activities in the same package, assuming that the error doesn’t affect parsing outside of the activity element.

Since the grouping of activities is non-semantic (the datastore discards the grouping and stores activities individually in any case), I think that would be an overreaction. OTOH, if there is an error at the top level (e.g. the attributes on iati-activities, that might justify a wholesale rejection, because we can’t be sure that we’re applying those attributes to the individual activities correctly.

1 Like

Given we are considering this, please can we properly document it in the rules/IATI approach etc so that it is not just an ad-hoc decision:

  1. Spell out the logic, so that it can apply in other situations e.g. we are establishing the concept that in core IATI tools, where there is a conflict between availability of data, and XYZ, then we prioritise availability of data - I suspect if we apply this lens more widely, we might find a lot more ‘tweaks’ to make to IATI tools e.g. because IATI does not replicate several key codelists, thousands of activities do not have machine readable sector or result narratives. Therefore if we follow the logic we are doing for activities in the datastore, we are establishing the primacy of user access to data and so IATI should replicate many more codelists that it currently doesn’t?
  2. What do we think about transactions - should we really ditch a whole activity if just one transaction fails to validate?
  3. It seems like it is also not sensible to ditch a whole activity if e.g. one result element fails validation
  4. There are different types of failing validation - please can we be clear that ‘fail’ in this case applies just to the ‘must’ elements of the rules, or have I misunderstood this?

For what it is worth, I think that we shouldn’t do this - if a major publisher cannot promptly fix a problem with their data that causes thousands of activities to disappear, I think we should not be trying to fix this via the datastore. I do not think that the ‘they will get an email’ is a serious proposition at least until we have proof that it works - I have previously tested the response rate to IATI emails previously with mixed results. Also, surely major publishers should themselves be checking e.g. that their activities are available in IATI tools - again if they are not doing that, I think IATI has a problem that needs a solution that lies outside the datastore.

Finally, maybe it would also be better to work this though so actual cases - do we have other examples in addition to the World Bank case (and what has happened in that situation?). Is it also not worth seeing what happens when the validator is launched? We spoke multiple times about all being aware that once implemented it will cause a huge increase in rejections - presumably we agreed then that we were ok with this and expected that after a short period of rapid fixes, everything would settle down?

@David_Megginson: agree that when we have an activity level validation we can skip the bad activities. But only if there is an active feedback mechanism. Not ‘eventually’ as you are suggesting, but at the moment activity level validation is implemented in the data store ingestion process.

Agree with @matmaxgeds that we should define which kind of errors lead to:

  • processing the individual activity even if there is an error in the content in an activity (e.g. a non existing sector code)
  • skipping the whole activity
  • rejecting all of a specific publisher files (if more than x% of the activities have errors, malformed XML or XSD errors).

The questions remains of course is acceptable as an maximum number of errors (0.1%, 1%, 10%, ?).

About the reservation @matmaxgeds has about e-mail feedback to the publisher: e-mail feedback is by no means a panacea. It is i.m.o. better to engage with a publisher directly, but this rarely happens (see also the recent Catalpa report). By lack of a better feedback mechanism, I would rather have a working e-mail feedback process than nothing at all.