I understand @IATI-techteam in the above, so this is in no way to try to contradict that timeline. I’m putting it here for posterity when this conversation re-opens after the initial launch.
I’m in agreement with @bill_anderson, @David_Megginson and others on this. I am routinely asked why IATI doesn’t line up with official statistics, or why expected IATI data is missing, or why we can’t we can’t trust IATI data. In my opinion it isn’t satisfactory to say that a publisher fluffed one number in a dataset of thousands of activities and therefore all of that data is inaccessible, and that this is by design.
The removal of arbitrary, valid data will undermine trust in IATI, and frustrate both users and publishers, and exacerbate existing narratives about the viability of the entire corpus of data we work to produce.
Regarding @Herman’s concern about the onus moving away from the publisher: I understand this principle, but at the moment we need pragmatism. I would be happy with a number of measures to re-establish that onus that don’t involve removing access to valuable data.
For example:
- We could institutionalise IATI Canary, making use of it a basic participation prerequisite for engagement.
- We could take this further by publishing the response times to data validation issues, and possibly push for this to be a relevant metric in future transparency evaluations such as the Aid Transparency Index or the IATI Dashboard
- We could include a flag within the XML to denote validity, and put garish, unsightly banners on relevant D-Portal pages or other presentation sites to make it clear that there are validation issues.
- We could celebrate the rapid engagement with and resolution of data validation issues in newsletters and official communications (if the publisher consents).
- We could have a public ‘caution’ list of publishers with invalid data.
I’m not seriously suggesting all of these, and some of them might seem extreme, but for me they are all sensible* compared to removing an unknown quantity of valid data from the one official data store.
*
To add some numbers to this sentiment (see workings here):
- There are currently ~982k activities.
- If we take the publisher stats and add an activity to file ratio value, we can see that the top 25 publishers by number of activities published account for ~814k activities, about 82.89% of the total.
- These activities are split amongst 2,234 files (meaning a total activity-to-file ratio of 364 among them).
The median activity-to-file ratio among them is 530, and the arithmetic mean is 1,657. This is because of our top five activity-to-file publishers:
- GlobalGiving.org
- UN Pooled Funds
- Norad - Norwegian Agency for Development Cooperation
- Food and Agriculture Organization of the United Nations (FAO)
- The Global Alliance for Improved Nutrition
Together these five account for 38,000 activities spread between 5 files.
Going back to our top 25 publishers by activity count, it’s fairly clear that one validation error in any of these publishers will mean a serious loss of valid IATI data.
If GlobalGiving have one sector percent that doesn’t add up to 100 one missing sector field or other schema error, we could lose nearly 2% of all IATI data pertaining to nearly 10,000 organisations.
EDIT: changing the sector example as per @JoshStanley’s correction.