Late to this discussion but I have some thoughts from OpenAg’s recent Tool Accelerator Workshop in London, some already partially represented, others newer/more controversial. (Obligatory disclaimer: I’m blissfully ignorant as a newcomer to this community of all the reasons why things are the way they are - not sure if this is a help or a hiderance, probably both in equal measure:)
- As @rory_scott mentioned, there can and should be levels of validation (checks out vs. opinionated review). However, we’ve found that in our discussions about data quality with publishers, we get much farther focusing on concepts rather than fields. Out of that experience, we discussed an opinionated review that was much more appreciative in nature - “here’s what your data can do: allow for sub-national mapping thanks to all that wonderful location data you provided; allow for granular analysis thanks to your classifications of project type and services; maaaybe you could do a bit more on linking to donors/partners to improve traceability.” Something like that.
- While some data quality issues can only be addressed by the publisher, OpenAg is also investigating ways we can grab their pen and add value to the data on our own. This is represented by two early-stage concepts: 1) an auto-classification algorithm that uses machine learning to match concepts (in our case, potentially from an agricultural vocabulary like AGROVOC, though other sector-specific codelists would theoretically work just as well) based on project descriptions (if available) or project documents (if linked); 2) a text extraction and matching algorithm that searches docs/descriptions for names of regions/districts/zones/towns and tags each that it finds via fuzzy match, perhaps with some human validation or logic to filter out the various mentions of Washington, DC or London, UK that appear in various donor-produced documents.
- Following on that idea, if we can add value to the data, there are then two options: privately hand it back to publishers and ask nicely if they want to import it to their system, or (I can anticipate the tension this will generate, but hear me out) we create a separate “IATI Improved” data store/registry that makes improved data immediately available without waiting for the publisher to approve it.
- Young Innovations also proposed a tool (which they discussed elsewhere I think, though I can’t find the blog link right now) that would allow data users to suggest changes and updates to things like generic org names and IDs.
Happy to share/discuss more next week.