Technical measures to improve/incentivise better data quality

reidmporter · March 4, 2017, 6:04am

Late to this discussion but I have some thoughts from OpenAg’s recent Tool Accelerator Workshop in London, some already partially represented, others newer/more controversial. (Obligatory disclaimer: I’m blissfully ignorant as a newcomer to this community of all the reasons why things are the way they are - not sure if this is a help or a hiderance, probably both in equal measure:)

As @rory_scott mentioned, there can and should be levels of validation (checks out vs. opinionated review). However, we’ve found that in our discussions about data quality with publishers, we get much farther focusing on concepts rather than fields. Out of that experience, we discussed an opinionated review that was much more appreciative in nature - “here’s what your data can do: allow for sub-national mapping thanks to all that wonderful location data you provided; allow for granular analysis thanks to your classifications of project type and services; maaaybe you could do a bit more on linking to donors/partners to improve traceability.” Something like that.
While some data quality issues can only be addressed by the publisher, OpenAg is also investigating ways we can grab their pen and add value to the data on our own. This is represented by two early-stage concepts: 1) an auto-classification algorithm that uses machine learning to match concepts (in our case, potentially from an agricultural vocabulary like AGROVOC, though other sector-specific codelists would theoretically work just as well) based on project descriptions (if available) or project documents (if linked); 2) a text extraction and matching algorithm that searches docs/descriptions for names of regions/districts/zones/towns and tags each that it finds via fuzzy match, perhaps with some human validation or logic to filter out the various mentions of Washington, DC or London, UK that appear in various donor-produced documents.
Following on that idea, if we can add value to the data, there are then two options: privately hand it back to publishers and ask nicely if they want to import it to their system, or (I can anticipate the tension this will generate, but hear me out) we create a separate “IATI Improved” data store/registry that makes improved data immediately available without waiting for the publisher to approve it.
Young Innovations also proposed a tool (which they discussed elsewhere I think, though I can’t find the blog link right now) that would allow data users to suggest changes and updates to things like generic org names and IDs.

Happy to share/discuss more next week.

YohannaLoucheur · April 19, 2017, 4:07pm

Only seeing Reid’s input today.

This is somewhat similar to an idea starting to form in my mind about a different approach to data quality assessment. Rather than assessing elements/attributes individually, we could check whether the data is sufficient/good enough to provide answers to specific questions (ie data uses). For one thing, it would highlight the dependencies between data elements, and the importance of attributes (which tend to be overlooked in current assessment approaches). It would also make the impact of quality issues more concrete, especially if the questions are relevant to the publisher itself.

I believe with a set of 8-10 questions we would cover most important use cases (and most elements/attributes). For instance:

How much will publisher’s operational projects disburse in country x in the next 12 months? (this requires planned disbursements broken down by quarter)
Which national NGOs are involved in the delivery of the publisher’s activities? (this requires all the details on the implementing partner)
Has the publisher implemented projects in support of the following sectors (work needed to create a list of sectors where 5-digits are necessary, eg primary education, basic health infrastructure)? (this would test the use of 5-digit DAC codes instead of 3-digit)
Has the publisher implemented projects at the district/village/something level (choosing a relevant admin 3 level)? (this would test geographic data)

I’d be happy to work offline with others to develop a set of questions and see how the work as a basis for a quality assessment.

reidmporter · April 19, 2017, 7:20pm

I LOVE the idea of framing it by questions (we were doing the framing by “themes” but I think questions are more interoperable). I think OpenAg could contribute a batch of questions and we can see which ones stick, which ones need to be merged to serve as a multi-purpose quality check, and which ones are maybe a bit more sector-specific.

Shall we open a new thread in Discuss or just find each other (hopefully others) in the wilds of the interwebs?

YohannaLoucheur · April 19, 2017, 8:14pm

Perhaps another thread on Discuss to signal this work is underway and invite others to join + document(s)* on googledocs to co-create?

Not sure what would be best basis to work. It may be good to have one document as the master list of questions, but working in another document (or several) to explore & refine each one.

Herman · June 8, 2017, 2:00pm

@reidmporter I like the idea of using machine learning in order to enrich the IATI data with additional classifications.

I think it is important though to adhere to the ‘Publish once, use often’ principle, avoiding republication of IATI data by anyone else than the original publisher. The original publisher is and should be ultimately fully responsible for the data quality of its own data.

Wouldn’t it be nice if this data enrichment (machine learning) software would be made available as an API, enabling the original publisher to improve its own IATI dataset? Such content specific (e.g. agriculture) enrichment API could even be build in tools like Aidstream, providing suggestions for data improvement at the moment of registering or uploading data. In that way the IATI data is not replicated and data quality is improved at the source.

reidmporter · June 8, 2017, 6:22pm

Absolutely agree, and we actually stepped away from the idea of auto-republishing for technical and principle-based reasons.

This is actually the approach we settled on - make the enhancement tools available via API, bake them into the publication process of donors and other publishers, and if we do find that we can grab and enhance data automatically (or a research consultant does so as part of their contracted scope of work), we provide it semi-publicly* for the data owner to validate/reingest/approve/publish themselves.

*think Google Docs with link sharing - anyone can access, so technically it’s public, but it’s not findable, someone has to share the link with you, so it’s effectively private until you make it public.