Technical measures to improve/incentivise better data quality

dalepotter · February 8, 2017, 11:47am

Thanks for your thoughts @bill_anderson, @siemvaessen, @hayfield and @Herman

I would still strongly argue that preventing the publication of invalid data via the Registry will make a needed difference to the quality and usability of IATI data, which is key to IATI achieving the impact it seeks.

The key benefit of validating data on initial publication and each update is that it ensures we catch errors at the point of entry - and at the time when publishers are most invested in making their data available.

If such a measure is introduced, we must think carefully about the publisher experience. If a user attempts to publish invalid data via the Registry, there should be a clear user interface to help users understand:

If there are technical issues with their data
Exactly where these issues appear in their file/s, and
Clear guidance on how to fix each issue, so that they can publish their data.

Other approaches (including scoring invalid data more harshly in publisher statistics) are good ideas but are inherently reactive approaches. I.e we would be trying to chase publishers to resolve issues (hours/weeks/months later?) when they have moved already onto other tasks. Previous experience along these lines have resulted in limited impact, whilst inefficiently consuming support resources.

With particular regard to publisher statistics, the effectiveness of this approach assumes publishers routinely look at and care about their performance there - which may or may not be the case.

I disagree that preventing publication of invalid data amounts to censorship. It seems that all serious attempts to make use of IATI data at any scale have encountered significant data quality issues limiting outcomes. Given the renewed focus on data usage, we are saying that we want IATI to be a good quality and usable source of data, thus fulfilling IATI’s mission to ‘make information about aid spending easier to access, use and understand’.

Herman · February 8, 2017, 12:08pm

@dalepotter I fully agree with you. Accepting bad data (= data with severe errors) in the registry will not only increase the workload fix the data later but will also lead to wrong information. If we use IATI data as a means to improve aid effectiveness (as opposed to only improving transparency), bad data will be harmful because wrong conclusions can easily be drawn.

The key to data quality management is stopping errors as soon as possible. The registry is the natural place to do this, since all IATI data must pass through the registry.

siemvaessen · February 8, 2017, 12:49pm

Regarding the three options raised. As stated by @hayfield, option 3 feels like a status quo, which has proved to be ineffective to the ambition set. That would leave us with options 1 & 2, which could be read as a single option/solution as described by @dalepotter

If the Registry would to be enriched with a Registry Validator, which ensures data coming in is validated (not just according to the XSD but a wider range of validations tbd), we could solve this issue at the door, rather than in the room. Solving this issue in the room has proved ineffective and I propose we deviate from that road.

I also disagree this method would censor data. It will effectively ensure data quality published at the Registry and avoid the current state of garbage in, garbage out.

So, I’d vote in favour of strict validation at the Registry.

Wendy · February 8, 2017, 2:14pm

I would also agree that it would be better to validate data as it is published and prevent the publication of any invalid data. I can echo from experience the comment made above by @dalepotter that publishers are indeed most likely to amend and correct their data when they are specifically focused on initially publishing or at the time when they are making updates.

I also agree with @siemvaessen about extending the validation beyond just schema validation and our plans of enhancing the IATI validator do include doing just that

rory_scott · February 10, 2017, 11:06am

Hi Everyone!

Firstly, Bill is absolutely right to say that invalid data shouldn’t be banned, and I think that it might be helpful to flesh out one consideration so that people can reflect on it. I’m not wedded to the following idea, but it’d be interested to hear @siemvaessen and @Herman’s thoughts as users, and it might move this conversation along to other, better proposals

Technical implementation

The registry gets a new protected field (i.e. set by sys, not the publisher) that has the following values:

1: ‘invalid_xml’ - xml validation fails
2: ‘invalid_IATI’ - xmllint schema validation fails
3: ‘invalid_content’ (arithmetic errors etc.) - opinionated content validation fails
4: ‘healthy_dataset’ - all good

Call this field ‘Health’

There is a daily batch process which does the following on every registered dataset:

# This is pseudocode!
def daily_check(dataset):
    if dataset is newly_registered:
        # the 'validate()' method would be a xmllint+ content validator
        dataset.health = validate(dataset)
        add_to_md5_cache_register(dataset)
        # then do any initial setup of the dataset necessary
    else:
        if md5(dataset) not in md5_cache_register:
            dataset.health = validate(dataset)
            add_to_md5_cache_register(dataset)
            # other operations such as updating the activity count and `IATI data updated` time field
        else:
            # record somewhere that the check has been run

In plain english: this checks if a dataset has been recently registered (which is available in CKAN metadata).

If the dataset has been recently registered, the method runs a validator to define the ‘health’ field and adds an md5 hash string for that activity to a register.
If it hasn’t, the method checks to see if the md5 hash exists already
- If it doesn’t, the method runs the checks and updates the cache
- If it does, then nothing happens

In every step except the last one, the IATI Data Updated field would be updated, possibly along with other fields (see below for why)

So far, so good? (@dalepotter, @hayfield - please let me know if I’m making naive assumptions about what is possible in CKAN.)

Direct uses in the IATI infrastructure

Consider the following metadata from an IRC_UK activity:

Imagine that ‘Health’ now sits in the Contents section (as does IATI data updated, ideally), and that all of the fields in that section are set by sys, not publisher, in a method similar to the one above.

This has the following advantages:

The dashboard, datastore, and other systems which use the registry API could just decide to retrieve activities which have a health rating of 3 or higher, and simply count the files in the other categories for the sake of gathering statistics (without trying to parse them, or spend computation time on validating them).
OIPA / other parsers could do something similar.
No transparency is lost, i.e. all files are still accessible, and users who want to dredge through the invalid data can attempt to (though I’d say there’s little value, particularly for health < 2.
Because the IATI data updated field would be trustworthy, a lot of computation time throughout the technical ecosystem could be saved; I’m sure the dashboard / datastore / OIPA run times could be cut in half, or possibly down below a quarter, just by skipping any datasets that haven’t been updated since their last run.
With the right UX considerations, users could be given the choice between ‘clean’ IATI and ‘comprehensive’ IATI, where the former is good for data analysis, and the latter is good for transparency and accountability.

Thoughts? I look forward to some debates about this at the TAG

dalepotter · February 10, 2017, 11:14am

Thanks for this @rory_scott - good to see such detailed thinking on how a proposal could work in practice!

This suggestion has the benefit of adding some clear metadata to each dataset, and provide a framework whereby IATI secretariat and community tools could skip datasets with known technical issues.

Aside from this, I would contest that this proposal would have limited (if any) impact on addressing the data quality issues that occur within IATI datasets. I agree with @siemvaessen @Herman and @Wendy that we need to prevent bad data getting into the room. I would argue that the current approach (of allowing any data to be published and flagging up issues with data publishers) is clearly not working - whilst at the same time placing significant technical burdens on users and tool providers to make IATI data usable at scale.

Preventing new/updated invalid data being published may seem harsh but it would prevent data quality issues from occurring in new datasets. Changing the process so that all updates to already-published IATI datasets pass through the Registry will (albeit slowly) begin to resolve issues with current datasets.

Even if this change were to be agreed (presumably at the TAG, then other IATI governance processes), it would be fair to assume there would be a grace period to enable the validation process to be established on the IATI Registry, alongside allowing publishers to adapt their processes. In the immediate term, your idea to set-up a batch process and report on ‘dataset health’ is definitely worth taking forward. We’ve added a task to investigate setting something like this up to our weekly list of maintenance and improvement jobs

Herman · February 10, 2017, 1:05pm

Thanks @dalepotter and @rory_scott for this detailed scetch of the IATI validation process. Looks like a very good starting point for improving the technical part of assuring data quality. A couple of observations:

The most challenging ‘health’ category will be ‘invalid content’. From the users point of view, depending on your use case, some content errors might be acceptable in one case, but not in another. An example: a viz showing the financial relation between publishers might accept an error in the policy marker field, but certainly not in references to other activities. So the ‘invalid content’ category needs to be more fine grainded to be usable in practice. I will try to come up with some classification based on the many types of error we encountered during the processing of IATI data.
Some errors, like duplicate activities published be one publisher, will cause great functional and technical problems. This class of errors should not be allowed in the registry at all. The same is true for XML or XSD validation errors: these can and should be fixed by the publisher.
Clear and active feedback to the publisher is critically important. What good is an automated validation process if the publisher is not aware that there are problems? Many publishers using Aidstream for instance, understandably think everything is fine when the click on the ‘Publish’ button. Since Aidstream is very relaxed about data quality, these publishers are often unaware that there are serious problems with their data. The same is true for other publising tools.
Some IATI publishers will need guidance to improve data quality. This will cost time and effort from IATI support. So an option migth be to couple this kind of premium support to wether or not a publisher has paid its membership fee.
The transparency index now rewards completeness of publication. To add an incentive to improve data quality, an important part of the ranking should be coupled to the consistent publication of quality data, preferably measured on the basis of the health status of the data in the registry over a longer period of time.

Herman · February 10, 2017, 3:01pm

@dalepotter Is it possible to assign this topic to the groups ‘Data quality’ and ‘Tag 2017’?

dalepotter · February 10, 2017, 3:12pm

Thanks @Herman It seems posts can only be assigned to one category. I’ve added this to #iati-tag-2017 for the time being, as I imagine we will want to discuss further at the event.

Herman · February 10, 2017, 4:13pm

This is my initial attempt to classify the types of errors I’ve encountered frequently. A lot of them would fall under the category ‘content_errors’. This list is by no means complete. Please feel free to extend this list with your own experiences working with IATI data:

Code list errors
• Non existing codes published for code list fields

Data entry errors
• Frequent use of unlikely dates

Duplicate errors
• Duplicate data of one publisher in IATI files split by country
• Duplicate data of one publisher in one IATI file.
• Duplicate datasets with different publication times

Guideline errors
• Activity files containing multiple IATI versions in one file.
• Country and region percentages not adding up to 100%
• Inconsistent use of the organization roles.
• Missing organization names if no IATI identifier exists for the organization
• Organization identifiers not following IATI guidelines
• Sector percentages not adding up to 100%

Reference errors
• Missing and/or invalid funding activity ids in related activities.
• Missing IATI identifiers for known IATI organizations.
• Missing or invalid references (IATI activity identifiers) to providing activities for incoming transactions
• Use of non-existing IATI identifiers for organizations

Registration errors
• Incorrect registry metadata (activity files registered with an organization file type and vice versa)
• Registration of publisher organizations with the organization reference as publisher name.

Registry errors
• Numerous download errors, invalid files.
• Registering non validated data in the registry
• Registry entries with incomplete IATI files of even non-IATI files like HTML

jonesiom · February 13, 2017, 1:26am

similar to @Herman, i noted incorrect dates etc when i explored legacy public aid flow
@dalepotter: an extra option would be to support a preferred default IATI hosted copy of the hashed dataset with crowdsourced amendments but maintain access to the registered default dataset

i also think a donors tab to complement the publishers tab on the IATI registry home page is needed in the roadmap as it would not be obvious to a primary publisher that a secondary publisher has progressed datasets to prompt outreach

YohannaLoucheur · February 24, 2017, 10:25pm

This is a very interesting discussion, tks for starting it. And thanks in particular to Herman for offering an initial typology of errors.

I tend to be with Bill on the issue of banning “bad quality” data, except perhaps in some extreme cases. As someone mentioned above, something may make data unusable for use case A but go unnoticed for use case B - if it’s banned, use case B doesn’t have it.

One reason to not go to such extremes is that as far as I know, there has not been that much to individual publishers on specific data quality issues, especially for issues like codelist, guideline, reference errors (to use Herman’s typology). In particular, there haven’t been a lot of demonstrations like the recent ones from Anjesh and Mark pointing out how very specific errors/problems had an impact on data use.

Suggestions have been made recently, here and in other threads, to create more active feedback mechanisms. Before concluding that flagging issues to publishers doesn’t work, perhaps we should try these ideas out. And, as was said repeatedly in the last TAG, we need to be very clear when defining and identifying quality issues, and linking them whenever possible with the related use case.

As an example, in the recent research on Tanzania data, we had to exclude all the data from DFID and The Netherlands (among many others) because the implementing partner name was not there (or not in the right place, which is the same for our purpose). On the other hand, Canada’s data would fail for Herman’s interest in cross-referencing activities. These examples may or may not be capital sins, depending on what you want to do with the data. I don’t think banning this data would help.

Herman · March 1, 2017, 2:55pm

@YohannaLoucheur I agree that some errors have a greater impact depending on the use case. A distinction can be made between use-case specific errors/omissions and errors which are use-case independant (e.g. publication of not well formed XML and invalid codes). I would be in favor of blocking the use-case independant error types.

Leaving the current acceptance policy (‘anything goes’) will i.m.o. not improve data quality. This will be detrimental for the use and the long term succes of IATI. This is demonstrated quite well by the many IATI pilots which suffered from bad data.

A carefull implementation approach is necessary when validation rules are enforced. For the next decimal upgrades they will produce warnings. After migrating to a new integer upgrade they will be blocking.

bill_anderson · March 3, 2017, 7:28am

I have tried to capture this discussion and simplify it into some high-level options in this paper for Standards Day. If we can reach consensus at this level we can then explore the technology and capacity required to implement a coherent policy.

reidmporter · March 4, 2017, 6:04am

Late to this discussion but I have some thoughts from OpenAg’s recent Tool Accelerator Workshop in London, some already partially represented, others newer/more controversial. (Obligatory disclaimer: I’m blissfully ignorant as a newcomer to this community of all the reasons why things are the way they are - not sure if this is a help or a hiderance, probably both in equal measure:)

As @rory_scott mentioned, there can and should be levels of validation (checks out vs. opinionated review). However, we’ve found that in our discussions about data quality with publishers, we get much farther focusing on concepts rather than fields. Out of that experience, we discussed an opinionated review that was much more appreciative in nature - “here’s what your data can do: allow for sub-national mapping thanks to all that wonderful location data you provided; allow for granular analysis thanks to your classifications of project type and services; maaaybe you could do a bit more on linking to donors/partners to improve traceability.” Something like that.
While some data quality issues can only be addressed by the publisher, OpenAg is also investigating ways we can grab their pen and add value to the data on our own. This is represented by two early-stage concepts: 1) an auto-classification algorithm that uses machine learning to match concepts (in our case, potentially from an agricultural vocabulary like AGROVOC, though other sector-specific codelists would theoretically work just as well) based on project descriptions (if available) or project documents (if linked); 2) a text extraction and matching algorithm that searches docs/descriptions for names of regions/districts/zones/towns and tags each that it finds via fuzzy match, perhaps with some human validation or logic to filter out the various mentions of Washington, DC or London, UK that appear in various donor-produced documents.
Following on that idea, if we can add value to the data, there are then two options: privately hand it back to publishers and ask nicely if they want to import it to their system, or (I can anticipate the tension this will generate, but hear me out) we create a separate “IATI Improved” data store/registry that makes improved data immediately available without waiting for the publisher to approve it.
Young Innovations also proposed a tool (which they discussed elsewhere I think, though I can’t find the blog link right now) that would allow data users to suggest changes and updates to things like generic org names and IDs.

Happy to share/discuss more next week.

YohannaLoucheur · April 19, 2017, 4:07pm

Only seeing Reid’s input today.

This is somewhat similar to an idea starting to form in my mind about a different approach to data quality assessment. Rather than assessing elements/attributes individually, we could check whether the data is sufficient/good enough to provide answers to specific questions (ie data uses). For one thing, it would highlight the dependencies between data elements, and the importance of attributes (which tend to be overlooked in current assessment approaches). It would also make the impact of quality issues more concrete, especially if the questions are relevant to the publisher itself.

I believe with a set of 8-10 questions we would cover most important use cases (and most elements/attributes). For instance:

How much will publisher’s operational projects disburse in country x in the next 12 months? (this requires planned disbursements broken down by quarter)
Which national NGOs are involved in the delivery of the publisher’s activities? (this requires all the details on the implementing partner)
Has the publisher implemented projects in support of the following sectors (work needed to create a list of sectors where 5-digits are necessary, eg primary education, basic health infrastructure)? (this would test the use of 5-digit DAC codes instead of 3-digit)
Has the publisher implemented projects at the district/village/something level (choosing a relevant admin 3 level)? (this would test geographic data)

I’d be happy to work offline with others to develop a set of questions and see how the work as a basis for a quality assessment.

reidmporter · April 19, 2017, 7:20pm

I LOVE the idea of framing it by questions (we were doing the framing by “themes” but I think questions are more interoperable). I think OpenAg could contribute a batch of questions and we can see which ones stick, which ones need to be merged to serve as a multi-purpose quality check, and which ones are maybe a bit more sector-specific.

Shall we open a new thread in Discuss or just find each other (hopefully others) in the wilds of the interwebs?

YohannaLoucheur · April 19, 2017, 8:14pm

Perhaps another thread on Discuss to signal this work is underway and invite others to join + document(s)* on googledocs to co-create?

Not sure what would be best basis to work. It may be good to have one document as the master list of questions, but working in another document (or several) to explore & refine each one.

Herman · June 8, 2017, 2:00pm

@reidmporter I like the idea of using machine learning in order to enrich the IATI data with additional classifications.

I think it is important though to adhere to the ‘Publish once, use often’ principle, avoiding republication of IATI data by anyone else than the original publisher. The original publisher is and should be ultimately fully responsible for the data quality of its own data.

Wouldn’t it be nice if this data enrichment (machine learning) software would be made available as an API, enabling the original publisher to improve its own IATI dataset? Such content specific (e.g. agriculture) enrichment API could even be build in tools like Aidstream, providing suggestions for data improvement at the moment of registering or uploading data. In that way the IATI data is not replicated and data quality is improved at the source.

reidmporter · June 8, 2017, 6:22pm

Absolutely agree, and we actually stepped away from the idea of auto-republishing for technical and principle-based reasons.

This is actually the approach we settled on - make the enhancement tools available via API, bake them into the publication process of donors and other publishers, and if we do find that we can grab and enhance data automatically (or a research consultant does so as part of their contracted scope of work), we provide it semi-publicly* for the data owner to validate/reingest/approve/publish themselves.

*think Google Docs with link sharing - anyone can access, so technically it’s public, but it’s not findable, someone has to share the link with you, so it’s effectively private until you make it public.