Technical measures to improve/incentivise better data quality

dalepotter · April 27, 2017, 10:32am

Following the conversations in thread: TAG Tech Consult: Outline Agenda, this tread is set-up to discuss technical measures that could be introduced to incentivise/improve better data quality.

Problem

The need for improved data quality follows recent shortfalls identified by real-world attempts to use IATI data, most notably Young Innovations attempts to use Tanzanian IATI Data. The Bangladesh IATI – AIMS import project, other threads on IATI Discuss and the IATI Dashboard also highlight other issues with the use of IATI data.

Discussion

@bill_anderson has suggested some specific ideas are needed regarding, for instance:

What should be validated?
How should it be validated?
What happens with invalid data?
How should the Registry handle invalid data?
How should the Dashboard handle invalid data?
How should Publisher Statistics handle invalid data?
Should good quality data be ‘kitemarked’?
NB If there is a need for discussions on tools, support, etc these belong on Day 2 and 3

One tangible proposal: Force initial publication and updates through the IATI Registry

@Herman has suggested on Twitter that most issues could be solved by making the IATI validation more restrictive and mandatory for publishing on the Registry.

This is a sensible idea and would stop publication of bad data at source. However, this would not necessarily make the situation better, as, under the current situation, publishers post a URL of their self-hosted IATI XML file on the IATI Registry. This self-hosted file can then be modified any number of times without re-posting to the Registry, providing that the URL itself does not change. This means that checking a file is valid on initial publication does not mean it will be valid when a publisher updates the location that file.

To solve this problem - and if it is agreed that forcing initial publication and all updates to be published through the Registry (with validation on each publication) is a good idea - changes would need to be made to how updated data is accessed. Some options:

All IATI data would be held on the IATI Registry
Find a clever way to keep IATI XML files self-hosted but ‘cache’ each update. Each publication (initial or update) via the Registry would generate a unique Registry URL, which resolves to the data that was published at that point.

Would be good to hear everyone’s thoughts on these technical measures.

siemvaessen · February 7, 2017, 3:31pm

Hi Dale,

We would be very supportive to improve data quality. From our perspective there is still too much garbage in, garbage out, which is a major headache.

Any initiative that could tackle this problem this would be very welcome. I’d even say it should surpass anything else at the upcoming TAG, seeing bad quality data is seemingly holding IATI hostage.

Looking forward to new insights and fruitful discussions at the TAG on data quality.

Siem

Herman · February 7, 2017, 3:53pm

Hi Dale,
Thanks for setting op this thread. The questions raised by @bill_anderson make clear that there are several viewpoints to look at when discussing data quality.

The first 2 questions (What should be validated?, How could be validated?) be linked to the question ‘What are we using IATI data for?’. In other words what is the IATI use case. In the example of monitoring activity progress with IATI for instance, its critically important that references between activities are valid and complete. Only that way, you can avoid double counting and provide insight in the network of activities funded for a specific goal or purpose. From the technical point of view there needs to be a way to determine if a reference is valid or not. It would be worthwhile to compare the experience of several IATI developers with the goal to take stock of the major classes of data quality problems and their impact on the usability of IATI data. This could be a separate session(s) on the TAG (e.g. a number of 20 minutes presentations, each highlighting a specific use case)

The question ‘What happens with invalid data’ could be the topic of another separate session on the TAG. This session should also address the question ‘How should the registry handle invalid data’, since the registry is the official entry point for metadata on all publisher files.

The questions about the Dashboard and Publisher Statistics refer to the question how to provide feedback to publishers about their data quality and how the appreciate data quality issues. Also the question can be raised here if technical means to provide feedback, are a sufficient incentive for publishers to improve their data quality or that more is needed (e.g. specific support). This could be the topic of a session called ‘How to provide data quality feedback’ .

hayfield · February 7, 2017, 5:44pm

With regards to How should the Registry handle invalid data?

At present, all that is published to the Registry is a URL that points to data rather than the data itself. This means that while it may be possible to undertake validation at the point of initial publication and ensure that data is fully valid at that point, the content of the file at the specified URL can be modified to contain invalid data at a later point.

In this situation, there are multiple ways that the Registry could respond.

See that the data is no longer valid and prevent it from being accessed until it is made valid again
Provide access to the last known-valid data
Allow access to the invalid data, potentially marking it as invalid in some manner

In the first of these options, a very minor error could cause large amounts of useful data to become inaccessible. This does not appear an ideal situation.

In the second option, it would require the Registry to store a copy of the data. This would mean weakening or abandoning the principle that publishers own their data since the ‘official’ copy of data would not necessarily be the latest version provided by a publisher.

In the third option, it would mean the Registry provides access to data that does not validate against the Standard, the situation that exists at present. An improved UI (Registry, Dashboard, elsewhere) could, however, better indicate when data is not valid while not modifying the role of the Registry or redefining who ‘owns’ data.
Other tools, such as an updated Datastore, could then pick up the task of restricting access to invalid data or filtering invalid elements / attributes so data users may know that data they are using is fully valid against the Standard (should they desire this).

Herman · February 7, 2017, 9:36pm

IATI files should not be replaced without updating the metadata in the registry. Accurate metadata is key for the succesfull use of data. This could be checked automatically by saving the md5 checksum in the registry on initial load, and periodically scanning the corresponding IATI file to see if its md5 checksum matches. If not, mark the registry entry for that file as ‘invalid’.

For determining wether or not a file has a good enough data quality to be qualified as usable in the registry, maybe the severity of the errors should be taken in consideration using traffic lighting:
Green - everything ok
Yellow - minor errors in the file (e.g. some invalid dates)
Red - major unacceptable errors in the file (e.g. Duplicate activities, invalid references, non existing code values, etc). In that case the file will be rejected and marked as ‘invalid’ in the registry and the publisher will be activily informed (e.g. Automatically by an e-mail)

bill_anderson · February 8, 2017, 6:57am

I hear a number of voices calling for a far tougher regime than we have operated to date. I am in favour of this with one exception:

I am strongly opposed to data being ‘banned’ or excluded by the Registry. Firstly this is not the way for an open data standard to do its business. Secondly this could well deny users of a huge amount of data that fails validation for relatively trivial reasons.

Flag invalid data with big warning signs. Punish it far more harshly in the publisher statistics. Name and shame. But don’t censor.

dalepotter · February 8, 2017, 11:47am

Thanks for your thoughts @bill_anderson, @siemvaessen, @hayfield and @Herman

I would still strongly argue that preventing the publication of invalid data via the Registry will make a needed difference to the quality and usability of IATI data, which is key to IATI achieving the impact it seeks.

The key benefit of validating data on initial publication and each update is that it ensures we catch errors at the point of entry - and at the time when publishers are most invested in making their data available.

If such a measure is introduced, we must think carefully about the publisher experience. If a user attempts to publish invalid data via the Registry, there should be a clear user interface to help users understand:

If there are technical issues with their data
Exactly where these issues appear in their file/s, and
Clear guidance on how to fix each issue, so that they can publish their data.

Other approaches (including scoring invalid data more harshly in publisher statistics) are good ideas but are inherently reactive approaches. I.e we would be trying to chase publishers to resolve issues (hours/weeks/months later?) when they have moved already onto other tasks. Previous experience along these lines have resulted in limited impact, whilst inefficiently consuming support resources.

With particular regard to publisher statistics, the effectiveness of this approach assumes publishers routinely look at and care about their performance there - which may or may not be the case.

I disagree that preventing publication of invalid data amounts to censorship. It seems that all serious attempts to make use of IATI data at any scale have encountered significant data quality issues limiting outcomes. Given the renewed focus on data usage, we are saying that we want IATI to be a good quality and usable source of data, thus fulfilling IATI’s mission to ‘make information about aid spending easier to access, use and understand’.

Herman · February 8, 2017, 12:08pm

@dalepotter I fully agree with you. Accepting bad data (= data with severe errors) in the registry will not only increase the workload fix the data later but will also lead to wrong information. If we use IATI data as a means to improve aid effectiveness (as opposed to only improving transparency), bad data will be harmful because wrong conclusions can easily be drawn.

The key to data quality management is stopping errors as soon as possible. The registry is the natural place to do this, since all IATI data must pass through the registry.

siemvaessen · February 8, 2017, 12:49pm

Regarding the three options raised. As stated by @hayfield, option 3 feels like a status quo, which has proved to be ineffective to the ambition set. That would leave us with options 1 & 2, which could be read as a single option/solution as described by @dalepotter

If the Registry would to be enriched with a Registry Validator, which ensures data coming in is validated (not just according to the XSD but a wider range of validations tbd), we could solve this issue at the door, rather than in the room. Solving this issue in the room has proved ineffective and I propose we deviate from that road.

I also disagree this method would censor data. It will effectively ensure data quality published at the Registry and avoid the current state of garbage in, garbage out.

So, I’d vote in favour of strict validation at the Registry.

Wendy · February 8, 2017, 2:14pm

I would also agree that it would be better to validate data as it is published and prevent the publication of any invalid data. I can echo from experience the comment made above by @dalepotter that publishers are indeed most likely to amend and correct their data when they are specifically focused on initially publishing or at the time when they are making updates.

I also agree with @siemvaessen about extending the validation beyond just schema validation and our plans of enhancing the IATI validator do include doing just that

rory_scott · February 10, 2017, 11:06am

Hi Everyone!

Firstly, Bill is absolutely right to say that invalid data shouldn’t be banned, and I think that it might be helpful to flesh out one consideration so that people can reflect on it. I’m not wedded to the following idea, but it’d be interested to hear @siemvaessen and @Herman’s thoughts as users, and it might move this conversation along to other, better proposals

Technical implementation

The registry gets a new protected field (i.e. set by sys, not the publisher) that has the following values:

1: ‘invalid_xml’ - xml validation fails
2: ‘invalid_IATI’ - xmllint schema validation fails
3: ‘invalid_content’ (arithmetic errors etc.) - opinionated content validation fails
4: ‘healthy_dataset’ - all good

Call this field ‘Health’

There is a daily batch process which does the following on every registered dataset:

# This is pseudocode!
def daily_check(dataset):
    if dataset is newly_registered:
        # the 'validate()' method would be a xmllint+ content validator
        dataset.health = validate(dataset)
        add_to_md5_cache_register(dataset)
        # then do any initial setup of the dataset necessary
    else:
        if md5(dataset) not in md5_cache_register:
            dataset.health = validate(dataset)
            add_to_md5_cache_register(dataset)
            # other operations such as updating the activity count and `IATI data updated` time field
        else:
            # record somewhere that the check has been run

In plain english: this checks if a dataset has been recently registered (which is available in CKAN metadata).

If the dataset has been recently registered, the method runs a validator to define the ‘health’ field and adds an md5 hash string for that activity to a register.
If it hasn’t, the method checks to see if the md5 hash exists already
- If it doesn’t, the method runs the checks and updates the cache
- If it does, then nothing happens

In every step except the last one, the IATI Data Updated field would be updated, possibly along with other fields (see below for why)

So far, so good? (@dalepotter, @hayfield - please let me know if I’m making naive assumptions about what is possible in CKAN.)

Direct uses in the IATI infrastructure

Consider the following metadata from an IRC_UK activity:

Imagine that ‘Health’ now sits in the Contents section (as does IATI data updated, ideally), and that all of the fields in that section are set by sys, not publisher, in a method similar to the one above.

This has the following advantages:

The dashboard, datastore, and other systems which use the registry API could just decide to retrieve activities which have a health rating of 3 or higher, and simply count the files in the other categories for the sake of gathering statistics (without trying to parse them, or spend computation time on validating them).
OIPA / other parsers could do something similar.
No transparency is lost, i.e. all files are still accessible, and users who want to dredge through the invalid data can attempt to (though I’d say there’s little value, particularly for health < 2.
Because the IATI data updated field would be trustworthy, a lot of computation time throughout the technical ecosystem could be saved; I’m sure the dashboard / datastore / OIPA run times could be cut in half, or possibly down below a quarter, just by skipping any datasets that haven’t been updated since their last run.
With the right UX considerations, users could be given the choice between ‘clean’ IATI and ‘comprehensive’ IATI, where the former is good for data analysis, and the latter is good for transparency and accountability.

Thoughts? I look forward to some debates about this at the TAG

dalepotter · February 10, 2017, 11:14am

Thanks for this @rory_scott - good to see such detailed thinking on how a proposal could work in practice!

This suggestion has the benefit of adding some clear metadata to each dataset, and provide a framework whereby IATI secretariat and community tools could skip datasets with known technical issues.

Aside from this, I would contest that this proposal would have limited (if any) impact on addressing the data quality issues that occur within IATI datasets. I agree with @siemvaessen @Herman and @Wendy that we need to prevent bad data getting into the room. I would argue that the current approach (of allowing any data to be published and flagging up issues with data publishers) is clearly not working - whilst at the same time placing significant technical burdens on users and tool providers to make IATI data usable at scale.

Preventing new/updated invalid data being published may seem harsh but it would prevent data quality issues from occurring in new datasets. Changing the process so that all updates to already-published IATI datasets pass through the Registry will (albeit slowly) begin to resolve issues with current datasets.

Even if this change were to be agreed (presumably at the TAG, then other IATI governance processes), it would be fair to assume there would be a grace period to enable the validation process to be established on the IATI Registry, alongside allowing publishers to adapt their processes. In the immediate term, your idea to set-up a batch process and report on ‘dataset health’ is definitely worth taking forward. We’ve added a task to investigate setting something like this up to our weekly list of maintenance and improvement jobs

Herman · February 10, 2017, 1:05pm

Thanks @dalepotter and @rory_scott for this detailed scetch of the IATI validation process. Looks like a very good starting point for improving the technical part of assuring data quality. A couple of observations:

The most challenging ‘health’ category will be ‘invalid content’. From the users point of view, depending on your use case, some content errors might be acceptable in one case, but not in another. An example: a viz showing the financial relation between publishers might accept an error in the policy marker field, but certainly not in references to other activities. So the ‘invalid content’ category needs to be more fine grainded to be usable in practice. I will try to come up with some classification based on the many types of error we encountered during the processing of IATI data.
Some errors, like duplicate activities published be one publisher, will cause great functional and technical problems. This class of errors should not be allowed in the registry at all. The same is true for XML or XSD validation errors: these can and should be fixed by the publisher.
Clear and active feedback to the publisher is critically important. What good is an automated validation process if the publisher is not aware that there are problems? Many publishers using Aidstream for instance, understandably think everything is fine when the click on the ‘Publish’ button. Since Aidstream is very relaxed about data quality, these publishers are often unaware that there are serious problems with their data. The same is true for other publising tools.
Some IATI publishers will need guidance to improve data quality. This will cost time and effort from IATI support. So an option migth be to couple this kind of premium support to wether or not a publisher has paid its membership fee.
The transparency index now rewards completeness of publication. To add an incentive to improve data quality, an important part of the ranking should be coupled to the consistent publication of quality data, preferably measured on the basis of the health status of the data in the registry over a longer period of time.

Herman · February 10, 2017, 3:01pm

@dalepotter Is it possible to assign this topic to the groups ‘Data quality’ and ‘Tag 2017’?

dalepotter · February 10, 2017, 3:12pm

Thanks @Herman It seems posts can only be assigned to one category. I’ve added this to #iati-tag-2017 for the time being, as I imagine we will want to discuss further at the event.

Herman · February 10, 2017, 4:13pm

This is my initial attempt to classify the types of errors I’ve encountered frequently. A lot of them would fall under the category ‘content_errors’. This list is by no means complete. Please feel free to extend this list with your own experiences working with IATI data:

Code list errors
• Non existing codes published for code list fields

Data entry errors
• Frequent use of unlikely dates

Duplicate errors
• Duplicate data of one publisher in IATI files split by country
• Duplicate data of one publisher in one IATI file.
• Duplicate datasets with different publication times

Guideline errors
• Activity files containing multiple IATI versions in one file.
• Country and region percentages not adding up to 100%
• Inconsistent use of the organization roles.
• Missing organization names if no IATI identifier exists for the organization
• Organization identifiers not following IATI guidelines
• Sector percentages not adding up to 100%

Reference errors
• Missing and/or invalid funding activity ids in related activities.
• Missing IATI identifiers for known IATI organizations.
• Missing or invalid references (IATI activity identifiers) to providing activities for incoming transactions
• Use of non-existing IATI identifiers for organizations

Registration errors
• Incorrect registry metadata (activity files registered with an organization file type and vice versa)
• Registration of publisher organizations with the organization reference as publisher name.

Registry errors
• Numerous download errors, invalid files.
• Registering non validated data in the registry
• Registry entries with incomplete IATI files of even non-IATI files like HTML

jonesiom · February 13, 2017, 1:26am

similar to @Herman, i noted incorrect dates etc when i explored legacy public aid flow
@dalepotter: an extra option would be to support a preferred default IATI hosted copy of the hashed dataset with crowdsourced amendments but maintain access to the registered default dataset

i also think a donors tab to complement the publishers tab on the IATI registry home page is needed in the roadmap as it would not be obvious to a primary publisher that a secondary publisher has progressed datasets to prompt outreach

YohannaLoucheur · February 24, 2017, 10:25pm

This is a very interesting discussion, tks for starting it. And thanks in particular to Herman for offering an initial typology of errors.

I tend to be with Bill on the issue of banning “bad quality” data, except perhaps in some extreme cases. As someone mentioned above, something may make data unusable for use case A but go unnoticed for use case B - if it’s banned, use case B doesn’t have it.

One reason to not go to such extremes is that as far as I know, there has not been that much to individual publishers on specific data quality issues, especially for issues like codelist, guideline, reference errors (to use Herman’s typology). In particular, there haven’t been a lot of demonstrations like the recent ones from Anjesh and Mark pointing out how very specific errors/problems had an impact on data use.

Suggestions have been made recently, here and in other threads, to create more active feedback mechanisms. Before concluding that flagging issues to publishers doesn’t work, perhaps we should try these ideas out. And, as was said repeatedly in the last TAG, we need to be very clear when defining and identifying quality issues, and linking them whenever possible with the related use case.

As an example, in the recent research on Tanzania data, we had to exclude all the data from DFID and The Netherlands (among many others) because the implementing partner name was not there (or not in the right place, which is the same for our purpose). On the other hand, Canada’s data would fail for Herman’s interest in cross-referencing activities. These examples may or may not be capital sins, depending on what you want to do with the data. I don’t think banning this data would help.

Herman · March 1, 2017, 2:55pm

@YohannaLoucheur I agree that some errors have a greater impact depending on the use case. A distinction can be made between use-case specific errors/omissions and errors which are use-case independant (e.g. publication of not well formed XML and invalid codes). I would be in favor of blocking the use-case independant error types.

Leaving the current acceptance policy (‘anything goes’) will i.m.o. not improve data quality. This will be detrimental for the use and the long term succes of IATI. This is demonstrated quite well by the many IATI pilots which suffered from bad data.

A carefull implementation approach is necessary when validation rules are enforced. For the next decimal upgrades they will produce warnings. After migrating to a new integer upgrade they will be blocking.

bill_anderson · March 3, 2017, 7:28am

I have tried to capture this discussion and simplify it into some high-level options in this paper for Standards Day. If we can reach consensus at this level we can then explore the technology and capacity required to implement a coherent policy.