IATI Datastore - what data should go in?

stevieflow · January 29, 2019, 7:23pm

As part of the build of the new IATI datastore, there’s an important point for our community to consider: what goes into the datastore?

A common response might be “all published IATI data, surely?”, but I wanted to offer an alternative, which I think others will support.

In short, I propose that the IATI datastore initially pulls in data that is:

Valid to the IATI schema, and;
Published under an open licence, and;
Using version 2.0x of IATI

To be clear: this does not encompass all current published data. So - why limit ourselves? Here’s three reasons:

1 - Schema validation is something we are very used to

For those of us that grapple with data validation around IATI, we know it can often mean many things. The term “compliance with IATI” is often heard, but not universally agreed.

However, we have a very simple mechanism to help us: the IATI schema. The schema provides exactness on how data should be ordered and structure: it’s the minimum level of validation one should pass.

The online IATI validator has always provided a means to test data against the schema. It’s true that there are a range of further “validation” tests one could make, including a host of rulesets and codelist checking - and even extending to the coherence of organisation identifiers. However, for us to get a basic level, we should begin by working with data that passes this initial schema inspection.

My argument here is simple: if we start to support data that is not valid against the schema, why have a schema? Or even - what support are we giving to data users, if we supply invalid data?

2 - Data licencing supports data use

You might be surprised to see mention of data licencing in this proposal, as it can often be something that is added at the last moment of publication, whilst someone sets up their IATI Registry account. However, appropriate data licencing is an absolute must if we are to support people to successfully use IATI data.

In fact, we should really consider the datastore as a data user! In this light, it needs to be clear that they can access data under a licence that is permissive and open. When data is licenced under restricted grounds, then it cannot be reused (as that is what the licence says!).

My challenge: Why would we support the datastore to use data that has no licence for reuse?

3 - Version 1 of the standard is deprecated

The TAG meeting in Kathmandu supported the decision of IATI members to deprecate version 1 of the standard, meaning that from June 2019 publishers using this will not be supported.

Whilst it’s technically possible to convert data from version 1 to version 2 of the IATI standard, this would take up a limited resource on the datastore project we could deploy elsewhere.

My rationale: to get the support of the new datastore, organisations need to supply data in a version that is actively supported by the initiative.

Support for our principles?

A common thread between these three conditions is mutual support. We all want to support our data standard and data users via the datastore project. To do this, we must ensure that we respect the core protocols we have around the schemas, licences and versions for our standard. Given that the datastore represents a renewed focus on data quality and use, I can’t imagine a scenario where we would actively go against these.

Of course, there are currently a range of publishing organisations that would be omitted from inclusion in the datastore, in terms of failed schema tests, restrictive licencing and/or use of unsupported versions,. However, we should be careful to not start to cite examples in order to find reason for a relaxation of this criteria. I do believe this is a relatively low bar for entry - and that our community and tech team can provide positive support to those that need to remedy their data.

What next? I’m hoping those active in our community can support these principles, so that we can in turn endorse the data that makes its way into the datastore. Maybe respond with a quick nod of a approval to get us moving on…

My guess (based on informal discussions - see below) is that the first two principles are very agreeable, whilst there’s a dilemma about use of version 1 data. That seems fine - and is a reason for my separating v1 into a new point.

After this, we can start to extend our discussions around data validity, compliance and quality in other, more advanced, ways. But, I do hope colleagues are able to step back and agree that this initial benchmark is for the betterment of the initiative.

Disclosure: prior to posting this I quickly discussed these ideas with @bill_anderson, @KateHughes, @Herman @siemvaessen , @andylolz, @markbrough & @rory_scott - more as an valued and accessible sounding board rather than definitive answer (but thanks, nevertheless!)!

matmaxgeds · January 30, 2019, 11:49am

Hi @stevieflow - can you recommend a way to work out what this would mean in terms of what data the datastore would/wouldn’t return under the three rules you propose? Ideally this would be an IATI xml file that contains the data that would be rejected so I can scan it and make sure that I won’t miss what is being dropped. That seems the best way for me to be able to say something sensible/evidence based rather than just debating the principles.

andylolz · January 31, 2019, 12:11pm

Hi @matmaxgeds,

The following table shows the datasets that would not be pulled in according to this proposal (based on registry data from 2019-01-30.) The reason for exclusion is also listed. Clicking the reasons will show the list of datasets excluded for that reason.

Reason for exclusion	Dataset count	Activity count
Invalid XML	272	-
Version 1.0x	1,157	126,352
Invalid against (v2.0x) IATI schema	762	100,716
Closed license	51	411
License not specified	1,092	21,819

Note:

The groups above are mutually exclusive by definition – datasets can’t be in multiple groups.
Schema validation was performed at activity level (as suggested by Steven Flower). I found roughly 14,000 valid activities in invalid datasets. The datastore could exclude invalid activities, rather than invalid datasets.
It’s unclear whether datasets with an unspecified license would be excluded or not, so you can maybe ignore those ones.

stevieflow · January 31, 2019, 4:42pm

Thanks @andylolz - really very useful

@matmaxgeds does this data answer your question? I think you also mean some kind of function from the datastore, containing the excluded activities - but the stats ^^ are useful context for us.

matmaxgeds · January 31, 2019, 7:15pm

@andylolz huge thanks
@stevieflow - yes, I think it does. In Somalia we are building an aid management system that will allow users to use IATI data via the datstore - if these changes were implemented, we would lose access to e.g SIDA data: http://preview.iatistandard.org/index.php?url=http%3A//iati.openaid.se/xml/SO.xml - which answers my question about how significant this is…significant.

So from my side, I think the list of changes/principles is excellent. For me the problems to solve are:

IATI datastore data no longer = IATI registry data - this means that it will no longer be good enough to do research and share data and give the source as ‘IATI 2019-01-21’ but will now have to specify that it is from the datastore, and because of XYZ, ABC are excluded - pretty confusing for readers, but would be key because the amount dropped would be a significant difference especially for some publishers.
I presume that this change would need the same approval as a shift from 2.03 to 3.01 - what would the process for that be? And I presume this process should include a period where all offending publishers were contacted and helped to work through the (typically minor) tweaks needed to pass the tests? But who are we going to assign the time to do this - assuming we are talking several hundred publishers - and what would the cutoff be - 80, 90% of activities made compliant?
Putting the two above together, why not just apply this at the registry level - remove links to all files that do not pass the test and benefit from these principles throughout the whole IATI ecosystem, not just one small part. That would also do it at the moment of publishing which is by far the easiest stage to have a conversation with the actual person responsible for publishing, and would give far more leverage, if it is just on the datastore, then they can confirm to their boss/funder that they are publishing to IATI and not worry further.
I am worried about a situation where for those funders (e.g. Netherlands) that require IATI publishing, that if this move removed their recipients required IATI data from the datastore, it will mean those funders no longer use the datastore as their way of checking - and the datastore further loses the critical mass of IATI data use that it needs to exist.

In summary, I think this is a great idea (it would help the data use side hugely), so good in fact that it should be applied at the registry level, and the secretariat should dedicate resources to bring it about in a way that supports publishers, and those users/systems that currently use the datastore - not just have them as collateral damage of a good step forward.

Herman · January 31, 2019, 8:15pm

@stevieflow @matmaxgeds @andylolz Concerning missing license info: since IATI is open data, publications without licence should be considered open by default.

Concerning closed licenses: should they even be allowed on the registry. The whole IATI effort is about sharing open data. Attribution licences should not pose a problem: since IATI supports the ‘reporting organisation’ item, all IATI date can be attributed to the publisher.

A last thought about licences: shouldn’t we consider the datastore as IATI infrastructure instead of an IATI data use application?

andylolz · January 31, 2019, 8:46pm

I meant to add a note about Sida – their data has been offline for a couple of days, hence why they’re in the “invalid XML” category. It appears to be back up now. It’s v2.01, openly licensed and valid, so it would be included

Yes! I’m hopeful that’s the case. If so, then that’s great.

YohannaLoucheur · January 31, 2019, 9:28pm

These are very sensible principles to uphold for all the reasons outlined in Steven’s post.

No objection from Canada on #1 and #2 given how central this is to the whole IATI standard endeavour. Also agree with suggestions that not specifying a license should be considered open by default.

I understand the concerns about #3, but ultimately we need to move in this direction, for the same reasons that we have to deprecate 1.0x. Matt, you raise valid concerns about losing access to some data, but I don’t see this having as much of an impact as you anticipate. The few remaining active publishers using 1.0x are preparing to move to 2.0x. I can’t speak on their behalf, but it seems unlikely that the UK or Netherlands would accept a data file published in a deprecated version of the standard.

So the main issue for principle #3 would be files published in the past by now-inactive publishers - and there are a lot of them. I doubt it is used in aid management systems, as partner countries tend to focus on current and future data. Still, this older data can have tremendous value for some users e.g. evaluations, audits, historical trends, etc. If we were to concentrate on this specific use case, could we perhaps find solutions to maintain some form of access to 1.0x data?

Herman · February 1, 2019, 8:32am

The Netherlands IATI reporting guidelines require that publishers use IATI version 2.02 or higher. We are technically still processing 1.x IATI files though. Since 1.x from the information content point of view is largely a subset of 2.x, the continued processing of 1.x was in our case a very small technical effort because we choose to skip processing of 1.x which are depreciated in 2.x (e.g. some location elements).

My concern with not processing 1.x anymore is that the datastore can not be considered as an authorative source of IATI data anymore, since relevant data is missing. The decision wheter or not to process 1.x could i.m.o. be dependant on two criteria:

the number of active publishers who will not have migrated to 2.x on june 30 2019 (an active publisher defined as a publisher who publishes at least once each year);
the technical effort to additionally process 1.x data excluding the depreciated 1.x data-elements

bill_anderson · February 1, 2019, 8:37am

The datastore will do a one-off load of non-active* Version 1 activities.
I suspect most closed or missing licences are oversight, not deliberate. A job for Tech Team and community to address.
Personally I would load all valid non-active Version 1 activities irrespective of licence.
I also personally agree with @Herman that any data discoverable via the registry is de facto open. Publishing to an open data standard and insisting on licence restrictions (other than attribution) I would imagine to be legally questionable.

(* I agree with @Herman’s twitter definition of active meaning publishing at once a year. So all publishers who haven’t published (or refreshed) anything in the last year are non-active.)

siemvaessen · February 1, 2019, 12:11pm

According to IATI guidance, open data is a requirement, not some optional feature nor should it offer closed licensing options, right? But…

“As an open data standard, IATI requires you to make your data available under an open licence so it can be freely used. This is central to improving transparency and efficiency in all development cooperation and humanitarian work.”

But… her comes the contradiction:

“But if you don’t offer your data under a licence that sets out the terms of use, others won’t know what they’re allowed to do with it and it won’t be classed as ‘open data’. Data users would also need to contact you for permission each time they wanted to use some of your data.”

So, according to the guidance open licensing is an actual requirement, but publishers are allowed otherwise, e.g. closed licensing. This is very unclear. Who can/should clear this up?

Source: https://iatistandard.org/en/guidance/preparing-organisation/organisation-data-publication/how-to-license-your-data/

David_Megginson · February 1, 2019, 1:59pm

As I mentioned on Twitter, we ignore Postel’s fundamental law of the Internet – “be conservative in what you send, [but] liberal in what you accept” – at our own peril. If there’s any reasonable way we can keep accepting v.1 IATI from active reporters, then it might not be a bad idea to do so.

As for messy licenses, just as Wikipedia isn’t running out of paper, the IATI Datastore won’t be running out of index cards. Let’s take in as much data as we can, from anyone who wants to provide it, then we can flag “bad” data to exclude from the headline reports, leading indicators, and visualisation dashboards (so that there’s still a consequence to not being open).

D

siemvaessen · February 1, 2019, 2:40pm

Hmhh well, don’t think I agree here. The same could be argued for stuffing 100 doves into your sleeve, you probably could -a magician would-, but not sure how the condition of these doves would be after the trick. I don’t think this is about bad data perse, but rather the condition under how to re-use that data and how IATI provides guidance should have some say in this right? Just accepting anything defies the purpose of having guidance in the first place.

David_Megginson · February 1, 2019, 2:52pm

I agree about the reuse problem. That’s why I’d have the data excluded from common queries by default, and included only when the user explicitly opted in (e.g. “Include non-open data” option in the UI, or “&license=nonopen” in the API).

There are some use cases where non-open data is better than no data, but it’s OK to make the users do a bit of extra work in those cases (including demonstrating that they’re aware of the problem).

andylolz · February 1, 2019, 5:24pm

^^ Agreed / cool. Step #1 is this ticket, which would stem the tide of “license unspecified” data.

It sounds like there’s appetite for removing the option to publish closed IATI data going forward (FWIW I support this). Plus the number of activities published with a closed license is really small (see table above). If the option of a closed license were to be removed, I doubt it would be worth special casing for closed data in the datastore API.

stevieflow · February 1, 2019, 7:12pm

Thanks everyone to the detailed, considered and useful answers. It’s like a Technical Advisory Group!

Allow me for a moment to sit on my TAG chair cushion and undertake my duties. In amongst all these exciting conversations and (potential) tangents, I think this is where we are:

On schema validation - I see no objection.
On open licences - we seem to also agree on the principle, but see a contradiction in how an open data standard can accommodate closed licences.
On 1.0x, we seem less ready to “reject” that data - but think the deprecation of v1 should mean active publishers will make a plan to migrate to v2

There are a few tasks coming from this, it seems:

clarifying our guidance on closed licences
understanding why/how the Registry would allow them
thinking through how the Registry might apply some / all of these principles
considering how we make available / archive “non-active” version 1 publishers
understanding our position on limiting data, in an unlimited data world

But - as we break for the weekend (think of it as a coffee break in this energetic meeting we’re having, but the chance to get some actual fresh air) I’m hoping this is a adequate summary of where we are at.

bill_anderson · February 2, 2019, 7:53am

Yes.

The Datastore’s priority clients are data users who should reasonably expect to be served usable data.

In my opinion “usable data” sits somewhere in between schema-only validation and full validation against schema, codelists and rulesets.

We have agreed the first step: all activities from active publishers MUST validate against the schema.

BUT we haven’t yet provided the DS developers with guidance and a roadmap as to how to tolerate ruleset and codelist errors.

SJohns · February 8, 2019, 1:26pm

Good summary - going back to Andy’s original summary of the activities affected, would any exclusions be based on excluding activities rather than whole data files? Just thinking of the 600+ CSO publishers, some of whom have old activities going back to 2011 that won’t meet these criteria, but are part of the same datafile as newer activities that will be 2.0x and meet the criteria. They are not going to have the resources to go back and update older activities. And as many donors now link the payment of funds to the publication of data - it could be a real risk to them to have their datafile pulled completely. What would be the best advice you can give a CSO in advance of these changes?

stevieflow · February 9, 2019, 12:54pm

Hi @SJohns - thanks, it’s a very valid question

In terms of specific file having a mix of 1.0x and 2.0x activities within it, then I don’t think this is actually possible. The version attribute is only applicable at the <iati-activities> element, not the <iati-activity>, so it can only be declared once per file. It used to be different (in version 1.0x) - but was changed in the move to 2.01 (see changelog). @bill_anderson @IATI-techteam do you agree?

However, the point still remains that it could be possible to publish a file with a mix of valid and invalid activities (in the same version). I think @andylolz did some stats on this too…

andylolz · February 9, 2019, 5:26pm

@SJohns: pragmatically, I’d suggest any publisher that can’t go back and update old v1.0x data should ensure all new activities are created in a brand new v2.03 activity file. This means all future data will be “datastore compliant”. And perhaps at some point, the old v1.0x data could be one-off converted.

That’s true – in the stats above, schema validation was performed at activity level (i.e. rather than validate each dataset, I validated each activity.) So in practice this means the “activity count” is a count of invalid activities, rather than a count of all activities inside invalid datasets.