IATI Datastore - what data should go in?

stevieflow · January 31, 2019, 4:42pm

Thanks @andylolz - really very useful

@matmaxgeds does this data answer your question? I think you also mean some kind of function from the datastore, containing the excluded activities - but the stats ^^ are useful context for us.

matmaxgeds · January 31, 2019, 7:15pm

@andylolz huge thanks
@stevieflow - yes, I think it does. In Somalia we are building an aid management system that will allow users to use IATI data via the datstore - if these changes were implemented, we would lose access to e.g SIDA data: http://preview.iatistandard.org/index.php?url=http%3A//iati.openaid.se/xml/SO.xml - which answers my question about how significant this is…significant.

So from my side, I think the list of changes/principles is excellent. For me the problems to solve are:

IATI datastore data no longer = IATI registry data - this means that it will no longer be good enough to do research and share data and give the source as ‘IATI 2019-01-21’ but will now have to specify that it is from the datastore, and because of XYZ, ABC are excluded - pretty confusing for readers, but would be key because the amount dropped would be a significant difference especially for some publishers.
I presume that this change would need the same approval as a shift from 2.03 to 3.01 - what would the process for that be? And I presume this process should include a period where all offending publishers were contacted and helped to work through the (typically minor) tweaks needed to pass the tests? But who are we going to assign the time to do this - assuming we are talking several hundred publishers - and what would the cutoff be - 80, 90% of activities made compliant?
Putting the two above together, why not just apply this at the registry level - remove links to all files that do not pass the test and benefit from these principles throughout the whole IATI ecosystem, not just one small part. That would also do it at the moment of publishing which is by far the easiest stage to have a conversation with the actual person responsible for publishing, and would give far more leverage, if it is just on the datastore, then they can confirm to their boss/funder that they are publishing to IATI and not worry further.
I am worried about a situation where for those funders (e.g. Netherlands) that require IATI publishing, that if this move removed their recipients required IATI data from the datastore, it will mean those funders no longer use the datastore as their way of checking - and the datastore further loses the critical mass of IATI data use that it needs to exist.

In summary, I think this is a great idea (it would help the data use side hugely), so good in fact that it should be applied at the registry level, and the secretariat should dedicate resources to bring it about in a way that supports publishers, and those users/systems that currently use the datastore - not just have them as collateral damage of a good step forward.

Herman · January 31, 2019, 8:15pm

@stevieflow @matmaxgeds @andylolz Concerning missing license info: since IATI is open data, publications without licence should be considered open by default.

Concerning closed licenses: should they even be allowed on the registry. The whole IATI effort is about sharing open data. Attribution licences should not pose a problem: since IATI supports the ‘reporting organisation’ item, all IATI date can be attributed to the publisher.

A last thought about licences: shouldn’t we consider the datastore as IATI infrastructure instead of an IATI data use application?

andylolz · January 31, 2019, 8:46pm

I meant to add a note about Sida – their data has been offline for a couple of days, hence why they’re in the “invalid XML” category. It appears to be back up now. It’s v2.01, openly licensed and valid, so it would be included

Yes! I’m hopeful that’s the case. If so, then that’s great.

YohannaLoucheur · January 31, 2019, 9:28pm

These are very sensible principles to uphold for all the reasons outlined in Steven’s post.

No objection from Canada on #1 and #2 given how central this is to the whole IATI standard endeavour. Also agree with suggestions that not specifying a license should be considered open by default.

I understand the concerns about #3, but ultimately we need to move in this direction, for the same reasons that we have to deprecate 1.0x. Matt, you raise valid concerns about losing access to some data, but I don’t see this having as much of an impact as you anticipate. The few remaining active publishers using 1.0x are preparing to move to 2.0x. I can’t speak on their behalf, but it seems unlikely that the UK or Netherlands would accept a data file published in a deprecated version of the standard.

So the main issue for principle #3 would be files published in the past by now-inactive publishers - and there are a lot of them. I doubt it is used in aid management systems, as partner countries tend to focus on current and future data. Still, this older data can have tremendous value for some users e.g. evaluations, audits, historical trends, etc. If we were to concentrate on this specific use case, could we perhaps find solutions to maintain some form of access to 1.0x data?

Herman · February 1, 2019, 8:32am

The Netherlands IATI reporting guidelines require that publishers use IATI version 2.02 or higher. We are technically still processing 1.x IATI files though. Since 1.x from the information content point of view is largely a subset of 2.x, the continued processing of 1.x was in our case a very small technical effort because we choose to skip processing of 1.x which are depreciated in 2.x (e.g. some location elements).

My concern with not processing 1.x anymore is that the datastore can not be considered as an authorative source of IATI data anymore, since relevant data is missing. The decision wheter or not to process 1.x could i.m.o. be dependant on two criteria:

the number of active publishers who will not have migrated to 2.x on june 30 2019 (an active publisher defined as a publisher who publishes at least once each year);
the technical effort to additionally process 1.x data excluding the depreciated 1.x data-elements

bill_anderson · February 1, 2019, 8:37am

The datastore will do a one-off load of non-active* Version 1 activities.
I suspect most closed or missing licences are oversight, not deliberate. A job for Tech Team and community to address.
Personally I would load all valid non-active Version 1 activities irrespective of licence.
I also personally agree with @Herman that any data discoverable via the registry is de facto open. Publishing to an open data standard and insisting on licence restrictions (other than attribution) I would imagine to be legally questionable.

(* I agree with @Herman’s twitter definition of active meaning publishing at once a year. So all publishers who haven’t published (or refreshed) anything in the last year are non-active.)

siemvaessen · February 1, 2019, 12:11pm

According to IATI guidance, open data is a requirement, not some optional feature nor should it offer closed licensing options, right? But…

“As an open data standard, IATI requires you to make your data available under an open licence so it can be freely used. This is central to improving transparency and efficiency in all development cooperation and humanitarian work.”

But… her comes the contradiction:

“But if you don’t offer your data under a licence that sets out the terms of use, others won’t know what they’re allowed to do with it and it won’t be classed as ‘open data’. Data users would also need to contact you for permission each time they wanted to use some of your data.”

So, according to the guidance open licensing is an actual requirement, but publishers are allowed otherwise, e.g. closed licensing. This is very unclear. Who can/should clear this up?

Source: https://iatistandard.org/en/guidance/preparing-organisation/organisation-data-publication/how-to-license-your-data/

David_Megginson · February 1, 2019, 1:59pm

As I mentioned on Twitter, we ignore Postel’s fundamental law of the Internet – “be conservative in what you send, [but] liberal in what you accept” – at our own peril. If there’s any reasonable way we can keep accepting v.1 IATI from active reporters, then it might not be a bad idea to do so.

As for messy licenses, just as Wikipedia isn’t running out of paper, the IATI Datastore won’t be running out of index cards. Let’s take in as much data as we can, from anyone who wants to provide it, then we can flag “bad” data to exclude from the headline reports, leading indicators, and visualisation dashboards (so that there’s still a consequence to not being open).

D

siemvaessen · February 1, 2019, 2:40pm

Hmhh well, don’t think I agree here. The same could be argued for stuffing 100 doves into your sleeve, you probably could -a magician would-, but not sure how the condition of these doves would be after the trick. I don’t think this is about bad data perse, but rather the condition under how to re-use that data and how IATI provides guidance should have some say in this right? Just accepting anything defies the purpose of having guidance in the first place.

David_Megginson · February 1, 2019, 2:52pm

I agree about the reuse problem. That’s why I’d have the data excluded from common queries by default, and included only when the user explicitly opted in (e.g. “Include non-open data” option in the UI, or “&license=nonopen” in the API).

There are some use cases where non-open data is better than no data, but it’s OK to make the users do a bit of extra work in those cases (including demonstrating that they’re aware of the problem).

andylolz · February 1, 2019, 5:24pm

^^ Agreed / cool. Step #1 is this ticket, which would stem the tide of “license unspecified” data.

It sounds like there’s appetite for removing the option to publish closed IATI data going forward (FWIW I support this). Plus the number of activities published with a closed license is really small (see table above). If the option of a closed license were to be removed, I doubt it would be worth special casing for closed data in the datastore API.

stevieflow · February 1, 2019, 7:12pm

Thanks everyone to the detailed, considered and useful answers. It’s like a Technical Advisory Group!

Allow me for a moment to sit on my TAG chair cushion and undertake my duties. In amongst all these exciting conversations and (potential) tangents, I think this is where we are:

On schema validation - I see no objection.
On open licences - we seem to also agree on the principle, but see a contradiction in how an open data standard can accommodate closed licences.
On 1.0x, we seem less ready to “reject” that data - but think the deprecation of v1 should mean active publishers will make a plan to migrate to v2

There are a few tasks coming from this, it seems:

clarifying our guidance on closed licences
understanding why/how the Registry would allow them
thinking through how the Registry might apply some / all of these principles
considering how we make available / archive “non-active” version 1 publishers
understanding our position on limiting data, in an unlimited data world

But - as we break for the weekend (think of it as a coffee break in this energetic meeting we’re having, but the chance to get some actual fresh air) I’m hoping this is a adequate summary of where we are at.

bill_anderson · February 2, 2019, 7:53am

Yes.

The Datastore’s priority clients are data users who should reasonably expect to be served usable data.

In my opinion “usable data” sits somewhere in between schema-only validation and full validation against schema, codelists and rulesets.

We have agreed the first step: all activities from active publishers MUST validate against the schema.

BUT we haven’t yet provided the DS developers with guidance and a roadmap as to how to tolerate ruleset and codelist errors.

SJohns · February 8, 2019, 1:26pm

Good summary - going back to Andy’s original summary of the activities affected, would any exclusions be based on excluding activities rather than whole data files? Just thinking of the 600+ CSO publishers, some of whom have old activities going back to 2011 that won’t meet these criteria, but are part of the same datafile as newer activities that will be 2.0x and meet the criteria. They are not going to have the resources to go back and update older activities. And as many donors now link the payment of funds to the publication of data - it could be a real risk to them to have their datafile pulled completely. What would be the best advice you can give a CSO in advance of these changes?

stevieflow · February 9, 2019, 12:54pm

Hi @SJohns - thanks, it’s a very valid question

In terms of specific file having a mix of 1.0x and 2.0x activities within it, then I don’t think this is actually possible. The version attribute is only applicable at the <iati-activities> element, not the <iati-activity>, so it can only be declared once per file. It used to be different (in version 1.0x) - but was changed in the move to 2.01 (see changelog). @bill_anderson @IATI-techteam do you agree?

However, the point still remains that it could be possible to publish a file with a mix of valid and invalid activities (in the same version). I think @andylolz did some stats on this too…

andylolz · February 9, 2019, 5:26pm

@SJohns: pragmatically, I’d suggest any publisher that can’t go back and update old v1.0x data should ensure all new activities are created in a brand new v2.03 activity file. This means all future data will be “datastore compliant”. And perhaps at some point, the old v1.0x data could be one-off converted.

That’s true – in the stats above, schema validation was performed at activity level (i.e. rather than validate each dataset, I validated each activity.) So in practice this means the “activity count” is a count of invalid activities, rather than a count of all activities inside invalid datasets.

matmaxgeds · February 9, 2019, 9:29pm

Slightly off-topic but does IATI give a ‘lifespan’ estimate when new version of the standard are created i.e. with an operating system, updates are guaranteed for X number of years? It seems like it might be helpful/standard practice to say with new version that they will not be depreciated (dropped from the registry/core tools) for X years, or until X date?

SJohns · February 11, 2019, 9:44am

@stevieflow @andylolz thanks for clarifying. I was really thinking of this scenario - older activities that are poorer quality within the same datafile as activities of good quality, but I didn’t express it well!! So just thinking about Andy’s suggestion - how would it work for AidStream users.

Aidstream users (who are using the full version) should click on the button to upgrade to version 2.03 and then continue to add in their data for the current activities. If they have older, closed activities on AidStream that are poor quality (data missing/incomplete), then they can convert them to draft activities in AidStream by editing them. This means that when they publish the datafile, only the current activities will show up in a datafile that is tagged as iati-activities version=“2.03”.

This 2.03 datafile should (if no other issues) get pulled through to the new database without the older activities, which will no longer be publicly available. This should not therefore impact their funding (because the current activities are published) but will shorten their track record.

Then if an organisation has extra resources, they can go back and fix the older files if they want to show a longer track record.

For organisations with a smaller amount of activities, this will be feasible to do.For organisations that use AidStream to publish many activities,for multiple donors, it’s going to be a headache, so the more time and warning you can give, the better.

Unfortunately, as soon as funders link an open, public good like IATI to withdrawing funding which an organisation receives to run their programmes (which vulnerable people depend on) it gets a lot more complicated than just excluding data and teliing organisations to update it as and when.

andylolz · February 13, 2019, 10:01am

@SJohns: I’ve replied on twitter with a suggested approach that doesn’t involve removing existing IATI data.