Data retention on the Data Store

siemvaessen · November 13, 2019, 11:16am

The Data Store depends on whatever is available on the IATI Registry. We have come across a use-case with @Wendy on the Humanitarian Data Portal (Grand Bargain) where data is not available in the Data Store once it becomes unavailable in the Registry.

While some publishers remove their data on purpose, sometimes data becomes unreachable and therefore will be removed from the Datastore.

What retention policy should we take into consideration here and should the Registry have a publisher ‘kill-switch’ which could tell other services down the data pipeline (Datastore etc) to remove data if that is checked to True in case of a retention policy of say 30 days?

samuele-mattiuzzo · November 13, 2019, 4:07pm

Wouldn’t removing the data from the Datastore and other tools also affect the way we can report on historical data, being that there will be gaps/missing pieces that were there in the first place?
Although I think this would be the case anyway whenever a publisher takes their file offline.

Above aside, I would say a notification system that tells all the tools that the data has been removed by the publishers is what we should aim for, so that we can ensure consistency across tools

matmaxgeds · November 13, 2019, 5:13pm

As far as I am aware, IATI doesn’t have any historical approach, it is up to the publishers to decide if they want to keep historical data available.

My assumption would be that if data is removed/becomes unreachable then it should immediately disappear from the datastore as there is no way of knowing whether this was an error, or on-purpose as I suspect that many publishers would ignore a kill-switch.

If the DS kept it in, and then the publisher replaced it with another file there would be issues knowing which one was the right one etc

samuele-mattiuzzo · November 13, 2019, 5:19pm

ah I did not think about the edge cases of mistakes/publishing the wrong file or generic errors like such (in which case having faulty broken data and retaining it becomes detrimental)

I was more referring to data that was valid and has once been used to draw calculations and then gets removed by the publisher 2-3 years afterwards, that would fundamentally change previous reports and possibly reports that work on comparisons as well (as that comparable data would have been removed).

Maybe there needs to be a way to clarify whether certain datasets have actually been published with mistakes and thus need removal and which data can stay but is not relevant/withdrawn for certain reasons?

matmaxgeds · November 13, 2019, 6:23pm

@samuele-mattiuzzo - totally with you on the need to think/decide whether IATI should do historical (and what that would involve), but I think it must have been discussed 10 years ago and probably needs solving in that context, not in the context of a problem it causes for the DS as the implications are far wider…also e.g. if something was published as v1.03 and the secretariat stored it for historical purposes, would it then be lost as v1 is depreciated (in which case it is only part of history being kept), or would the Secretariat republish in v2…I guess because of all the difficulties is why the current situation exists

bill_anderson · November 13, 2019, 6:47pm

Splitting this in three:

As far as I am aware the current position remains that the Datastore is an organised interface to the registry. It is not a curated database. If a publisher removes any data from the registry it gets removed from the datastore. This is simple. The Datastore does not hold data (historical or otherwise) against the wishes of a publisher.
Ascertaining whether a publisher has removed data or whether it is accidentally, temporarily unavailable is the challenge that @siemvaessen focuses on above, and that does require some rules that I don’t think have been clearly expressed anywhere.
As per 1 above I don’t think it is the responsibility of the Datastore to perpetually cope with deprecated versions of the standard - where a publisher has neither removed nor refreshed their data - but this is perhaps best dealt with as a separate, validation issue.

siemvaessen · November 13, 2019, 9:09pm

Agree.

Correct, IATI needs to define a ruleset for this.

Agree.

rolfkleef · November 13, 2019, 9:11pm

My guess would be this might involve the 104 FTS files with URLs that have gone missing since July 30th?

As the new IATI Validator currently still uses its independent harvesting backend (integration with Datastore underway): we use the “last seen alive” version of files (July 29th).

This allows for reasonably robust availability of data independent of availability of source files (or even the Registry).

The “publisher kill switch” would be one of two options:

Remove the dataset from the Registry: in that case, the dataset will disappear from the Validator at the next successful refresh from the Registry (currently scheduled every 3 hours).
Remove an activity (or all data) from the data served via the URL of the dataset: in that case, the specific activity will disappear after the next successful file refresh and processing (currently scheduled every 8 hours).

andylolz · November 14, 2019, 8:59am

Happily, such a notification system exists:

siemvaessen · November 14, 2019, 10:37am

Back to my original subject: data retention. Currently policy (rules) on the matter are not in place. What is the best way forward to introduce such a policy and who needs to be involved? What is the mechanism to start this?

andylolz · November 14, 2019, 11:53am

I’m in favour of consistency between tools.

So for instance: d-portal has a policy on this (@shi and xriss can elucidate). The validator also has a policy (as @rolfkleef outlines above) which sounds very similar to the one d-portal uses. I think (but would need to check) that the existing datastore policy is also similar.

So there’s already a precedent here. It would seem reasonable for the new datastore to follow the precedent already set (see: Principle of least astonishment).

siemvaessen · November 14, 2019, 12:12pm

Sure, but we need a formal rule, not informal precedent. Is there any audit trail on IATI communication discussion on this? It needs to be discussed outside of the realm of 2-3 tools, seeing it touches on overall data consistency.

And it also does not provide a solution to data accidentally being unavailable on the Registry, while some other tool (a website for example) depends on another tool (datastore for example) etc.etc.

andylolz · November 14, 2019, 12:49pm

Absolutely agree. I suppose a rule on this should be proposed, agreed, and then formally conveyed (e.g. via a page on iatistandard.org).

I don’t see anything in the draft ToR about it. This was certainly discussed as part of the Manchester developer workshop last year, but I’m afraid I’m not able to find documentation of that.

shi · November 14, 2019, 2:24pm

Thanks, @andylolz - yes, our process is similar to the validator:

If a file download errors out, we use the last successful download. (This mostly deals with intermittent network errors.)
We do a test wherein if the Registry looks suspiciously empty, we skip the Registry update and use the previous version.

Step 2 was added when the Registry hiccuped and emptied itself.

matmaxgeds · November 14, 2019, 4:32pm

I am not convinced that solving the issue of data being accidentally/temporarily unavailable in the registry should be solved via the datastore, as this means that only tools that use the datastore as the source of their data benefit from it, if everyone needs this feature, perhaps this would be better inserted into the current ToR for the revisions to the registry, then it would be solved once (and with only one ruleset) for everyone?

siemvaessen · November 14, 2019, 6:03pm

Me neither, never made that claim either. Again: this topic tries to address the lack of a data retention policy in place and how formalise it. Who within IATI is responsible for shaping this policy? Perhaps @WendyThomas could shed light on the matter?

Where could I find that ToR revisions plan for the Registry?

stevieflow · April 10, 2020, 10:49am

Please can I ask: did a Data Retention Policy for IATI products come through?

Seems the last post from @siemvaessen was querying this with @WendyThomas / @IATI-techteam

stevieflow · May 30, 2020, 9:12am

@IATI-techteam @WendyThomas is there any update on this topic? Thanks

IATI-techteam · June 17, 2020, 4:12pm

The IATI Technical Team has now written guidance for data owners to explain the process for removing data published to the IATI Standard. This guidance sets out the process to be followed by data owners in instances where they choose to remove data published to the IATI Standard. Please note that this is separate from data retention, which sets policies on, for instance, how long a dataset is retained for in a specific application. We will look at this in the future, but at the moment the priority focus was on developing data removal guidance. Please do have a look at the IATI data removal guidance Discuss post and let us know if you have any clarification points.