How useful is the Registry "Data Search"?

stevieflow · October 29, 2018, 8:02pm

It would be great to hear if anyone relies on the IATI Registry “Data Search”, and specifically the file-level set up that some publishers follow: one IATI XML file per recipient-country.

The underlying logic, which I think has been in place since day one of IATI, is that the Data Search can give us all files that are about a certain recipient country, for example these 46 files for Bangladesh:

However - I think it’s well-known that this search is not definitive.

If we check d-portal (for example) we can see that 185 publishers actually have activities with this same recipient-country:

This is a big mismatch. It’s because many publishers might include all their activities in one / a few file(s) (Netherlands @Herman; Canada @YohannaLoucheur ) or also have multi-country activities that may be in a regional file (DFID @JohnAdams).

Therefore - what’s the point of the XML file-per-country and/or the IATI Registry Data Search? Does this actively help data users?

Added to this, is that the recipient-country filter on the Registry is populated with quite a few non-obvious entries:

Thoughts welcome. I can understand how something like the Datastore might eventually fix this, but we should also think about the Here and Now: there might be someone trying to use IATI data and quickly finding this not useful.

andylolz · October 29, 2018, 11:51pm

Agree – this particular filter shouldn’t exist on the registry. In fact, I think this attribute should also be removed from registry metadata. As you’ve pointed out, it potentially produces misleading results.

Splitting data into files by recipient country is merely convention / a convenience to avoid very large XML files. Not all publishers split their data in this way, nor should they be expected to.

Data users shouldn’t be directed to the registry for this purpose. It’s possible to use d-portal / OIPA / the datastore query builder / other tools for this purpose, all of which will filter data based on the actual declared recipient-country, rather than the dataset metadata.

stevieflow · October 30, 2018, 6:42am

Thanks @andylolz

Regarding the file-per-country publishing - I’m interested to know if anybody relies on this. For example: @markbrough @tdavis do AIMS imports work on this basis?

matmaxgeds · October 30, 2018, 9:03am

Hi @stevieflow - I use this occasionally (4 or 5 times a year perhaps) as a shortcut to seeing whether I can just download one file for the publisher-country combination I am interested in. I just did so for Uganda, requiring me to do searches for ‘UG’, ‘UGA’, ‘Uganda’, and ‘uganda’ - let alone check the numbered examples (tend to be for ‘regions’ if I remember correctly) that you point out.

I am currently supporting the development of an AIMS for UNDP in Somalia that uses IATI data but there is no way we can use this feature hence we are trying to decide between using the datastore (will the endpoints change, will it return full/unmodified xml) vs the registry (wasted download of not needed data).

What I don’t understand is why there is standardisation on the two digit ISO code ‘UG’ for recipient-country at the activity level, but not in the file metadata - or is this wishful thinking about the recipient-country coding as well?

I think that if it is not removed as a feature (or the codes standardised) then there should at least be an obvious note to point out that it can be very misleading in the data it returns unless you try all variations. A multi-select would also be very helpful.

Thanks,

Matt

markbrough · October 30, 2018, 9:32am

Hey @stevieflow – I actually find this feature quite useful, I think the same way as Matt. Although many donors don’t segment by country, many do, and it can quite often be useful for debugging. For example, if something is going wrong in the IATI import in Bangladesh I can take a quick look at worldbank-bd or dfid-bd and see how the data that eventually made it into the system (via the Datastore) compares with the originally published data (I guess it doesn’t necessarily require the country code drop-down, as I already know the package name I’m looking for).

Once the IATI Datastore is more reliable and we can be much more certain that nothing is being lost, maybe we can do away with this feature, but until then, I think it is quite useful.

tdavis · October 30, 2018, 6:52pm

Hi @stevieflow for our IATI-AIMS Importer, it uses the recipient country field to identify what country the activity goes to. So if multiple countries are included in a single file, it identifies that and prompts users to select which country they want to import for.

andylolz · November 2, 2018, 4:48pm

I dug into this a bit more…

21% of activity datasets (a total of 1,282) containing 24% of activities (a total of 263,337) do not list a recipient country in their registry metadata. That’s a large proportion of IATI activities that are unreachable via the recipient country search.

A further ~50,000 activities include a recipient-country/@code in the data that appears to disagree with the recipient country code in the dataset metadata. (Though note that I’ve reached this number in a bit of a hacky way… I haven’t attempted to harmonize 2 and 3 letter country codes, and I’ve ignored regions completely.)

It sounds like people do use the recipient country filter, so I’ll roll back on my previous answer and suggest perhaps the solution would be to fix this, rather than remove it. This metadata could be set automatically (like other dataset metadata is currently, including data_updated), and could list all recipient-countrys / recipient-regions in a dataset, rather than just one (as @matmaxgeds suggests). But note that for some datasets, this list will be quite long.

andylolz · November 1, 2018, 8:04pm

This soooounds like the IATI-AIMS Importer may be using the XML, rather than the metadata (given that the metadata can include maximally one recipient country). If so: that’s good, given the metadata is currently misleading.

stevieflow · November 1, 2018, 9:36pm

Thanks @matmaxgeds @markbrough & @tdavis for your practical examples - really very helpful. Huge thanks @andylolz for the research - which does seem to support the theory that the Registry metadata is largely unreliable.

I’d suggest the following observations:

It seems lucky that some publishers segment their data in specific recipient-country files where they can: this seems to help people using the data, particularly when debugging/testing
But for someone casually browsing the IATI Registry, the idea that the recipient-country file metadata - expressed through the drop-down menu on the Data search page - would be precise, seems inaccurate.

As far as I know, the standard has nothing to say on file segmentation, except that any file should be 40MB or less (can anyone find that guidance?). I’d propose the following might be helpful:

When segmenting activities into files, please aim to avoid any intended usage via the file metadata. Services that use IATI data will work with data within any IATI activity.

(that ^^ isn’t the best , but hopefully someone can improve in this!)

andylolz · November 2, 2018, 12:30pm

I don’t think the guidance says anything on segmentation anymore (I couldn’t find it, anyway). Here’s the bit I think you’re referring to, from the archives:

It is recommended that publishers segment their files to ensure that no file is larger than 40MB in size. Whilst there is no definitive rule as to when a file is too large, we recommend a maximum file size in order to minimise any issues of file processing by both IATI and third parties data users.

With regard to segmentation, there is no definitive best practice for how files should be segmented. By country was historically the preferred option but this is no longer a requirement because published data can now be queried more easily via the IATI Datastore. The publisher should therefore carry out segmentation on whatever makes most sense to their publishing process. Examples might be by country, by region, by grant or loan type etc. As data volumes have increased some publisher are now segmenting by open and closed activities. This has the benefit that once created, there is no further requirement to reproduce or update a file of closed projects each time the Publisher updates their data.

matmaxgeds · November 2, 2018, 1:43pm

What would happen if we removed (or for now, ignored) the ‘recipient country’ field from the registry metadata? Presumably the registry search field would have to do the same as we all do, and check the recipient-country tag on each activity in each file - if that is achievable, I think it could be better. So the dropdown doesn’t get to complex, selecting ‘UG’ would show you all files that had at least one activity coded to ‘UG’, no need for a dropdown that had ‘UG, KE, SOl RW etc’ just to be able to select one publisher’s East Africa file.

A side question - is there any chance/way that the use of ISO 2 digits country codes will be enforced (I assumed this was a requirement)? Do files that use 3 digit codes, or full names at least fail the data quality tests? Is there any process to notify publishers (ideally automatically) when they use their own 3 digit name?

andylolz · November 2, 2018, 9:23pm

Absolutely! So, this potentially involves searching a lot of data. Therefore it’s probably best done asynchronously and cached, ready for use by the search. There’s currently a process that runs on the registry and does exactly this sort of asynchronous caching – it’s called the “IATI Archiver” (formerly IATI Harvester, I think). You can see it making metadata updates if you click the “History” tab on any dataset. For example…

The IATI Harvester (and metadata updating) is referenced a couple of times in the IATI Tech Audit write-up. One of those recommendations was: “Build additional scripts to manage metadata as a service external to the registry.” (emphasis mine.) Also, “Put certain ideas out on discuss for developers, i.e. ask if anyone wants to work on splitting off […] IATI Harvester”

I think updating this recipient country metadata could be included as part of this service. (In fact, I had to write a script that did pretty much this, in order to derive the information above.) I notice, though, that work is already underway by the registry supplier derilinx to improve or rewrite the existing service inside of the registry. Therefore, it would be great to get an understanding of current plans from the @IATI-techteam.

thea · November 7, 2018, 5:54pm

Hi all! A search for the reason behind multiple segmented files per country brought me here…

Aidstream actually still recommends doing this in its “publishing settings”, because the IATI standard suggests it. My original question is: does IATI still recommend this too? If not, perhaps Aidstream can update their recommendation on the site?

A lot of organisations publish one file for all their activities, and that makes searching for the recipient-country of the file, rather than in the activities themselves, very unreliable. Perhaps we should do what’s been mentioned before, have the search option rely on the recipient country on the activity rather than on the met data on the file.

Interesting to see that people use the files per country to check errors. I don’t think I quite follow, but that’s probably because the project management database of Plan Netherlands produces a file for us, and validates it too, so we probably don’t work the same way.

From a data analyst’s view, I don’t really see an upside to the multiple files. I use the files for my visualisation in Power BI because the data store is sometimes down, and one file is always easier than multiple.

Herman · November 9, 2018, 3:59pm

I fully agree with @andylolz and @stevieflow. The technical splitting of publisher files by country might introduce problems, like publishing the same activity in multiple files because an activity is implemented in multiple countries. This will produce redundancy in a publishers IATI publication and it might therefore introduce inconsistencies and double counting.

Once there is a properly working datastore, there will be no need anymore to split XML’s by country anymore.

All in all splitting IATI files by country is i.m.o. not to be recommended.

matmaxgeds · March 26, 2019, 7:29pm

Did anything come of these discussions beyond a suggestion for the publishing guidance? There were some good suggestions e.g. making recipient-country field use the activity level codes, not just the file metadata?

I ask because I am using the registry search a lot at the moment, and it would be really helpful if when selecting one of the filters, options that then become unavailable, are removed from the other filters.

Thanks!

stevieflow · April 3, 2019, 11:40am

Thanks for the prompt @matmaxgeds

I think it’d be very useful to hear from the @IATI-techteam in terms of any planned next steps. We’ve had a fruitful discussion above - but must be minded that the Registry is supplied to the secretariat by an external vendor. @IATI-techteam how can we help in terms of our recommendations ^^ ?

KateHughes · April 4, 2019, 9:33am

The work on the registry is planned to start in the later half of this quarter. We will need to scope it fully before we can start. HDX use “vanilia CKAN” and have thier own scripts for the metadata and the repo they have for this is very expansive.
We’ve had one chat with the team at HDX so far and will be talking to them again soon.
We’ll be publishing a quarterly update blog next week then you’ll be able to read more about our plans for the quarter.