Getting to a list of organisation references for IATI publishers

stevieflow · October 4, 2017, 12:52pm

Question: where can I get a list of organisation references for IATI publishers?

Answer - here’s a list, which took some steps to produce

Sounds simple? Well, it took us a few steps to get to this (big thanks to @bjwebb):

We first looked at the IATI Registry. Each publisher has a freetext field for their organisation reference. We found this to be inconsistent and unreliable: the dashboard confirms this.
we then looked Activity data. The reporting-org reference is useful here. But - this is repeated, and sometimes different. It seemed a lot of overhead to check 4500+ files for around 500 identifiers.
so, finally, we landed at the Organisation file. Big Surprise!

Yes. When discovering the preferred organisation reference for an IATI publisher, we found the Organisation files to be the best source. Specifically, we looked for instances where:

reporting-org/@ref matches the identifier in organisation-identifier (or iati-identifier in versions 1.0x)

(it’s entirely possible and feasible to reference many organisations in a single org file, but we focused on matches between the publisher and org reference initially)

We took a look into this (in July 2017), and found:

Of 555 IATI publishers, there were 406 IATI organisation files (73% of publishers provide an Org file)
Of these 406 publishers, 392 organisation identifiers match the reporting-org/@ref (97%)

So - when a publisher provides an Org file, it’s highly likely that this will be a definitive source for their organisation reference

So far, so good.

Next, we then took a look at the prefixes for these references, to understand if these were available via the org-id project (which IATI supports). On doing this, we found:

Of 392 “matching” organisation identifiers, 333 started with a “recognised” prefix (85%)

So again - a trend seems to be that definitive, and standardized / useful organisation references are also highly likely via the IATI Org file.

What Next?

It seems obvious:

all publishers should provide an Org file
the definitive organisation reference for a publisher should be maintained in the Org file

It doesn’t seem a lot of effort to get to nearly 100% coverage in the above metrics. It would be useful to hear thoughts from others.

Last notes:

this isn’t a list of all the organisations mentioned via IATI data: we wanted to focus initially on the growing list of publishers
how such a list (if it is a list - @TimDavies thinks it’s more of a cache) is published, maintained and used is another topic. In doing this research, we output the identifiers as a spreadsheet, but also looked at how the IATI Organisation standard could be used.

Reichner · October 4, 2017, 2:02pm

great job, very valuable work has been done and it is exactly this kind of groundwork (common code lists, alignment and validation of data published) that ensures that IATI data can be used properly, I very much hope that this continues and all publishers are requested to publish the org file with proper content, mandatory validation of such basics is a must if IATI wants to achieve its challenging goals

VincentVW · October 4, 2017, 2:29pm

Totally agree. In your list there are also some empty names due to orgs not reporting the “name” element in the org standard, would be good to put emphasis on that too imo. At the moment it is harder than it should be for toolbuilders to just create a correct list of publisher (aka org ref + names) to be used as a filter.

At the moment the publisher name often differs from the organisation standard name element, which sometimes differs from their activities reporting-org/narrative. It might not sound like a big problem but it leads to usability problems when a user searches for a different name than what the application uses as name (had questions about this multiple times on IATI Studio).

stevieflow · October 4, 2017, 4:53pm

@Reichner @VincentVW many thanks

Yes. I didnt mention it, but wonder if we can say that the Org (when the above conditions are met) is the canonical source of references for organisations who publish IATI data

We can also look to refresh the research and get an updated “list”, but would be useful to understand if this approach and data is going to be useful to others.

stevieflow · October 24, 2017, 12:50pm

We could try to update and maintain this list as a community effort, but when speaking to others at the Members’ Assembly (@Reichner, @andylolz) we did think this “list” might be best considered key infrastructure for IATI, and could potentially be output via the IATI Registry.

@IATI-techteam: any thoughts?

bill_anderson · October 24, 2017, 2:22pm

I would prefer to see this as part of a shared, cross-standard list, maintained by (a sustainably funded) org-id.guide.

Taking this a step further I would like to see (a sustainably funded) org-id.guide provide a service that harvests all valid org-ids used in all places in all participating standards.

andylolz · November 27, 2018, 9:33pm

Here’s a first pass at an org-id finder, using (more or less) the methodology described above: https://andylolz.github.io/org-id-finder/

What do you have in mind here, @stevieflow? Does that roughly do the thing publishers need it to do?

reidmporter · October 25, 2017, 6:46pm

Spoke with @andylolz about this briefly offline, and tbh I’m sure I don’t fully appreciate the use case here (though I trust there is one)…But…

At a strategic level, I’m with @bill_anderson - this needs to be integrated with other infrastructure, sustainably financed, embedded with other complementary tools, etc. I return to my pub props - ketchup bottle, salt shaker, hot sauce (which then became water bottles at the last miniTAG) if we solve every problem individually, without tying those solutions together in some way, we’re increasing the maintenance burden for ourselves and adding to the overwhelmed feeling new publishers get when introduced to the ever-expanding list of tools and workarounds-as-tools. (I think this is aimed at a more forgiving clientele, but I’ll say it anyway.)
Tactically, this seems very similar, possibly overlapping in parts, but at the very least tangentially related to the work @anjesh and Young Innovations are doing with the org data clean up, org-data API service, and AidStream UI integration. Anjesh can say more, and @TimDavies is involved in both, so I don’t think there’s risk of duplicating work, but wanted people to be aware since they’re working in proximity to each other. See more: https://github.com/younginnovations/aidstream-org-data

stevieflow · October 26, 2017, 3:26pm

I just want to take a slight step back here to the original question

The reason (or use case) for this would be:

IATI publishing organisations are very likely to be talking about themselves and their involvement in activities
With traceability, a key and vital consideration is that organisations can talk about each other in consistent and unique ways (GB-GOV-1 rather than DFID or DfID or D.F.I.D)
IATI publishers self-identify. If we can be certain about what they use to identify themselves, then that can be useful to others, for reuse

The intention here is to not solve wider issues about the cannonical reference for organisations outside of the 555 (now 592) publishers, right now. We just wanted to test and find the most reliable source for being confident of the reference for the publishers. It turned out that is the organisation file, published by these very same organisations!

In terms of org-id.guide. Right now, that is a service to check/verify the first part of any reference (the GB-GOV or GB-GOV-1, for example). We used this in the methodology above. org-id.guide is not currently about hosting lists of specific organisation identifiers.

The IATI Registry is a listing of organisations that publish data with the IATI format. We’ve a list, which is automated by the software underneath it, has an API underneath it, and is administered by the core @IATI-techteam. If the IATI Registry can serve out (perhaps along the methodology outlined above) the identifiers for the publishers registered on its own system, in an authoritative way, then that seems very helpful for people such as @Reichner and others.

Honestly, it’s really fantastic that @andylolz has built something, and that @anjesh and team are integrating org referencing tools to AidStream. But, I also think building upon the core and common infrastructure we have in place already, is well worth consideration, for this particular question

andylolz · October 27, 2017, 3:35pm

Registry metadata is available via API, and includes the publisher’s name and publisher’s organisation identifier. Here’s a random example. As both a user and a publisher, I’ve found the overlap between the publisher metadata on the registry versus the information in the organisation file super confusing.

Pulling organisation data from organisation files into the registry metadata would be awesome. I think that would achieve the thing you’re talking about here, @stevieflow. The registry archiver already does a similar job when it comes to last updated timestamps for datasets, so there’s precedent for this (albeit it’s currently being fixed!) What do you think to this change, @IATI-techteam?

Then the registry would be providing a publisher-maintained, centralised list of organisation identifiers, available via API. Adding endpoints to make it queryable by organisation name and/or organisation identifier would also be brilliant.

stevieflow · October 26, 2017, 3:28pm

Yes, agree.

(my emphasis) - but that’s the crucial bit for me.

andylolz · January 22, 2018, 10:41am

Bumping this:

What do you think, @IATI-techteam?

I use https://andylolz.github.io/org-id-finder/ quite often. It would be great if the registry could provide this service directly.

stevieflow · January 22, 2018, 3:53pm

+1

I see this as core infrastructure, and based on the needs of publishers.

anjesh · January 24, 2018, 10:41am

Hi Andy and all,

I am bit late into the discussion here. As Reid mentioned in earlier thread, we are doing something similar in and for AidStream users - where we are taking data from org xml and the publishers list in the registry. We are only consuming data that pass org-id.guide criteria or is present in iati-org-codelist - rest are ignored even if they are included in org-xmls. We want this to be controlled list instead of solely consuming org-xml files only. There are number of issues with org-xml files which might give wrong info to the users. I randomly typed DFID and got this
Apparently this xml has that id https://aidstream.org/files/xml/stromme_ug-org.xml

We are putting extra eyes to avoid situation like this but still there are chances of missing those as well, when the numbers of org increase. So we call for suggestions from the users as well to improve the data. Like providing alternative names for organisations so that searching for DFID also gives results here http://api.stage.aidstream.org/organisation But it’s far from perfection but hope that this will at least help the majority of aidstream users to improve the a limited number of organisations to start with.

We are releasing this as a part of aidstream new feature solely targeting the participating-organisations data.

I would be very happy to collaborate and see how we can combine our forces on org-data.

Best
Anjesh.

andylolz · January 24, 2018, 2:43pm

Nice! Thanks for sharing, @anjesh! I’m exciting about something like this being baked into AidStream.

Just to respond on this point:

So, I did it this way by design, mostly because I don’t have the time or desire to take ownership of someone else’s data issues Funnily enough, I did exactly the same search as you last week, and found the same data issue. But instead of taking responsibility for the problem and fixing it myself, I was able to trace where the problem was, by clicking the source link:

I reported the issue (via zendesk) last week, and it’s currently with the publisher in question to fix.

Admittedly, that doesn’t help users in the meantime – the data is bad, and remains bad until the publisher fixes it. But once it’s fixed, it’s fixed for everyone. I’d encourage you to also bubble up the data issues you find back to the publishers.

stevieflow · January 24, 2018, 6:22pm

At the “Mini developers TAG” meeting today, I heard several people (I think) reiterate the need for a canonical list of verified organisation references (my words). I pointed to this thread on twitter, but want to flag again.

I also wanted reiterate that the method we went through confirmed (as @anjesh describes) that the Organisation XML files seem to be our best initial source of these references. This isn’t an ID for every single organisation mentioned in IATI data, but it is a start.

And - I’m going to do that thing of tagging people I heard say (or at least listen to!) this: @rolfkleef @pelleaardema @Herman @siemvaessen @hayfield @bill_anderson @Imogen_Kutz @r_clements @JohnAdams

Herman · January 24, 2018, 7:22pm

A canonical list of activity ids would also be very helpfull to implement validation of references to other activities. The lack of both the canonical org id and activity lists as a part of the IATI infrastructure causes quite some headaches and duplication of effort to do very basic data validation checks.

siemvaessen · January 25, 2018, 2:00pm

So, who will be managing this list to the extent that anyone can trust this list for it to be codified into a codelist?

Herman · January 25, 2018, 2:40pm

Any working solution should not have manual intervention, since new organizations en activities are frequently added. What would be helpful is that the existing organization and activity codes are automatically extracted from the current IATI XML publications. Then these list can be used to validate if references to organisations-id and activity-ids actually exist.

Trust is implicit since an IATI publisher is responsible for its own data. So if you as a publisher publish an activity with a certain identifier, that is by definition the truth since you own the data that activity.

andylolz · January 25, 2018, 8:23pm

For clarity: In what sense does this thing fail to do what you’re looking for, @stevieflow / @siemvaessen / @Herman?

^^ I agree with this! That’s why it’s exactly how this thing works (for org IDs, at least.) For activity IDs, I guess you need to look to a datastore like this one (though I’m afraid I have no way to judge its trustworthiness!)

I guess you’re talking about the secretariat either funding, managing, running or endorsing this somehow. Is that right?