IATI Identifiers should not be allowed to contain special characters

@TimDavies I think the bigger problem lies with the ‘project number’ part of the identifier, though I take the point that this could crop up on the organisation side as well

@r_clements you make a strong case, but how do you maintain a link to the original project number:

  1. Is there a conversion (that works in both directions) that could be standardised?
  2. Or do you argue that the benefits of breaking that link outweigh the drawbacks?

Hello @bill_anderson, In this case we were lucky in that the project was not linked to any other IATI data sets so I asked the organisation to fix it (i.e. remove the ’ and republish) - I must appologise to the community here as I didn’t realise that you’re not meant to do this, so I misinformed the publisher in this case.

There were other issues with their published data, projects were published four times a year as the ID was adapted to reflect the current financial quarter, causing duplication of data within the IATI network. Again we have advised the organisation involved about this and they are changing their data, going forward, so they only publish a project ID once and adjust the finances rather than change the ID.

I think my real concern with the IATI identifiers is that when we start to really improve our linking through the network, via the transaction ref fields, that a badly formed identifier (i.e. non url compliant) is going to cause issues for anyone trying to track funds when they’re using API calls to return the data.

It’s not impossible to work round this issue, but I think we’re going to make the IATI data much more complex to work with than it needs to be - I suppose my core question, difficulties of changing the systems to enforce complance behaviour aside, is it too much to ask project inputters to avoid using these specific characters when allocating IATI Identifiers?

Sorry I went off a bit there: I think you’ve hit the nail on the head, in the first instance I would break the link so 2. but would look to flag this in some way (Vincent’s IATI bug tracker could be adapted for this, so that part 1. can happen in conversation between the impacted organisations.

Do we know firsthand how many IATI identifiers contain special characters, seeing I don’t have those numbers in front of me.

From API (OIPA) perspective this has been an ongoing issue for years now. Non URL/URI compliance does in many cases require a custom approach from our perspective. Seeing how others may have a different approach as well, this does not add to interoperability if we for example were to align different systems in the IATI network.

Basically the issue is two-fold: an IATI org. identifier can contain special character plus the additional identifier may contain special characters as well.

We would prefer this to be solved at the root of the chain -the standard itself- and not anywhere else.

Isn’t the root of the chain the institution’s own business rules for how they id their projects? IATI doesn’t have jurisdiction over this.

@r_clements just to be clear, when I talked about breaking the link I didn’t mean across IATI datasets, but between IATI and the reporting organisation’s project management system

From the IATI network perspective yes and no. I understand IATI does not have jurisdiction over their business rules, but I guess leaving this as is will continue to cause issues down the line.

What if IATI -from data quality perspective- picks this up according to tbd upon convention / conversion we can all agree to? As in your option 1 @bill_anderson What would you propose?

If the consensus is Option 1 I would:

  1. Agree Regex rules
  2. Schema (3.01):Add Regex validation to all fields containing org and activity identifiers
  3. Guideline (2.03): Publishers should follow Regex rules
  4. Action: Add Regex rule and guidance to identify-org.net (@TimDavies ok?)
  5. Guideline (2.03): Publishers whose in-house business model involves use of invalid characters should provide a note (in the registry metadata?) on how users may be able to derive the original project id.

Just to add that re 5) in @bill_anderson post above, publishers can also use the other-identifier element to cross ref the activity to their own internal project identifier.

Good shout. So 5 should in fact be:

5 . Publishers whose in-house business model involves use of invalid characters should record the original identifier in the other-identifier element with @type="A1"

Question: Is this a should or a must?

IATI isn’t a green field anymore. Since changing existing IATI identifiers will break references to organizations and activities of other publishers, I strongly oppose to this change. This change may have a huge impact on existing IATI data users.

As a rule you never change existing business identifiers. I can only think of two exceptions in this case:
1 - An IATI identifier has not been used by any other publisher: so it is safe to change the identifier or
2 - the proposed change is only applied to NEW IATI identifiers. The exiting identifiers are left unchanged.

To estimate to impact of this change it would be nice to have some metrics on the use of IATI identifiers with invalid characters.

In the specific case I was talking about the ID on IATI was actually a compound ID based on data from their internal system and as they were using CSV2IATI to generate their data, so the ID actually didn’t exist in the IATI form on thier system.

They hadn’t realised what they were doing would cause a problem and were happy to change it when they did, so I think that there might be some naivety in the publishing community as to the issues that are being caused by IDs with non url compliant characters.

@bill_anderson - The only addition I would make to your list is that we identify organisations that are currently publishing IDs that have non complaint urls and, if possible, give them a nudge to change the impacted ID(s) to something that’s compliant with urls.

Strongly agree with @Herman, @bill_anderson and others on option 2. Proper URL encoding much simpler than getting all publishers to replace characters from their project IDs. Slashes in project IDs sometimes have real meaning, and getting all organisations to implement manually RFC 3986 rather than have libraries that do the same job seems like a recipe for disaster to me.

For example, if the project ID is 2017/123-456, should both the publisher and the implementing partner be told that they need to remember to ignore the slash and turn it into some other character? Clearly that won’t always happen, so tools will always need to handle these characters, so why make people go to any effort? Even the conversation about what to do is complicated and going to add a lot of overhead.

I think we need a clearer explanation of why percent-encoding URL inputs is insufficient before undertaking what would be quite a disruptive step.

I think the point that comes out of this is that data usage is - or should be - , primarily, content-related. Finding the path of least resistance might produce a ‘better’ technical solution that appears to improve data quality (fewer errors), but does it achieve this at the expense of the meaning of the data?

Reactivating an old conversation, since I just stumbled on that regex in the 2.03 conversation. The current regex means that, in a Unicode context, this is a valid activity identifier:

XI-IATI-OCHADSC-:recycle:️xxx123…:slight_smile:

Perhaps it would be wise to revise at least to specify allowable Unicode character classes.

D

@David_Megginson can you explain why this would be a problem? Is it just that some systems would fail to handle some unicode characters correctly? Also, could you suggest an alternative regex that would deal with this?

Aside: Back in Feb 2017, I sent a PR to fix this regex. This was merged earlier this year, but then discarded (presumably by accident) in this PR :frowning_face:

I think BNF (or similar) might be clearer than a regex, because of all the different regex flavours, but if we are sticking with regular expressions, then in POSIX-y dialects (including Python regex’s) we can use \w to match any alphanumeric character, \s to match any whitespace character, etc.

We also have to specify whether we’re allowing Unicode or just ASCII. I’m a huge Unicode (and UTF-8 encoding) fan, but even an experienced coder or DBA will often blow up a system and/or open security holes when they get an unexpected non-ASCII character in an identifier, etc. If we scan the registry and find that no one is using non-ASCII characters in identifiers, I’d suggest making the regex very explicit (inclusion rather than exclusion character groups) and issuing a guidance note along the lines of “this is what we meant, and what the registry will support”.

D

Hey @David_Megginson – I think this is a rare moment where I maybe have to disagree with you! There are a bunch of different perspectives and arguments in the thread above, but my argument is something like the following:

So IATI Identifiers should be composed of [Organisation ID]-[Organisation's internal project ID].

Having restrictions on IATI Identifiers means that we have to:

  1. restrict which characters an organisation has in its project ID in internal systems (which as @bill_anderson says is not something IATI has control over), or
  2. we require that organisations with non-permitted characters to convert those characters in a consistent way

I think 2. has several issues:

  • every other organisation referring to this IATI ID has to convert in exactly the same way, whereas probably at least sometimes people will make mistakes
  • it breaks the link between internal project IDs and IATI Identifiers

In either case:

  • the benefits are difficult to ascertain, because as we have seen elsewhere in IATI, there will always be cases where organisations don’t implement this perfectly – so systems using the data will therefore have to handle funny characters anyway (e.g percent-encoding if using these identifiers in URLs)
  • we would have to have a long and very painful discussion about which characters exactly should be permitted… e.g. would we be excluding data from Chinese or Arabic systems which have non-ASCII identifiers? I don’t know…

Or have I got the wrong end of the stick of what you’re proposing here?

1 Like

Good points, @markbrough, but I think there’s a risk of being overly cautious here. Yes, it is possible that there is a major enterprise computer system somewhere in the world that uses emojis in its database primary keys, but I’d suggest that it’s highly improbable, to the point that we can leave it out of consideration. Accented or non-Roman characters in a primary key are slightly less improbable, but anyone doing that would already have to convert them for interoperability with other systems.

On the other side of the scale, allowing non-alphanumeric, non-basic-punctuation characters opens a huge range of security holes in naive implementations, and a huge range of potential bugs. So we have to ask which cost is greater – accommodating a theoretically-possible edge case (that we could help a single org work around if it happened), or adding the potential for bugs and security holes in every IATI software implementation. There’s no zero-cost choice here.

(Note that I am a huge advocate of multilingual support in the human-readable data in IATI – titles, descriptions, etc – but not necessarily in the purely machine-readable stuff like XML tags, identifiers, etc).

D

By default, \w will match unicode characters in python3. But python’s re library has the re.ASCII flag, which would make \w do the right thing (if “the right thing” means ASCII-only).

[Just mentioning this here because I was previously unaware of re.ASCII].

1 Like