IATI Identifiers should not be allowed to contain special characters

siemvaessen · March 2, 2017, 11:49am

From the IATI network perspective yes and no. I understand IATI does not have jurisdiction over their business rules, but I guess leaving this as is will continue to cause issues down the line.

What if IATI -from data quality perspective- picks this up according to tbd upon convention / conversion we can all agree to? As in your option 1 @bill_anderson What would you propose?

bill_anderson · March 2, 2017, 12:08pm

If the consensus is Option 1 I would:

Agree Regex rules
Schema (3.01):Add Regex validation to all fields containing org and activity identifiers
Guideline (2.03): Publishers should follow Regex rules
Action: Add Regex rule and guidance to identify-org.net (@TimDavies ok?)
Guideline (2.03): Publishers whose in-house business model involves use of invalid characters should provide a note (in the registry metadata?) on how users may be able to derive the original project id.

Wendy · March 2, 2017, 12:24pm

Just to add that re 5) in @bill_anderson post above, publishers can also use the other-identifier element to cross ref the activity to their own internal project identifier.

bill_anderson · March 2, 2017, 12:31pm

Good shout. So 5 should in fact be:

5 . Publishers whose in-house business model involves use of invalid characters should record the original identifier in the other-identifier element with @type="A1"

Question: Is this a should or a must?

Herman · March 2, 2017, 12:38pm

IATI isn’t a green field anymore. Since changing existing IATI identifiers will break references to organizations and activities of other publishers, I strongly oppose to this change. This change may have a huge impact on existing IATI data users.

As a rule you never change existing business identifiers. I can only think of two exceptions in this case:
1 - An IATI identifier has not been used by any other publisher: so it is safe to change the identifier or
2 - the proposed change is only applied to NEW IATI identifiers. The exiting identifiers are left unchanged.

To estimate to impact of this change it would be nice to have some metrics on the use of IATI identifiers with invalid characters.

r_clements · March 2, 2017, 12:44pm

In the specific case I was talking about the ID on IATI was actually a compound ID based on data from their internal system and as they were using CSV2IATI to generate their data, so the ID actually didn’t exist in the IATI form on thier system.

They hadn’t realised what they were doing would cause a problem and were happy to change it when they did, so I think that there might be some naivety in the publishing community as to the issues that are being caused by IDs with non url compliant characters.

@bill_anderson - The only addition I would make to your list is that we identify organisations that are currently publishing IDs that have non complaint urls and, if possible, give them a nudge to change the impacted ID(s) to something that’s compliant with urls.

markbrough · March 2, 2017, 3:10pm

Strongly agree with @Herman, @bill_anderson and others on option 2. Proper URL encoding much simpler than getting all publishers to replace characters from their project IDs. Slashes in project IDs sometimes have real meaning, and getting all organisations to implement manually RFC 3986 rather than have libraries that do the same job seems like a recipe for disaster to me.

For example, if the project ID is 2017/123-456, should both the publisher and the implementing partner be told that they need to remember to ignore the slash and turn it into some other character? Clearly that won’t always happen, so tools will always need to handle these characters, so why make people go to any effort? Even the conversation about what to do is complicated and going to add a lot of overhead.

I think we need a clearer explanation of why percent-encoding URL inputs is insufficient before undertaking what would be quite a disruptive step.

bill_anderson · March 2, 2017, 3:45pm

I think the point that comes out of this is that data usage is - or should be - , primarily, content-related. Finding the path of least resistance might produce a ‘better’ technical solution that appears to improve data quality (fewer errors), but does it achieve this at the expense of the meaning of the data?

David_Megginson · May 13, 2019, 11:51am

Reactivating an old conversation, since I just stumbled on that regex in the 2.03 conversation. The current regex means that, in a Unicode context, this is a valid activity identifier:

XI-IATI-OCHADSC-️xxx123…️

Perhaps it would be wise to revise at least to specify allowable Unicode character classes.

D

andylolz · May 13, 2019, 5:07pm

@David_Megginson can you explain why this would be a problem? Is it just that some systems would fail to handle some unicode characters correctly? Also, could you suggest an alternative regex that would deal with this?

Aside: Back in Feb 2017, I sent a PR to fix this regex. This was merged earlier this year, but then discarded (presumably by accident) in this PR

David_Megginson · May 21, 2019, 12:12pm

I think BNF (or similar) might be clearer than a regex, because of all the different regex flavours, but if we are sticking with regular expressions, then in POSIX-y dialects (including Python regex’s) we can use \w to match any alphanumeric character, \s to match any whitespace character, etc.

We also have to specify whether we’re allowing Unicode or just ASCII. I’m a huge Unicode (and UTF-8 encoding) fan, but even an experienced coder or DBA will often blow up a system and/or open security holes when they get an unexpected non-ASCII character in an identifier, etc. If we scan the registry and find that no one is using non-ASCII characters in identifiers, I’d suggest making the regex very explicit (inclusion rather than exclusion character groups) and issuing a guidance note along the lines of “this is what we meant, and what the registry will support”.

D

markbrough · May 22, 2019, 5:05pm

Hey @David_Megginson – I think this is a rare moment where I maybe have to disagree with you! There are a bunch of different perspectives and arguments in the thread above, but my argument is something like the following:

So IATI Identifiers should be composed of [Organisation ID]-[Organisation's internal project ID].

Having restrictions on IATI Identifiers means that we have to:

restrict which characters an organisation has in its project ID in internal systems (which as @bill_anderson says is not something IATI has control over), or
we require that organisations with non-permitted characters to convert those characters in a consistent way

I think 2. has several issues:

every other organisation referring to this IATI ID has to convert in exactly the same way, whereas probably at least sometimes people will make mistakes
it breaks the link between internal project IDs and IATI Identifiers

In either case:

the benefits are difficult to ascertain, because as we have seen elsewhere in IATI, there will always be cases where organisations don’t implement this perfectly – so systems using the data will therefore have to handle funny characters anyway (e.g percent-encoding if using these identifiers in URLs)
we would have to have a long and very painful discussion about which characters exactly should be permitted… e.g. would we be excluding data from Chinese or Arabic systems which have non-ASCII identifiers? I don’t know…

Or have I got the wrong end of the stick of what you’re proposing here?

David_Megginson · May 23, 2019, 1:42pm

Good points, @markbrough, but I think there’s a risk of being overly cautious here. Yes, it is possible that there is a major enterprise computer system somewhere in the world that uses emojis in its database primary keys, but I’d suggest that it’s highly improbable, to the point that we can leave it out of consideration. Accented or non-Roman characters in a primary key are slightly less improbable, but anyone doing that would already have to convert them for interoperability with other systems.

On the other side of the scale, allowing non-alphanumeric, non-basic-punctuation characters opens a huge range of security holes in naive implementations, and a huge range of potential bugs. So we have to ask which cost is greater – accommodating a theoretically-possible edge case (that we could help a single org work around if it happened), or adding the potential for bugs and security holes in every IATI software implementation. There’s no zero-cost choice here.

(Note that I am a huge advocate of multilingual support in the human-readable data in IATI – titles, descriptions, etc – but not necessarily in the purely machine-readable stuff like XML tags, identifiers, etc).

D

andylolz · June 10, 2019, 4:50pm

By default, \w will match unicode characters in python3. But python’s re library has the re.ASCII flag, which would make \w do the right thing (if “the right thing” means ASCII-only).

[Just mentioning this here because I was previously unaware of re.ASCII].

Herman · June 12, 2019, 2:19pm

Changes in current IATI identifiers will produce havoc when using IATI data, especially when those identifiers are being used in other activities or by other publishers. I think the best we could do here is to make this a guideline, which could be checked and flagged by the data-validator (e.g. IATI identifier ABC contains non-standard characters) as a ‘warning’ class message.

David_Megginson · June 13, 2019, 11:19am

Very true, Herman. My question is whether it would involve a change to any existing identifiers. We’d have to crawl the registry to check.

andylolz · June 13, 2019, 10:11pm

Worth noting that there’s already a recommendation in the docs against using non-ASCII characters.

This is trivial to do using iatikit.

Here’s a gist containing the code and results.

In summary, there are currently 198 non-ASCII identifiers on the registry.

Worth mentioning that d-portal appears to cope with unicode in identifiers. The only exceptions relate to carriage returns in identifiers (which d-portal strips e.g. here) and angle brackets in identifiers (which d-portal gives up on e.g. here, from here). But neither of these are unicode issues.

David_Megginson · June 13, 2019, 10:47pm

Thanks, Andy. Did you see how many non-alphanumeric/basic punctuation characters there were?

D