IATI Identifiers should not be allowed to contain special characters

andylolz · February 21, 2017, 4:21pm

I’m puzzled about this! [^\/\&\|\?]+ will match an identifier that contains at least one character that isn’t a forward slash, ampersand, pipe or question mark.

So for instance, :-/ uh oh? :-| would match this regex, and therefore be considered a valid identifier, despite the special characters.

I wonder if this regex should instead be: ^[^\/\&\|\?]+$ i.e. the identifier can’t contain any forward slashes, ampersands, pipes or question marks.

bjwebb · February 21, 2017, 7:01pm

Yes, I think you’re right. This was probably a mistake on my part a few years ago.

bill_anderson · March 2, 2017, 6:08am

We do not seem to have consensus on this issue. There would appear to be two approaches to use of characters in IATI identifiers.

Restrict characters allowed and fix regex rules.
Allow all valid characters and enforce url encoding when using identifiers in urls.

Could everyone provide a bottom-line response to these two options?

@Herman @r_clements @andylolz @bjwebb @matmaxgeds @markbrough @stevieflow @TimDavies

(Personally I am with @herman on the second option as the activity part of the identifier should reflect the identifier used in the publisher’s own system.)

matmaxgeds · March 2, 2017, 6:27am

Also with @Herman - the donor systems I have seen would have significant difficulty (huge increase in manual tweaking of fields required) to restrict these characters, potentially also making those fields less readable. Suggest enforcing URL encoding for all URLs, not just those with identifiers.

stevieflow · March 2, 2017, 8:30am

I agree (in more than 20 characters)

TimDavies · March 2, 2017, 10:04am

I prefer option (2).

The two other considerations that might point towards (1) however:

(a) Input systems that only allow alphanumeric characters. Often legacy systems will have restrictive validation on fields; or at least may not support full unicode in input fields for identifiers;
(b) Cases where identifiers are presented in a range of different ways (e.g. Australian Business Number might be written as ‘123 123 123’ or ‘123.123.123’ or ‘123123123’ in different places, all for the same organisation;

These noted, I still support option (2), with the idea that we might need to provide guidance to users of organisation identifiers on some basic normalisation to apply to them to maximally ensure identifier matches.

(For example, I’m not aware of any identifier schemes in which ‘123/123/123’ and ‘123123123’ would identify different organisations - so it is generally safe internally when consuming data to strip out special characters to get the best chance of identifier matches)

r_clements · March 2, 2017, 10:21am

Hello everyone,

For me it’s still: 1.Restrict characters allowed and fix regex rules.

I don’t like to be the contrarian here, but I’m afraid I disagree and would like to highlight one of the issues.

An organisation whose data we wanted to consume published an identifier that went something like: GB-test-Sana’a and caused the team behind DevTracker and OIPA no end of headaches to try and formulate and process an ID that was usable from the existing data.

I think it would be far easier for any tool builder to have the option of excluding IATI identifiers that contain this type of character and I fail to see why allowing / ’ " = and & values within IATI identifiers adds any value to the standard.

For me it results in a significant overhead for anyone trying to build API tools on top of the data, that will be replicated by any technical team looking to build tools with IATI data.

As a consumer of IATI data, I would rather spend my time working to develop new features to our tools (e.g DevTracker), insead of trying to mitigate against the impact of iati identifiers whi ch contain special characters that break url encoding.

@Herman @andylolz @bill_anderson @bjwebb @markbrough @stevieflow @VincentVW @siemvaessen @matmaxgeds

bill_anderson · March 2, 2017, 10:23am

@TimDavies I think the bigger problem lies with the ‘project number’ part of the identifier, though I take the point that this could crop up on the organisation side as well

bill_anderson · March 2, 2017, 10:30am

@r_clements you make a strong case, but how do you maintain a link to the original project number:

Is there a conversion (that works in both directions) that could be standardised?
Or do you argue that the benefits of breaking that link outweigh the drawbacks?

r_clements · March 2, 2017, 11:30am

Hello @bill_anderson, In this case we were lucky in that the project was not linked to any other IATI data sets so I asked the organisation to fix it (i.e. remove the ’ and republish) - I must appologise to the community here as I didn’t realise that you’re not meant to do this, so I misinformed the publisher in this case.

There were other issues with their published data, projects were published four times a year as the ID was adapted to reflect the current financial quarter, causing duplication of data within the IATI network. Again we have advised the organisation involved about this and they are changing their data, going forward, so they only publish a project ID once and adjust the finances rather than change the ID.

I think my real concern with the IATI identifiers is that when we start to really improve our linking through the network, via the transaction ref fields, that a badly formed identifier (i.e. non url compliant) is going to cause issues for anyone trying to track funds when they’re using API calls to return the data.

It’s not impossible to work round this issue, but I think we’re going to make the IATI data much more complex to work with than it needs to be - I suppose my core question, difficulties of changing the systems to enforce complance behaviour aside, is it too much to ask project inputters to avoid using these specific characters when allocating IATI Identifiers?

r_clements · March 2, 2017, 11:33am

Sorry I went off a bit there: I think you’ve hit the nail on the head, in the first instance I would break the link so 2. but would look to flag this in some way (Vincent’s IATI bug tracker could be adapted for this, so that part 1. can happen in conversation between the impacted organisations.

siemvaessen · March 2, 2017, 11:33am

Do we know firsthand how many IATI identifiers contain special characters, seeing I don’t have those numbers in front of me.

From API (OIPA) perspective this has been an ongoing issue for years now. Non URL/URI compliance does in many cases require a custom approach from our perspective. Seeing how others may have a different approach as well, this does not add to interoperability if we for example were to align different systems in the IATI network.

Basically the issue is two-fold: an IATI org. identifier can contain special character plus the additional identifier may contain special characters as well.

We would prefer this to be solved at the root of the chain -the standard itself- and not anywhere else.

bill_anderson · March 2, 2017, 11:37am

Isn’t the root of the chain the institution’s own business rules for how they id their projects? IATI doesn’t have jurisdiction over this.

@r_clements just to be clear, when I talked about breaking the link I didn’t mean across IATI datasets, but between IATI and the reporting organisation’s project management system

siemvaessen · March 2, 2017, 11:49am

From the IATI network perspective yes and no. I understand IATI does not have jurisdiction over their business rules, but I guess leaving this as is will continue to cause issues down the line.

What if IATI -from data quality perspective- picks this up according to tbd upon convention / conversion we can all agree to? As in your option 1 @bill_anderson What would you propose?

bill_anderson · March 2, 2017, 12:08pm

If the consensus is Option 1 I would:

Agree Regex rules
Schema (3.01):Add Regex validation to all fields containing org and activity identifiers
Guideline (2.03): Publishers should follow Regex rules
Action: Add Regex rule and guidance to identify-org.net (@TimDavies ok?)
Guideline (2.03): Publishers whose in-house business model involves use of invalid characters should provide a note (in the registry metadata?) on how users may be able to derive the original project id.

Wendy · March 2, 2017, 12:24pm

Just to add that re 5) in @bill_anderson post above, publishers can also use the other-identifier element to cross ref the activity to their own internal project identifier.

bill_anderson · March 2, 2017, 12:31pm

Good shout. So 5 should in fact be:

5 . Publishers whose in-house business model involves use of invalid characters should record the original identifier in the other-identifier element with @type="A1"

Question: Is this a should or a must?

Herman · March 2, 2017, 12:38pm

IATI isn’t a green field anymore. Since changing existing IATI identifiers will break references to organizations and activities of other publishers, I strongly oppose to this change. This change may have a huge impact on existing IATI data users.

As a rule you never change existing business identifiers. I can only think of two exceptions in this case:
1 - An IATI identifier has not been used by any other publisher: so it is safe to change the identifier or
2 - the proposed change is only applied to NEW IATI identifiers. The exiting identifiers are left unchanged.

To estimate to impact of this change it would be nice to have some metrics on the use of IATI identifiers with invalid characters.

r_clements · March 2, 2017, 12:44pm

In the specific case I was talking about the ID on IATI was actually a compound ID based on data from their internal system and as they were using CSV2IATI to generate their data, so the ID actually didn’t exist in the IATI form on thier system.

They hadn’t realised what they were doing would cause a problem and were happy to change it when they did, so I think that there might be some naivety in the publishing community as to the issues that are being caused by IDs with non url compliant characters.

@bill_anderson - The only addition I would make to your list is that we identify organisations that are currently publishing IDs that have non complaint urls and, if possible, give them a nudge to change the impacted ID(s) to something that’s compliant with urls.

markbrough · March 2, 2017, 3:10pm

Strongly agree with @Herman, @bill_anderson and others on option 2. Proper URL encoding much simpler than getting all publishers to replace characters from their project IDs. Slashes in project IDs sometimes have real meaning, and getting all organisations to implement manually RFC 3986 rather than have libraries that do the same job seems like a recipe for disaster to me.

For example, if the project ID is 2017/123-456, should both the publisher and the implementing partner be told that they need to remember to ignore the slash and turn it into some other character? Clearly that won’t always happen, so tools will always need to handle these characters, so why make people go to any effort? Even the conversation about what to do is complicated and going to add a lot of overhead.

I think we need a clearer explanation of why percent-encoding URL inputs is insufficient before undertaking what would be quite a disruptive step.