IATI Identifiers should not be allowed to contain special characters

r_clements · April 27, 2017, 9:00am

Hello,

We have had a lot of issues latelywhen trying to develop solutions on top of IATI data when the underlying IATI identifiers contain special characters like ’ and &.

The IATI standard states that identifiers should pass the followiing regex [^/&|?]+

I think that this suggestion should be made into a requirement that identifiers must not contain special characters and that any IATI identifier that contains special characters will cause the activity to fail validation.

The standard also states that once an ID has been created that it must not be changed. I think in the case of special characters within IDs, however, that an exception should be made in order to improve the quality of the data contained within the registry.

Ross

stevieflow · December 30, 2016, 9:55am

Hi @r_clements

I agree that such identifiers then “break” uses of the data - especially when an identifier is a part of a URL - a / or a \ can present challenges / break things

One issue in terms of enforcing this is that the current IATI schema does not check such content - for example the format of an identifier. The rules you cite are additional resources - meaning that somebody validating their data via the schema (or the IATI Validator online) may “pass” validation, but be unaware / skip this additional good practice.

It would be great if others could indicate how such a process could become a part of the central resources

Wendy · January 4, 2017, 11:19am

Thanks for raising this issue @r_clements and I thought I should add that it is still our intention to enhance the IATI validator so that it carries out all specific validation and content checking (such as this for the activity identifier) as defined as part of the IATI Standard. Unfortunately work on the new validator has had to be paused due to other priorities but we hope to get going with it again in the near future.

Also you suggest that the activity identifier should not contain any special characters so we could extend the regex to explicitly cover other characters that should not be used? eg ’ $%* etc. However, I assume that we would want to continue to allow ‘-’ hypens to be used?

I also agree that whilst it is explicitly stated that an activity identifier once published should not change we may perhaps need to make an exception for when special characters have inadvertently been used. However, I would be interested to get the views of others (and especially data users) on this?

Herman · January 9, 2017, 9:19am

I am not sure this change outweighs the benefits. When using IATI identifiers in URL’s you should always URL encode the IATI identifier. That solves the problem.

If you still want to exclude special characters from the IATI identifier, this check should be part of the IATI validator, since this the formal IATI conformance check. I would not change existing IATI identifiers, since that would cause all kind of problems when relating activities (as we extensively do). The hyphen ‘-’ is a part of the existing guidelines. No reason i.m.o. to change that. Since this is a breaking change, it should be part of the next integer upgrade of the standard, only to be applied to new activities. That would be not be trivial to implement though.

r_clements · January 9, 2017, 11:30am

Thank you to everyone for your thoughts on my initial question.

@Wendy - I think to clarify I would like to say that I had never thought that hyphen would be one of the characters that’s removed as DFID use it in all of our H2 IATI project identifiers and it doesn’t break URLs.

@Herman - The standard currently warns against the use of characters that would break URLs but I don’t think that it goes far enough and they should be explicitly banned from usage. Ideally the IATI validator would check for this but I think that firming up the guidance would be a good start.
The OIPA API / Devtracker ecosystem has been updated so that it returns project identifiers that are stripped of special characters, however if I wanted to return raw JSON data from the OIPA API then you’re in a position where you are forcing the interface to process low quality input (e.g. the / character in a project identifier) so that users get the expected project data returned to them.

I honestly believe that this issue is causing problems across the board for people trying to create tools on top of IATI data and means that developers are spending time trying to interpret bad data rather than improve the tools that they are developing.

Herman · January 31, 2017, 4:13pm

Hi @r_clements,
Yes I agree that the use of IATI should be as simple as possible in principle. Changing existing identifiers will have considerable impact though in all existing applications referring to activities of other publishers. So this is no trivial change. It would be interesting to know how many publishers are producing activity identifiers with non-standard characters, so we know what the impact of this change would be.

The use of special characters throughout IATI should maybe considered (such as the use of special characters in the names of IATI file in the registry)

matmaxgeds · January 31, 2017, 4:42pm

Hi all,

I am not really a programmer but is an alternative not to tweak the IATI standard to enforce (or convert to) the use of character entity references e.g. http://www.theukwebdesigncompany.com/articles/entity-escape-characters.php instead of just the [^/&|?]+ regex so that all characters can continue to be used, and peoples parsers don’t break? E.g. “/” would be " & # 4 7 ; ". This seems to me one of the advantages of raw IATI data being machine readable.

My hunch is that otherwise this could force publishers to maintain two names e.g. for projects in their own systems (one for their system including special characters, and another for publishing to IATI), and also rmove lots of useful hyperlinks in the data which would be a significant inconvenience and reduce the amount of data that is published.

Matt

andylolz · February 21, 2017, 4:21pm

I’m puzzled about this! [^\/\&\|\?]+ will match an identifier that contains at least one character that isn’t a forward slash, ampersand, pipe or question mark.

So for instance, :-/ uh oh? :-| would match this regex, and therefore be considered a valid identifier, despite the special characters.

I wonder if this regex should instead be: ^[^\/\&\|\?]+$ i.e. the identifier can’t contain any forward slashes, ampersands, pipes or question marks.

bjwebb · February 21, 2017, 7:01pm

Yes, I think you’re right. This was probably a mistake on my part a few years ago.

bill_anderson · March 2, 2017, 6:08am

We do not seem to have consensus on this issue. There would appear to be two approaches to use of characters in IATI identifiers.

Restrict characters allowed and fix regex rules.
Allow all valid characters and enforce url encoding when using identifiers in urls.

Could everyone provide a bottom-line response to these two options?

@Herman @r_clements @andylolz @bjwebb @matmaxgeds @markbrough @stevieflow @TimDavies

(Personally I am with @herman on the second option as the activity part of the identifier should reflect the identifier used in the publisher’s own system.)

matmaxgeds · March 2, 2017, 6:27am

Also with @Herman - the donor systems I have seen would have significant difficulty (huge increase in manual tweaking of fields required) to restrict these characters, potentially also making those fields less readable. Suggest enforcing URL encoding for all URLs, not just those with identifiers.

stevieflow · March 2, 2017, 8:30am

I agree (in more than 20 characters)

TimDavies · March 2, 2017, 10:04am

I prefer option (2).

The two other considerations that might point towards (1) however:

(a) Input systems that only allow alphanumeric characters. Often legacy systems will have restrictive validation on fields; or at least may not support full unicode in input fields for identifiers;
(b) Cases where identifiers are presented in a range of different ways (e.g. Australian Business Number might be written as ‘123 123 123’ or ‘123.123.123’ or ‘123123123’ in different places, all for the same organisation;

These noted, I still support option (2), with the idea that we might need to provide guidance to users of organisation identifiers on some basic normalisation to apply to them to maximally ensure identifier matches.

(For example, I’m not aware of any identifier schemes in which ‘123/123/123’ and ‘123123123’ would identify different organisations - so it is generally safe internally when consuming data to strip out special characters to get the best chance of identifier matches)

r_clements · March 2, 2017, 10:21am

Hello everyone,

For me it’s still: 1.Restrict characters allowed and fix regex rules.

I don’t like to be the contrarian here, but I’m afraid I disagree and would like to highlight one of the issues.

An organisation whose data we wanted to consume published an identifier that went something like: GB-test-Sana’a and caused the team behind DevTracker and OIPA no end of headaches to try and formulate and process an ID that was usable from the existing data.

I think it would be far easier for any tool builder to have the option of excluding IATI identifiers that contain this type of character and I fail to see why allowing / ’ " = and & values within IATI identifiers adds any value to the standard.

For me it results in a significant overhead for anyone trying to build API tools on top of the data, that will be replicated by any technical team looking to build tools with IATI data.

As a consumer of IATI data, I would rather spend my time working to develop new features to our tools (e.g DevTracker), insead of trying to mitigate against the impact of iati identifiers whi ch contain special characters that break url encoding.

@Herman @andylolz @bill_anderson @bjwebb @markbrough @stevieflow @VincentVW @siemvaessen @matmaxgeds

bill_anderson · March 2, 2017, 10:23am

@TimDavies I think the bigger problem lies with the ‘project number’ part of the identifier, though I take the point that this could crop up on the organisation side as well

bill_anderson · March 2, 2017, 10:30am

@r_clements you make a strong case, but how do you maintain a link to the original project number:

Is there a conversion (that works in both directions) that could be standardised?
Or do you argue that the benefits of breaking that link outweigh the drawbacks?

r_clements · March 2, 2017, 11:30am

Hello @bill_anderson, In this case we were lucky in that the project was not linked to any other IATI data sets so I asked the organisation to fix it (i.e. remove the ’ and republish) - I must appologise to the community here as I didn’t realise that you’re not meant to do this, so I misinformed the publisher in this case.

There were other issues with their published data, projects were published four times a year as the ID was adapted to reflect the current financial quarter, causing duplication of data within the IATI network. Again we have advised the organisation involved about this and they are changing their data, going forward, so they only publish a project ID once and adjust the finances rather than change the ID.

I think my real concern with the IATI identifiers is that when we start to really improve our linking through the network, via the transaction ref fields, that a badly formed identifier (i.e. non url compliant) is going to cause issues for anyone trying to track funds when they’re using API calls to return the data.

It’s not impossible to work round this issue, but I think we’re going to make the IATI data much more complex to work with than it needs to be - I suppose my core question, difficulties of changing the systems to enforce complance behaviour aside, is it too much to ask project inputters to avoid using these specific characters when allocating IATI Identifiers?

r_clements · March 2, 2017, 11:33am

Sorry I went off a bit there: I think you’ve hit the nail on the head, in the first instance I would break the link so 2. but would look to flag this in some way (Vincent’s IATI bug tracker could be adapted for this, so that part 1. can happen in conversation between the impacted organisations.

siemvaessen · March 2, 2017, 11:33am

Do we know firsthand how many IATI identifiers contain special characters, seeing I don’t have those numbers in front of me.

From API (OIPA) perspective this has been an ongoing issue for years now. Non URL/URI compliance does in many cases require a custom approach from our perspective. Seeing how others may have a different approach as well, this does not add to interoperability if we for example were to align different systems in the IATI network.

Basically the issue is two-fold: an IATI org. identifier can contain special character plus the additional identifier may contain special characters as well.

We would prefer this to be solved at the root of the chain -the standard itself- and not anywhere else.

bill_anderson · March 2, 2017, 11:37am

Isn’t the root of the chain the institution’s own business rules for how they id their projects? IATI doesn’t have jurisdiction over this.

@r_clements just to be clear, when I talked about breaking the link I didn’t mean across IATI datasets, but between IATI and the reporting organisation’s project management system