Publishing application (excluded 2.03)

IATI-techteam · June 16, 2017, 3:40pm

This proposal is part of the 2.03 upgrade process, please comment by replying below.

Standard
Activity and Organisation

Schema Object
iati-activities/@publisher-app
iati-organisations/@publisher-app

Type of Change
Addition to Schema

Issue
Knowing what application has been used to generate IATI data is of use for two reasons. Firstly it allows users and publishers to track similarities in data quality issues across publishers. For example a problem that arises in the data of one publisher may well be replicated across all publishers using the same system. Secondly it will allow new publishers to find others who have already generated IATI data from a similar system.

Proposal

Add attributes iati-activities/@publisher-app and iati-organisations/@publisher-app
- Definition A code describing the application used to publish this file. It must be a valid value in the PublisherApp codelist.
- Occurs 0..1
Add a new non-embedded PublishApp codelist Initial Values
- 11 - AidStream 12 - IATI Studio 13 - CSV2IATI 51 - Microsoft SQL Server 52 - SAP 53 - Oracle 54 - PeopleSoft 55 - Access Accounts 99 - Other
Standards Day
Agreement that the information would be useful, though more research needs to be undertaken. Is also metadata, so may want to sit in the Registry rather than in the standard.

Links

docs.google.com

14. Improvements relating to both Activity and Organisation Standards

Name of paper Improvements relating to both Activity and Organisation Standards Lead author Hayden Field / Bill Anderson Links to other relevant information (Discuss forum; Website etc) Discuss:...

hayfield · May 15, 2017, 1:25pm

Calling the new field publishing-app (and associated Codelist PublishingApp) would better indicate that it is describing a process rather than an entity.

An Activity or Organisation may be defined without a parent root element. An Enhancer application (see below) may alternatively modify only one of several Activities within a dataset. As such, would propose this was added at the iati-activity level.

…or possibly at the higher level, with lower-level override (though that seems confusing).

Take a tool that is able to enhance an existing IATI XML file with additional information. Call it Enhancer. When you take a file generated by, say, AidStream and pass it through Enhancer, which value should be published?

By the base proposal, the implication is that you should state that the file was generated by Enhancer. Doing this means you lose information about how the data was originally generated. As such, it should be possible to state multiple applications involved in the process of generating the XML.

Would therefore propose that publishing-app should be an element rather than an attribute.

As an element, it becomes unclear of the order in which applications touched the dataset. As such, an additional attribute to indicate the order in which applications touched the data could be added.

There was a suggestion on Standards Day that browser User Agents could be used as inspiration. They are not a good example of how information can be represented well.

Herman · May 19, 2017, 2:31pm

Isn’t this proposal meant to identify the application producing the original IATI dataset (Publish once!). Whatever you do afterwards with an IATI dataset shouldn’t change this.

Since this field describes a technical attribute of how the IATI data was produced and has nothing to do with the content itself, I prefer this information to be part of the metadata in the registry and not of the IATI file itself.

So I would suggest to keep this as simple as possible and do not change the standard, but change the metadata.

TimDavies · May 25, 2017, 6:23pm

This is a useful field to have in the standard - but I’m not sure if needs a codelist - a string would suffice.

I suspect something more along the lines of a User Agent String as the value would give greater flexibility for applications to declare themselves - and their properties.

TimDavies · May 25, 2017, 6:25pm

Just noticed the user agent suggestion was address above.

Critique noted, but user agent strings are interpreted well by lots of applications - and are generally ‘good enough’ for the purpose they are put to - so I don’t think should be discounted based on a tounge-in-cheek article.

hayfield · June 5, 2017, 12:17pm

Taking the concept of a non-Codelist-based string-field, an alternative proposal would be:

Add attributes iati-activities/@publishing-app and iati-organisations/@publishing-app
Base Type: xsd:string
Definition: A semi-colon separated list of application(s) used in the generation of this file.
Occurs: 0…1
Rules:
- Multiple applications must be separated by semi-colons (;).
- Application names must not contain semi-colons (;).
- Application names beyond the first must be appended to (rather than prepended to or inserted in the middle of) the list.
- There must not be a trailing semi-colon at the end of the list.
Guidelines:
- Application names should include a version, build number or equivalent.
- If multiple applications feed into the generation of a file, only those required to identify the source of a potential data problem should be stated.
Regex: The value must conform to the regular expression [regexTBC]. (Note: This would make some of the Rules redundant)

Herman · June 7, 2017, 3:32pm

@TimDavies The standard should i.m.o. be as simple and technology independent as possible. A data user should not be caring about what system produced the data. The registry itself provides ample means to add this kind of metadata. Why not solve it there and avoid adding another field to the standard and again increase complexity?

TimDavies · June 7, 2017, 3:49pm

The registry does not answer the use-case of understanding the different applications that have handled data during a workflow - and over-centralises information.

This is essentially about data provenance - which is an important consideration when using any data. As a user, I have use-cases where I do care about the system that produced data - as there is no requirement on systems to support all the features of the standard - and so knowing which system produced data is important to understanding the data’s limitations.

Given this is information that would be added by tools - this has minimal impact on individual publishers.

Herman · June 7, 2017, 4:08pm

@TimDavies Can you give a real world example? I am very curious, since I have been working for two years now very intensively with many IATI files from many different publishers produced by many different systems, and I have not missed this functionality.

TimDavies · June 7, 2017, 4:26pm

We’re currently doing technical assistance for a range of donors to improve the publication of agriculture-related aid activities.

This involves encouraging use of additional classifications, and making sure activities include location data.

It will also involve providing tooling that could help enhance datasets.

In order to identify the best interventions with each publisher, in terms of investing in updates to commonly used tools, or identifying who will find it difficult to provide particular data due to the tools they are using (AidStream / CSV to IATI / internal platform etc.) it would be useful to query across data to look at which tools are generating agriculture-related activities.

Then, as we work on enhancing data, I would like to be able to clearly indicate that the data has been processed through the tooling we’re working on, in terms of managing the provenance chain.

Bart_Stevens · June 8, 2017, 7:28am

Instead of using a fixed list that has to be updated centrally, why not follow the logic of the activity/organisation identifiers? Creators of publishing apps could register their application and get a unique identifier for their app. When saving changes, the app would then check the last value in the publisher-app element and when it’s a different identifier (meaning that the file was created/modified by another app) it would add its own identifier. If you add a date stamp then you would get a history of the modifications made to the file.

I do agree that this information should be included in the file, not in the registry. If I work on the file within my organisation before it has been registered, it may be useful to know what different tools have been used. Also if the information is stored in the file, you can automate the process. If you have to story it separately in the registry then you have to do it manually and you’ll have more work to do.

petyakangalova · June 8, 2017, 9:12am

I would just like to add a comment and echo @TimDavies’s comments that from a publisher support side there will be huge benefits from having information about the publishing applications being used. The IATI technical teams supports a high number of publishers and at the moment we are unable to identify which publishing tools organisations are using (except for CSV2IATI and Aidstream). This is normally the first question I would ask publishers so that I am able to identify how their data has been created, how they are using the standard and what potential challenges they might face in improving their data.

Also, with the CSV2IATI decommissioning project this will be a very timely exercise to incorporate any of the tools that are currently in development. I am sure this information will also be useful for those developing the tools so that they are able to identify which organisations are using them and make any improvements that would be fit for the specific organisations using that tool.

Herman · June 8, 2017, 12:55pm

@TimDavies Thanks for your examples. It clarifies why you would like to see this information in the file and not the registry meta-data.

One aspect of the discussion I am still very doubtful about: the notion that IATI data is reprocessed and republished after the original publication as IATI data again, thereby creating an audit trail of data processing applications.

Unless I fail to understand the above proposal, this is i.m.o. a violation of the ‘Publish once, use often’ principle. This principle is very important for multiple reasons, such as:

avoiding duplication of work
avoiding inconsistencies because the same IATI data are being maintained in different places
assuring the uniqueness of IATI activities
making clear who is ultimately responsible for the data quality

In summary: if you want to enrich existing IATI data and use it in an application, that’s fine. It creates i.m.o. problems though if you subsequently republish the enriched/altered IATI data as a new IATI dataset, since it introduces the problems mentioned above.

Therefore I fail to see the need to have an ‘audit trail’ of applications used to process the IATI data. It should be sufficient to just publish the name of the application which originally produced the IATI data (in the IATI file itself).

IATI-techteam · June 16, 2017, 4:11pm

This topic has not been included at this time to allow for further discussion in time for inclusion in the next upgrade.

If you feel that this should still be included in the current upgrade, please do respond here

andylolz · June 22, 2017, 2:56pm

This is a good point that’s not made in the proposal / explicitly in this discussion! That is: these two publishing applications do currently address this problem (i.e. files generated by them are easily identifiable as such) albeit in an unofficial way.

They do so by adding a “Generated By” comment near the top of the IATI files they generate. Other publishing apps do the same e.g. “Generated By EUDEVFIN” also appears, as well as a couple of others (I like this example ). Some of these point to custom scripts.

Anyway – so this is already done in an unofficial way for >20% of IATI files. This isn’t exactly a demonstrable usecase… but I do think it suggests that if this attribute were added, it would likely be put to good use.

IATI-techteam · October 25, 2017, 1:49pm