Flatten-tool: New approach for Spreadsheet to IATI conversion

This post is mostly a write up of the lightening talk I gave at the TAG, with some additional info on how the work is progressing.

CSV2IATI is going to be retired this year, so at Open Data Services Co-operative we’ve been looking at alternate approaches for converting spreadsheets to IATI XML.

For OCDS (Open Contracting Data Standard) and 360Giving, we built flatten-tool. This is a command line tool, and python library, for converting from flat spreadsheet formats to nested structured formats (JSON or XML). We’re working on adding IATI XML support to this.

Key features of flatten-tool:

No mapping tool

Instead of a mapping tool, we require that headings contain an xpath (or close to an xpath) for the part of the IATI XML the column corresponds to.

e.g. this spreadsheet:

becomes this XML:

Most users won’t need to worry about xpath though, as we’ll have a pre-made template for them to download and populate. However, the approach does also allow custom templates to be easily crafted for particular use cases.

Since all information is contained in the spreadsheet, it will be possible to upgrade to a new IATI standard version by just changing the data, without any change required to the tool.

One-to-many relationships

In IATI, one activity usually has multiple child elements of certain types (e.g. multiple transactions). flatten-tool supports multiple ways of representing this:

  • Multiple columns
  • Separate sheets, linked by the IATI Identifier in both sheets
  • Repeated rows, for compatibility with CSV2IATI

e.g. multiple columns might be convenient for mulitiple recipient countries:

but a separate sheet might be more convenient for multiple transactions:

Excel workbook (XLSX) support, in addition to CSV

CSV and XLSX are the supported input file formats.

CSVs exported from Excel often have problems - e.g. encodings and date formats. By supporting XLSX we can more easily get at the original data. It’s also an easy way to supply multiple sheets with one-to-many relationships (above).

Progress so far

Basic XML support has been merged into flatten-tool, and there’s instructions for using it to produce IATI XML in the docs. We’re already using this code internally to help people publish IATI data. As @johnadams would say, we’re eating our own dog food.

There’s a bunch of issues still to be worked on. The most pressing, which we’re working on currently, are sorting the XML correctly, and allowing spreadsheets to specify the IATI standard version. We’re currently working around these by manually adding the version number, and sorting with this script.

Future plans

We’re working on a web interface, so that users can upload a file, instead of having to use a commandline tool.

We want this web interface to have integrated validation. Additionally validation messages will list the cell in the spreadsheet that caused the error, allowing data to be fixed more easily. @stevieflow made a presentation about this approach to the IATI Members’ Assembly. Here’s a screenshot of what this currently looks like for 360Giving:

For OCDS and 360Giving we also have conversion to spreadsheet (which means it’s possible to roundtrip, ie. edit and then convert back to XML). We think this could be useful for IATI too, but at the moment we’re focusing on conversion from spreadsheet first, because that appears to have the greater need.

Super cool.

Was just going to suggest doing this the HXL way, with a human-readable header row followed by the flatten-tool-readable xpath-like header row. But I see @TimDavies has already suggested it :slight_smile:

Sorry if this is a silly question… Are you talking about specifying the output version? If so, it seems odd to store this in with the spreadsheet. If you’re instead talking about the IATI version used by the spreadsheet, isn’t it possible to infer this from the headers / data?

Oh yeah for sure, this would be really great. Loads of applications for this.

Forgive my IATI naïvité - can you give me some examples? I’m working on all the product owner-ey bits of making all this happen!

On the contrary! I’m fairly new to IATI, so I may well be wrong!

I was specifically thinking about CSV export from the datastore (but I suppose CSV export from d-portal, too?) Getting flattened data out so it can be manipulated using familiar tools (e.g. Excel) is helpful for a set of data users.

I’ve seen some issues about the way the datastore does CSV export. If it were to use flatten-tool that could at least help in standardising IATI to spreadsheet export across tools.

Yes, it’s the IATI version used by the spreadsheet. Currently all info in the spreadsheet is assumed to be about iati-activity and child elements, whereas @version is found on the iati-activities root element. What we’re working on is letting people add an extra tab for these root level attributes.

Thanks for posting this update, especially for those who were not at the TAG lightning talk session. Good to hear of progress too!

Just reading though the discussion on how versions will be specified, it implies that the output XML can be produced to any versions of IATI Standard (v1.01 to v2.02) - is this correct?

I know @Wendy was curious about this a few weeks ago.

Yup! xpaths in headings mean that you can specify where in the tree the value should go so, for example, if you had a sheet which had in it:

+-----------------+--------------------+-------------------------------------+
| iati-identifier | ...other values... | transaction/transaction/description |
+-----------------+--------------------+-------------------------------------+
| GB-TEST-1       |                ... | Test description for fake activity  |
+-----------------+--------------------+-------------------------------------+

This would result in something like:

<iati-activities version="1.05" etc...>
  <iati-activity>
    <iati-identifier>GB-TEST-1</iati-identifier>
    <!-- other columns -->
      <transaction>
        <description>Test description for fake activity</description>
      </transaction
    <!-- other columns -->
  </iati-activity>
</iati-activities>

Change that to:

+-----------------+--------------------+-----------------------------------------------+
| iati-identifier | ...other values... | transaction/transaction/description/narrative |
+-----------------+--------------------+-----------------------------------------------+
| GB-TEST-1       |                ... | Test description for fake activity            |
+-----------------+--------------------+-----------------------------------------------+

And it would be more like:

<iati-activities version="2.01" etc...>
  <iati-activity>
    <iati-identifier>GB-TEST-1</iati-identifier>
    <!-- other columns -->
      <transaction>
        <description>
          <narrative>Test description for fake activity</narrative>
        </description>
      </transaction
    <!-- other columns -->
  </iati-activity>
</iati-activities>

Though I’ve skimmed over the version declaration and other elements.

Ultimately it’s just very flexible, but would require a good template for each version or a good level of IATI knowledge to make a spreadsheet work well i.e. knowing the difference between a 1.05 transaction description and a 2.01 one.

Hope this is clear :slight_smile:

Thanks for the extra detail. Looking forward to giving it a go sometime…

Re templates - would it make sense to just have a v1.05 template, plus a v2.02 template?

Therefore – unless a user specifically wants another point version (and wants to do the work to figure out which fields didn’t exist in earlier versions) – the tool outputs at these versions.

When v2.03 is released, just add any new fields into the sheet and the default output options are then v1.05 and v2.03.