CSV API Guidlines

bjwebb · August 19, 2014, 1:42pm

It occurs to me that we should maybe have some general guidelines for working with/producing CSVs. Currently the IATI Datastore, Dashboard, Codelist API, the Registry and the CSV Conversion tool linked to from the registry produce csvs. The Registry, CSV2IATI and AidStream also accept CSVs as inputs. If I have missed any active CSV producing/consuming IATI tools/APIs, please post them below.

Guidlines might include might include:

What format should be used (ie. what should the separator be, and what type of escaping should be used)
How should new lines be treated (the behaviour of quoted new lines seems to vary a lot)
What encoding should be used (I would suggest UTF-8)

If these choices are consistent between multiple APIs, then they can share one piece of user guidance (ie. How to open this file in your spreadsheet program of choice). The most common user mistake I see is opening UTF-8 encoded files as windows code page.

Dragon · August 4, 2014, 11:52pm

The CSV RFC, RFC 4180, specifies CRLF line endings: some generic tools don’t work with just LF - I can’t remember which off the top of my head at the moment.

Encoding is a pain. As far as I can see, Unicode is the only sane option from the point of view of guaranteeing that any combinations of names can be represented, particularly since there’s no way of specifying an encoding in CSV. Yet Excel doesn’t make it easy to open UTF-8 files correctly. (UTF-8 also has the useful property of not incorrectly decoding things which aren’t UTF-8 encoding, and being relatively easy to detect when decoded as another encoding: has one non-ASCII character become many? It was UTF-8 to start with).

If you’re genuinely producing CSVs, then I’d stick to the RFC for escaping/separators to help support loading into generic tools. If you can guarantee the contents of your files won’t contain certain characters, and will never contain newlines in data, then pipe- or semi-colon separated files are a sensible approach, too.

It might be worth making a simple test CSV file that people can load into their system / save back out. Something as simple as:

"newline CRLF
","newline LF
",noquote,"doublequote ""","singlequote '","comma ,","ÚTF-8"

and checking that the output matches the input exactly. (The UTF-8 bit should be multilingual, ideally)

[The above is all just my opinion from having worked with CSVs quite a bit - please critique harshly :)]