Working with BOMs (and MS Excel)

Excel is unfortunately both the most commonly used software to view CSV data, and the worst software there is to view CSV data. The main issue has to do with encoding - Excel will use the local system’s default encoding, which changes from one installation to another

The only way (that I know of) to force Excel to use the right encoding when opening a CSV file is to:

Since version 0.1.18, kantan.csv has full support for BOMs, enabled by importing the following package:

import kantan.codecs.resource.bom._

Once that’s done, all IO operations performed by kantan.csv will be BOM aware:

For example:

import kantan.csv._
import kantan.csv.ops._
import scala.io.Codec

// Let kantan.csv know that data should be written in UTF-8
implicit val codec: Codec = Codec.UTF8

// Our input is in katakana, characters that cannot be encoded using ISO-LATIN-1.
val input = List("ニコラ", "リノド")

// File in which we'll be writing the CSV data.
val out = java.io.File.createTempFile("kantan.csv", "csv")

// Writes input using , as a column separator.
out.writeCsv(input, rfc)

Since we’ve imported kantan.codecs.resource.bom._, out contains the UTF-8 BOM. We can verify that by attempting to read it with an incompatible encoding:

def readIso() = {
  // ISO-LATIN-1 cannot be used to read our file, since it does not support katakana.
  implicit val codec: Codec = Codec.ISO8859

  out.readCsv[List, String](rfc)
}

readIso()
// res1: List[ReadResult[String]] = List(
//   Right(value = "ニコラ"),
//   Right(value = "リノド")
// )

Note that these behaviours are disabled by default: BOMs are advised against, and looking for them (and interpreting them when found) has a performance cost.


Other tutorials: