Error handling

There are many ways of dealing with parse errors in kantan.csv. This tutorial shows the most common strategies, but it essentially boils down to knowing how Either (the underlying type of ReadResult) works.

All the examples here are going to be using the following data:

1,Nicolas,true
2,Kazuma,28
3,John,false

Note how the second row’s third column is not of the same type as that of the other rows.

Let’s first declare the basic things we need to decode such a CSV file (see this if it does not make sense to you):

import kantan.csv._
import kantan.csv.ops._
import kantan.csv.generic._

case class Person(id: Int, name: String, flag: Boolean)

val rawData: java.net.URL = getClass.getResource("/dodgy.csv")

Throw on errors

The simplest, least desirable error handling mechanism is to ignore the possibility of failure and allow exceptions to be thrown. This is achieved by using asUnsafeCsvReader:

scala.util.Try(rawData.asUnsafeCsvReader[Person](rfc).toList)
// res0: util.Try[List[Person]] = Failure(
//   exception = DecodeError(message = "For input string: \"28\"")
// )

Note that this is hardly ever an acceptable solution. In idiomatic Scala, we pretend that exceptions don’t exist and rely on encoding errors in return types. Still, unsafe readers can be useful - when writing one-off scripts for which reliability or maintainability are not an issue, for example.

Drop errors

Another common, if not always viable strategy is to use collect to simply drop whatever rows failed to decode:

rawData.asCsvReader[Person](rfc).collect { case Right(a) => a }.toList
// res1: List[Person] = List(
//   Person(id = 1, name = "Nicolas", flag = true),
//   Person(id = 3, name = "John", flag = false)
// )

collect is a bit like a filter and a map rolled into one, and allows us to:

This is achieved in an entirely safe way, validated at compile time.

Fail if at least one row fails to decode

When not streaming data, a good option is to fail if a single row fails to decode - turn a List[ReadResult[A]] into a ReadResult[List[A]]. This is done through ReadResult’s sequence method:

ReadResult.sequence(rawData.readCsv[List, Person](rfc))
// res2: Either[ReadError, List[Person]] = Left(
//   value = TypeError(message = "'28' is not a valid Boolean")
// )

The only real downside to this approach is that it requires loading the entire data in memory.

Use more flexible types to prevent errors

Our problem here is that the flag field of our Person class is not always of the same type - some rows have it as a boolean, others as an Int. This is something that the Either type is well suited for, so we could rewrite Person as follows:

case class SafePerson(id: Int, name: String, flag: Either[Boolean, Int])

We can now load the whole data without an error:

rawData.readCsv[List, SafePerson](rfc)
// res3: List[ReadResult[SafePerson]] = List(
//   Right(value = SafePerson(id = 1, name = "Nicolas", flag = Left(value = true))),
//   Right(value = SafePerson(id = 2, name = "Kazuma", flag = Right(value = 28))),
//   Right(value = SafePerson(id = 3, name = "John", flag = Left(value = false)))
// )

Following the same general idea, one could use Option for fields that are not always set.

This strategy is not always possible, but is good to keep in mind for these cases where it can be applied.


Other tutorials: