There are many ways of dealing with parse errors in kantan.csv. This tutorial shows the most common strategies, but
it essentially boils down to knowing how Either
(the underlying type of ReadResult
) works.
All the examples here are going to be using the following data:
1,Nicolas,true
2,Kazuma,28
3,John,false
Note how the second row’s third column is not of the same type as that of the other rows.
Let’s first declare the basic things we need to decode such a CSV file (see this if it does not make sense to you):
import kantan.csv._
import kantan.csv.ops._
import kantan.csv.generic._
case class Person(id: Int, name: String, flag: Boolean)
val rawData: java.net.URL = getClass.getResource("/dodgy.csv")
The simplest, least desirable error handling mechanism is to ignore the possibility of failure and allow exceptions
to be thrown. This is achieved by using asUnsafeCsvReader
:
scala.util.Try(rawData.asUnsafeCsvReader[Person](rfc).toList)
// res0: util.Try[List[Person]] = Failure(
// exception = DecodeError(message = "For input string: \"28\"")
// )
Note that this is hardly ever an acceptable solution. In idiomatic Scala, we pretend that exceptions don’t exist and rely on encoding errors in return types. Still, unsafe readers can be useful - when writing one-off scripts for which reliability or maintainability are not an issue, for example.
Another common, if not always viable strategy is to use collect
to simply drop whatever rows failed to decode:
rawData.asCsvReader[Person](rfc).collect { case Right(a) => a }.toList
// res1: List[Person] = List(
// Person(id = 1, name = "Nicolas", flag = true),
// Person(id = 3, name = "John", flag = false)
// )
collect
is a bit like a filter
and a map
rolled into one, and allows us to:
This is achieved in an entirely safe way, validated at compile time.
When not streaming data, a good option is to fail if a single row fails to decode - turn a
List[ReadResult[A]]
into a ReadResult[List[A]]
. This is done through ReadResult
’s
sequence
method:
ReadResult.sequence(rawData.readCsv[List, Person](rfc))
// res2: Either[ReadError, List[Person]] = Left(
// value = TypeError(message = "'28' is not a valid Boolean")
// )
The only real downside to this approach is that it requires loading the entire data in memory.
Our problem here is that the flag
field of our Person
class is not always of the same type - some rows have it as a
boolean
, others as an Int
. This is something that the Either
type is well suited for, so we could rewrite
Person
as follows:
case class SafePerson(id: Int, name: String, flag: Either[Boolean, Int])
We can now load the whole data without an error:
rawData.readCsv[List, SafePerson](rfc)
// res3: List[ReadResult[SafePerson]] = List(
// Right(value = SafePerson(id = 1, name = "Nicolas", flag = Left(value = true))),
// Right(value = SafePerson(id = 2, name = "Kazuma", flag = Right(value = 28))),
// Right(value = SafePerson(id = 3, name = "John", flag = Left(value = false)))
// )
Following the same general idea, one could use Option
for fields that are not always set.
This strategy is not always possible, but is good to keep in mind for these cases where it can be applied.