fi.pelam.csv

table

package table

This package contains the whole table oriented API for processing CSV data.

See the reader and writer classes for more information.

Source
package.scala
See also

TableWriter

TableReader

Linear Supertypes
AnyRef, Any
Ordering
  1. Alphabetic
  2. By inheritance
Inherited
  1. table
  2. AnyRef
  3. Any
  1. Hide All
  2. Show all
Learn more about member selection
Visibility
  1. Public
  2. All

Type Members

  1. final case class CellType[+RT, +CT](rowType: RT, colType: CT) extends Product with Serializable

    The "type" of a cell in a Table is considered a pair of objects each identifying the row type and the column type respectively.

    The "type" of a cell in a Table is considered a pair of objects each identifying the row type and the column type respectively.

    This type concept should not be confused with general scala types. Row and column types are typically just some suitable (case) objects defined by the client code.

    Cell types are used in TableReader to map which actual subtypes of Cell should be used for each position in table.

    RT

    the type of all row types used by the client code.

    CT

    the type of all column types used by the client code.

    rowType

    instance of row type defined by client code.

    colType

    instance of column type defined by client code.

  2. final class DetectingTableReader[RT, CT, M <: LocaleTableMetadata[M]] extends AnyRef

    This CSV format detection heuristic tries to read the input CSV with varying parameters until a Table is produced with no errors or all combinations are exhausted.

    This CSV format detection heuristic tries to read the input CSV with varying parameters until a Table is produced with no errors or all combinations are exhausted. In the latter case the Table with least errors is returned.

    For simpler usage you can skip initialMetadata in the constructor by using the apply method defined in the companion object.

    The interface of this class is similar to the one in TableReader, but this class creates multiple TableReader instances under the hood.

    Example on detection of CSV format

    This example detects and parses a weird CSV format in which the separator is the one used at least in the finnish locale, but numeric data is formatted in english style. The column types are defined on the first row and the row type is defined by the first column.

    import fi.pelam.csv.table._
    import fi.pelam.csv.cell._
    import TableReaderConfig._
    
    val validColTypes = Set("header", "model", "price")
    
    // Setup a DetectingTableReader which will try combinations of CSV formatting types
    // to understand the data.
    val reader = DetectingTableReader[String, String](
    
      tableReaderMaker = { (metadata) => new TableReader(
    
        // An implicit from the object TableReaderConfig converts the string
        // to a function providing streams.
        openStream =
          "header;model;price\n" +
          "data;300D;1,234.0\n" +
          "data;SLS AMG;234,567.89",
    
        // Make correct metadata end up in the final Table
        tableMetadata = metadata,
    
        // First column specifies row types
        rowTyper = makeRowTyper({
          case (CellKey(_, 0), rowType) => rowType
        }),
    
        // Column type is specified by the first row.
        // Type names are checked and error is generated for unknown
        // column types by errorOnUndefinedCol.
        // This strictness is what enables the correct detection of CSV format.
        colTyper = errorOnUndefinedCol(makeColTyper({
          case (CellKey(0, _), colType) if validColTypes.contains(colType) => colType
        })),
    
        cellUpgrader = makeCellUpgrader({
          case CellType("data", "price") => DoubleCell.parserForLocale(metadata.dataLocale)
        }))
      }
    )
    
    val table = reader.readOrThrow()
    
    // Get values from cells in column with type "name" on rows with type "data."
    table.getSingleCol("data", "model").map(_.value).toList
    // Will give List("300D", "SLS AMG")
    
    // Get values from cells in column with type "number" on rows with type "data."
    table.getSingleCol("number", "price").map(_.value).toList)
    // Will give List(1234, 234567.89)
    RT

    The client specific row type.

    CT

    The client specific column type.

    M

    The type of the metadata parameter. Must be a sub type of LocaleTableMetadata. This is used to manage the character set, separator, cellTypeLocale and dataLocale combinations when attempting to read the CSV data from the input stream.

  3. case class LocaleMetadata(dataLocale: Locale = Locale.ROOT, cellTypeLocale: Locale = Locale.ROOT, charset: Charset = CsvConstants.defaultCharset, separator: Char = CsvConstants.defaultSeparatorChar) extends LocaleTableMetadata[LocaleMetadata] with Product with Serializable

    dataLocale
    cellTypeLocale

  4. trait LocaleTableMetadata[T <: LocaleTableMetadata[T]] extends TableMetadata

  5. final case class SimpleMetadata(charset: Charset = CsvConstants.defaultCharset, separator: Char = CsvConstants.defaultSeparatorChar) extends TableMetadata with Product with Serializable

    Simplest implementation of TableMetadata.

  6. final case class Table[RT, CT, M <: TableMetadata](cells: IndexedSeq[IndexedSeq[Cell]], rowTypes: SortedBiMap[RowKey, RT], colTypes: SortedBiMap[ColKey, CT], metadata: M) extends Product with Serializable

    This class is an immutable container for Cells with optional row and column types.

    This class is an immutable container for Cells with optional row and column types. The ideas in the API roughly follow popular spread sheet programs.

    The cells are stored in rows which are numbered conceptually from top to bottom and then in columns which are numbered from left to right.

    The row and column types are an additional abstraction with the purpose of simplifying machine reading of complex spread sheets.

    This class is part of the the higher level API for reading, writing and processing CSV data.

    The simpler stream based API is enough for many scenarios, but if several different sets of data will be pulled from the same CSV file and the structure of the CSV file is not rigid, this API may be a better fit.

    Several methods are provided for getting cells based on the row and column types. For example

    Example

    This example constructs a table directly, although usually it is done via a TableReader. In this example, simple String values are used for row and column types, although usually an enumeration or case object type solution is cleaner and safer.

    val table = Table(
      List(StringCell(CellKey(0, 0), "name"),
        StringCell(CellKey(0, 1), "value"),
        StringCell(CellKey(1, 0), "foo"),
        IntegerCell(CellKey(1, 1), 1),
        StringCell(CellKey(2, 0), "bar"),
        IntegerCell(CellKey(2, 1), 2)
      ),
    
      SortedBiMap(RowKey(0) -> "header",
        RowKey(1) -> "data",
        RowKey(2) -> "data"),
    
      SortedBiMap(ColKey(0) -> "name",
        ColKey(1) -> "number")
     )
    
    table.getSingleCol("name", "data").map(_.value).toList
    // Will give List("foo","bar")
    
    table.getSingleCol("number", "data").map(_.value).toList)
    // Will give List(1,2)

    Note on row and column numbers

    Internally rows and columns have zero based index numbers, but in some cases like in toString methods of Cell and CellKey the index numbers are represented similarly to popular spread sheet programs. In that csae row numbers are one based and column numbers are alphabetic.

    RT

    The client specific row type.

    CT

    The client specific column type.

    M

    The type of the metadata parameter. Must be a sub type of TableMetadata. This specifies the character set and separator to use when reading the CSV data from the input stream.

    cells

    All cells in a structure of nested IndexedSeqs. The order is first rows, then columns.

    rowTypes

    A bidirectional map mapping rows to their row types and vice versa. Multiple rows can have the same type.

    colTypes

    A bidirectional map mapping columns to their column types and vice versa. Multiple columns can have the same type.

    metadata

    User extensible metadata that is piggybacked in the Table instance.

  7. trait TableMetadata extends AnyRef

    Base class for metadata attached to Table.

    Base class for metadata attached to Table.

    Idea is that client code can extend this trait and piggyback whatever extraneous data to Table instances.

    One example is the details of the CSV format used. They are convenient to keep with the Table data in case user needs to save a modified version of the original CSV file from which the data was read from.

    Another use for this metadata mechanism is during the process of autodetecting details of the CSV format by DetectingTableReader.

    This trait has two values that TableReader can use directly.

    For more complex format detection heuristics, this can be inherited and extended with values that a more custom detection algorithm then tries to detect.

  8. case class TableProjection[RT, CT, M <: TableMetadata](baseTable: Table[RT, CT, M], rows: SortedSet[RowKey] = SortedSet(), cols: SortedSet[ColKey] = SortedSet()) extends Product with Serializable

    Part of the API to "project" a Table.

    Part of the API to "project" a Table. Idea is to pick rows and columns in an fluent and immutable way, and then get a table with just the selected rows and columns. This is useful for example just displaying or logging certain data.

    Example:

    import TableProjection._ // Import implicit toTable and toProjection
    
    val table: Table = ...
    println(table.withColTypes(Name, Price).withRowTypes(Item))
    
    // The inverse may also be useful for removing some data
    
    println(table.withColTypes(Comments).inverse)
  9. class TableReader[RT, CT, M <: TableMetadata] extends AnyRef

    This class is part of the the higher level API for reading, writing and processing CSV data.

    This class is part of the the higher level API for reading, writing and processing CSV data.

    The simpler stream based API is enough for many scenarios, but if several different sets of data will be pulled from the same CSV file and the structure of the CSV file is not rigid, this API may be a better fit.

    The result of calling read method on this class will be an instance of Table class. The table is an immutable data structure for holding and processing data in a spreadsheet program like format.

    TableWriter is the counterpart of this class for writing Table out to disk for example.

    Code Example

    This example parses a small bit of CSV data in which column types are defined on the first row.

    import fi.pelam.csv.table._
    import fi.pelam.csv.cell._
    import TableReaderConfig._
    
    // Create a TableReader that parses a small bit of CSV data in which the
    // column types are defined on the first row.
    val reader = new TableReader[String, String, SimpleMetadata](
    
      // An implicit from the object TableReaderConfig converts the string
      // to a function providing streams.
      openStream =
        "product,price,number\n" +
        "apple,0.99,3\n" +
        "orange,1.25,2\n" +
        "banana,0.80,4\n",
    
      // The first row is the header, the rest are data.
      rowTyper = makeRowTyper({
        case (CellKey(0, _), _) => "header"
        case _ => "data"
      }),
    
      // First row defines column types.
      colTyper = makeColTyper({
        case (CellKey(0, _), colType) => colType
      }),
    
      // Convert cells on the "data" rows in the "number" column to integer cells.
      // Convert cells on the "data" rows in the "price" column to decimal cells.
      cellUpgrader = makeCellUpgrader({
        case CellType("data", "number") => IntegerCell.defaultParser
        case CellType("data", "price") => DoubleCell.defaultParser
      }))
    
    // Get values from cells in column with type "product" on rows with type "data."
    table.getSingleCol("data", "product").map(_.value).toList
    // Will give List("apple", "orange", "banana")
    
    // Get values from cells in column with type "price" on rows with type "data."
    table.getSingleCol("data", "price").map(_.value).toList)
    // Will give List(0.99, 1.25, 0.8)

    CSV format detection heuristics

    One simple detection heuristic is implemented in DetectingTableReader

    Since deducing whether correct parameters like character set were used in reading a CSV file without any extra knowledge is impossible, this class supports implementing a custom format detection algorithm by client code.

    The table reading is split to stages to allow implementing format detection heuristics that lock some variables during the earlier stages and then proceeding to later stages. Unfortunately there is currently no example or implementation of this idea.

    Locking some variables and then proceeding results in more efficient algorithm than exhaustive search of the full set of combinations (character set, locale, separator etc).

    The actual detection heuristic is handled outside this class. The idea is that the detection heuristic class uses this repeatedly with varying parameters until some criterion is met. The criterion for ending detection could be that zero errors is detected. If no combination of parameters gives zero errors, then the heuristic could just pick the solution which gave errors in the latest stage and then the fewest errors.

    Stages

    RT

    The client specific row type.

    CT

    The client specific column type.

    M

    The type of the metadata parameter. Must be a sub type of TableMetadata. This specifies the character set and separator to use when reading the CSV data from the input stream.

    Note

    This is about the internal structure of TableReader processing.

    The table reading is split into four stages.

    The table reading process may fail and terminate at each phase. Then an incomplete Table object will be returned together with the errors detected so far.

    The table reading is split to stages to allow implementing format detection heuristics in a structured manner.

    • csvReadingStage Parse CSV byte data to cells. Depends on charset and separator provided via the metadata parameter.
    • rowTypeDetectionStage Detect row types (hard coded or based on cell contents). The rowTyper parameter is used in this stage.
    • colTypeDetectionStage Detect column types (hard coded or based on row types and cell contents). The colTyper parameter is used in this stage.
    • cellUpgradeStage Upgrade cells based on cell types, which are combined from row and column types. The cellUpgrader parameter is used in this stage.
  10. final case class TableReaderEvaluator[RT, CT, M <: TableMetadata] extends Product with Serializable

    This class models a process where several differently constructed TableReader instances are tried and the result from the one with least, preferably zero, errors is picked.

    This class models a process where several differently constructed TableReader instances are tried and the result from the one with least, preferably zero, errors is picked.

    RT

    The client specific row type.

    CT

    The client specific column type.

    M

    The type of the metadata parameter. Must be a sub type of TableMetadata. This specifies the character set and separator to use when reading the CSV data from the input stream.

  11. final case class TableReadingError(msg: String, cell: Option[Cell] = None) extends Product with Serializable

    Various phases in TableReader produce these when building a Table object from input fails.

    Various phases in TableReader produce these when building a Table object from input fails. CellParsingErrors errors are converted to these errors in TableReader.

  12. final case class TableReadingErrors(stageNumber: Int = 0, errors: IndexedSeq[TableReadingError] = IndexedSeq()) extends Ordered[TableReadingErrors] with Product with Serializable

    Captures errors that happen inside TableReader.

    Captures errors that happen inside TableReader. This class is ordered in increasing success orderd. This ordering is used in format detection heuristics to pick the solution that produces best (least badness) results.

    stageNumber

    The number of stage reached in TableReader. Before any stages are run this is zero. After the first stage this is 1 etc.

    errors

    List of errors. All errors are from same stage, because TableReader stops after first stage that produces errors.

  13. final case class TableReadingState[RT, CT](cells: IndexedSeq[Cell] = IndexedSeq(), rowTypes: RowTypes[RT] = SortedBiMap[RowKey, RT](), colTypes: ColTypes[CT] = SortedBiMap[ColKey, CT](), errors: TableReadingErrors = TableReadingErrors()) extends StageResult[TableReadingState[RT, CT]] with Product with Serializable

    This is class is an internal class used to thread state through stages in TableReader.

  14. final class TableWriter[RT, CT, M <: TableMetadata] extends AnyRef

    This class writes a Table as CSV to the given OutputStream.

    This class writes a Table as CSV to the given OutputStream.

    The stream is closed at the end. This can be used to write CSV files. The CSV format is taken from Table.metadata.

    Cells contents are each formatted according to their individual serializedString.

    val table = Table[String, String, SimpleMetadata](IndexedSeq(
      StringCell(CellKey(0,0), "foo"),
      StringCell(CellKey(0,1), "bar")))
    
    val writer = new TableWriter(table)
    
    val outputStream = new ByteArrayOutputStream()
    
    writer.write(outputStream)
    
    val written = new String(outputStream.toByteArray(), table.metadata.charset)
    
    assertEquals("foo,bar\n", written)
    RT

    Client specified object type used for typing rows in CSV data.

    CT

    Client specified object type used for typing columns in CSV data.

    M

    a user customizable metadata type than can piggybacks additional information on the table object.

Value Members

  1. object DetectingTableReader

  2. object Table extends Serializable

  3. object TableProjection extends Serializable

  4. object TableReader

    Contains type definitions for various types used in constructing a TableReader instance.

  5. object TableReaderConfig

    A set of functions that map various things used as parameters for fi.pelam.csv.table.TableReader.

    A set of functions that map various things used as parameters for fi.pelam.csv.table.TableReader.

    Idea is to allow various simpler ways of configuring the TableReader.

  6. object TableReadingError extends Serializable

  7. object TableReadingErrors extends Serializable

  8. object TableUtil

    Collection of helper methods for Table and TableProjection implementation.

Inherited from AnyRef

Inherited from Any

Ungrouped