The "type" of a cell in a Table is considered a pair of objects each identifying the row type and the column type respectively.
The "type" of a cell in a Table is considered a pair of objects each identifying the row type and the column type respectively.
This type concept should not be confused with general scala types. Row and column types are typically just some suitable (case) objects defined by the client code.
Cell types are used in TableReader to map which actual subtypes of Cell should be used for each position in table.
the type of all row types used by the client code.
the type of all column types used by the client code.
instance of row type defined by client code.
instance of column type defined by client code.
This CSV format detection heuristic tries to read the input CSV
with varying parameters until a Table
is produced with no errors or
all combinations are exhausted.
This CSV format detection heuristic tries to read the input CSV
with varying parameters until a Table
is produced with no errors or
all combinations are exhausted. In the latter case
the Table
with least errors is returned.
For simpler usage you can skip initialMetadata
in the constructor
by using the apply method
defined in the companion object.
The interface of this class is similar to the one in TableReader, but this
class creates multiple TableReader
instances under the hood.
This example detects and parses a weird CSV format in which the separator is the one used at least in the finnish locale, but numeric data is formatted in english style. The column types are defined on the first row and the row type is defined by the first column.
import fi.pelam.csv.table._ import fi.pelam.csv.cell._ import TableReaderConfig._ val validColTypes = Set("header", "model", "price") // Setup a DetectingTableReader which will try combinations of CSV formatting types // to understand the data. val reader = DetectingTableReader[String, String]( tableReaderMaker = { (metadata) => new TableReader( // An implicit from the object TableReaderConfig converts the string // to a function providing streams. openStream = "header;model;price\n" + "data;300D;1,234.0\n" + "data;SLS AMG;234,567.89", // Make correct metadata end up in the final Table tableMetadata = metadata, // First column specifies row types rowTyper = makeRowTyper({ case (CellKey(_, 0), rowType) => rowType }), // Column type is specified by the first row. // Type names are checked and error is generated for unknown // column types by errorOnUndefinedCol. // This strictness is what enables the correct detection of CSV format. colTyper = errorOnUndefinedCol(makeColTyper({ case (CellKey(0, _), colType) if validColTypes.contains(colType) => colType })), cellUpgrader = makeCellUpgrader({ case CellType("data", "price") => DoubleCell.parserForLocale(metadata.dataLocale) })) } ) val table = reader.readOrThrow() // Get values from cells in column with type "name" on rows with type "data." table.getSingleCol("data", "model").map(_.value).toList // Will give List("300D", "SLS AMG") // Get values from cells in column with type "number" on rows with type "data." table.getSingleCol("number", "price").map(_.value).toList) // Will give List(1234, 234567.89)
The client specific row type.
The client specific column type.
The type of the metadata
parameter. Must be a sub type of LocaleTableMetadata.
This is used to manage the character set, separator, cellTypeLocale
and dataLocale
combinations when attempting to read the CSV data from the input stream.
Simplest implementation of TableMetadata.
This class is an immutable container for Cells with optional row and column types.
This class is an immutable container for Cells with optional row and column types. The ideas in the API roughly follow popular spread sheet programs.
The cells are stored in rows which are numbered conceptually from top to bottom and then in columns which are numbered from left to right.
The row and column types are an additional abstraction with the purpose of simplifying machine reading of complex spread sheets.
This class is part of the the higher level API for reading, writing and processing CSV data.
The simpler stream based API is enough for many scenarios, but if several different sets of data will be pulled from the same CSV file and the structure of the CSV file is not rigid, this API may be a better fit.
Several methods are provided for getting cells based on the row and column types. For example
This example constructs a table directly, although usually it is done via a
TableReader. In this example, simple String
values are used for row
and column types, although usually an enumeration or
case object type solution is cleaner and safer.
val table = Table( List(StringCell(CellKey(0, 0), "name"), StringCell(CellKey(0, 1), "value"), StringCell(CellKey(1, 0), "foo"), IntegerCell(CellKey(1, 1), 1), StringCell(CellKey(2, 0), "bar"), IntegerCell(CellKey(2, 1), 2) ), SortedBiMap(RowKey(0) -> "header", RowKey(1) -> "data", RowKey(2) -> "data"), SortedBiMap(ColKey(0) -> "name", ColKey(1) -> "number") ) table.getSingleCol("name", "data").map(_.value).toList // Will give List("foo","bar") table.getSingleCol("number", "data").map(_.value).toList) // Will give List(1,2)
Internally rows and columns have zero based index numbers, but in some cases
like in toString
methods of Cell
and CellKey
the index numbers are represented similarly
to popular spread sheet programs. In that csae row numbers are one based and column
numbers are alphabetic.
The client specific row type.
The client specific column type.
The type of the metadata
parameter. Must be a sub type of TableMetadata.
This specifies the character set and separator to use when reading the CSV data from the input stream.
All cells in a structure of nested IndexedSeq
s. The order is first rows, then columns.
A bidirectional map mapping rows to their row types and vice versa. Multiple rows can have the same type.
A bidirectional map mapping columns to their column types and vice versa. Multiple columns can have the same type.
User extensible metadata that is piggybacked in the Table
instance.
Base class for metadata attached to Table.
Base class for metadata attached to Table.
Idea is that client code can extend this trait and piggyback whatever extraneous data to Table instances.
One example is the details of the CSV format used. They are convenient to keep with the Table data in case user needs to save a modified version of the original CSV file from which the data was read from.
Another use for this metadata mechanism is during the process of autodetecting details of the CSV format by DetectingTableReader.
This trait has two values that TableReader can use directly.
For more complex format detection heuristics, this can be inherited and extended with values that a more custom detection algorithm then tries to detect.
Part of the API to "project" a Table.
Part of the API to "project" a Table. Idea is to pick rows and columns in an fluent and immutable way, and then get a table with just the selected rows and columns. This is useful for example just displaying or logging certain data.
Example:
import TableProjection._ // Import implicit toTable and toProjection val table: Table = ... println(table.withColTypes(Name, Price).withRowTypes(Item)) // The inverse may also be useful for removing some data println(table.withColTypes(Comments).inverse)
This class is part of the the higher level API for reading, writing and processing CSV data.
This class is part of the the higher level API for reading, writing and processing CSV data.
The simpler stream based API is enough for many scenarios, but if several different sets of data will be pulled from the same CSV file and the structure of the CSV file is not rigid, this API may be a better fit.
The result of calling read
method on this class will be an instance of Table class.
The table is an immutable data structure for holding and processing data
in a spreadsheet program like format.
TableWriter is the counterpart of this class for writing Table out to disk for example.
This example parses a small bit of CSV data in which column types are defined on the first row.
import fi.pelam.csv.table._ import fi.pelam.csv.cell._ import TableReaderConfig._ // Create a TableReader that parses a small bit of CSV data in which the // column types are defined on the first row. val reader = new TableReader[String, String, SimpleMetadata]( // An implicit from the object TableReaderConfig converts the string // to a function providing streams. openStream = "product,price,number\n" + "apple,0.99,3\n" + "orange,1.25,2\n" + "banana,0.80,4\n", // The first row is the header, the rest are data. rowTyper = makeRowTyper({ case (CellKey(0, _), _) => "header" case _ => "data" }), // First row defines column types. colTyper = makeColTyper({ case (CellKey(0, _), colType) => colType }), // Convert cells on the "data" rows in the "number" column to integer cells. // Convert cells on the "data" rows in the "price" column to decimal cells. cellUpgrader = makeCellUpgrader({ case CellType("data", "number") => IntegerCell.defaultParser case CellType("data", "price") => DoubleCell.defaultParser })) // Get values from cells in column with type "product" on rows with type "data." table.getSingleCol("data", "product").map(_.value).toList // Will give List("apple", "orange", "banana") // Get values from cells in column with type "price" on rows with type "data." table.getSingleCol("data", "price").map(_.value).toList) // Will give List(0.99, 1.25, 0.8)
One simple detection heuristic is implemented in DetectingTableReader
Since deducing whether correct parameters like character set were used in reading a CSV file without any extra knowledge is impossible, this class supports implementing a custom format detection algorithm by client code.
The table reading is split to stages to allow implementing format detection heuristics that lock some variables during the earlier stages and then proceeding to later stages. Unfortunately there is currently no example or implementation of this idea.
Locking some variables and then proceeding results in more efficient algorithm than exhaustive search of the full set of combinations (character set, locale, separator etc).
The actual detection heuristic is handled outside this class. The idea is that the detection heuristic class uses this repeatedly with varying parameters until some criterion is met. The criterion for ending detection could be that zero errors is detected. If no combination of parameters gives zero errors, then the heuristic could just pick the solution which gave errors in the latest stage and then the fewest errors.
The client specific row type.
The client specific column type.
The type of the metadata
parameter. Must be a sub type of TableMetadata.
This specifies the character set and separator to use when reading the CSV data from the input stream.
This is about the internal structure of TableReader
processing.
The table reading is split into four stages.
The table reading process may fail and terminate at each phase. Then an incomplete Table object will be returned together with the errors detected so far.
The table reading is split to stages to allow implementing format detection heuristics in a structured manner.
csvReadingStage
Parse CSV byte data to cells. Depends on charset
and separator
provided
via the metadata
parameter.rowTypeDetectionStage
Detect row types (hard coded or based on cell contents). The rowTyper
parameter
is used in this stage.colTypeDetectionStage
Detect column types (hard coded or based on row types and cell contents). The colTyper
parameter
is used in this stage.cellUpgradeStage
Upgrade cells based on cell types, which are combined from row and column types. The cellUpgrader
parameter is used in this stage.
This class models a process where several differently constructed TableReader instances are tried and the result from the one with least, preferably zero, errors is picked.
This class models a process where several differently constructed TableReader instances are tried and the result from the one with least, preferably zero, errors is picked.
The client specific row type.
The client specific column type.
The type of the metadata
parameter. Must be a sub type of TableMetadata.
This specifies the character set and separator to use when reading the CSV data from the input stream.
Various phases in TableReader produce these when building a Table object from input fails.
Various phases in TableReader produce these when building a Table object from input fails. CellParsingErrors errors are converted to these errors in TableReader.
Captures errors that happen inside TableReader.
Captures errors that happen inside TableReader. This class is ordered in increasing success orderd. This ordering is used in format detection heuristics to pick the solution that produces best (least badness) results.
The number of stage reached in TableReader. Before any stages are run this is zero. After the first stage this is 1 etc.
List of errors. All errors are from same stage, because TableReader stops after first stage that produces errors.
This is class is an internal class used to thread state through stages in TableReader.
This class writes a Table as CSV to the given OutputStream
.
This class writes a Table as CSV to the given OutputStream
.
The stream is closed at the end. This can be used to write CSV files. The CSV format is taken from Table.metadata.
Cells contents are each formatted according to their individual serializedString.
val table = Table[String, String, SimpleMetadata](IndexedSeq( StringCell(CellKey(0,0), "foo"), StringCell(CellKey(0,1), "bar"))) val writer = new TableWriter(table) val outputStream = new ByteArrayOutputStream() writer.write(outputStream) val written = new String(outputStream.toByteArray(), table.metadata.charset) assertEquals("foo,bar\n", written)
Client specified object type used for typing rows in CSV data.
Client specified object type used for typing columns in CSV data.
a user customizable metadata type than can piggybacks additional information on the table object.
Contains type definitions for various types used in constructing a TableReader
instance.
A set of functions that map various things used as parameters for fi.pelam.csv.table.TableReader.
A set of functions that map various things used as parameters for fi.pelam.csv.table.TableReader.
Idea is to allow various simpler ways of configuring the TableReader
.
Collection of helper methods for Table and TableProjection implementation.
This package contains the whole table oriented API for processing CSV data.
See the reader and writer classes for more information.
TableWriter
TableReader