Partial function used in rowTypeDetectionStage
Partial function used in colTypeDetectionStage
Partial function used in cellUpgradeStage
Partial function used in cellUpgradeStage
Partial function used in colTypeDetectionStage
The main method in this class.
The main method in this class. Can be called several times. The input stream is opened and closed once per each call.
If there are no errors TableReadingErrors.noErrors is true
.
a pair with a .TableReader and TableReadingErrors.
This method extends the basic read
method with exception based error handling,
which may be useful in smaller applications that don't expect or handle
errors in input.
This method extends the basic read
method with exception based error handling,
which may be useful in smaller applications that don't expect or handle
errors in input.
A RuntimeException
will be thrown when error is encountered.
Partial function used in rowTypeDetectionStage
This class is part of the the higher level API for reading, writing and processing CSV data.
The simpler stream based API is enough for many scenarios, but if several different sets of data will be pulled from the same CSV file and the structure of the CSV file is not rigid, this API may be a better fit.
The result of calling
read
method on this class will be an instance of Table class. The table is an immutable data structure for holding and processing data in a spreadsheet program like format.TableWriter is the counterpart of this class for writing Table out to disk for example.
Code Example
This example parses a small bit of CSV data in which column types are defined on the first row.
CSV format detection heuristics
One simple detection heuristic is implemented in DetectingTableReader
Since deducing whether correct parameters like character set were used in reading a CSV file without any extra knowledge is impossible, this class supports implementing a custom format detection algorithm by client code.
The table reading is split to stages to allow implementing format detection heuristics that lock some variables during the earlier stages and then proceeding to later stages. Unfortunately there is currently no example or implementation of this idea.
Locking some variables and then proceeding results in more efficient algorithm than exhaustive search of the full set of combinations (character set, locale, separator etc).
The actual detection heuristic is handled outside this class. The idea is that the detection heuristic class uses this repeatedly with varying parameters until some criterion is met. The criterion for ending detection could be that zero errors is detected. If no combination of parameters gives zero errors, then the heuristic could just pick the solution which gave errors in the latest stage and then the fewest errors.
Stages
The client specific row type.
The client specific column type.
The type of the
metadata
parameter. Must be a sub type of TableMetadata. This specifies the character set and separator to use when reading the CSV data from the input stream.This is about the internal structure of
TableReader
processing.The table reading is split into four stages.
The table reading process may fail and terminate at each phase. Then an incomplete Table object will be returned together with the errors detected so far.
The table reading is split to stages to allow implementing format detection heuristics in a structured manner.
csvReadingStage
Parse CSV byte data to cells. Depends oncharset
andseparator
provided via themetadata
parameter.rowTypeDetectionStage
Detect row types (hard coded or based on cell contents). TherowTyper
parameter is used in this stage.colTypeDetectionStage
Detect column types (hard coded or based on row types and cell contents). ThecolTyper
parameter is used in this stage.cellUpgradeStage
Upgrade cells based on cell types, which are combined from row and column types. ThecellUpgrader
parameter is used in this stage.