x Get our new HTML parser and build any scraping project 80% faster.

Working with CSV

CSV format

  • delimiter (default ,): value used to separate individual fields in the input.

  • quote (default "): value used for escaping values where the field delimiter is part of the value (e.g. the value " a , b " is parsed as a , b).

  • quoteEscape (default "): value used for escaping the quote character inside an already escaped value (e.g. the value " "" a , b "" " is parsed as " a , b ").

  • charToEscapeQuoteEscaping (default \0): value used for escaping the quote escape character, when quote and quote escape are different (e.g. the value “\ " a , b " \” is parsed as \ " a , b " \, if quote = ", quoteEscape = \ and charToEscapeQuoteEscaping = \).

CSV parser settings

Escaping quote escape characters

In CSV, quotes inside quoted values must be escaped. For example, the sequence \" represents the quote character inside a quoted value. But what if your quoted value ends with the backslash? In this case you need to escape the escape character. Consider the following input in escape.csv:

"You are \"beautiful\""
"Yes, \\\"in the inside\"\\"

To parse this properly, you need to define the CharToEscapeQuoteEscaping:

// quotes inside quoted values are escaped as \"
settings.getFormat().setQuoteEscape('\\');

// but if two backslashes are found before a quote symbol they represent a single slash.
settings.getFormat().setCharToEscapeQuoteEscaping('\\');

This way the data will be correctly processed as:

[You are "beautiful"]    
[Yes, \"in the inside"\]

By default, if you define an escape character that is different than the character using for quoting values, the CharToEscapeQuoteEscaping will be the same as the escape character.

Keeping escape sequences

If you want to simply “split” the input values in your CSV, and keep the escape sequences intact for further processing, you should turn on the keepEscapeSequences feature:

//now we want to keep the escape sequences. We should see the slash before the quotes.
settings.setKeepEscapeSequences(true);

CsvParser parser = new CsvParser(settings);

List<String[]> allRows = parser.parseAll(getReader("/examples/european.csv"));

The result will be:

1 [1997, Ford, E350, ac; abs; moon, 3000,00]
-----------------------
2 [1999, Chevy, Venture \"Extended Edition\", null, 4900,00]
-----------------------
3 [1996, Jeep, Grand Cherokee, MUST SELL!
air; moon roof; loaded, 4799,00]
-----------------------
4 [1999, Chevy, Venture \"Extended Edition; Very Large\", null, 5000,00]
-----------------------
5 [null, null, Venture \"Extended Edition\", null, 4900,00]
-----------------------

Format auto-detection

It’s not uncommon to have CSV’s from different sources being served into your data processing pipeline, and these might in come with different configurations. If you can’t know line endings, column delimiters and quotation characters beforehand you should give the built-in CSV format auto-detection a try as it is very likely it will do the trick for you in most cases:

//turns on automatic detection of line separators, column separators, quotes & quote escapes
settings.detectFormatAutomatically();

CsvParser parser = new CsvParser(settings);

List<String[]> rows;
//First, CSV we've been using to demonstrate all examples.
println("Data in /examples/example.csv:");
rows = parser.parseAll(getReader("/examples/example.csv"));
printRows(rows, false);

//Then, the same data but in European style (column separator is ; and decimals are separated by ,). We also escaped quotes with \ instead of using double quotes
println("\nData in /examples/european.csv:");
rows = parser.parseAll(getReader("/examples/european.csv"));
printRows(rows, false);

//Let's see the detected format:
println("\nFormat detected in /examples/european.csv:");
CsvFormat detectedFormat = parser.getDetectedFormat();
println(detectedFormat);

The code above produces:

Data in /examples/example.csv:
Printing 6 rows
Row 1 (length 5): [Year, Make, Model, Description, Price]
Row 2 (length 5): [1997, Ford, E350, ac, abs, moon, 3000.00]
Row 3 (length 5): [1999, Chevy, Venture "Extended Edition", null, 4900.00]
Row 4 (length 5): [1996, Jeep, Grand Cherokee, MUST SELL!
air, moon roof, loaded, 4799.00]
Row 5 (length 5): [1999, Chevy, Venture "Extended Edition, Very Large", null, 5000.00]
Row 6 (length 5): [null, null, Venture "Extended Edition", null, 4900.00]

Data in /examples/european.csv:
Printing 6 rows
Row 1 (length 5): [Year, Make, Model, Description, Price]
Row 2 (length 5): [1997, Ford, E350, ac; abs; moon, 3000,00]
Row 3 (length 5): [1999, Chevy, Venture "Extended Edition", null, 4900,00]
Row 4 (length 5): [1996, Jeep, Grand Cherokee, MUST SELL!
air; moon roof; loaded, 4799,00]
Row 5 (length 5): [1999, Chevy, Venture "Extended Edition; Very Large", null, 5000,00]
Row 6 (length 5): [null, null, Venture "Extended Edition", null, 4900,00]

Format detected in /examples/european.csv:
CsvFormat:
        Comment character=#
        Field delimiter=;
        Line separator (normalized)=\n
        Line separator sequence=\n
        Quote character="
        Quote escape character=\
        Quote escape escape character=null

Exploring the various quote and escape handling settings

There are so many variations and corner cases when processing CSV that we decided to create an example with all of them.


  • Note: the parse method in the following example simply creates a new CsvParser with the given settings, parses the input line, and prints out the parsed String[]*

To make it easier to read, the output of each call to parse() in the code below is commented in each subsequent line.

settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.STOP_AT_CLOSING_QUOTE);
settings.getFormat().setLineSeparator("\r\n");

//let's quote values with single quotes
settings.getFormat().setQuote('\'');
settings.getFormat().setQuoteEscape('\'');

//Line separators are normalized by default. This means they are all converted to \n, including line separators found within quoted values.
parse("value 1,'line 1\r\nline 2',value 3", settings, "Normalizing line endings");
//result: [value 1, line 1\nline 2, value 3]

//You can disable this behavior to keep the original line separators in parsed values.
settings.setNormalizeLineEndingsWithinQuotes(false);
parse("value 1,'line 1\r\nline 2',value 3", settings, "Without normalized line endings");
//result: [value 1, line 1\r\nline 2, value 3]

//Values that contain a quote character, but are not enclosed within quotes, are read as-is
parse("value 1,I'm NOT a quoted value,value 3", settings, "Value with a quote, not enclosed");
//result: [value 1, I'm NOT a quoted value, value 3]

//But if your input comes with escaped quotes, and is not enclosed within quotes you'll get the escape sequence
parse("value 1,I''m NOT a quoted value,value 3", settings, "Value with quote, escaped, not enclosed");
//result: [value 1, I''m NOT a quoted value, value 3]

//Turn on the escape unquoted values to correctly unescape this sort of input
settings.setEscapeUnquotedValues(true);
parse("value 1,I''m NOT a quoted value,value 3", settings, "Value with quote, escaped, not enclosed, processing escape");
//result: [value 1, I'm NOT a quoted value, value 3]

//As usual, when you parse values that have escaped characters, such as the quote, you get the unescaped result.
parse("value 1,'I''m a quoted value',value 3", settings, "Enclosed value, quote escaped");
//result: [value 1, I'm a quoted value, value 3]

//But in some cases you might want to get the original text, character by character, including the original escape sequence
settings.setKeepEscapeSequences(true);
parse("value 1,'I''m a quoted value',value 3", settings, "Enclosed value, quote escaped, keeping escape sequences");
//result: [value 1, I''m a quoted value, value 3]

//By default, the parser handles broken quote escapes, so it won't complain about "I'm" not being escaped properly (should be "I''m").
parse("value 1,'I'm a broken quoted value',value 3", settings, "Enclosed value, broken quote escape");
//result: [value 1, I'm a broken quoted value, value 3]

//But you can disable this and get exceptions instead.
settings.setUnescapedQuoteHandling(UnescapedQuoteHandling.RAISE_ERROR);
try {
    parse("value 1,'Hey, I'm a broken quoted value',value 3", settings, "This will blow up");
} catch (TextParsingException exception) {
    //The exception will give you better details about what went wrong, and where.
    println("Quote escape error. Parser stopped after reading: [" + exception.getParsedContent() + "] of column " + exception.getColumnIndex());
}

For reference, entire output produced by the code is:


Normalizing line endings:
    [value 1, line 1\nline 2, value 3]

Without normalized line endings:
    [value 1, line 1\r\nline 2, value 3]

Value with a quote, not enclosed:
    [value 1, I'm NOT a quoted value, value 3]

Value with quote, escaped, not enclosed:
    [value 1, I''m NOT a quoted value, value 3]

Value with quote, escaped, not enclosed, processing escape:
    [value 1, I'm NOT a quoted value, value 3]

Enclosed value, quote escaped:
    [value 1, I'm a quoted value, value 3]

Enclosed value, quote escaped, keeping escape sequences:
    [value 1, I''m a quoted value, value 3]

Enclosed value, broken quote escape:
    [value 1, I'm a broken quoted value, value 3]
Quote escape error. Parser stopped after reading: [Hey, I] of column 1

Further Reading

Feel free to proceed to the following sections (in any order).

Bugs, contributions & support

If you find a bug, please report it on github or send us an email on parsers@univocity.com.

We try out best to eliminate all bugs as soon as possible and you’ll rarely see a bug open for more than 24 hours after it’s reported. We do our best to answer all questions. Enhancements/suggestions are implemented on a best effort basis.

Fell free to submit your contribution via pull requests. Any little bit is appreciated, from improvements on documentation to a full blown rewrite from scratch.

For commercial support, customizations or anything in between, please contact support@univocity.com.

Thank you for using our parsers!

The univocity team.