x Get our new HTML parser and build any scraping project 80% faster.

Working with TSV

Parsing TSV files

To parse TSV files, simply use a TsvParser. As we keep saying, the API is essentially same for every parser.

This is the input:

# TSV's can also have comments
# Multi-line records are escaped with \n.
# Accepted escape sequences are: \n, \t, \r and \\

Year    Make    Model    Description    Price
1997    Ford    E350    ac, abs, moon    3000.00
1999    Chevy    Venture "Extended Edition"        4900.00

# Look     a multi line value. And blank rows around it!

1996    Jeep    Grand Cherokee    MUST SELL!\nair, moon roof, loaded    4799.00
1999    Chevy    Venture "Extended Edition, Very Large"        5000.00
        Venture "Extended Edition"        4900.00

This is the code:

TsvParserSettings settings = new TsvParserSettings();
settings.getFormat().setLineSeparator("\n");

// creates a TSV parser
TsvParser parser = new TsvParser(settings);

// parses all rows in one go.
List<String[]> allRows = parser.parseAll(getReader("/examples/example.tsv"));

The output will be:

1 [Year, Make, Model, Description, Price]
-----------------------
2 [1997, Ford, E350, ac, abs, moon, 3000.00]
-----------------------
3 [1999, Chevy, Venture "Extended Edition", null, 4900.00]
-----------------------
4 [1996, Jeep, Grand Cherokee, MUST SELL!
air, moon roof, loaded, 4799.00]
-----------------------
5 [1999, Chevy, Venture "Extended Edition, Very Large", null, 5000.00]
-----------------------
6 [null, null, Venture "Extended Edition", null, 4900.00]
-----------------------

TSV format

The TSV format lets you set the default escape character for values that contain \n, \r, \t and \.

  • escapeChar (default \): value used to escape special characters in TSV.

TSV parser settings

Line joining

By default the TsvParser considers that values that contain the newline character will have the line endings escaped as a literal \ character followed by a n or r character. This way, all data of a single record will be represented in a single - and potentially long - line of text.

However, this is not always the case, and you might want to actually “break” the contents into multiple lines instead, by having the escape character before the line ending. To parse/write files using this method, enable the lineJoiningEnabled flag:

//Let's write 3 values to a TSV, one of them has a line break.
String []values = new String[]{"Value 1",    "Breaking [\n] here", "Value 3"};

TsvWriterSettings writerSettings = new TsvWriterSettings();
writerSettings.getFormat().setLineSeparator("\n");

// In TSV, we can have line separators escaped with a slash before a line break. In this case the current
// line will be joined with the next line.
writerSettings.setLineJoiningEnabled(true);

//Let's write the values and see how the data looks like:
String writtenLine = new TsvWriter(writerSettings).writeRowToString(values);
println("Written data\n------------\n" + writtenLine);

// To parse, we just use the same confiuration:
TsvParserSettings parserSettings = new TsvParserSettings();
parserSettings.setLineJoiningEnabled(true);
parserSettings.getFormat().setLineSeparator("\n");

TsvParser parser = new TsvParser(parserSettings);

//Let's parse the contents we've just written:
values = parser.parseLine(writtenLine);

println("\nParsed elements\n---------------");
println("First: " + values[0]);
println("Second: " + values[1]);
println("Third: " + values[2]);

The parsed result will be:

Written data
------------
Value 1    Breaking [\
] here    Value 3

Parsed elements
---------------
First: Value 1
Second: Breaking [
] here
Third: Value 3

Further Reading

Feel free to proceed to the following sections (in any order).

Bugs, contributions & support

If you find a bug, please report it on github or send us an email on parsers@univocity.com.

We try out best to eliminate all bugs as soon as possible and you’ll rarely see a bug open for more than 24 hours after it’s reported. We do our best to answer all questions. Enhancements/suggestions are implemented on a best effort basis.

Fell free to submit your contribution via pull requests. Any little bit is appreciated, from improvements on documentation to a full blown rewrite from scratch.

For commercial support, customizations or anything in between, please contact support@univocity.com.

Thank you for using our parsers!

The univocity team.