x Get our new HTML parser and build any scraping project 80% faster.

Working with Fixed-Width

Parsing fixed-width files

All functionalities you have with the CSV file format are available for the fixed-width format (and any other parser we introduce in the future).

In the example.txt fixed-width file, we chose to fill the unwritten spaces with underscores (‘_’), so in the parser settings we set the padding to underscore:


YearMake_Model___________________________________Description_____________________________Price___
1997Ford_E350____________________________________ac, abs, moon___________________________3000.00_
1999ChevyVenture "Extended Edition"______________________________________________________4900.00_
1996Jeep_Grand Cherokee__________________________MUST SELL!
air, moon roof, loaded_______4799.00_
1999ChevyVenture "Extended Edition, Very Large"__________________________________________5000.00_
_________Venture "Extended Edition"______________________________________________________4900.00_

Use FixedWidthFieldLengths to define what is the length of each field in the input. With that information we can then create the FixedWidthParserSettings. After that, you can instantiate and run the parser:

// creates the sequence of field lengths in the file to be parsed
FixedWidthFields lengths = new FixedWidthFields(4, 5, 40, 40, 8);

// creates the default settings for a fixed width parser
FixedWidthParserSettings settings = new FixedWidthParserSettings(lengths);

//sets the character used for padding unwritten spaces in the file
settings.getFormat().setPadding('_');
settings.getFormat().setLineSeparator("\n");

// creates a fixed-width parser with the given settings
FixedWidthParser parser = new FixedWidthParser(settings);

// parses all rows in one go.
List<String[]> allRows = parser.parseAll(getReader("/examples/example.txt"));

The output will be:

1 [Year, Make, Model, Description, Price]
-----------------------
2 [1997, Ford, E350, ac, abs, moon, 3000.00]
-----------------------
3 [1999, Chevy, Venture "Extended Edition", null, 4900.00]
-----------------------
4 [1996, Jeep, Grand Cherokee, MUST SELL!
air, moon roof, loaded, 4799.00]
-----------------------
5 [1999, Chevy, Venture "Extended Edition, Very Large", null, 5000.00]
-----------------------
6 [null, null, Venture "Extended Edition", null, 4900.00]
-----------------------

All the rest is the same as with CSV parsers. You can use all RowProcessors for annotations, conversions, master-detail records and anything else we (or you) might introduce in the future.

We created a set of examples using fixed with parsing in the FixedWidthParserExamples.java

Fixed width format

In addition to the default format definition, the fixed with format contains:

  • padding (default ’ ’, i.e. an empty space): value used for filling unwritten spaces. This is a default which be overridden by individual column paddings defined in FixedWidthFields.

Fixed-width settings

// For the sake of the example, we will not read the last 8 characters (for the Year column).
// We will also NOT set the padding character to '_' so the output makes more sense for reading
// and you can see what characters are being processed
FixedWidthParserSettings parserSettings = new FixedWidthParserSettings(new FixedWidthFields(4, 5, 40, 40 /*, 8*/));

//the file used in the example uses '\n' as the line separator sequence.
//the line separator sequence is defined here to ensure systems such as MacOS and Windows
//are able to process this file correctly (MacOS uses '\r'; and Windows uses '\r\n').
parserSettings.getFormat().setLineSeparator("\n");

// The fixed width parser settings has most of the settings for CSV.
// These are the only extra settings you need:

// If a row has more characters than what is defined, skip them until the end of the line.
parserSettings.setSkipTrailingCharsUntilNewline(true);

// If a record has less characters than what is expected and a new line is found,
// this record is considered parsed. Data in the next row will be parsed as a new record.
parserSettings.setRecordEndsOnNewline(true);

RowListProcessor rowProcessor = new RowListProcessor();

parserSettings.setProcessor(rowProcessor);
parserSettings.setHeaderExtractionEnabled(true);

FixedWidthParser parser = new FixedWidthParser(parserSettings);
parser.parse(getReader("/examples/example.txt"));

List<String[]> rows = rowProcessor.getRows();

The parser output with such configuration for parsing the example.txt file will be:

1 [1997, Ford_, E350____________________________________, ac, abs, moon___________________________]
-----------------------
2 [1999, Chevy, Venture "Extended Edition"______________, ________________________________________]
-----------------------
3 [1996, Jeep_, Grand Cherokee__________________________, MUST SELL!]
-----------------------
4 [air,, moon, roof, loaded_______4799.00_]
-----------------------
5 [1999, Chevy, Venture "Extended Edition, Very Large"__, ________________________________________]
-----------------------
6 [____, _____, Venture "Extended Edition"______________, ________________________________________]
-----------------------

As recordEndsOnNewline = true, lines 3 and 4 are considered different records, instead of a single, multi-line record. To clarity: in line 4, the value of the first column is ‘air,’, the second column has value ‘moon’, and the third is ‘roof, loaded_______4799.00_’.

Per-field alignment and padding configuration

Fields can be individually configured to work with a given alignment or padding, different than the default provided in the FixedWidthFormat:

FixedWidthFields fields = new FixedWidthFields();
//"id" has length of 5 characters, is aligned to the right and unwritten spaces should be represented as 0
fields.addField("id", 5, FieldAlignment.RIGHT, '0');

//"code" is aligned to the center, and padded with '_'
fields.addField("code", 20, FieldAlignment.CENTER, '_');

//name and quantity use the default padding defined in the settings (further below).
fields.addField("name", 15, FieldAlignment.LEFT);
fields.addField("quantity", 5, FieldAlignment.CENTER); //"quantity" has more than 5 characters. This header will be truncated.
fields.addField("total", 5, FieldAlignment.RIGHT, '0');

FixedWidthWriterSettings writerSettings = new FixedWidthWriterSettings(fields);

//this is the default padding to use to represent unwritten spaces.
writerSettings.getFormat().setPadding('.');

//The following settings will override the individual column padding and alignment when writing headers only.
//we want to write header rows, but use the default padding for them.
writerSettings.setUseDefaultPaddingForHeaders(true);
//we also want to align headers to the left.
writerSettings.setDefaultAlignmentForHeaders(FieldAlignment.LEFT);

//Let's create the writer
FixedWidthWriter writer = new FixedWidthWriter(writerSettings);

//Writing the headers into a formatted String.
String headers = writer.writeHeadersToString();

//And a few records
String line1 = writer.writeRowToString(new String[]{"45", "ABC", "cool thing", "3", "135"});
String line2 = writer.writeRowToString(new String[]{"8000", "XYZ", "expensive thing", "1", "8000"});

//Let's see how they look like:
println(headers);
println(line1);
println(line2);

//We should be able to parse these records as well. Let's give this a try
FixedWidthParserSettings parserSettings = new FixedWidthParserSettings(fields);
parserSettings.setFormat(writerSettings.getFormat());

FixedWidthParser parser = new FixedWidthParser(parserSettings);
String[] record1 = parser.parseLine(line1);
String[] record2 = parser.parseLine(line2);

println("\nParsed:");
println(Arrays.toString(record1));
println(Arrays.toString(record2));

Output:

id...code................name...........quanttotal
00045_________ABC________cool thing.......3..00135
08000_________XYZ________expensive thing..1..08000

Parsed:
[45, ABC, cool thing, 3, 135]
[8000, XYZ, expensive thing, 1, 8000]

Parsing multi-schema files using lookahead/lookbehind

Parsing with lookahead

It’s very common for fixed-width inputs to contain rows that use different formats. The format of each row is usually identified by a lookahead value that identifies what lengths to use to process an incoming record.

FixedWidthParserSettings settings = new FixedWidthParserSettings();
settings.getFormat().setLineSeparator("\n");

//We are going to parse the multi_schema.txt file, with a lookahead value in front of each record
//Let's define the format used to store clients' records
FixedWidthFields clientFields = new FixedWidthFields();
clientFields.addField("Lookahead", 2); //here we will store the look ahead value in a column
clientFields.addField("ClientID", 7, FieldAlignment.RIGHT, '0');
clientFields.addField("Name", 20);

//Here's the format used for client accounts:
FixedWidthFields accountFields = new FixedWidthFields();
accountFields.addField("ID", 7, FieldAlignment.RIGHT, '0'); //here the account ID will be prefixed by the lookahead value
accountFields.addField("Bank", 4);
accountFields.addField("AccountNumber", 10);
accountFields.addField("Swift", 7);

//If a record starts with C#, it's a client record, so we associate "C#" with the client format
settings.addFormatForLookahead("C#", clientFields);

//And here we associate "A#" with the account format
settings.addFormatForLookahead("A#", accountFields);

//We can now parse all rows
FixedWidthParser parser = new FixedWidthParser(settings);
List<String[]> rows = parser.parseAll(getReader("/examples/multi_schema.txt"));

Output:

1 [C#, 321, Miss Foo]
-----------------------
2 [A#23234, HSBC, 123433-000, HSBCAUS]
-----------------------
3 [A#00234, HSBC, 222343-130, HSBCCAD]
-----------------------
4 [C#, 322, Mr Bar]
-----------------------
5 [A#01234, CITI, 213343-130, CITICAD]
-----------------------

Parsing with lookahead, and a default format

In other cases some rows might have a lookahead, but others don’t, and should be parsed using a default format:

//In some cases the input records might not have a lookahead value. On the multi_schema2.txt file,
//only client records have a lookahead. If no other lookahead is matched, the parser will switch back to
//the default field format. Here, the format used by account records will be used as default.
FixedWidthParserSettings settings = new FixedWidthParserSettings(accountFields);
settings.getFormat().setLineSeparator("\n");

//Let's again define the format used to store clients' records
FixedWidthFields clientFields = new FixedWidthFields();
clientFields.addField("Lookahead", 2); //here we will store the look ahead value in a column
clientFields.addField("ClientID", 7, FieldAlignment.RIGHT, '0');
clientFields.addField("Name", 20);

//If a record starts with C#, it's a client record, so we associate "C#" with the client format.
//Any other record will be parsed using the default format
settings.addFormatForLookahead("?#", clientFields);

//Let's parse all rows now
FixedWidthParser parser = new FixedWidthParser(settings);
List<String[]> rows = parser.parseAll(getReader("/examples/multi_schema2.txt"));

Output:

1 [C#, 321, Miss Foo]
-----------------------
2 [23234, HSBC, 123433-000, HSBCAUS]
-----------------------
3 [234, HSBC, 222343-130, HSBCCAD]
-----------------------
4 [C#, 322, Mr Bar]
-----------------------
5 [1234, CITI, 213343-130, CITICAD]
-----------------------

Parsing with lookbehind

Sometimes the format of a row can also be defined by a lookbehind, where the first few characters of the previous row identify the format of the following rows:

//We can also specify a lookbehind value to determine which format to use when parsing the input.

//If a record starts with C#, it's a client record, so we associate "C#" with the client format.
settings.addFormatForLookahead("C#", clientFields);
//If a record parsed previously has a C#, but the current doesn't, then we are processing accounts. Let's use the account format.
settings.addFormatForLookbehind("?#", accountFields);

//Let's parse all rows now
FixedWidthParser parser = new FixedWidthParser(settings);
List<String[]> rows = parser.parseAll(getReader("/examples/multi_schema2.txt"));

Output:

1 [C#, 321, Miss Foo]
-----------------------
2 [23234, HSBC, 123433-000, HSBCAUS]
-----------------------
3 [234, HSBC, 222343-130, HSBCCAD]
-----------------------
4 [C#, 322, Mr Bar]
-----------------------
5 [1234, CITI, 213343-130, CITICAD]
-----------------------

Writing multi-schema files using lookahead/lookbehind

If you need to write multi-schema fixed-width files, you will probably need to use lookahead/behind values and assign a row format for each value:

Writing with lookahead

//Here's the format used for client accounts:
FixedWidthFields accountFields = new FixedWidthFields();
accountFields.addField("ID", 10); //account value includes the lookahead value.
accountFields.addField("Bank", 8);
accountFields.addField("AccountNumber", 15);
accountFields.addField("Swift", 12);

//Format for clients' records
FixedWidthFields clientFields = new FixedWidthFields();
clientFields.addField("Lookahead", 5); //clients have their lookahead in a separate column
clientFields.addField("ClientID", 15, FieldAlignment.RIGHT, '0'); //let's pad client ID's with leading zeroes.
clientFields.addField("Name", 20);

FixedWidthWriterSettings settings = new FixedWidthWriterSettings();
settings.getFormat().setLineSeparator("\n");
settings.getFormat().setPadding('_');

//If a record starts with C#, it's a client record, so we associate "C#" with the client format.
settings.addFormatForLookahead("C#", clientFields);

//Rows starting with any character then 'A' should be written using the account format
settings.addFormatForLookahead("?A", accountFields);

StringWriter out = new StringWriter();

//Let's write
FixedWidthWriter writer = new FixedWidthWriter(out, settings);

writer.writeRow(new Object[]{"C#",23234, "Miss Foo"});
writer.writeRow(new Object[]{"#A23234", "HSBC", "123433-000", "HSBCAUS"});
writer.writeRow(new Object[]{"^A234", "HSBC", "222343-130", "HSBCCAD"});
writer.writeRow(new Object[]{"C#",322, "Mr Bar"});
writer.writeRow(new Object[]{"@A1234", "CITI", "213343-130", "CITICAD"});

writer.close();

print(out);

Output:

C#___000000000023234Miss Foo____________
^A234_____HSBC____222343-130_____HSBCCAD_____
C#___000000000000322Mr Bar______________
@A1234____CITI____213343-130_____CITICAD_____

Writing with lookahead, and a default format

//As accounts don't have a lookahead value, we use their format as the default.
FixedWidthWriterSettings settings = new FixedWidthWriterSettings(accountFields);
settings.getFormat().setLineSeparator("\n");
settings.getFormat().setPadding('_');

//If a record starts with C#, it's a client record, so we associate "C#" with the client format.
//Any other row will be written using the default format (for accounts)
settings.addFormatForLookahead("C#", clientFields);

StringWriter out = new StringWriter();

//Let's write
FixedWidthWriter writer = new FixedWidthWriter(out, settings);

writer.writeRow(new Object[]{"C#",23234, "Miss Foo"});
writer.writeRow(new Object[]{"23234", "HSBC", "123433-000", "HSBCAUS"});
writer.writeRow(new Object[]{"234", "HSBC", "222343-130", "HSBCCAD"});
writer.writeRow(new Object[]{"C#",322, "Mr Bar"});
writer.writeRow(new Object[]{"1234", "CITI", "213343-130", "CITICAD"});

writer.close();

print(out);

Output:

C#___000000000023234Miss Foo____________
23234_____HSBC____123433-000_____HSBCAUS_____
234_______HSBC____222343-130_____HSBCCAD_____
C#___000000000000322Mr Bar______________
1234______CITI____213343-130_____CITICAD_____

Writing with lookbehind

//If a record starts with C#, it's a client record, so we associate "C#" with the client format.
settings.addFormatForLookahead("C#", clientFields);
//If a record written previously had a C#, but the current doesn't, then we are writing accounts. Let's use the account format.
settings.addFormatForLookbehind("C#", accountFields);

Output:

C#___000000000023234Miss Foo____________
23234_____HSBC____123433-000_____HSBCAUS_____
234_______HSBC____222343-130_____HSBCCAD_____
C#___000000000000322Mr Bar______________
1234______CITI____213343-130_____CITICAD_____

Further Reading

Feel free to proceed to the following sections (in any order).

Bugs, contributions & support

If you find a bug, please report it on github or send us an email on parsers@univocity.com.

We try out best to eliminate all bugs as soon as possible and you’ll rarely see a bug open for more than 24 hours after it’s reported. We do our best to answer all questions. Enhancements/suggestions are implemented on a best effort basis.

Fell free to submit your contribution via pull requests. Any little bit is appreciated, from improvements on documentation to a full blown rewrite from scratch.

For commercial support, customizations or anything in between, please contact support@univocity.com.

Thank you for using our parsers!

The univocity team.