x Get our new HTML parser and build any scraping project 80% faster. Use code HALFFORME to get a 50% discount at checkout. Valid for the first 100 orders.

Other Row Processors

Extracting more than one java bean from a single input row

The MultiBeanListProcessor allows you to collect as many instances of an annotated class as available from each row parsed from the input.

// The MultiBeanProcessor allows multiple bean instances to be created from a single input record.
// in this example, we will create instances of TestBean and AnotherTestBean.
// Here we use a MultiBeanListProcessor which is a convenience class that implements the
// abstract beanProcessed() method of MultiBeanProcessor to add each instance to a list.;
MultiBeanListProcessor processor = new MultiBeanListProcessor(TestBean.class, AnotherTestBean.class);

// one of the records in the input won't be compatible with AnotherTestBean: the field "pending"
// only accepts 'y' or 'n' as valid representations of true or false. We want to continue processing, let's ignore the error.
settings.setProcessorErrorHandler(new RowProcessorErrorHandler() {
    @Override
    public void handleError(DataProcessingException error, Object[] inputRow, ParsingContext context) {
        //ignore the error.
    }
});

// we also need to grab the headers from our input file
settings.setHeaderExtractionEnabled(true);

//let's configure the parser to use our MultiBeanProcessor
settings.setProcessor(processor);

CsvParser parser = new CsvParser(settings);

//and parse everything.
parser.parse(getReader("/examples/bean_test.csv"));

// we can get all beans parsed from the input as a map, where the keys are the bean type associated
// with a list of corresponding bean instances
Map<Class<?>, List<?>> beans = processor.getBeans();

// or we can get the lists of beans processed individually by providing the type:
List<TestBean> testBeans = processor.getBeans(TestBean.class);
List<AnotherTestBean> anotherTestBeans = processor.getBeans(AnotherTestBean.class);

//Let's have a look:
println("TestBeans\n----------------");
for (TestBean testBean : testBeans) {
    println(testBean);
}

//We expect one of the instances here to be null
println("\nAnotherTestBeans\n----------------");
for (AnotherTestBean anotherTestBean : anotherTestBeans) {
    println(anotherTestBean);
}

Here’s the result:

TestBeans
----------------
TestBean [quantity=1, comments=?, amount=555.999, pending=true]
TestBean [quantity=0, comments=" something ", amount=null, pending=false]

AnotherTestBeans
----------------
null
AnotherTestBean [date=10/Oct/2001, pending=false]

Processing rows in parallel

As of univocity-parsers 1.4.0 you can process rows as they are parsed in a separate thread easily. All you’ve got to do is to wrap your RowProcessor in a ConcurrentRowProcessor:

parserSettings.setProcessor(new ConcurrentRowProcessor(rowProcessor));

Note that this may not always produce faster processing times. univocity-parsers is highly optimized and processing your data sequentially will still be faster than in parallel in many cases. We recommend you to profile your particular processing scenario before blindly deciding whether to use this feature.

Parsing simple master-detail style files

Use MasterDetailProcessor or MasterDetailListProcessor to produce MasterDetailRecord objects. A simple example a master-detail file is in the master_detail.csv file.

Each MasterDetailRecord holds a master record row and its list of associated detail rows.

// 1st, Create a RowProcessor to process all "detail" elements
ObjectRowListProcessor detailProcessor = new ObjectRowListProcessor();

// converts values at in the "Amount" column (position 1 in the file) to integer.
detailProcessor.convertIndexes(Conversions.toInteger()).set(1);

// 2nd, Create MasterDetailProcessor to identify whether or not a row is the master row.
// the row placement argument indicates whether the master detail row occurs before or after a sequence of "detail" rows.
MasterDetailListProcessor masterRowProcessor = new MasterDetailListProcessor(RowPlacement.BOTTOM, detailProcessor) {
    @Override
    protected boolean isMasterRecord(String[] row, ParsingContext context) {
        //Returns true if the parsed row is the master row.
        //In this example, rows that have "Total" in the first column are master rows.
        return "Total".equals(row[0]);
    }
};
// We want our master rows to store BigIntegers in the "Amount" column
masterRowProcessor.convertIndexes(Conversions.toBigInteger()).set(1);

CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.setHeaderExtractionEnabled(true);

// Set the RowProcessor to the masterRowProcessor.
parserSettings.setProcessor(masterRowProcessor);
parserSettings.getFormat().setLineSeparator("\n");

CsvParser parser = new CsvParser(parserSettings);
parser.parse(getReader("/examples/master_detail.csv"));

// Here we get the MasterDetailRecord elements.
List<MasterDetailRecord> rows = masterRowProcessor.getRecords();
MasterDetailRecord masterRecord = rows.get(0);

// The master record has one master row and multiple detail rows.
Object[] masterRow = masterRecord.getMasterRow();
List<Object[]> detailRows = masterRecord.getDetailRows();

After printing the master row and its details rows, the output is:

[Total, 100]
=======================
1 [Item1, 50]
-----------------------
2 [Item2, 40]
-----------------------
3 [Item3, 10]
-----------------------

Reading columns instead of rows

Since univocity-parsers 1.3.0, a few special types of RowProcessors have been introduced to collect the values of columns instead of rows:

To avoid problems with memory when processing large inputs, we also introduced the following column processors. These will return the column values processed after a batch of a given number of rows:

Here are some examples on how to use them:

Read all columns at once

CsvParserSettings parserSettings = new CsvParserSettings();
parserSettings.getFormat().setLineSeparator("\n");
parserSettings.setHeaderExtractionEnabled(true);

// To get the values of all columns, use a column processor
ColumnProcessor rowProcessor = new ColumnProcessor();
parserSettings.setProcessor(rowProcessor);

CsvParser parser = new CsvParser(parserSettings);

//This will kick in our column processor
parser.parse(getReader("/examples/example.csv"));

//Finally, we can get the column values:
Map<String, List<String>> columnValues = new TreeMap<String, List<String>>(rowProcessor.getColumnValuesAsMapOfNames());

Let’s see the output. Each row displays the column name and the values parsed on each:

Description -> [ac, abs, moon, null, MUST SELL!
air, moon roof, loaded, null, null]
Make -> [Ford, Chevy, Jeep, Chevy, null]
Model -> [E350, Venture "Extended Edition", Grand Cherokee, Venture "Extended Edition, Very Large", Venture "Extended Edition"]
Price -> [3000.00, 4900.00, 4799.00, 5000.00, 4900.00]
Year -> [1997, 1999, 1996, 1999, null]

Read columns in batches

//To process larger inputs, we can use a batched column processor.
//Here we set the batch size to 3, meaning we'll get the column values of at most 3 rows in each batch.
settings.setProcessor(new BatchedColumnProcessor(3) {

    @Override
    public void batchProcessed(int rowsInThisBatch) {
        List<List<String>> columnValues = getColumnValuesAsList();

        println(out, "Batch " + getBatchesProcessed() + ":");
        int i = 0;
        for (List<String> column : columnValues) {
            println(out, "Column " + (i++) + ":" + column);
        }
    }
});

FixedWidthParser parser = new FixedWidthParser(settings);
parser.parse(getReader("/examples/example.txt"));

Here we print the column values from each batch of 3 rows. As we have 5 rows in the input, the last batch will have 2 values per column:

Batch 0:
Column 0:[1997, 1999, 1996]
Column 1:[Ford, Chevy, Jeep]
Column 2:[E350, Venture "Extended Edition", Grand Cherokee]
Column 3:[ac, abs, moon, null, MUST SELL!
air, moon roof, loaded]
Column 4:[3000.00, 4900.00, 4799.00]
Batch 1:
Column 0:[1999, null]
Column 1:[Chevy, null]
Column 2:[Venture "Extended Edition, Very Large", Venture "Extended Edition"]
Column 3:[null, null]
Column 4:[5000.00, 4900.00]

Reading columns while converting the parsed content to Objects

// ObjectColumnProcessor converts the parsed values and stores them in columns
// Use BatchedObjectColumnProcessor to process columns in batches
ObjectColumnProcessor rowProcessor = new ObjectColumnProcessor();

// converts values in the "Price" column (index 4) to BigDecimal
rowProcessor.convertIndexes(Conversions.toBigDecimal()).set(4);

// converts the values in columns "Make, Model and Description" to lower case, and sets the value "chevy" to null.
rowProcessor.convertFields(Conversions.toLowerCase(), Conversions.toNull("chevy")).set("Make", "Model", "Description");

// converts the values at index 0 (year) to BigInteger. Nulls are converted to BigInteger.ZERO.
rowProcessor.convertFields(new BigIntegerConversion(BigInteger.ZERO, "0")).set("year");

parserSettings.setProcessor(rowProcessor);

TsvParser parser = new TsvParser(parserSettings);

//the rowProcessor will be executed here.
parser.parse(getReader("/examples/example.tsv"));

//Let's get the column values:
Map<Integer, List<Object>> columnValues = rowProcessor.getColumnValuesAsMapOfIndexes();

Now we will print the column indexes and their values:

0 -> [1997, 1999, 1996, 1999, 0]
1 -> [ford, null, jeep, null, null]
2 -> [e350, venture "extended edition", grand cherokee, venture "extended edition, very large", venture "extended edition"]
3 -> [ac, abs, moon, null, must sell!
air, moon roof, loaded, null, null]
4 -> [3000.00, 4900.00, 4799.00, 5000.00, 4900.00]

Handling complex multi-schema files

It’s not uncommon to have input files with records that contain records of different schemas grouped together. In these cases, the rows have different formats, with different fields, generally with a value that identifies what the row contains to guide the parsing process.

For example, you could have one row for a “client” entity, followed by one or more rows containing “purchases” made by this client, where each “purchase” could be followed by “items”, and so on.

To handle this sort of complexity, we built a special type of RowProcessor called InputValueSwitch. It allows you to associate a value to a RowProcessor. When parsing, if the value of a given column in parsed row matches, with the value you provided, the parser will switch to your RowProcessor to handle the specific format of your row.

The InputValueSwitch has a rowProcessorSwitched method that can be overridden to notify you of a change in the row format.

The following example demonstrates the power of the InputValueSwitch when processing this sort of input:

//Here's a master-detail schema, with rows using different formats. Rows starting with MASTER are master rows
//and they are followed by one or more car records.
StringReader input = new StringReader("" +
        "MASTER,Value 1,Value 2\n" +
        "2012,Toyota,Camry,10000\n" +
        "2014,Toyota,Camry,12000\n" +
        "MASTER,Value 3\n" +
        "1982,Volkswagen,Beetle,3500"
);

//We are going to collect all MASTER records using a processor for lists of Object[]
final ObjectRowListProcessor masterElementProcessor = new ObjectRowListProcessor();
//And rows with car information will be parsed into a list of Car objects.
final BeanListProcessor<Car> carProcessor = new BeanListProcessor(Car.class);

//We create a switch based on the first column of the input. Values of each column of each row
//will be compared against registered processors that "know" how to parse the input row.
InputValueSwitch inputSwitch = new InputValueSwitch(0) {

    // the input value switch allows you to override the "rowProcessorSwitched" so you can handle
    // what should happen when a different type of row is found in the input.
    public void rowProcessorSwitched(RowProcessor from, RowProcessor to) {
        //let's add associate the list of parsed Car instances to the MASTER record parsed before.
        //when the current row processor is not the "carProcessor" anymore, it means we parsed all cars under a MASTER record
        if (from == carProcessor) {

            List<Object[]> masterRows = masterElementProcessor.getRows();
            int lastMasterRowIndex = masterRows.size() - 1;
            Object[] lastMasterRow = masterRows.get(lastMasterRowIndex);

            //let's expand the last master row
            Object[] masterRowWithCars = Arrays.copyOf(lastMasterRow, 3);

            //and add our parsed car list in the last element:
            masterRowWithCars[2] = new ArrayList<Car>(carProcessor.getBeans());

            masterRows.set(lastMasterRowIndex, masterRowWithCars);

            //clear the list of cars so it's empty for the next batch of cars under another master row.
            carProcessor.getBeans().clear();
        }
    }
};

//Rows whose value at column 0 is "MASTER" will be processed using the masterElementProcessor.
inputSwitch.addSwitchForValue("MASTER", masterElementProcessor, 1, 2); //we are not interested in the "MASTER" value at column 0, let's select which columns to get values from

//Car records don't have an identifier in their rows. In this case the input switch will use a "default" processor
//to handle the input. All records that don't start with "SUPER" are Car records, we use the car processor as the default.
inputSwitch.setDefaultSwitch(carProcessor, "year", "make", "model", "price"); //the header list here matches the fields annotated in the Car class.

//All we need now is to set row processor of our parser, which is the input switch.
CsvParserSettings settings = new CsvParserSettings();
settings.setProcessor(inputSwitch);

settings.getFormat().setLineSeparator("\n");

//and parse!
CsvParser parser = new CsvParser(settings);
parser.parse(input);

//Let's have a look at our master rows which contain lists of cars:
List<Object[]> rows = masterElementProcessor.getRows();

for (Object[] row : rows) {
    println(Arrays.toString(row));
}

Here’s the result:

[Value 1, Value 2, [Car: {year=2012, make=Toyota, model=Camry, price=10000}, Car: {year=2014, make=Toyota, model=Camry, price=12000}]]
[Value 3, null, [Car: {year=1982, make=Volkswagen, model=Beetle, price=3500}]]

Writing multi-schema outputs

Similarly to the InputValueSwitch for parsing multi-schema inputs, the OutputValueSwitch lets you write multi-schema outputs with relative ease:

//creates a switch that will use a different row processor for writing a row, based on values at column 0.
OutputValueSwitch writerSwitch = new OutputValueSwitch("type");

// If the value is "SUPER", we want to use an ObjectRowWriterProcessor.
// Field names "type", "field1" and "field2" will be associated with this row processor
writerSwitch.addSwitchForValue("SUPER", new ObjectRowWriterProcessor(), "type", "field1", "field2");

// If the value is "DUPER", another ObjectRowWriterProcessor will be used.
// Field names "type", "A", "B" and "C" will be used here
writerSwitch.addSwitchForValue("DUPER", new ObjectRowWriterProcessor(), "type", "A", "B", "C");

CsvWriterSettings settings = new CsvWriterSettings();

// configure the writer to use the switch
settings.setRowWriterProcessor(writerSwitch);
//rows with less values than expected will be expanded, i.e. empty columns will be written
settings.setExpandIncompleteRows(true);

settings.getFormat().setLineSeparator("\n");
settings.setHeaderWritingEnabled(false);

StringWriter output = new StringWriter();
CsvWriter writer = new CsvWriter(output, settings);

Map<String, Object> duperValues = new HashMap();
duperValues.put("type", "DUPER");
duperValues.put("A", "value A");
duperValues.put("B", "value B");
duperValues.put("C", "value C");

writer.processRecord(new Object[]{"SUPER", "Value 1", "Value 2"}); //writing an array
writer.processRecord(duperValues); //writing a map

duperValues.remove("A"); //no data for column "A"
duperValues.put("B", 5555); //updating the value of B
duperValues.put("D", null); //not included, will be ignored

writer.processRecord(duperValues);
writer.processRecord(new Object[]{"SUPER", "Value 3"}); //no value for column "field2", an empty column will be written

writer.close();

print(output.toString());

This will write the following:

SUPER,Value 1,Value 2
DUPER,value A,value B,value C
DUPER,,5555,value C
SUPER,Value 3,

As the OutputValueSwitch works with instances of RowWriterProcessor, you can use different annotated java beans to write rows in different formats, and use an ObjectRowWriterProcessor as well:

//creates a switch that will use a different row processor for writing a row, based on values at column 0.
OutputValueSwitch writerSwitch = new OutputValueSwitch("type");

// If the value is "SUPER", we want to use an ObjectRowWriterProcessor.
// Field names "type", "field1" and "field2" will be associated with this row processor
writerSwitch.addSwitchForValue("SUPER", new ObjectRowWriterProcessor(), "type", "field1", "field2");

//we are going to write instances of Car
writerSwitch.addSwitchForType(Car.class); //you can also define specific fields to write by giving a list of header names/column indexes.

CsvWriterSettings settings = new CsvWriterSettings();

// configure the writer to use the switch
settings.setRowWriterProcessor(writerSwitch);

settings.getFormat().setLineSeparator("\n");
settings.setHeaderWritingEnabled(false);

StringWriter output = new StringWriter();
CsvWriter writer = new CsvWriter(output, settings);


writer.processRecord(new Object[]{"SUPER", "Value 1", "Value 2"}); //writing an array

//Here's our car
Car car = new Car();
car.setYear(2012);
car.setMake("Toyota");
car.setModel("Camry");
car.setPrice(new BigDecimal("10000"));
writer.processRecord(car);

//And another car
car.setYear(2014);
car.setPrice(new BigDecimal("12000"));
writer.processRecord(car);

writer.processRecord(new Object[]{"SUPER", "Value 3"}); //no value for column "field2", an empty column will be written

writer.close();

print(output.toString());

The result will be:

SUPER,Value 1,Value 2
2012,,Toyota,Camry,10000
2014,,Toyota,Camry,12000
SUPER,Value 3

Further Reading

Feel free to proceed to the following sections (in any order).

Bugs, contributions & support

If you find a bug, please report it on github or send us an email on parsers@univocity.com.

We try out best to eliminate all bugs as soon as possible and you’ll rarely see a bug open for more than 24 hours after it’s reported. We do our best to answer all questions. Enhancements/suggestions are implemented on a best effort basis.

Fell free to submit your contribution via pull requests. Any little bit is appreciated, from improvements on documentation to a full blown rewrite from scratch.

For commercial support, customizations or anything in between, please contact support@univocity.com.

Thank you for using our parsers!

The univocity team.