Reading data into java beans

Both univocity-html-parser and univocity-parsers support the same annotations to generate java beans from rows parsed from the input. You can follow this tutorial to learn about all supported annotations for both libraries.

All annotations you can use are in package com.univocity.parsers.annotations,

The HTML parser has a couple of additional annotations that handle nested collections and maps of objects, which are showcased in the last two sections of this tutorial.

Let’s get started.

Creating a class with annotations

Every attribute of your classes that should be populated with values collected from a given result must be annotated with one of:

  • @Parsed - to bind the attribute of your class to a field of your results.

  • @Nested - to handle nested classes when your attribute is of a custom type (i.e a class) that has @Parsed and @Nested attributes.

  • @Linked - to handle nested collections, arrays and maps (currently available to the HTML parser only).

@Parsed is the core annotation to use and is the only one that actually maps fields in your results to an attribute.

Let’s assume your data has the following structure (shown in a fixed-width format for clarity):

profile_id  username        followers
123.........theuser........200.......
1...........admin....................

You can create a class such as:

class ProfileByFieldName {

    @Parsed(field = "profile_id")
    private Long profileId;

    @Parsed
    private String username;

    @Parsed
    private int followers;

    @Override
    public String toString() {
        return "ProfileByFieldName{" +
                "profileId=" + profileId +
                ", username='" + username + '\'' +
                ", followers=" + followers +
                '}';
    }
}

Which maps attribute names to the corresponding headers of your data. Names are automatically mapped if they are the same in both your data and in your class. In the example above the column profile_id doesn’t match with the attribute name profileId, so we had to explicitly map the field name with @Parsed(field = "profile_id").

Alternatively, you can map your attributes to the position of each column of your input:

class ProfileByFieldPosition {

    @Parsed(index = 0)
    private Long profileId;

    @Parsed(index = 1)
    private String username;

    @Parsed(index = 2)
    private int followers;

    @Override
    public String toString() {
        return "ProfileByFieldPosition{" +
                "profileId=" + profileId +
                ", username='" + username + '\'' +
                ", followers=" + followers +
                '}';
    }
}

Lastly, if you intend to use the same class to hold values of different sets of results, with different column names, each attribute can be mapped to multiple possible names:

class ProfileByMultipleFieldNames {

    @Parsed(field = {"profile_id", "id"})
    private Long profileId;

    @Parsed(field = {"username", "user"})
    private String username;

    @Parsed
    private int followers;

    @Override
    public String toString() {
        return "ProfileByMultipleFieldNames{" +
                "profileId=" + profileId +
                ", username='" + username + '\'' +
                ", followers=" + followers +
                '}';
    }
}

With multiple field names defined, the input data shown earlier should produce the following instances:

ProfileByMultipleFieldNames{profileId=123, username='theuser', followers=200}
ProfileByMultipleFieldNames{profileId=1, username='admin', followers=0}

A different set of input rows can also be processed:

id          user           created_at  fees     type  admin  stars
123.........theuser........2015-10-30...$12,90..U.....no.....143..
1...........admin..........21/05/2001....$0,00..S.....yes....?....

This will produce the same result except for attribute “followers” which is not present:

ProfileByMultipleFieldNames{profileId=123, username='theuser', followers=0}
ProfileByMultipleFieldNames{profileId=1, username='admin', followers=0}

Exploring the available annotations

Let’s have a look again at the following user profile data:

id          user           created_at  fees     type  admin  stars
123.........theuser........2015-10-30...$12,90..U.....no.....143..
1...........admin..........21/05/2001....$0,00..S.....yes....?....

To populate each column to a class attribute we need to somehow handle dates and amounts that may have different formats. We also need to convert “yes” and “no” to a proper boolean value. On top of that there’s a “type” column that could be mapped to an enum:

id          user           created_at  fees     type  admin  stars
123.........theuser........2015-10-30...$12,90..U.....no.....143..
1...........admin..........21/05/2001....$0,00..S.....yes....?....

This actually involves very little work. We can simply define a class such as:

class Profile {

    enum Type {
        USER('U'),
        SYSTEM('S');

        public final char typeCode;

        Type(char typeCode) {
            this.typeCode = typeCode;
        }
    }

    @Parsed
    private Long id;

    @Parsed
    @Trim
    @UpperCase
    private String user;

    @Parsed(field = "created_at")
    @Format(formats = {"yyyy-MM-dd", "dd/MM/yyyy"}, options = "locale=en;lenient=false")
    private Date createdAt;

    @Parsed
    @Replace(expression = "\\$", replacement = "")
    @Format(formats = {"#0,00"}, options = "decimalSeparator=,")
    private BigDecimal fees;

    @Parsed
    @BooleanString(trueStrings = {"yes", "y"}, falseStrings = {"no", "n"})
    private boolean admin;

    @Parsed
    @EnumOptions(customElement = "typeCode")
    private Type type;

    @Parsed(field = "stars")
    @NullString(nulls = {"?", "N/A"})
    private Integer stars;

    @Override
    public String toString() {
        return "Profile{" +
                "id=" + id +
                ", user='" + user + '\'' +
                ", createdAt=" + createdAt +
                ", fees=" + fees +
                ", admin=" + admin +
                ", type=" + type +
                ", stars=" + stars +
                '}';
    }

    //

And the objects generated from that input will be:

Profile{id=123, user='THEUSER', createdAt=Fri Oct 30 00:00:00 CST 2015, fees=12.90, admin=false, type=USER, stars=143}
Profile{id=1, user='ADMIN', createdAt=Mon May 21 00:00:00 CST 2001, fees=0.00, admin=true, type=SYSTEM, stars=null}

Let’s explain each annotation:

  • Trim: trims any whitespaces around a value;

  • UpperCase: upper cases the value;

  • LowerCase: lower cases the value;

  • Replace: performs a String.replace operation on the value, matching the regular expression provided in the expression property and replacing with the replacement string provided. In our example use used it to remove the ‘$’ character from values in the “fees” column.

  • NullString: converts a String in the input record to null, as it represents a missing value. In our example, values “?” or “N/A” will be converted to null

  • Format: applies to Date, Calendar or any numeric type attribute and accepts multiple formats. The formats and options to use depend on the attribute type.

    • with dates the formats you define are handled by Java’s SimpleDateFormat so they must conform to the patterns accepted by that class. The options allow you to configure properties of the SimpleDateFormat that is created internally. You must provide key-value pairs separated by a ;. In the given example “locale=en” will internally execute new SimpleDateFormat("yyyy-MM-dd", new Locale("en")); and “lenient=false” executes simpleDateFormat.setLenient(false);.

    • numeric attributes should have formats that conform to the rules of java’s DecimalFormat. Similarly to the date patterns, options will set attributes of the DecimalFormat instance used to parse the input values, as well as the associated DecimalFormatSymbols.
  • BooleanString: accepts sets of multiple Strings that represent either true or false. If the records collected by the parser have a value that match with the set of trueStrings, the attribute will be set to true. Otherwise, false will be determined from the set of falseStrings. If neither sets match, DataProcessingException will be raised informing you that no boolean value can be determined.

  • EnumOptions: converts the plain String in the input record to an enumeration value. Enumerations are mapped automatically if one of the following matches with the input value, in order of precedence:

    • the name of a value of the enum

    • the ordinal value (i.e. the input value is a numeric representation of the enumeration’s ordinal);

    • the toString() of a value of the enum

    This order can be redefined in the selectors attribute of EnumOptions

    In the example we used, the Type enumeration has a typeCode character that represents each human-readable value. The records we are processing use that character, and we can easily map it by setting the customElement in @EnumOptions. The parser will discover what values to use based on the typeCode and map them to the Type values accordingly.

Using your own conversions in annotations

You can create your own implementation of the Conversion interface so it can be used in fields annotated with Parsed.

A varargs constructor can be used on your implementation so it can also be initialized with String... args.

For example, the following custom implementation splits a String into a String[] based on a given delimiter.

class Splitter implements Conversion<String, String[]> {

    private String separator;

    public Splitter(String... args) {
        if(args.length == 0){
            separator = ",";
        } else {
            separator = args[0];
        }
    }

    @Override
    public String[] execute(String input) {
        if(input == null){
            return new String[0];
        }
        return input.split(separator);
    }

    @Override
    public String revert(String[] input) {
        StringBuilder out = new StringBuilder();
        for (String value : input) {
            if (out.length() > 0) {
                out.append(separator);
            }
            out.append(value);
        }
        return out.toString();
    }
}

Now, assuming the parser produces the following records:

id          user                    words
123.........theuser........cheater;snitch
222.........otheruser..............camper

We can create a class such as:

class Offender {

    @Nested
    private Profile profile;

    @Parsed
    @Convert(conversionClass = Splitter.class, args = ";")
    private String[] words;

    @Override
    public String toString() {
        return "Offender{" +
                "id=" + profile.getId() +
                ", user='" + profile.getUser() +
                "', words=" + Arrays.toString(words) +
                '}';
    }
}

Which will make use of our custom conversion to split words in the words column based on the ‘;’ separator. Our Offender objects will be populated with:

Offender{id=123, user='THEUSER', words=[cheater, snitch]}
Offender{id=222, user='OTHERUSER', words=[camper]}

We also used the Nested annotation to reuse the existing Profile class. We are going to talk more about it in the next section.

Annotations in methods

All annotations can be put on the methods of your class instead of directly into its fields. The BetterOffender class below makes the previous Offender example that depends on a custom conversion look silly:

class BetterOffender {

    @Nested
    private Profile profile;

    private String[] words;

    @Parsed(field = "words")
    public void setWords(String words) {
        this.words = words.split(";");
    }

    @Override
    public String toString() {
        return "BetterOffender{" +
                "id=" + profile.getId() +
                ", user='" + profile.getUser() +
                "', words=" + Arrays.toString(words) +
                '}';
    }
}

Just keep in mind that if you intend to use the same class for writing, you will want to annotate the “getter” method as well so the attribute value can be read from your objects.

Nested classes

The @Nested annotation works to populate nested types in your class structure from a single input row. This means that the nested attributes have their headers collected and joined together with the headers of the parent class.

Let’s look again at the BetterOffender class:

class BetterOffender {

    @Nested
    private Profile profile;

    private String[] words;

    @Parsed(field = "words")
    public void setWords(String words) {
        this.words = words.split(";");
    }

    @Override
    public String toString() {
        return "BetterOffender{" +
                "id=" + profile.getId() +
                ", user='" + profile.getUser() +
                "', words=" + Arrays.toString(words) +
                '}';
    }
}

This class structure is capable to reading the following input columns of a given result:

  • id, user, created_at, fees, admin, type, stars in the Profile class (as defined in the annotations in each attribute of Profile)

PLUS

  • words from the BetterOffender class itself (the field words is defined in the annotation of the setWords method)

So this input:

id          user                    words
123.........theuser........cheater;snitch
222.........otheruser..............camper

Can be parsed into instances of BetterOffender, whose toString will show:

BetterOffender{id=123, user='THEUSER', words=[cheater, snitch]}
BetterOffender{id=222, user='OTHERUSER', words=[camper]}

Multiple nested attributes

In some cases you will want to use the same type of @Nested attribute, for example consider the AddressBook that has two nested Address types:

class AddressBook {

    @Nested
    private Address mailingAddress;

    @Nested
    private Address mainAddress;

Where the Address class is simply:

class Address {

    @Parsed
    private String street;

    @Parsed
    private String city;

    @Parsed
    private String state;

    @Override
    public String toString() {
        return "Address{" +
                "street='" + street + '\'' +
                ", city='" + city + '\'' +
                ", state='" + state + '\'' +
                '}';
    }
}

The problem here is that mailingAddress and mainAddress have the same type and consequently the same field names. Luckily we can transform the field names used by each nested Address with the help of a HeaderTransformer.

Consider the parsed records have this structure:

mail_street      mail_city  mail_state   main_street      main_city  main_state
12 Some St.......Sydney............NSW...23 The St........Brisbane..........QLD
450 Other St.....Adelaide...........SA...2 William Av.....Hobart............TAS

As the column names are prefixed with mail for the mailing address and main for the main address, we can implement a HeaderTransformer such as:

class AddressTypeTransformer extends HeaderTransformer {

    private String prefix;

    public AddressTypeTransformer(String... args) {
        prefix = args[0];
    }

    @Override
    public String transformName(Field field, String name) {
        return prefix + "_" + name;
    }
}

Which will insert a given prefix to the field names in our nested Address attributes. The AddressBook class must be updated to make use of our AddressTypeTransformer:

class AddressBook {

    @Nested(headerTransformer = AddressTypeTransformer.class, args = "mail")
    private Address mailingAddress;

    @Nested(headerTransformer  = AddressTypeTransformer.class, args = "main")
    private Address mainAddress;

    @Override
    public String toString() {
        return "AddressBook{" +
                "mailing=" + mailingAddress +
                ", main=" + mainAddress +
                '}';
    }
}

With this, mailingAddress.street will get the values from column mail_street, and mainAddress.street from main_street. The other attributes will be mapped similarly.

When converting the input records to java beans, we get the correct mapping:

AddressBook{mailing=Address{street='12 Some St', city='Sydney', state='NSW'}, main=Address{street='23 The St', city='Brisbane', state='QLD'}}
AddressBook{mailing=Address{street='450 Other St', city='Adelaide', state='SA'}, main=Address{street='2 William Av', city='Hobart', state='TAS'}}

Avoiding repetitive annotations

In many cases a class can have multiple attributes that require the usage of the same set of annotations, over and over. For example, an input such as:

id   created_at  updated_at  deleted_at
123..2010-12-20..2018-02-05..2018-04-01
456..2011-01-29..2017-06-26......-.....

All dates are formatted using the yyyy-MM-dd format, or shown as a ‘-’ if the date is not available (i.e. null).

When dealing with this the first approach uninformed people tend to use is to define a class with repetitive annotations. It usually becomes something like this:

class DatesRepetitive {

    @Parsed
    private Long id;

    @Parsed(field = "created_at")
    @NullString(nulls = "-")
    @Format(formats = "yyyy-MM-dd")
    private Date createdAt;

    @Parsed(field = "updated_at")
    @NullString(nulls = "-")
    @Format(formats = "yyyy-MM-dd")
    private Date updatedAt;

    @Parsed(field = "deleted_at")
    @NullString(nulls = "-")
    @Format(formats = "yyyy-MM-dd")
    private Date deletedAt;

    //

Maybe it’s not so bad to add the same 3 annotations to every dated field in your class, but it can get much worse if you need to add the same annotations in multiple classes, and many more fields. To avoid repetition, you can define a meta-annotation instead.

A meta-annotation is your own annotation type which you can use in place of multiple annotations. You can create a @MyCompanyDate annotation type easily and use in your classes to make them look like this:

class DatesWithMetaAnnotation {

    @Parsed
    private Long id;

    @MyCompanyDate(field = "created_at")
    private Date createdAt;

    @MyCompanyDate(field = "updated_at")
    private Date updatedAt;

    @MyCompanyDate(field = "deleted_at")
    private Date deletedAt;

    //

The definition of @MyCompanyDate is relatively straightforward:

//default annotation stuff - you'll probably just want to copy and paste this
@Retention(RetentionPolicy.RUNTIME)
@Inherited
@Target({ElementType.FIELD, ElementType.METHOD, ElementType.ANNOTATION_TYPE})

//your common annotations go here
@Parsed
@NullString(nulls = "-")
@Format(formats = "yyyy-MM-dd")

public @interface MyCompanyDate {

    @Copy(to = Parsed.class) //copies the value provided in `MyCompanyDate.field` to `Parsed.field`
    String field() default "";

} 

So all you have to do is to add the group of annotations you use everywhere to the meta-annotation declaration itself.

We added a field property to the meta-annotation with the intent of allowing you to map attributes to a custom column name of your results. This works exactly like the field property of the @Parsed annotation because the @Copy annotation copies the value set in @MyCompanyDate.field into @Parsed.field.

The Copy annotation can copy any value of any property of your meta-annotation. Where property names are different, you need to provide the target property name. For example, we could add a position attribute to the meta-annotation and copy it’s value to Parsed.index with:

@Copy(to = Parsed.class, property = "index")
int position() default -1;

This is very useful when you find yourself using the same annotations over and over. You can even create meta-annotations using other meta-annotations.

Validation

You can add a @Validate annotation to any attribute or method of your class to perform basic data validations. By default, nulls and blanks (i.e. the empty string """ or values with blank only, such as " \n") are not allowed.


    @Parsed
    @Validate
    public String notNulNotBlank;
    
    @Parsed
    @Validate(nullable = true)
    public String nullButNotBlank;
    
    @Parsed
    @Validate(allowBlanks = true)
    public String notNullButBlank;

You can also specify acceptable values:

    @Parsed
    @Validate(oneOf = {"a", "b"})
    public String a_or_b;

And invalid values:

    @Parsed
    @Validate(noneOf = {"a", "b"})
    public String not_a_nor_b;

Finally, you can also enforce a given format using regular expressions:

    @Parsed
    @Validate(matches = "^[^\\d\\s]+$")
    public String noDigitsNoSpaces;

When a validation fails, a DataValidationException will be thrown. You can handle such errors with a ProcessorErrorHandler to discard invalid records or use a RetryableErrorHandler to try and recover from the error, as described in the Recovering from errors section of the univocity-parsers tutorial.

Custom validations

You can also create complex validators by implementing instances of Validator. For example, the following class ensures all numeric values are positive:

    public class Positive implements Validator<Integer> {
        @Override
        public String validate(Integer value) {
            if (value < 0) {
                return "value must be positive or zero";
            }
            return null; //all good, no validation error messages to return
        }
    }

All we need now is to instruct the @Validate annotation to use our Positive validation:

    @Parsed
    @Validate(validators = Positive.class)
    public int positive;

Note that the validators property accepts multiple validation classes, which allows you to compose any custom validation sequence at will.

Linked results

The @Linked annotation is available only to the HTML parser currently, and allows you to populate single objects, arrays, collections and maps of different types from results that are linked to a parent record. The association among records can exist naturally via results collected via Link following), or manually by joining results collected by multiple entities in the same page.

Proceed to Reading linked results into java beans to learn more.