Reading data into java beans
Both univocity-html-parser and univocity-parsers support the same annotations to generate java beans from rows parsed from the input. You can follow this tutorial to learn about all supported annotations for both libraries.
All annotations you can use are in package com.univocity.parsers.annotations,
The HTML parser has a couple of additional annotations that handle nested collections and maps of objects, which are showcased in the last two sections of this tutorial.
Let’s get started.
Creating a class with annotations
Every attribute of your classes that should be populated with values collected from a given result must be annotated with one of:
-
@Parsed - to bind the attribute of your class to a field of your results.
-
@Nested - to handle nested classes when your attribute is of a custom type (i.e a class) that has
@Parsed
and@Nested
attributes. - @Linked - to handle nested collections, arrays and maps (currently available to the HTML parser only).
@Parsed
is the core annotation to use and is the only one that actually maps fields in your results to an attribute.
Let’s assume your data has the following structure (shown in a fixed-width format for clarity):
profile_id username followers
123.........theuser........200.......
1...........admin....................
You can create a class such as:
class ProfileByFieldName {
@Parsed(field = "profile_id")
private Long profileId;
@Parsed
private String username;
@Parsed
private int followers;
@Override
public String toString() {
return "ProfileByFieldName{" +
"profileId=" + profileId +
", username='" + username + '\'' +
", followers=" + followers +
'}';
}
}
Which maps attribute names to the corresponding headers of your data. Names are automatically mapped if they are the same in both your data and in your class. In the example above the column profile_id
doesn’t match with the attribute name profileId
, so we had to explicitly map the field name with @Parsed(field = "profile_id")
.
Alternatively, you can map your attributes to the position of each column of your input:
class ProfileByFieldPosition {
@Parsed(index = 0)
private Long profileId;
@Parsed(index = 1)
private String username;
@Parsed(index = 2)
private int followers;
@Override
public String toString() {
return "ProfileByFieldPosition{" +
"profileId=" + profileId +
", username='" + username + '\'' +
", followers=" + followers +
'}';
}
}
Lastly, if you intend to use the same class to hold values of different sets of results, with different column names, each attribute can be mapped to multiple possible names:
class ProfileByMultipleFieldNames {
@Parsed(field = {"profile_id", "id"})
private Long profileId;
@Parsed(field = {"username", "user"})
private String username;
@Parsed
private int followers;
@Override
public String toString() {
return "ProfileByMultipleFieldNames{" +
"profileId=" + profileId +
", username='" + username + '\'' +
", followers=" + followers +
'}';
}
}
With multiple field names defined, the input data shown earlier should produce the following instances:
ProfileByMultipleFieldNames{profileId=123, username='theuser', followers=200}
ProfileByMultipleFieldNames{profileId=1, username='admin', followers=0}
A different set of input rows can also be processed:
id user created_at fees type admin stars
123.........theuser........2015-10-30...$12,90..U.....no.....143..
1...........admin..........21/05/2001....$0,00..S.....yes....?....
This will produce the same result except for attribute “followers” which is not present:
ProfileByMultipleFieldNames{profileId=123, username='theuser', followers=0}
ProfileByMultipleFieldNames{profileId=1, username='admin', followers=0}
Exploring the available annotations
Let’s have a look again at the following user profile data:
id user created_at fees type admin stars
123.........theuser........2015-10-30...$12,90..U.....no.....143..
1...........admin..........21/05/2001....$0,00..S.....yes....?....
To populate each column to a class attribute we need to somehow handle dates and amounts that may have different formats. We also need to convert “yes” and “no” to a proper boolean
value. On top of that there’s a “type” column that could be mapped to an enum
:
id user created_at fees type admin stars
123.........theuser........2015-10-30...$12,90..U.....no.....143..
1...........admin..........21/05/2001....$0,00..S.....yes....?....
This actually involves very little work. We can simply define a class such as:
class Profile {
enum Type {
USER('U'),
SYSTEM('S');
public final char typeCode;
Type(char typeCode) {
this.typeCode = typeCode;
}
}
@Parsed
private Long id;
@Parsed
@Trim
@UpperCase
private String user;
@Parsed(field = "created_at")
@Format(formats = {"yyyy-MM-dd", "dd/MM/yyyy"}, options = "locale=en;lenient=false")
private Date createdAt;
@Parsed
@Replace(expression = "\\$", replacement = "")
@Format(formats = {"#0,00"}, options = "decimalSeparator=,")
private BigDecimal fees;
@Parsed
@BooleanString(trueStrings = {"yes", "y"}, falseStrings = {"no", "n"})
private boolean admin;
@Parsed
@EnumOptions(customElement = "typeCode")
private Type type;
@Parsed(field = "stars")
@NullString(nulls = {"?", "N/A"})
private Integer stars;
@Override
public String toString() {
return "Profile{" +
"id=" + id +
", user='" + user + '\'' +
", createdAt=" + createdAt +
", fees=" + fees +
", admin=" + admin +
", type=" + type +
", stars=" + stars +
'}';
}
//
And the objects generated from that input will be:
Profile{id=123, user='THEUSER', createdAt=Fri Oct 30 00:00:00 CST 2015, fees=12.90, admin=false, type=USER, stars=143}
Profile{id=1, user='ADMIN', createdAt=Mon May 21 00:00:00 CST 2001, fees=0.00, admin=true, type=SYSTEM, stars=null}
Let’s explain each annotation:
-
Trim: trims any whitespaces around a value;
-
UpperCase: upper cases the value;
-
LowerCase: lower cases the value;
-
Replace: performs a
String.replace
operation on the value, matching the regular expression provided in theexpression
property and replacing with thereplacement
string provided. In our example use used it to remove the ‘$’ character from values in the “fees” column. -
NullString: converts a
String
in the input record tonull
, as it represents a missing value. In our example, values “?” or “N/A” will be converted tonull
-
Format: applies to
Date
,Calendar
or any numeric type attribute and accepts multiple formats. Theformats
andoptions
to use depend on the attribute type.-
with dates the formats you define are handled by Java’s
SimpleDateFormat
so they must conform to the patterns accepted by that class. Theoptions
allow you to configure properties of theSimpleDateFormat
that is created internally. You must provide key-value pairs separated by a;
. In the given example “locale=en” will internally executenew SimpleDateFormat("yyyy-MM-dd", new Locale("en"));
and “lenient=false” executessimpleDateFormat.setLenient(false);
. - numeric attributes should have formats that conform to the rules of java’s
DecimalFormat
. Similarly to the date patterns,options
will set attributes of theDecimalFormat
instance used to parse the input values, as well as the associatedDecimalFormatSymbols
.
-
-
BooleanString: accepts sets of multiple
Strings
that represent eithertrue
orfalse
. If the records collected by the parser have a value that match with the set oftrueStrings
, the attribute will be set totrue
. Otherwise,false
will be determined from the set offalseStrings
. If neither sets match, DataProcessingException will be raised informing you that noboolean
value can be determined. -
EnumOptions: converts the plain
String
in the input record to an enumeration value. Enumerations are mapped automatically if one of the following matches with the input value, in order of precedence:-
the
name
of a value of theenum
-
the
ordinal
value (i.e. the input value is a numeric representation of the enumeration’s ordinal); - the
toString()
of a value of theenum
This order can be redefined in the
selectors
attribute of EnumOptionsIn the example we used, the
Type
enumeration has atypeCode
character that represents each human-readable value. The records we are processing use that character, and we can easily map it by setting thecustomElement
in@EnumOptions
. The parser will discover what values to use based on thetypeCode
and map them to theType
values accordingly. -
Using your own conversions in annotations
You can create your own implementation of the Conversion interface so it can be used in fields annotated with Parsed.
A varargs constructor can be used on your implementation so it can also be initialized with String... args
.
For example, the following custom implementation splits a String
into a String[]
based on a given delimiter.
class Splitter implements Conversion<String, String[]> {
private String separator;
public Splitter(String... args) {
if(args.length == 0){
separator = ",";
} else {
separator = args[0];
}
}
@Override
public String[] execute(String input) {
if(input == null){
return new String[0];
}
return input.split(separator);
}
@Override
public String revert(String[] input) {
StringBuilder out = new StringBuilder();
for (String value : input) {
if (out.length() > 0) {
out.append(separator);
}
out.append(value);
}
return out.toString();
}
}
Now, assuming the parser produces the following records:
id user words
123.........theuser........cheater;snitch
222.........otheruser..............camper
We can create a class such as:
class Offender {
@Nested
private Profile profile;
@Parsed
@Convert(conversionClass = Splitter.class, args = ";")
private String[] words;
@Override
public String toString() {
return "Offender{" +
"id=" + profile.getId() +
", user='" + profile.getUser() +
"', words=" + Arrays.toString(words) +
'}';
}
}
Which will make use of our custom conversion to split words in the words column based on the ‘;’ separator. Our Offender
objects will be populated with:
Offender{id=123, user='THEUSER', words=[cheater, snitch]}
Offender{id=222, user='OTHERUSER', words=[camper]}
We also used the Nested
annotation to reuse the existing Profile
class. We are going to talk more about it in the next section.
Annotations in methods
All annotations can be put on the methods of your class instead of directly into its fields. The BetterOffender
class below makes the previous Offender
example that depends on a custom conversion look silly:
class BetterOffender {
@Nested
private Profile profile;
private String[] words;
@Parsed(field = "words")
public void setWords(String words) {
this.words = words.split(";");
}
@Override
public String toString() {
return "BetterOffender{" +
"id=" + profile.getId() +
", user='" + profile.getUser() +
"', words=" + Arrays.toString(words) +
'}';
}
}
Just keep in mind that if you intend to use the same class for writing, you will want to annotate the “getter” method as well so the attribute value can be read from your objects.
Nested classes
The @Nested
annotation works to populate nested types in your class structure from a single input row. This means that the nested attributes have their headers collected and joined together with the headers of the parent class.
Let’s look again at the BetterOffender
class:
class BetterOffender {
@Nested
private Profile profile;
private String[] words;
@Parsed(field = "words")
public void setWords(String words) {
this.words = words.split(";");
}
@Override
public String toString() {
return "BetterOffender{" +
"id=" + profile.getId() +
", user='" + profile.getUser() +
"', words=" + Arrays.toString(words) +
'}';
}
}
This class structure is capable to reading the following input columns of a given result:
- id, user, created_at, fees, admin, type, stars in the
Profile
class (as defined in the annotations in each attribute ofProfile
)
PLUS
- words from the
BetterOffender
class itself (the fieldwords
is defined in the annotation of thesetWords
method)
So this input:
id user words
123.........theuser........cheater;snitch
222.........otheruser..............camper
Can be parsed into instances of BetterOffender
, whose toString
will show:
BetterOffender{id=123, user='THEUSER', words=[cheater, snitch]}
BetterOffender{id=222, user='OTHERUSER', words=[camper]}
Multiple nested attributes
In some cases you will want to use the same type of @Nested
attribute, for example consider the AddressBook
that has two nested Address
types:
class AddressBook {
@Nested
private Address mailingAddress;
@Nested
private Address mainAddress;
Where the Address
class is simply:
class Address {
@Parsed
private String street;
@Parsed
private String city;
@Parsed
private String state;
@Override
public String toString() {
return "Address{" +
"street='" + street + '\'' +
", city='" + city + '\'' +
", state='" + state + '\'' +
'}';
}
}
The problem here is that mailingAddress
and mainAddress
have the same type and consequently the same field names. Luckily we can transform the field names used by each nested Address
with the help of a HeaderTransformer.
Consider the parsed records have this structure:
mail_street mail_city mail_state main_street main_city main_state
12 Some St.......Sydney............NSW...23 The St........Brisbane..........QLD
450 Other St.....Adelaide...........SA...2 William Av.....Hobart............TAS
As the column names are prefixed with mail
for the mailing address and main
for the main address, we can implement a HeaderTransformer such as:
class AddressTypeTransformer extends HeaderTransformer {
private String prefix;
public AddressTypeTransformer(String... args) {
prefix = args[0];
}
@Override
public String transformName(Field field, String name) {
return prefix + "_" + name;
}
}
Which will insert a given prefix to the field names in our nested Address
attributes. The AddressBook
class must be updated to make use of our AddressTypeTransformer
:
class AddressBook {
@Nested(headerTransformer = AddressTypeTransformer.class, args = "mail")
private Address mailingAddress;
@Nested(headerTransformer = AddressTypeTransformer.class, args = "main")
private Address mainAddress;
@Override
public String toString() {
return "AddressBook{" +
"mailing=" + mailingAddress +
", main=" + mainAddress +
'}';
}
}
With this, mailingAddress.street
will get the values from column mail_street
, and mainAddress.street
from main_street
. The other attributes will be mapped similarly.
When converting the input records to java beans, we get the correct mapping:
AddressBook{mailing=Address{street='12 Some St', city='Sydney', state='NSW'}, main=Address{street='23 The St', city='Brisbane', state='QLD'}}
AddressBook{mailing=Address{street='450 Other St', city='Adelaide', state='SA'}, main=Address{street='2 William Av', city='Hobart', state='TAS'}}
Avoiding repetitive annotations
In many cases a class can have multiple attributes that require the usage of the same set of annotations, over and over. For example, an input such as:
id created_at updated_at deleted_at
123..2010-12-20..2018-02-05..2018-04-01
456..2011-01-29..2017-06-26......-.....
All dates are formatted using the yyyy-MM-dd format, or shown as a ‘-’ if the date is not available (i.e. null
).
When dealing with this the first approach uninformed people tend to use is to define a class with repetitive annotations. It usually becomes something like this:
class DatesRepetitive {
@Parsed
private Long id;
@Parsed(field = "created_at")
@NullString(nulls = "-")
@Format(formats = "yyyy-MM-dd")
private Date createdAt;
@Parsed(field = "updated_at")
@NullString(nulls = "-")
@Format(formats = "yyyy-MM-dd")
private Date updatedAt;
@Parsed(field = "deleted_at")
@NullString(nulls = "-")
@Format(formats = "yyyy-MM-dd")
private Date deletedAt;
//
Maybe it’s not so bad to add the same 3 annotations to every dated field in your class, but it can get much worse if you need to add the same annotations in multiple classes, and many more fields. To avoid repetition, you can define a meta-annotation instead.
A meta-annotation is your own annotation type which you can use in place of multiple annotations. You can create a @MyCompanyDate
annotation type easily and use in your classes to make them look like this:
class DatesWithMetaAnnotation {
@Parsed
private Long id;
@MyCompanyDate(field = "created_at")
private Date createdAt;
@MyCompanyDate(field = "updated_at")
private Date updatedAt;
@MyCompanyDate(field = "deleted_at")
private Date deletedAt;
//
The definition of @MyCompanyDate
is relatively straightforward:
//default annotation stuff - you'll probably just want to copy and paste this
@Retention(RetentionPolicy.RUNTIME)
@Inherited
@Target({ElementType.FIELD, ElementType.METHOD, ElementType.ANNOTATION_TYPE})
//your common annotations go here
@Parsed
@NullString(nulls = "-")
@Format(formats = "yyyy-MM-dd")
public @interface MyCompanyDate {
@Copy(to = Parsed.class) //copies the value provided in `MyCompanyDate.field` to `Parsed.field`
String field() default "";
}
So all you have to do is to add the group of annotations you use everywhere to the meta-annotation declaration itself.
We added a field
property to the meta-annotation with the intent of allowing you to map attributes to a custom column name of your results. This works exactly like the field
property of the @Parsed
annotation because the @Copy
annotation copies the value set in @MyCompanyDate.field
into @Parsed.field
.
The Copy annotation can copy any value of any property of your meta-annotation. Where property names are different, you need to provide the target property name. For example, we could add a position
attribute to the meta-annotation and copy it’s value to Parsed.index
with:
@Copy(to = Parsed.class, property = "index")
int position() default -1;
This is very useful when you find yourself using the same annotations over and over. You can even create meta-annotations using other meta-annotations.
Validation
You can add a @Validate
annotation to any attribute or method of your class to perform basic data validations. By default, nulls and blanks (i.e. the empty string """
or values with blank only, such as " \n"
) are not allowed.
@Parsed
@Validate
public String notNulNotBlank;
@Parsed
@Validate(nullable = true)
public String nullButNotBlank;
@Parsed
@Validate(allowBlanks = true)
public String notNullButBlank;
You can also specify acceptable values:
@Parsed
@Validate(oneOf = {"a", "b"})
public String a_or_b;
And invalid values:
@Parsed
@Validate(noneOf = {"a", "b"})
public String not_a_nor_b;
Finally, you can also enforce a given format using regular expressions:
@Parsed
@Validate(matches = "^[^\\d\\s]+$")
public String noDigitsNoSpaces;
When a validation fails, a DataValidationException will be thrown. You can handle such errors with a ProcessorErrorHandler to discard invalid records or use a RetryableErrorHandler to try and recover from the error, as described in the Recovering from errors section of the univocity-parsers tutorial.
Custom validations
You can also create complex validators by implementing instances of Validator. For example, the following class ensures all numeric values are positive:
public class Positive implements Validator<Integer> {
@Override
public String validate(Integer value) {
if (value < 0) {
return "value must be positive or zero";
}
return null; //all good, no validation error messages to return
}
}
All we need now is to instruct the @Validate
annotation to use our Positive
validation:
@Parsed
@Validate(validators = Positive.class)
public int positive;
Note that the validators
property accepts multiple validation classes, which allows you to compose any custom validation sequence at will.
Linked results
The @Linked
annotation is available only to the HTML parser currently, and allows you to populate single objects, arrays, collections and maps of different types from results that are linked to a parent record. The association among records can exist naturally via results collected via Link following), or manually by joining results collected by multiple entities in the same page.
Proceed to Reading linked results into java beans to learn more.