Reading linked results into java beans
This tutorial demonstrates how to use the Linked annotation to populate java beans from records that are associated with a parent record. The associations among records can naturally occur through Link following, but can also be manually defined through the link
operations provided by the Results returned by the parser. The link
operation is also available from each individual Result element of a Results object. For simplicity the following examples use manually linked records.
To help you understand how linked results can be mapped into objects we built an example based on the following input HTML:
As you can see, this is a table with fuel prices from different petrol stations at different days of the week. We will use this example to populate the results into a class structure that organizes the data conveniently but first we need to extract the information in some meaningful way.
The HTML code of this page is:
<table border="1">
<thead>
<tr>
<th></th>
<th>MON</th>
<th>TUE</th>
<th>WED</th>
</tr>
</thead>
<tbody>
<tr>
<th>Retailer</th>
<td>Caltex</td>
<td>Shell</td>
<td>Texaco</td>
</tr>
<tr>
<th>Prices</th>
<td>
<table>
<tr>
<td>UL</td>
<td>$1.32</td>
</tr>
<tr>
<td>D</td>
<td>$1.36</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>UL</td>
<td>$1.40</td>
</tr>
<tr>
<td>D</td>
<td>$1.33</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>UL</td>
<td>$1.52</td>
</tr>
<tr>
<td>D</td>
<td>$1.40</td>
</tr>
</table>
</td>
</tr>
<tr>
<th>Retailer</th>
<td>Shell</td>
<td>Texaco</td>
<td>BP</td>
</tr>
<tr>
<th>Prices</th>
<td>
<table>
<tr>
<td>UL</td>
<td>$1.10</td>
</tr>
<tr>
<td>D</td>
<td>$1.55</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>UL</td>
<td>$1.23</td>
</tr>
<tr>
<td>D</td>
<td>$1.90</td>
</tr>
</table>
</td>
<td>
<table>
<tr>
<td>UL</td>
<td>$1.69</td>
</tr>
<tr>
<td>D</td>
<td>$1.01</td>
</tr>
</table>
</td>
</tr>
</tbody>
<tfoot>
<tr>
<th>Reviewer</th>
<td><span>Bob</span><span>(001)</span></td>
<td><span>Adam</span><span>(007)</span></td>
<td><span>Jade</span><span>(099)</span></td>
</tr>
</tfoot>
</table>
As there is no particular CSS class or element ID in this document, the parsing code has to be built to collect data based on the HTML structure. We tried to explain each relevant matching rule in the comments:
HtmlEntityList htmlEntityList = new HtmlEntityList();
// captures all days of the week listed
HtmlEntitySettings dayOfWeek = htmlEntityList.configureEntity("dayOfWeek");
dayOfWeek.addField("day") // any <th> with at least 3 characters of text in them in the first <thead> of the document.
.matchFirst("thead").match("th").withText("???").getText();
// reads all pretrol station names
HtmlEntitySettings petrolStation = htmlEntityList.configureEntity("petrolStation");
petrolStation.addField("petrol_station_name") //any <td> somewhere after a <th> with text "Retailer"
.match("td").precededBy("th").withText("Retailer").getText();
// collects all reviewer names and their IDs
HtmlEntitySettings reviewer = htmlEntityList.configureEntity("reviewer");
reviewer.addField("reviewer_name") //first <span> inside a <td> of a <tfoot>
.match("tfoot").match("td").matchFirst("span").getText();
reviewer.addField("reviewer_id") //last <span> inside a <td> of a <tfoot>, removes values within '(' and ')'
.match("tfoot").match("td").matchLast("span").getText().transform(s -> s.substring(1, s.length() - 1));
// collects fuel prices
HtmlEntitySettings fuel = htmlEntityList.configureEntity("fuel");
// creates a reusable path: matches every <tr> of any <table> that is contained in an outer <table>
PartialPath fuelPricePath = fuel.newPath().match("table").match("table").match("tr");
fuelPricePath.addField("fuel_type") // fields are added to the partial path
.matchFirst("td").getText(); // continue matching from where the path ends. Gets the text from the first <td>
fuelPricePath.addField("price")
.matchLast("td").getText(); // The fuel price is in the the last <td> of the <tr> matched by the partial path.
// Now we add fields to our "fuel" that already exist in the other entities created above.
// We will use these fields to join records of each entity ('dayOfWeek', 'petrolStation', 'reviewer' and 'fuel')
fuel.addPersistentField("day") // captures the day of week of each fuel price listed.
//matches a <table> contained by an outer <table>. Then goes up to the first <th> that is above the inner <table> and grabs its text.
.match("table").match("table").upToHeader("th").getText();
fuel.addPersistentField("petrol_station_name")// captures the petrol station name of each fuel price listed.
// match a <td> that contains a <table>, then collects the text in the row above this <td>, from the same column.
.match("td").parentOf("table").getTextAbove();
fuel.addPersistentField("reviewer_id")// collects the reviewer ID under each fuel price listed.
// finds a <table> contained by a <table>. From the inner <table> look down to a <td> inside a <tfoot>
// that <td> should be in the same column of the inner <table>.
.match("table").match("table").downToFooter("td").containedBy("tfoot")
// From that <td>, gets the last <span> then collect the text between '(' and ')'
.matchLast("span").getText().transform(s -> s.substring(1, s.length() - 1));
//That's it, let's parse the HTML and get the results to see what data we get.
FileProvider inputFile = new FileProvider("documentation/tutorial/html/annotations/linkedEntityTest.html");
Results<HtmlParserResult> result = new HtmlParser(htmlEntityList).parse(inputFile);
This code will collect the following information for each one of the entities defined:
[ dayOfWeek ]
day__
MON
TUE
WED
[ fuel ]
fuel_type__price__day__petrol_station_name__reviewer_id__
UL $1.32 MON Caltex 001
D $1.36 MON Caltex 001
UL $1.40 TUE Shell 007
D $1.33 TUE Shell 007
UL $1.52 WED Texaco 099
D $1.40 WED Texaco 099
UL $1.10 MON Shell 001
D $1.55 MON Shell 001
UL $1.23 TUE Texaco 007
D $1.90 TUE Texaco 007
UL $1.69 WED BP 099
D $1.01 WED BP 099
[ petrolStation ]
petrol_station_name__
Caltex
Shell
Texaco
Shell
Texaco
BP
[ reviewer ]
reviewer_name__reviewer_id__
Bob 001
Adam 007
Jade 099
We can now link the records together with:
// fuel records are linked to each day of week
// from each "dayOfWeek" record we can now get the corresponding "fuel" records
result.link("dayOfWeek", "fuel");
// petrol station and reviewer records are linked to each fuel record
// from each "fuel" record we can now obtain the corresponding "petrolStation" and "reviewer" record.
result.link("fuel", "petrolStation", "reviewer");
Now that we have records linked one to the other, we can explore the @Linked
and @Group
annotations to populate complex class structures.
Basic classes
First, we define a few basic classes. Let’s start with a Reviewer
class to keep the information we see at the footer of the price table:
class Reviewer {
@Parsed(field = "reviewer_id")
public String id;
@Parsed(field = "reviewer_name")
public String name;
@Override
public String toString() {
return name + '-' + id;
}
}
The Reviewer
class has its attributes mapped to fields “reviewer_id” and “reviewer_name” of the “reviewer” entity configured in the HtmlEntityList
defined earlier.
For the fuel prices, we can begin by defining a FuelType
enumeration to describe each fuel type code:
class FuelType {
UNLEADED("UL"),
DIESEL("D");
public final String code;
FuelType(String code) {
this.code = code;
}
@Override
public String toString() {
return code;
}
}
Which we use on a Price
class to store the fuel type and its price:
class Price {
@Parsed(field = "fuel_type")
public FuelType fuelType;
@Replace(expression = "\\$", replacement = "")
@Parsed
public BigDecimal price;
@Override
public String toString() {
return fuelType.name() + " = $" + price;
}
}
Here we are using the @Replace
annotation to remove any dollar sign from the parsed value so it can be converted into a BigDecimal
The @Linked annotation
Finally, we can define the PriceDetails
class:
class PriceDetails {
// Maps records with headers [fuel_type, price, petrol_station_name]
// each having 0 or 1 linked records with headers [reviewer_id, reviewer_name]
// These headers are defined in the @Parsed annotations of classes `Reviewer` and `Price`
/**
* For each record with data for a `PriceDetails` object, we expect to obtain 0 or 1
* linked records from an entity named "reviewer". The linked record will be converted
* to an instance the `Reviewer` class
*
* As the attribute name matches the entity name, 'entity = "reviewer"' could have been omitted
*/
@Linked(entity = "reviewer")
public Reviewer reviewer;
/**
* The nested `Price` attribute has fields "fuel_type" and "price".
* Each record with data for a `PriceDetails` object is expected to have
* fields named "fuel_type" and "price", which will be used to populate the
* attributes of an instance of `Price`
*/
@Nested
public Price price;
/**
* Each record with data for a `PriceDetails` object is expected to also have
* a field named "petrol_station_name", whose value will be used to set this
* "name" attribute
*/
@Parsed(field = "petrol_station_name")
public String name;
}
Here, the @Linked(entity = "reviewer")
annotation establishes that each record converted to PriceDetails
must have been linked to an entity named “reviewer”. The entity name is derived from the attribute name, so it works the same as if we had omitted entity = "reviewer"
and used just @Linked
instead.
In this first example the attribute is not a collection nor an array type, so we can either have 0 or 1 records from a “reviewer” linked to the parent record.
This class is configured to work with records that have a field named “petrol_station_name”, plus the field names of the @Nested Price
attribute: “fuel_type” and “price” (as seen in the definition of class Price
earlier)
Now we can read the HtmlParserResult of our “fuel” entity with this code to obtain a list of PriceDetails
:
// Let's convert the records of entity "fuel" into `DailyPriceList` beans
List<PriceDetails> prices = result.get("fuel").getBeans(PriceDetails.class);
// Now we can print out all price details
for (PriceDetails petrolStation : prices) {
print(petrolStation.name + " -> " + petrolStation.price);
println(" | Reviewed by: " + petrolStation.reviewer.name + " (" + petrolStation.reviewer.id + ")");
}
This will print out the following:
Caltex -> UNLEADED = $1.32 | Reviewed by: Bob (001)
Caltex -> DIESEL = $1.36 | Reviewed by: Bob (001)
Shell -> UNLEADED = $1.40 | Reviewed by: Adam (007)
Shell -> DIESEL = $1.33 | Reviewed by: Adam (007)
Texaco -> UNLEADED = $1.52 | Reviewed by: Jade (099)
Texaco -> DIESEL = $1.40 | Reviewed by: Jade (099)
Shell -> UNLEADED = $1.10 | Reviewed by: Bob (001)
Shell -> DIESEL = $1.55 | Reviewed by: Bob (001)
Texaco -> UNLEADED = $1.23 | Reviewed by: Adam (007)
Texaco -> DIESEL = $1.90 | Reviewed by: Adam (007)
BP -> UNLEADED = $1.69 | Reviewed by: Jade (099)
BP -> DIESEL = $1.01 | Reviewed by: Jade (099)
But it doesn’t have any information about the day of the week. Fortunately we linked each record of fuel to dayOfWeek earlier, so we can create a class such as:
class DailyPriceList {
@Parsed(field = "day")
public String dayOfWeek;
@Linked(entity = "fuel", type = PriceDetails.class, container = ArrayList.class)
public List<PriceDetails> priceDetails;
}
Here the @Linked
annotation tells us what our setup is: this class will work with a HtmlParserResult whose top-level records have a field named “day”. Each one of these records is expected to have a linked entity named “fuel” (specified by the entity
property of the annotation).
The “fuel” records linked to each “day” will be read to generate instances of PriceDetails
(specified by the type
property in the annotation). Instances of PriceDetails
will be stored in an ArrayList
(specified by the container
property of the annotation).
Now we can get the price results of each dayOfWeek and use it to generate a list of DailyPriceList
instances:
// As each "dayOfWeek" record has "fuel" records, we can obtain a list of `DailyPriceList` beans
List<DailyPriceList> pricesPerDay = result.get("dayOfWeek").getBeans(DailyPriceList.class);
// Now we can print out the price details of each day.
for (DailyPriceList priceList : pricesPerDay) {
println("* Petrol prices on " + priceList.dayOfWeek);
for (PriceDetails petrolStation : priceList.priceDetails) {
print("\t" + petrolStation.name + " -> " + petrolStation.price);
println(" | Reviewed by: " + petrolStation.reviewer.name + " (" + petrolStation.reviewer.id + ")");
}
println("----------------------");
}
Which produces the following output:
* Petrol prices on MON
Caltex -> UNLEADED = $1.32 | Reviewed by: Bob (001)
Caltex -> DIESEL = $1.36 | Reviewed by: Bob (001)
Shell -> UNLEADED = $1.10 | Reviewed by: Bob (001)
Shell -> DIESEL = $1.55 | Reviewed by: Bob (001)
----------------------
* Petrol prices on TUE
Shell -> UNLEADED = $1.40 | Reviewed by: Adam (007)
Shell -> DIESEL = $1.33 | Reviewed by: Adam (007)
Texaco -> UNLEADED = $1.23 | Reviewed by: Adam (007)
Texaco -> DIESEL = $1.90 | Reviewed by: Adam (007)
----------------------
* Petrol prices on WED
Texaco -> UNLEADED = $1.52 | Reviewed by: Jade (099)
Texaco -> DIESEL = $1.40 | Reviewed by: Jade (099)
BP -> UNLEADED = $1.69 | Reviewed by: Jade (099)
BP -> DIESEL = $1.01 | Reviewed by: Jade (099)
----------------------
The annotation processor is very powerful and allows for plenty of flexibility on your class structure. You can also use maps, as shown next.
The @Group annotation
The Group annotation configures how a Map
instance should be created and how its keys are populated. It’s used in conjunction with the @Link
annotation to determine how the map values are populated.
Let’s create a PetrolStation
class - essentially the same as previous PriceDetails
but without Price
information:
class PetrolStation implements Comparable<PetrolStation> {
@Parsed(field = "petrol_station_name")
public String name;
@Linked
public Reviewer reviewer;
public PetrolStation(){
}
@Override
public String toString() {
return name + " - Reviewer: " + reviewer + "";
}
// we're going to use PetrolStation as the keys of a TreeMap, so we implemented
// the `Comparable` interface
@Override
public int compareTo(PetrolStation o) {
return this.toString().compareTo(o.toString());
}
}
Now, instead of a DailyPriceList
, we define a DailyPriceMap
where each PetrolStation
is associated with a list of Price
objects:
class DailyPriceMap {
@Parsed(field = "day")
public String dayOfWeek;
@Group(key = PetrolStation.class, container = TreeMap.class)
@Linked(entity = "fuel", type = Price.class, container = ArrayList.class)
public Map<PetrolStation, List<Price>> pricesPerStation;
}
Again, the @Linked
annotation tells what the map values should be: each top-level record of the a HtmlParserResult must have a linked entity named “fuel” (specified by the entity
property of the annotation).
The linked “fuel” records will be read to generate instances of Price
(specified by type
). The instances of Price
will be stored in an ArrayList
(specified by container
). This ArrayList
will be associated with the values of each key generated according to the @Group
configuration.
The @Group
annotation determines that the keys of the map should be populated using our PetrolStation
class (specified by the key
property in the annotation), and the map should be an instance of TreeMap
(specified by container
).
Finally, we can convert each record of entity dayOfWeek and use to obtain DailyPriceMap
instances:
// Here, each `DailyPriceMap` instance has the day of the week and a Map<PetrolStation, List<Price>>
List<DailyPriceMap> pricesPerDay = result.get("dayOfWeek").getBeans(DailyPriceMap.class);
for (DailyPriceMap priceList : pricesPerDay) {
println("* Petrol prices on " + priceList.dayOfWeek);
for (Map.Entry<PetrolStation, List<Price>> e : priceList.pricesPerStation.entrySet()) {
PetrolStation petrolStation = e.getKey();
List<Price> prices = e.getValue();
println("\t" + petrolStation.name + " | Reviewed by: " + petrolStation.reviewer.name + "(" + petrolStation.reviewer.id + ")");
for (Price price : prices) {
println("\t\tPrice of " + price.fuelType.name() + ": $" + price.price);
}
}
println("----------------------");
}
Which prints out the price results grouped by each day of the week:
* Petrol prices on MON
Caltex | Reviewed by: Bob(001)
Price of UNLEADED: $1.32
Price of DIESEL: $1.36
Shell | Reviewed by: Bob(001)
Price of UNLEADED: $1.10
Price of DIESEL: $1.55
----------------------
* Petrol prices on TUE
Shell | Reviewed by: Adam(007)
Price of UNLEADED: $1.40
Price of DIESEL: $1.33
Texaco | Reviewed by: Adam(007)
Price of UNLEADED: $1.23
Price of DIESEL: $1.90
----------------------
* Petrol prices on WED
BP | Reviewed by: Jade(099)
Price of UNLEADED: $1.69
Price of DIESEL: $1.01
Texaco | Reviewed by: Jade(099)
Price of UNLEADED: $1.52
Price of DIESEL: $1.40
----------------------
Further reading
That’s it. With just a couple of annotations you should be able to generate almost any sort of class relationship that represents the data you get from a HtmlParserResult.
Feel free to proceed to the following sections (in any order).
- Introduction to the univocity HTML parser
- Fields and matching rules
- Reading data into java beans
- Pagination
- Link following
- Downloads and historical data management
- Resource Downloading
- Listening to parser actions
If you find a bug
We deal with errors very seriously and stop the world to fix bugs in less than 24 hours whenever possible. It’s rare to have known issues dangling around for longer than that. A new SNAPSHOT build will be generated so you (and anyone affected by the bug) can proceed with your work as soon as the adjustments are made.
If you find a bug don’t hesitate to report an issue here. You can also submit feature requests or any other improvements there.
We are happy to help if you have any questions in regards to how to use the parser for your specific use case. Just send us an e-mail with the details and we’ll reply as soon as humanely possible.
We can work for you
If you don’t have the resources or don’t really want to waste time coding we can build a custom solution for you using our products. We deliver quickly as we know the ins and outs of everything we are dealing with. Send us an e-mail to sales@univocity.com with your requirements and we’ll be happy to assist.
The univocity team.