Reading linked results into java beans

This tutorial demonstrates how to use the Linked annotation to populate java beans from records that are associated with a parent record. The associations among records can naturally occur through Link following, but can also be manually defined through the link operations provided by the Results returned by the parser. The link operation is also available from each individual Result element of a Results object. For simplicity the following examples use manually linked records.

To help you understand how linked results can be mapped into objects we built an example based on the following input HTML:

Display in a separate tab

As you can see, this is a table with fuel prices from different petrol stations at different days of the week. We will use this example to populate the results into a class structure that organizes the data conveniently but first we need to extract the information in some meaningful way.

The HTML code of this page is:

<table border="1">
    <thead>
    <tr>
        <th></th>
        <th>MON</th>
        <th>TUE</th>
        <th>WED</th>
    </tr>
    </thead>
    <tbody>
    <tr>
        <th>Retailer</th>
        <td>Caltex</td>
        <td>Shell</td>
        <td>Texaco</td>
    </tr>
    <tr>
        <th>Prices</th>
        <td>
            <table>
                <tr>
                    <td>UL</td>
                    <td>$1.32</td>
                </tr>
                <tr>
                    <td>D</td>
                    <td>$1.36</td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td>UL</td>
                    <td>$1.40</td>
                </tr>
                <tr>
                    <td>D</td>
                    <td>$1.33</td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td>UL</td>
                    <td>$1.52</td>
                </tr>
                <tr>
                    <td>D</td>
                    <td>$1.40</td>
                </tr>
            </table>
        </td>

    </tr>
    <tr>
        <th>Retailer</th>
        <td>Shell</td>
        <td>Texaco</td>
        <td>BP</td>
    </tr>
    <tr>
        <th>Prices</th>
        <td>
            <table>
                <tr>
                    <td>UL</td>
                    <td>$1.10</td>
                </tr>
                <tr>
                    <td>D</td>
                    <td>$1.55</td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td>UL</td>
                    <td>$1.23</td>
                </tr>
                <tr>
                    <td>D</td>
                    <td>$1.90</td>
                </tr>
            </table>
        </td>
        <td>
            <table>
                <tr>
                    <td>UL</td>
                    <td>$1.69</td>
                </tr>
                <tr>
                    <td>D</td>
                    <td>$1.01</td>
                </tr>
            </table>
        </td>
    </tr>
    </tbody>
    <tfoot>
    <tr>
        <th>Reviewer</th>
        <td><span>Bob</span><span>(001)</span></td>
        <td><span>Adam</span><span>(007)</span></td>
        <td><span>Jade</span><span>(099)</span></td>
    </tr>
    </tfoot>
</table>

As there is no particular CSS class or element ID in this document, the parsing code has to be built to collect data based on the HTML structure. We tried to explain each relevant matching rule in the comments:

HtmlEntityList htmlEntityList = new HtmlEntityList();

// captures all days of the week listed
HtmlEntitySettings dayOfWeek = htmlEntityList.configureEntity("dayOfWeek");
dayOfWeek.addField("day") // any <th> with at least 3 characters of text in them in the first <thead> of the document.
        .matchFirst("thead").match("th").withText("???").getText();

// reads all pretrol station names
HtmlEntitySettings petrolStation = htmlEntityList.configureEntity("petrolStation");
petrolStation.addField("petrol_station_name") //any <td> somewhere after a <th> with text "Retailer"
        .match("td").precededBy("th").withText("Retailer").getText();

// collects all reviewer names and their IDs
HtmlEntitySettings reviewer = htmlEntityList.configureEntity("reviewer");
reviewer.addField("reviewer_name") //first <span> inside a <td> of a <tfoot>
        .match("tfoot").match("td").matchFirst("span").getText();
reviewer.addField("reviewer_id") //last <span> inside a <td> of a <tfoot>, removes values within '(' and ')'
        .match("tfoot").match("td").matchLast("span").getText().transform(s -> s.substring(1, s.length() - 1));

// collects fuel prices
HtmlEntitySettings fuel = htmlEntityList.configureEntity("fuel");
// creates a reusable path: matches every <tr> of any <table> that is contained in an outer <table>
PartialPath fuelPricePath = fuel.newPath().match("table").match("table").match("tr");
fuelPricePath.addField("fuel_type") // fields are added to the partial path
        .matchFirst("td").getText(); // continue matching from where the path ends. Gets the text from the first <td>
fuelPricePath.addField("price")
        .matchLast("td").getText(); // The fuel price is in the the last <td> of the <tr> matched by the partial path.

// Now we add fields to our "fuel" that already exist in the other entities created above.
// We will use these fields to join records of each entity ('dayOfWeek', 'petrolStation', 'reviewer' and 'fuel')

fuel.addPersistentField("day") // captures the day of week of each fuel price listed.
        //matches a <table> contained by an outer <table>. Then goes up to the first <th> that is above the inner <table> and grabs its text.
        .match("table").match("table").upToHeader("th").getText();

fuel.addPersistentField("petrol_station_name")// captures the petrol station name of each fuel price listed.
        // match a <td> that contains a <table>, then collects the text in the row above this <td>, from the same column.
        .match("td").parentOf("table").getTextAbove();

fuel.addPersistentField("reviewer_id")// collects the reviewer ID under each fuel price listed.
        // finds a <table> contained by a <table>. From the inner <table> look down to a <td> inside a <tfoot>
        // that <td> should be in the same column of the inner <table>.
        .match("table").match("table").downToFooter("td").containedBy("tfoot")
        // From that <td>, gets the last <span> then collect the text between '(' and ')'
        .matchLast("span").getText().transform(s -> s.substring(1, s.length() - 1));

//That's it, let's parse the HTML and get the results to see what data we get.
FileProvider inputFile = new FileProvider("documentation/tutorial/html/annotations/linkedEntityTest.html");

Results<HtmlParserResult> result = new HtmlParser(htmlEntityList).parse(inputFile);

This code will collect the following information for each one of the entities defined:

[ dayOfWeek ]
day__
MON  
TUE  
WED  

[ fuel ]
fuel_type__price__day__petrol_station_name__reviewer_id__
UL         $1.32  MON  Caltex               001          
D          $1.36  MON  Caltex               001          
UL         $1.40  TUE  Shell                007          
D          $1.33  TUE  Shell                007          
UL         $1.52  WED  Texaco               099          
D          $1.40  WED  Texaco               099          
UL         $1.10  MON  Shell                001          
D          $1.55  MON  Shell                001          
UL         $1.23  TUE  Texaco               007          
D          $1.90  TUE  Texaco               007          
UL         $1.69  WED  BP                   099          
D          $1.01  WED  BP                   099          

[ petrolStation ]
petrol_station_name__
Caltex               
Shell                
Texaco               
Shell                
Texaco               
BP                   

[ reviewer ]
reviewer_name__reviewer_id__
Bob            001          
Adam           007          
Jade           099

We can now link the records together with:

// fuel records are linked to each day of week
// from each "dayOfWeek" record we can now get the corresponding "fuel" records
result.link("dayOfWeek", "fuel");

// petrol station and reviewer records are linked to each fuel record
// from each "fuel" record we can now obtain the corresponding "petrolStation" and "reviewer" record.
result.link("fuel", "petrolStation", "reviewer");

Now that we have records linked one to the other, we can explore the @Linked and @Group annotations to populate complex class structures.

Basic classes

First, we define a few basic classes. Let’s start with a Reviewer class to keep the information we see at the footer of the price table:

class Reviewer {

    @Parsed(field = "reviewer_id")
    public     String id;

    @Parsed(field = "reviewer_name")
    public String name;

    @Override
    public String toString() {
        return name + '-' + id;
    }
}

The Reviewer class has its attributes mapped to fields “reviewer_id” and “reviewer_name” of the “reviewer” entity configured in the HtmlEntityList defined earlier.

For the fuel prices, we can begin by defining a FuelType enumeration to describe each fuel type code:

class FuelType {

    UNLEADED("UL"),
    DIESEL("D");

    public final String code;

    FuelType(String code) {
        this.code = code;
    }

    @Override
    public String toString() {
        return code;
    }
}

Which we use on a Price class to store the fuel type and its price:

class Price {

    @Parsed(field = "fuel_type")
    public FuelType fuelType;

    @Replace(expression = "\\$", replacement = "")
    @Parsed
    public BigDecimal price;

    @Override
    public String toString() {
        return fuelType.name() + " = $" + price;
    }
}

Here we are using the @Replace annotation to remove any dollar sign from the parsed value so it can be converted into a BigDecimal

The @Linked annotation

Finally, we can define the PriceDetails class:

class PriceDetails {

    // Maps records with headers [fuel_type, price, petrol_station_name]
    // each having 0 or 1 linked records with headers [reviewer_id, reviewer_name]
    // These headers are defined in the @Parsed annotations of classes `Reviewer` and `Price`

    /**
     * For each record with data for a `PriceDetails` object, we expect to obtain 0 or 1
     * linked records from an entity named "reviewer". The linked record will be converted
     * to an instance the `Reviewer` class
     *
     * As the attribute name matches the entity name, 'entity = "reviewer"' could have been omitted
     */
    @Linked(entity = "reviewer")
    public Reviewer reviewer;

    /**
     * The nested `Price` attribute has fields "fuel_type" and "price".
     * Each record with data for a `PriceDetails` object is expected to have
     * fields named "fuel_type" and "price", which will be used to populate the
     * attributes of an instance of `Price`
     */
    @Nested
    public Price price;

    /**
     * Each record with data for a `PriceDetails` object is expected to also have
     * a field named "petrol_station_name", whose value will be used to set this
     * "name" attribute
     */
    @Parsed(field = "petrol_station_name")
    public String name;

}

Here, the @Linked(entity = "reviewer") annotation establishes that each record converted to PriceDetails must have been linked to an entity named “reviewer”. The entity name is derived from the attribute name, so it works the same as if we had omitted entity = "reviewer" and used just @Linked instead.

In this first example the attribute is not a collection nor an array type, so we can either have 0 or 1 records from a “reviewer” linked to the parent record.

This class is configured to work with records that have a field named “petrol_station_name”, plus the field names of the @Nested Price attribute: “fuel_type” and “price” (as seen in the definition of class Price earlier)

Now we can read the HtmlParserResult of our “fuel” entity with this code to obtain a list of PriceDetails:

// Let's convert the records of entity "fuel" into `DailyPriceList` beans
List<PriceDetails> prices = result.get("fuel").getBeans(PriceDetails.class);

// Now we can print out all price details
for (PriceDetails petrolStation : prices) {
    print(petrolStation.name + " -> " + petrolStation.price);
    println(" | Reviewed by: " + petrolStation.reviewer.name + " (" + petrolStation.reviewer.id + ")");
}

This will print out the following:

Caltex -> UNLEADED = $1.32 | Reviewed by: Bob (001)
Caltex -> DIESEL = $1.36 | Reviewed by: Bob (001)
Shell -> UNLEADED = $1.40 | Reviewed by: Adam (007)
Shell -> DIESEL = $1.33 | Reviewed by: Adam (007)
Texaco -> UNLEADED = $1.52 | Reviewed by: Jade (099)
Texaco -> DIESEL = $1.40 | Reviewed by: Jade (099)
Shell -> UNLEADED = $1.10 | Reviewed by: Bob (001)
Shell -> DIESEL = $1.55 | Reviewed by: Bob (001)
Texaco -> UNLEADED = $1.23 | Reviewed by: Adam (007)
Texaco -> DIESEL = $1.90 | Reviewed by: Adam (007)
BP -> UNLEADED = $1.69 | Reviewed by: Jade (099)
BP -> DIESEL = $1.01 | Reviewed by: Jade (099)

But it doesn’t have any information about the day of the week. Fortunately we linked each record of fuel to dayOfWeek earlier, so we can create a class such as:

class DailyPriceList {

    @Parsed(field = "day")
    public String dayOfWeek;

    @Linked(entity = "fuel", type = PriceDetails.class, container = ArrayList.class)
    public List<PriceDetails> priceDetails;

}

Here the @Linked annotation tells us what our setup is: this class will work with a HtmlParserResult whose top-level records have a field named “day”. Each one of these records is expected to have a linked entity named “fuel” (specified by the entity property of the annotation).

The “fuel” records linked to each “day” will be read to generate instances of PriceDetails (specified by the type property in the annotation). Instances of PriceDetails will be stored in an ArrayList (specified by the container property of the annotation).

Now we can get the price results of each dayOfWeek and use it to generate a list of DailyPriceList instances:

// As each "dayOfWeek" record has "fuel" records, we can obtain a list of `DailyPriceList` beans
List<DailyPriceList> pricesPerDay = result.get("dayOfWeek").getBeans(DailyPriceList.class);

// Now we can print out the price details of each day.
for (DailyPriceList priceList : pricesPerDay) {
    println("* Petrol prices on " + priceList.dayOfWeek);
    for (PriceDetails petrolStation : priceList.priceDetails) {
        print("\t" + petrolStation.name + " -> " + petrolStation.price);
        println(" | Reviewed by: " + petrolStation.reviewer.name + " (" + petrolStation.reviewer.id + ")");
    }
    println("----------------------");
}

Which produces the following output:

* Petrol prices on MON
    Caltex -> UNLEADED = $1.32 | Reviewed by: Bob (001)
    Caltex -> DIESEL = $1.36 | Reviewed by: Bob (001)
    Shell -> UNLEADED = $1.10 | Reviewed by: Bob (001)
    Shell -> DIESEL = $1.55 | Reviewed by: Bob (001)
----------------------
* Petrol prices on TUE
    Shell -> UNLEADED = $1.40 | Reviewed by: Adam (007)
    Shell -> DIESEL = $1.33 | Reviewed by: Adam (007)
    Texaco -> UNLEADED = $1.23 | Reviewed by: Adam (007)
    Texaco -> DIESEL = $1.90 | Reviewed by: Adam (007)
----------------------
* Petrol prices on WED
    Texaco -> UNLEADED = $1.52 | Reviewed by: Jade (099)
    Texaco -> DIESEL = $1.40 | Reviewed by: Jade (099)
    BP -> UNLEADED = $1.69 | Reviewed by: Jade (099)
    BP -> DIESEL = $1.01 | Reviewed by: Jade (099)
----------------------

The annotation processor is very powerful and allows for plenty of flexibility on your class structure. You can also use maps, as shown next.

The @Group annotation

The Group annotation configures how a Map instance should be created and how its keys are populated. It’s used in conjunction with the @Link annotation to determine how the map values are populated.

Let’s create a PetrolStation class - essentially the same as previous PriceDetails but without Price information:

class PetrolStation implements Comparable<PetrolStation> {

    @Parsed(field = "petrol_station_name")
    public String name;

    @Linked
    public Reviewer reviewer;

    public PetrolStation(){
    }

    @Override
    public String toString() {
        return name + " - Reviewer: " + reviewer + "";
    }

    // we're going to use PetrolStation as the keys of a TreeMap, so we implemented
    // the `Comparable` interface
    @Override
    public int compareTo(PetrolStation o) {
        return this.toString().compareTo(o.toString());
    }
}

Now, instead of a DailyPriceList, we define a DailyPriceMap where each PetrolStation is associated with a list of Price objects:

class DailyPriceMap {

    @Parsed(field = "day")
    public String dayOfWeek;

    @Group(key = PetrolStation.class, container = TreeMap.class)
    @Linked(entity = "fuel", type = Price.class, container = ArrayList.class)
    public Map<PetrolStation, List<Price>> pricesPerStation;

}

Again, the @Linked annotation tells what the map values should be: each top-level record of the a HtmlParserResult must have a linked entity named “fuel” (specified by the entity property of the annotation).

The linked “fuel” records will be read to generate instances of Price (specified by type). The instances of Price will be stored in an ArrayList (specified by container). This ArrayList will be associated with the values of each key generated according to the @Group configuration.

The @Group annotation determines that the keys of the map should be populated using our PetrolStation class (specified by the key property in the annotation), and the map should be an instance of TreeMap (specified by container).

Finally, we can convert each record of entity dayOfWeek and use to obtain DailyPriceMap instances:

// Here, each `DailyPriceMap` instance has the day of the week and a Map<PetrolStation, List<Price>>
List<DailyPriceMap> pricesPerDay = result.get("dayOfWeek").getBeans(DailyPriceMap.class);
for (DailyPriceMap priceList : pricesPerDay) {
    println("* Petrol prices on " + priceList.dayOfWeek);
    for (Map.Entry<PetrolStation, List<Price>> e : priceList.pricesPerStation.entrySet()) {
        PetrolStation petrolStation = e.getKey();
        List<Price> prices = e.getValue();

        println("\t" + petrolStation.name + " | Reviewed by: " + petrolStation.reviewer.name + "(" + petrolStation.reviewer.id + ")");
        for (Price price : prices) {
            println("\t\tPrice of " + price.fuelType.name() + ": $" + price.price);
        }
    }
    println("----------------------");
}

Which prints out the price results grouped by each day of the week:

* Petrol prices on MON
    Caltex | Reviewed by: Bob(001)
        Price of UNLEADED: $1.32
        Price of DIESEL: $1.36
    Shell | Reviewed by: Bob(001)
        Price of UNLEADED: $1.10
        Price of DIESEL: $1.55
----------------------
* Petrol prices on TUE
    Shell | Reviewed by: Adam(007)
        Price of UNLEADED: $1.40
        Price of DIESEL: $1.33
    Texaco | Reviewed by: Adam(007)
        Price of UNLEADED: $1.23
        Price of DIESEL: $1.90
----------------------
* Petrol prices on WED
    BP | Reviewed by: Jade(099)
        Price of UNLEADED: $1.69
        Price of DIESEL: $1.01
    Texaco | Reviewed by: Jade(099)
        Price of UNLEADED: $1.52
        Price of DIESEL: $1.40
----------------------

Further reading

That’s it. With just a couple of annotations you should be able to generate almost any sort of class relationship that represents the data you get from a HtmlParserResult.

Feel free to proceed to the following sections (in any order).

If you find a bug

We deal with errors very seriously and stop the world to fix bugs in less than 24 hours whenever possible. It’s rare to have known issues dangling around for longer than that. A new SNAPSHOT build will be generated so you (and anyone affected by the bug) can proceed with your work as soon as the adjustments are made.

If you find a bug don’t hesitate to report an issue here. You can also submit feature requests or any other improvements there.

We are happy to help if you have any questions in regards to how to use the parser for your specific use case. Just send us an e-mail with the details and we’ll reply as soon as humanely possible.

We can work for you

If you don’t have the resources or don’t really want to waste time coding we can build a custom solution for you using our products. We deliver quickly as we know the ins and outs of everything we are dealing with. Send us an e-mail to sales@univocity.com with your requirements and we’ll be happy to assist.

The univocity team.

www.univocity.com