x Use code HALFFORME to get a 50% discount at checkout. Valid for the first 100 orders.

Link following

This section demonstrates how the parser can capture bits and pieces of information available in different pages, and aggregate everything into records of any given entity.

The input to parse

Suppose we wanted to capture user details from an initial list of users, such as:

Display in a separate tab

The HTML of this page is simply:

<ul>
    <li>
        <span>
            <a href="./profiles/1123.html">Jon Smith</a>
        </span>
    </li>
    <li>
        <span>
            <a href="./profiles/3321.html">Maggie Jones</a>
        </span>
    </li>
    <li>
        <span>
            <a href="./profiles/8821.html">James Miller</a>
        </span>
    </li>
    <li>
        <span>
            <a href="./profiles/2315.html">Sofia Fischer</a>
        </span>
    </li>
</ul>

Each link points to the corresponding user profile page, which for demonstration purposes is a just table whose HTML could not be simpler:

<table>
    <tr>
        <td class="label">Username:</td>
        <td class="value">jsmith</td>
    </tr>
    <tr>
        <td class="label">Age:</td>
        <td class="value">25</td>
    </tr>
    <tr>
        <td class="label">Location:</td>
        <td class="value">Adelaide, South Australia</td>
    </tr>
    <tr>
        <td class="label">Profile created on:</td>
        <td class="value">25/4/2008</td>
    </tr>
</table>
<div>
    <a href="../list.html">Go Back</a>
</div>

Finally, we are ready to start coding.

To follow a specific link, all you need is to define a field that targets a URL. Let’s look at the code to build a User entity with fields: name, profileUrl, username, age, location and profileCreationDate, where profileUrl has the link to be followed.

HtmlEntityList entityList = new HtmlEntityList();

HtmlEntitySettings user = entityList.configureEntity("User");
user.addField("name").match("a").getText();

// The 'profileUrl' field has a link to the next page with user details. We want to follow that link.
HtmlLinkFollower profileFollower = user.addField("profileUrl")
        .match("a")
        .getAttribute("href")
        .followLink();

// We just add fields to the follower object. As the link follower comes from the "User" entity, the fields added
// here end up in the "User" entity.
getValueFromLabel(profileFollower, "username", "Username");
getValueFromLabel(profileFollower, "age", "Age");
getValueFromLabel(profileFollower, "location", "Location");
getValueFromLabel(profileFollower, "created", "Profile created on");

A HtmlLinkFollower is created on the definition of the profileUrl field. The parser is being instructed to collect the href value of the links displayed in the first page, and follow the link.

The HtmlLinkFollower is associated with the User entity and any rows collected from the the sub-pages it visits will be linked to the parent User row.

In the remainder of the code, we simply keep adding fields and matching rules to the link follower, so it can collect the data in the detailed profile pages. The getValueFromLabel method is defined as:

private void getValueFromLabel(HtmlLinkFollower follower, String fieldName, String key) {
    follower.addField(fieldName)
            .match("td").classes("value") //gets the text of a table cell with class "value"
            .precededImmediatelyBy("td").classes("label").withText(key) // if it is preceded by a cell with class "label" and a given text
            .getOwnText();
    }

As the data collected for the User entity comes from multiple pages, the values captured by the link follower are linked to the values captured from the first page.

for (HtmlRecord record : users.iterateRecords()) { //iterates the first level of results captured from the first page
    String[] parentRow = record.getValues(); // first level rows will have values for fields `name` and `profileUrl`
    
    HtmlParserResult linkedData = record.getLinkedFieldData(); // each record has the results captured from the linked page.
    
    //print the rows linked to the parent row...
}

The the output will be:

name___________profileUrl____________
Jon Smith      ./profiles/1123.html  
  + username__age__location___________________created____
  | jsmith    25   Adelaide, South Australia  25/4/2008  

Maggie Jones   ./profiles/3321.html  
  + username__age__location_______created____
  | mag       52   Austin, Texas  16/2/2010  

James Miller   ./profiles/8821.html  
  + username__age__location___________created____
  | jmiller   31   Detroit, Michigan  10/3/2007  

Sofia Fischer  ./profiles/2315.html  
  + username__age__location_________created___
  | fish      19   Berlin, Germany  1/7/2015

Combining results

In the example above, as each linked page returns only a single row, it probably makes more sense to join the parent and its linked row together. This can be easily accomplished by adding the following line before running the parser:

profileFollower.setNesting(Nesting.JOIN);

The setNesting method determines how the rows produced by a HtmlLinkFollower should be associated with the parent row. The JOIN nesting option puts together the values of the parent row and the values of each linked row. When there are multiple rows involved, the result is a cartesian product, i.e. Parent row [a,b] joined with linked rows [k,l] and [x,y] result in two rows [a,b,k,l] and [a,b,x,y].

Check the Nesting enumeration javadoc to learn about the different nesting options supported.

So our code becomes:

...
profileFollower.setNesting(Nesting.JOIN);

FileProvider input = new FileProvider("documentation/tutorial/html/example_007/list.html", "UTF-8");
HtmlParserResult users = new HtmlParser(entityList).parse(input).get("User");

List<String[]> users = result.getRows();

The rows returned from the Users will be:

name___________profileUrl____________username__age__location___________________created____
Jon Smith      ./profiles/1123.html  jsmith    25   Adelaide, South Australia  25/4/2008  
Maggie Jones   ./profiles/3321.html  mag       52   Austin, Texas              16/2/2010  
James Miller   ./profiles/8821.html  jmiller   31   Detroit, Michigan          10/3/2007  
Sofia Fischer  ./profiles/2315.html  fish      19   Berlin, Germany            1/7/2015

If required, the linked contents will still be available with record.getLinkedFieldData(); so you can have the original records collected from the followed page with fields “username”, “age”, “location”, “created”.

You can also add entities to link followers with as many nested levels of link followers as required. Handling this sort of structure is almost effortless. Let’s expand the previous example to demonstrate.

The updated example

Expanding the previous example, let’s add a link from each user profile page to an address list, for example:

<table>
    <tr>
        <td class="label">Username:</td>
        <td class="value">jsmith</td>
    </tr>
    <tr>
        <td class="label">Age:</td>
        <td class="value">25</td>
    </tr>
    <tr>
        <td class="label">Location:</td>
        <td class="value">Adelaide, South Australia <a href="./addresses/1123.html">Choose another</a></td>
    </tr>
    <tr>
        <td class="label">Profile created on:</td>
        <td class="value">25/4/2008</td>
    </tr>
</table>
<div>
    <a href="../list.html">Go Back</a>
</div>

Every user profile page now has a link like <a href="./addresses/1123.html">Choose another</a>, where the href attribute points to the address list associated with the profile:

Display in a separate tab

The HTML of the profile address pages looks like this:

<table border="1">
    <tr>
        <td class="label">Address</td>
        <td class="label">Personal</td>
        <td class="label">Business</td>
        <td class="label">Mailing</td>
    </tr>
    <tr>
        <td class="value">Somewhere, in the world</td>
        <td><input name="type_1" type="radio" checked disabled></td>
        <td><input name="type_1" type="radio" disabled></td>
        <td><input name="type_1" type="radio" disabled></td>
    </tr>
    <tr>
        <td class="value">Somewhere else, in the world</td>
        <td><input name="type_2" type="radio" disabled></td>
        <td><input name="type_2" type="radio" disabled></td>
        <td><input name="type_2" type="radio" disabled checked></td>
    </tr>
</table>
<div>
    <a href="../1123.html">Go Back</a>
</div>

What we want to do now is to aggregate all information about users and their addresses by making the parser visit 3 pages for each user.

Capturing data into a linked entity

With our input to parse ready, we want to define an Address entity whose data will be linked to each profile. To do this, we add a new link follower to the profileFollower created earlier.

Every HtmlLinkFollower has its own HtmlEntityList which always include the parent entity. You can add new entities to the entity list of a link follower:

// let's get the link follower created earlier back
HtmlLinkFollower profileFollower = entityList.getEntity("User").getRemoteFollowers().get("profileUrl");

// now we want to follow the link that points to a page with user addresses
HtmlLinkFollower addressFollower = profileFollower.addField("addressUrl").match("a")
        .withExactText("Choose another")
        .getAttribute("href")
        .followLink();

// Create a new "Address" entity. The results will linked to the parent `profileFollower`.
HtmlEntitySettings address = addressFollower.getEntityList().configureEntity("Address");

// Gets the content of all cells under the "Address" column
address.addField("address")
        .match("td")
        .underHeader("td").withExactText("Address")
        .getText();

// Finds a checked radio button and returns the text in the header of the corresponding column
address.addField("type")
        .match("input").attribute("type", "radio") //matches radio buttons
        .attribute("checked") //matches only checked radio buttons
        .getHeadingText(); //gets the text of the first row of the table, at the same column

// Now we add a field to the follower itself, it will be added to main "User" entity and will
// store the number of addresses associated with each user.
addressFollower.addField("address_count")
        .match("table") //just match a <table> and give the node to you so you can work with the DOM
        .getElement(new HtmlElementTransformation() { //you can use a lambda instead
            @Override
            public String transform(HtmlElement table) {
                //you must work with the matched element to return a String.
                //To get the number of addresses, we can query all <tr> elements of the current <table>.
                List<HtmlElement> rows = table.query("tr");

                //Subtract the first row from the total as it's a heading row. The result must be a String.
                return String.valueOf(rows.size() - 1);
            }
        });

// we don't want to have a "addressUrl" field in the user records.
// REPLACE_JOIN will replace the column of the link that was followed with the results obtained.
addressFollower.setNesting(Nesting.REPLACE_JOIN);

A few new matching rules were introduced in the example above. Review the comments in the code to learn more about them.

We can now run the parser and let it take care of everything. The following code shows how you can traverse the results to get the relevant rows and print them out.

//parse and print the results.
FileProvider input = new FileProvider("documentation/tutorial/html/example_008/list.html", "UTF-8");

HtmlParserResult users = new HtmlParser(entityList).parse(input).get("User");
for (HtmlRecord user : users.iterateRecords()) {
    println(Arrays.toString(user.getValues())); //the values collected by all followers are joined in a single row

    // As we configured the parser to join rows, the linked "Address" entity is available from the "User" record
    HtmlParserResult addressResults = user.getLinkedEntityData().get("Address");

    for (HtmlRecord addr : addressResults.iterateRecords()) {
        println("  * " + Arrays.toString(addr.getValues()));
    }
    println();
}

This will produce:

[Jon Smith, ./profiles/1123.html, jsmith, 25, Adelaide, South Australia, 25/4/2008, 2]
  * [Somewhere, in the world, Personal]
  * [Somewhere else, in the world, Mailing]

[Maggie Jones, ./profiles/3321.html, mag, 52, Austin, Texas, 16/2/2010, 0]

[James Miller, ./profiles/8821.html, jmiller, 31, Detroit, Michigan, 10/3/2007, 2]
  * [201 Fake St, Zambia 12311, Business]
  * [201 Fake St, Cayman Islands 12311, Business]

[Sofia Fischer, ./profiles/2315.html, fish, 19, Berlin, Germany, 1/7/2015, 1]
  * [Radelaide, Straya, Mailing]

Which is pretty usable especially considering the data is scattered across 3 different pages.

Notice how the Address entity data is accessible from the first level of User records. That’s because the rows have been joined with addressFollower.setNesting(Nesting.REPLACE_JOIN);.

Going through the result hierarchy

If for some reason you need to walk through the entire hierarchy of records, the Address rows will sill be available from the results captured from the profile page, as demonstrated here:

//parse and print the results.
FileProvider input = new FileProvider("documentation/tutorial/html/example_008/list.html", "UTF-8");

HtmlParserResult users = new HtmlParser(entityList).parse(input).get("User");
for (HtmlRecord user : users.iterateRecords()) {
    println(Arrays.toString(user.getValues())); //the values collected by all followers are joined in a single row

    //get the records collected by the first link follower
    HtmlParserResult profileResults = user.getLinkedFieldData();

    for (HtmlRecord profile : profileResults.iterateRecords()) {
        //the profile details were already joined with the parent row, so we ignore that data here.

        //we want the addresses collected by the second link follower. They are linked to each profile record.
        HtmlParserResult addressResults = profile.getLinkedEntityData().get("Address");

        for (HtmlRecord addr : addressResults.iterateRecords()) {
            println("  * " + Arrays.toString(addr.getValues()));
        }
    }
    println();
}

This will produce the same output as before.

Reading the results into java beans

As the relationships among records can become intricate, it’s usually simpler to recreate a class structure that represents them. For example, the following classes:

class User {

    @Parsed
    private String name;

    @Parsed
    private String username;

    @Parsed
    private int age;

    @Parsed(field = "address_count")
    private int addressCount;

    @Parsed
    private String location;

    @Parsed
    @Format(formats = "dd/M/yyyy")
    private java.util.Date created;

    @Linked(entity = "Address", type = UserAddress.class)
    private List<UserAddress> addresses;

    @Override
    public String toString() {
        StringBuilder out = new StringBuilder();
        out.append(name)
                .append(" (").append(username).append(")")
                .append(", age ").append(age)
                .append(", location=").append(location)
                .append(" - Created on ").append(new SimpleDateFormat("yyyy-MMM-dd").format(created));

        out.append("\n").append(addressCount).append(" addresses");

        for (UserAddress address : addresses) {
            out.append("\n * ").append(address);
        }
        return out.toString();
    }
}

And:

class UserAddress {

    public enum Type {
        BUSINESS,
        MAILING,
        PERSONAL
    }

    @Parsed
    @UpperCase
    private Type type;

    @Parsed
    private String address;

    @Override
    public String toString() {
        return type + ": " + address;
    }
}

Can be populated with:

//parse and print the results.
FileProvider input = new FileProvider("documentation/tutorial/html/example_008/list.html", "UTF-8");

HtmlParserResult users = new HtmlParser(entityList).parse(input).get("User");

List<User> userList = users.getBeans(User.class);
for (User user : userList) {
    println(user);
    println();
}

Which conveniently prints out:

Jon Smith (jsmith), age 25, location=Adelaide, South Australia - Created on 2008-Apr-25
2 addresses
 * PERSONAL: Somewhere, in the world
 * MAILING: Somewhere else, in the world

Maggie Jones (mag), age 52, location=Austin, Texas - Created on 2010-Feb-16
0 addresses

James Miller (jmiller), age 31, location=Detroit, Michigan - Created on 2007-Mar-10
2 addresses
 * BUSINESS: 201 Fake St, Zambia 12311
 * BUSINESS: 201 Fake St, Cayman Islands 12311

Sofia Fischer (fish), age 19, location=Berlin, Germany - Created on 2015-Jul-01
1 addresses
 * MAILING: Radelaide, Straya

Further reading

The sections Reading data into java beans and Reading linked results into java beans provide the ins and outs of using the annotations provided by the library in order to convert your data into object structures.

You can also proceed to the following sections (in any order).

If you find a bug

We deal with errors very seriously and stop the world to fix bugs in less than 24 hours whenever possible. It’s rare to have known issues dangling around for longer than that. A new SNAPSHOT build will be generated so you (and anyone affected by the bug) can proceed with your work as soon as the adjustments are made.

If you find a bug don’t hesitate to report an issue here. You can also submit feature requests or any other improvements there.

We are happy to help if you have any questions in regards to how to use the parser for your specific use case. Just send us an e-mail with the details and we’ll reply as soon as humanely possible.

We can work for you

If you don’t have the resources or don’t really want to waste time coding we can build a custom solution for you using our products. We deliver quickly as we know the ins and outs of everything we are dealing with. Send us an e-mail to sales@univocity.com with your requirements and we’ll be happy to assist.

The univocity team.

www.univocity.com