Introduction to the univocity HTML parser

The univocity-html-parser is our solution to extract and aggregate information from tricky HTML pages and websites. Any project you write with our parser should be a breeze code and to maintain in comparison to any other HTML processing solution you find or come up with.

Motivation

We built this library internally over 5 years to assist us building custom-made solutions for our clients who needed to extract data from large, complex sets of HTML pages with lots of data points to collect and “stitch” together. It became so useful that we decided to release it as a product that everyone can use and benefit from.

What the univocity-html-parser does differently from <insert your preferred tool here>?

  1. It’s declarative: most of the work you have to do is to declare matching rules to narrow down the HTML nodes you want to get data from. You don’t have to manually traverse a tree structure of HTML nodes - the DOM - to locate and extract some piece of information (well, you still can but 99% of the time you won’t need to do that).

  2. The data collected from each matching rule is combined into rows in a way that eliminates the problem of having to write code to aggregate the information into records you can use. Usually these rows come out ready to insert into a database, but you can also convert them into Java Beans using annotations. The data on each row can come from different pages, or from different sections of the same page.

  3. It can extract resources such as images, CSS and javascript used by the HTML like your browser does when you save a page to your hard drive. Your stored HTML files can be rendered locally and share the same resources collected in the past instead of downloading a new set of resource files every time a new page is saved.

  4. Historical data processing is supported. You can easily organize the files downloaded by the parser in a way that makes sense to you (by date, time, batch ID, etc) and re-run a parsing process over copies of HTML pages pulled in the past. This allows you to make adjustments to your code as you go and collect data that you might have missed or that should be structured differently.

  5. Pagination is handled almost transparently. Just define a paginator and tell it to update URL parameters, “click” the link to the next page of results, or capture form elements to submit a POST request to the next page.

  6. Depending on the structure of the HTML you need to work with, you can easily implement a change detection mechanism to automatically identify modifications on HTML pages which can potentially have new data points your program might be missing.

  7. It saves you lots and lots of time! Our software comes with a 14-day trial which should be more than enough for you to complete rather large projects.

We recommend you to follow through all sections of this tutorial first and get a grasp of what you can do with the univocity-html-parser. The source code of all examples shown in our tutorials is here for you to check out and execute things locally. Let’s get started!

Running the tutorial examples locally

If you are interested in seeing the examples shown here in action you can clone the source code of the tutorials using git with:

  git clone https://github.com/univocity/univocity-html-parser-tutorial.git

Then create a new project “from existing sources” using our preferred IDE. Select the univocity-html-parser-tutorial folder created in the step above and open it as a maven project.

The examples in the tutorial are TestNG unit tests. If you use eclipse make sure you have the TestNG plugin installed.

The Tutorial class has a main method that fires up the license manager if you need.

If you have any trouble executing the tutorials send us an e-mail on support@univocity.com.

Tutorial

If you worked with HTML scraping before, you already know that the bulk of the work when extracting data from HTML is in writing the code to target specific elements of the page and get their contents in a usable fashion. The univocity-html-parser aims to drastically reduce the amount of work required there.

This first tutorial is a quick introduction on how to use the parser and there is a lot to explore. We also created detailed tutorials for each feature group of the parser. Check them out after going through this tutorial:

In addition, we put a lot of effort in to write detailed and self-explanatory javadocs.

If you have any questions and suggestions just send an e-mail to parsers@univocity.com. We usually reply in less than 24 hours.

The source code and all files used in the following sections of this tutorial are available on github:

Let’s get started!

Basic configuration and parsing

Many websites are built using HTML that is more or less structured around elements with IDs and CSS classes. The more structured the easier the HTML is to handle. Take for example, the (simplified) snippet of search results from an online store. It’s simply a sequence of table cells (<td>) with products and their prices:

Display in a separate tab

The HTML looks like this:

<td>
    <div class="product">
        <a href="/catalog/products/60311495/" style="pointer-events:none;">
            <span class="prodName">IKEA PS 2014</span>
            <span class="prodDesc">Pendant lamp&nbsp;</span>
            <span class="prodPrice">$99</span>
        </a>
    </div>
</td>
<td>
    <div class="product">
        <a href="/catalog/products/80228607/" style="pointer-events:none;">
            <span class="prodName">STOCKHOLM</span>
            <span class="prodDesc">Pendant lamp&nbsp;</span>
            <span class="prodPrice">$108</span>
        </a>
    </div>
</td>

To parse this (or any other HTML) with our parser, the first step is to define one or more entities to store the data. Each entity can have multiple fields added to them and when the parser runs the entities will hold their own rows. The code to create a list of entities depend on a HtmlEntityList and looks like this:

// Create a new instance
HtmlEntityList entityList = new HtmlEntityList();

// Configure an entity named "items"
HtmlEntitySettings items = entityList.configureEntity("items");

So we have an items entity. The HtmlEntitySettings provides the methods to define fields and configure the entity. Let’s we collect the name, description and price of each item in the search results of the sample HTML shown earlier. All we have to do is to specify the fields of our items entity. Each field has a name and requires a path to the relevant HTML elements:

// Let's add a few fields to the "items" entity.
// A field must have a name and matching rules associated with them
items.addField("name")
        .match("span")         // match any span
        .classes("prodName") // with a class "prodName"
        .getText(); //if a <span> with the "prodName" class is found, get the text from inside the <span>.

// Next we add a "description" field, now matching <span> elements with class "prodDesc"
items.addField("description").match("span").classes("prodDesc").getText();

// Next we create a "price" field, and we clean up the price to remove the unwanted dollar sign
items.addField("price").match("span").classes("prodPrice").getText()
        .transform(price -> price.trim().replaceAll("\\$", ""));

Once our entity is configured with its fields, we can create a HtmlParser and put it to work:

// Create a parser for our entities
HtmlParser parser = new HtmlParser(entityList);

// Now we can parse the input. The `FileProvider` helps to load files from the classpath.
FileProvider input = new FileProvider("documentation/tutorial/html/example_001.html", "UTF-8");

// Call the `parse` method to parse the input and get the results.
Results<HtmlParserResult> results = parser.parse(input);

The Results produced by the parser maps entity names to their corresponding data. It also allows you to join data of multiple entities if required (we’ll save that for later). Let’s get the results of our “items” entity.

HtmlParserResult itemResults = results.get("items");

// The results have headers, rows and other information
String[] headers = itemResults.getHeaders();
List<String[]> rows = itemResults.getRows();

If we print out the headers and rows, the output will be:

name__________description___price__
IKEA PS 2014  Pendant lamp  99     
STOCKHOLM     Pendant lamp  108

Time to try dealing with something more intricate.

Exploring a bit more

As configuring fields and their paths represent the bulk of the work required to use our parser, we had to make sure creating rules are easy to build and flexible to handle anything. Let’s move on to something a bit messier, such as the page below:

Display in a separate tab

In this example we want to collect information about companies and their addresses. Even though the page looks simple, there are no elements with IDs or CSS classes. This makes collecting the information properly using conventional tools a quite painful experience (try for yourself). The HTML looks like this:

<div>
    <span>Company No:&nbsp;&nbsp;</span><span><b>123</b></span>
    <table>
        <tr>
            <td>Legal name:&nbsp;&nbsp;<b><span>My Corporation</span></b></td>
        </tr>
        <tr>
            <td><br/><b>Business address</b></td>
        </tr>
        <tr>
            <td>
                <table border="1">
                    <tr>
                        <td colspan="2">Street 1:<br/><span>MARKET PLAZA - AIR TOWER</span></td>
                        <td colspan="2">Street 2:<br/><span>25th floor</span></td>
                    </tr>
                    <tr>
                        <td>City:<br/><span>Sydney</span></td>
                        <td>State:<br/><SPAN>New South Wales</SPAN></td>
                        <td>Country:<br/><SPAN>Australia</SPAN></td>
                        <td>Postal Code:<br/><span>2222</span></td>
                    </tr>
                </table>
            </td>
        </tr>
        <tr>
            <td><br/><b>Mailing address</b></td>
        </tr>
        <tr>
            <td>
                <table border="1">
                    <tr>
                        <td colspan="2">Street 1:<br/><span>1 Some street</span></td>
                        <td colspan="2">Street 2:<br/><span></span></td>
                    </tr>
                    <tr>
                        <td>City:<br/><span>Sydney</span></td>
                        <td>State:<br/><SPAN>New South Wales</SPAN></td>
                        <td>Country:<br/><SPAN>Australia</SPAN></td>
                        <td>Postal Code:<br/><span>2220</span></td>
                    </tr>
                </table>
            </td>
        </tr>
    </table>
    <br/>
    <br/>
    <span>Company No:&nbsp;&nbsp;</span><span><b>456</b></span>
    <table>
        <tr>
            <td>Legal name:&nbsp;&nbsp;<b><span>Other Corporation</span></b></td>
        </tr>
        <tr>
            <td><br/><b>Business address</b></td>
        </tr>
        <tr>
            <td><br/>Not available</td>
        </tr>
        <tr>
            <td><br/><b>Mailing address</b></td>
        </tr>
        <tr>
            <td>
                <table border="1">
                    <tr>
                        <td colspan="2">Street 1:<br/><span>2 George St</span></td>
                        <td colspan="2">Street 2:<br/><span></span></td>
                    </tr>
                    <tr>
                        <td>City:<br/><span>Adelaide</span></td>
                        <td>State:<br/><SPAN>South Australia</SPAN></td>
                        <td>Country:<br/><SPAN>Australia</SPAN></td>
                        <td>Postal Code:<br/><span>5000</span></td>
                    </tr>
                </table>
            </td>
        </tr>
    </table>
</div>

The univocity-html-parser can handle this input quite cleanly. Let’s define a “company” entity and its fields first:

// We just need company ID and name.
HtmlEntitySettings company = entityList.configureEntity("company");

company.addField("company_id")
        .match("div")  //look for a <div>
        .match("span") //inside or after that <div>, find a <span>
        .precededImmediatelyBy("span").withText("company no") //see if it is preceded by another <span>
        // containing the text "company no"
        .getText(); //returns the text of the last matched element in the path (a <span>).

company.addField("name")
        .match("span") //look for any <span>
        .childOf("b") //the <span> must be a child of a <b> element
        .childOf("td").withText("legal name") //the <b> element must be a child of a
        // <td> element with text "legal name"
        .getText();  //returns the text of the last matched element in the path (a <span>).

The above snippet introduces some new rules for matching elements. Notice how the match method can be followed by another match, i.e. .match("div").match("span"). This helps narrowing down the search to only elements that appear one after the other. Any match method allows the specification of further constraints to be applied on top of the last matched element. In the company_id field for example, a <span> is only considered if it is preceded by another <span> which in turn must have the text “company no”. Notice that the withText constraint performs a case insensitive search.

Once an element is matched by the parser, the last element in a sequence of match rules will have its contents collected. The getText method returns the plain text inside the matched element. If this element has children, their text will also be collected. You can also use getFollowingText to get any text after the matched element, and many other alternatives.

Next, the “address” entity definition:

// The address entity has quite a bit of fields.
HtmlEntitySettings address = entityList.configureEntity("address");

// We want to know the ID of the company that "owns" each address.
address.addPersistentField("company_id") //a "persistent" field retains its value across all rows.
        .match("div")
        .match("span")
        .precededBy("span").withText("company no")
        .getText();

address.addField("type")
        .match("tr").withText("* address") //look, a wildcard
        .getText();

// The HTML structure is the same for street 1 & 2. Different rules can get to the same content, as demonstrated
address.addField("street1").match("td").withText("Street 1").match("span").getText();
address.addField("street2").match("span").precededByText("Street 2").getText();

// We created a method to configure the remaining fields as the structure is the same
// and only the text preceding each value changes
addField(address, "city", "City:");
addField(address, "state", "State:");
addField(address, "country", "Country:");
addField(address, "zip", "Postal Code:");

There’s not a lot of novelty here, except for the addPersistentField method. A persistent field is a field that retains its value across all rows produced for an entity. Every time a new row is produced by the parser, the values of persistent fields from the previous row are copied into the new one. The values of these fields will only change if a new match is found for the persistent field.

We also used the * wildcard on the withText rule of field type. The * matches any sequence of characters. You can also use the ? wildcard, which matches any one character.

To make the code less repetitive, a reusable method was created to take advantage of the fact that most address fields have the same structure:

private void addField(FieldDefinition address, String fieldName, String textToMatch) {
    address.addField(fieldName).match("td").withText(textToMatch).match("span").getText();
}

We recommend you to structure your code with reusable methods like the above to create code that is easy to write, read and maintain.

After running the parser with the given configuration, the results are:

[ address ]
company_id__type______________street1___________________street2_____city______state____________country____zip___
123         Business address  MARKET PLAZA - AIR TOWER  25th floor  Sydney    New South Wales  Australia  2222  
123         Mailing address   1 Some street                         Sydney    New South Wales  Australia  2220  
456         Business address                                                                                    
456         Mailing address   2 George St                           Adelaide  South Australia  Australia  5000  

[ company ]
company_id__name_______________
123         My Corporation     
456         Other Corporation

While this result is acceptable, we would like to prevent having a row with no details for the business address.

We can narrow down the matching rules for addresses with the help of groups.

Groups

A Group demarcates sections of the HTML page where matching rules are valid. Fields within a group will only be populated if the parser is processing elements from inside the given group. Still using our address example shown earlier, we can define groups that target only the elements that appear between the <tr> elements with “Business address” and “Mailing address”, like this:

Group businessAdressGroup = address.newGroup()
    .startAt("tr").withExactTextMatchCase("Business address")
    .endAt("tr").withTextMatchCase("Mailing address");

businessAdressGroup.addField("street2").match("span").precededByText("Street 2").getText();
businessAdressGroup.addField("type", "B");
...

In the example above, the field street2 will only be populated from elements inside the group boundaries. We also set the field type to have the value “B”. Any rows produced from this group will have type = B.

Here is the complete re-definition of the address entity using groups for business and mailing addresses:

HtmlEntitySettings address = entityList.configureEntity("address");

address.addPersistentField("company_id")
        .match("div").match("span").precededBy("span").withText("company no")
        .getText();

// business address group starts at <tr> with text "Business address"
// the group ends when the <tr> with "Mailing address" is reached.
Group businessAdressGroup = address.newGroup()
        .startAt("tr").withExactTextMatchCase("Business address")
        .endAt("tr").withTextMatchCase("Mailing address");

// Any rows produced from within this group will have field "type" set to "B" (for business addresses)
businessAdressGroup.addField("type", "B");

// Add fields to the group directly so their matching rules execute only when the group is entered.
addAddressFieldsToGroup(businessAdressGroup);

// Mailing address group starts from <tr> with text "Business address"
// Here we identify the group end using the closing tag </tr>, i.e. the group ends when the <tr> that
// follows the "Mailing address" heading is closed.
Group mailingAddressGroup = address.newGroup()
        .startAt("tr").withExactTextMatchCase("Mailing address")
        .endAtClosing("tr").precededBy("tr").withExactTextMatchCase("Mailing address");

//Any rows produced from within this group will have field "type" set to "M" (for mailing addresses)
mailingAddressGroup.addField("type", "M");

// Now we can add the address fields to this group too.
addAddressFieldsToGroup(mailingAddressGroup);

The mailingAddressGroup is defined using a different approach to determine the boundaries. This one starts when a “Mailing address” row is found, and ends when the <tr> tag that contains all address fields is closed, i.e. when the </tr> is reached.

The addAddressFieldsToGroup() method is simply defined as:

private void addAddressFieldsToGroup(Group group){
    addField(group, "street1", "Street 1:");
    addField(group, "street2", "Street 2:");
    addField(group, "city", "City:");
    addField(group, "state", "State:");
    addField(group, "country", "Country:");
    addField(group, "zip", "Postal Code:");
}

Notice that the fields were added to each group. These fields become part of the address entity itself and there’s no limit on how many different paths/groups are associated with a given field name.

Now, when running the parser the following results should be produced:

[ address ]
company_id__type__street1___________________street2_____city______state____________country____zip___
123         B     MARKET PLAZA - AIR TOWER  25th floor  Sydney    New South Wales  Australia  2222  
123         M     1 Some street                         Sydney    New South Wales  Australia  2220  
456         M     2 George St                           Adelaide  South Australia  Australia  5000  

[ company ]
company_id__name_______________
123         My Corporation     
456         Other Corporation

Which is pretty usable now. As the address entity include a company_id column the rows of each entity could just be dumped into a database. However using a database would be overkill if you need to perform basic row grouping/joining. That’s where the Results object produced by the parser come into play.

Combining results

Once you call parse the HtmlParser returns a Results object which allows you to manage the associations among rows of multiple entities. Using the example provided earlier, you can link rows of a given entity based on the values of fields that are common to them. As company and address have a company_id field, we can simply write:

results.link("company", "address");

Now, when going through the company results, we have access the addresses linked to each company (i.e. addresses that have the same company_id)

// links rows of address to company based on values of fields with the same name - "company_id" in this example.
results.link("company", "address");

HtmlParserResult companies = results.get("company"); // each company record will now have linked results
for (HtmlRecord company : companies.iterateRecords()) { // iterate over each company record
    String companyName = company.getString("name"); // using the record, we can get fields by name
    Long companyId = company.getLong("company_id"); // values can be read with the appropriate type

    println("Addresses of company: " + companyName + " (" + companyId + ")");
    Results<HtmlParserResult> linkedEntities = company.getLinkedEntityData(); // returns all results linked
    // the current "company" record

    HtmlParserResult companyAddresses = linkedEntities.get("address"); //get the addresses linked to the company.
    // print company addresses ...
}

This code will produce the following output:

Addresses of company: My Corporation (123)
123  B  MARKET PLAZA - AIR TOWER  25th floor  Sydney  New South Wales  Australia  2222  
123  M  1 Some street                         Sydney  New South Wales  Australia  2220  

Addresses of company: Other Corporation (456)
456  M  2 George St    Adelaide  South Australia  Australia  5000

Which is really handy. We can also join the results with:

HtmlParserResult joinedResult = results.join("company", "address");

This produces a new HtmlParserResult object with rows containing the columns of all joined entities. Printing out the joinedResult above yields:

company_id__name_______________type__street1___________________street2_____city______state____________country____zip___
123         My Corporation     B     MARKET PLAZA - AIR TOWER  25th floor  Sydney    New South Wales  Australia  2222  
123         My Corporation     M     1 Some street                         Sydney    New South Wales  Australia  2220  
456         Other Corporation  M     2 George St                           Adelaide  South Australia  Australia  5000

The results can also be read directly into instances of annotated classes, as demonstrated on the next section.

Annotated java beans

Instead of iterating over raw records of HtmlParserResult like we’ve been doing until now, you can collect the results as instances of classes that use a few annotations. All you need to do is to add the Parsed annotation to the fields of your class, like this:

class Address {

    public enum Type {
        BUSINESS('B'),
        MAILING('M');

        public final char code;

        Type(char code) {
            this.code = code;
        }
    }

    @Parsed
    @EnumOptions(customElement = "code")
    private Type type;

    @Parsed
    @Trim
    @LowerCase
    private String street1;

    @Parsed
    @LowerCase
    private String street2;

    @Parsed
    private String city;

    @Parsed
    private String state;

    @Parsed
    @UpperCase
    private String country;

    @Parsed(field = "zip")
    private long postCode;

    @Override
    public String toString() {
        return type + ": " + street1 + (street2 == null ? "" : " " + street2) + ", " + city + " - " + state + ", " + country + " " + postCode;
    }
}

Now you can get easily obtain a List of Address with:

List<Address> addresses = results.get("address").getBeans(Address.class);

All annotations supported by univocity-parsers are available on the HTML parser as well. The HTML parser supports the additional Linked and Group annotations, to support the processing of nested collections and maps. For example, we can create a Company class with a list of addresses:

class Company {

    @Parsed(field = "company_id")
    public Long id;

    @Parsed
    public String name;

    @Linked(entity = "address", type = Address.class)
    public List<Address> addresses;

    @Override
    public String toString() {
        StringBuilder out = new StringBuilder();
        out.append("Company ").append(id).append(": ").append(name);

        if (addresses != null && addresses.size() > 0) {
            for (Address address : addresses) {
                out.append("\n * ").append(address);
            }
        }

        return out.toString();
    }
}

The Linked annotation refers to an entity that must be available from the parent record. In this case, the address results must have been linked to company so that each company record has addresses linked to it.

For example:

HtmlParserResult companies = results.get("company");
HtmlParserResult addresses = results.get("address");

companies.link(addresses); //links addresses to companies

List<Company> companyList = companies.getBeans(Company.class);

for (Company company : companyList) {
    println(company);
}

This will produce:

Company 123: My Corporation
 * BUSINESS: market plaza - air tower 25th floor, Sydney - New South Wales, AUSTRALIA 2222
 * MAILING: 1 some street, Sydney - New South Wales, AUSTRALIA 2220
Company 456: Other Corporation
 * MAILING: 2 george st, Adelaide - South Australia, AUSTRALIA 5000

Linking results explicitly is usually required when processing multiple entities from the same page like we have been doing with the companies and addresses. However, most websites will have the information you need scattered across multiple pages. The Link following section demonstrates how to use the parser to collect data organized like this.

Rate limiting

To minimize the chance of unintentionally abusing a target server and having your access to it blocked, the parser waits for some time - 15 milliseconds by default - before making a new request to process a linked resource. This time can be configured (and disabled) from the HtmlParserSettings:

long intervalInMillis = htmlEntityList.getParserSettings().getRemoteInterval();
htmlEntityList.getParserSettings().setRemoteInterval(0); //disables the internal rate limiter.

If an interval is defined, all threads responsible for following links and/or downloading resources such as images, will wait for the given interval, counting from the time since the previous thread made a request.

This affects especially HtmlLinkFollowers, but the interval can be adjusted among requests if you define a NextInputHandler, i.e.

linkFollower.setNextLinkHandler(new NextInputHandler<RemoteContext>() { //you can use a lambda here
    @Override
    public void prepareNextCall(RemoteContext remoteContext) {
        RateLimiter rateLimiter = remoteContext.getRateLimiter();
        
        long threadsWaiting = rateLimiter.getWaitingCount()
        
        if(threadsWaiting > 10){
            rateLimiter.setInterval(0); //disables the rate limiter
        } if(threadsWaiting > 4){
            rateLimiter.decreaseWaitTime(10); //removes 10 milliseconds from the configured wait time
        } else if(threadsWaiting < 2){
            rateLimiter.increaseWaitTime(10); //adds 10 milliseconds to the configured wait time
        }
    }
});

This is also available when using a HtmlPaginator:

paginator.setPaginationHandler(new NextInputHandler<PaginationContext>() {
        @Override
        public void prepareNextCall(PaginationContext remoteContext) {
            RateLimiter rateLimiter = remoteContext.getRateLimiter();
                    
            // ... same as before.
        }
    }
); 

In both cases, the RemoteContext provides the active RateLimiter, which lets you know how may threads are waiting to connect to a remote server, and also allows you to increase or decrease the wait time of the next remote call. Once the request executes the time any increase/decrease is reset to 0. Use setInterval to update the interval or disable it altogether.

Also note that the parser settings cascade, so the order in which the configuration is set matters. For example:

This affects the current page and all linked pages:

HtmlEntityList  entityList = new HtmlEntityList();

//disables the internal rate limiter.
entityList.getParserSettings().setRemoteInterval(0); 
// downloads files referenced by the HTML such as image, javascript and .css files.
entityList.getParserSettings().fetchResourcesBeforeParsing(new FetchOptions());

... configure linkFollowers

This affects current page only but none of the linked pages:

HtmlEntityList  entityList = new HtmlEntityList();

... configure linkFollowers

//disables the internal rate limiter.
entityList.getParserSettings().setRemoteInterval(0); 
// downloads files referenced by the HTML such as image, javascript and .css files.
entityList.getParserSettings().fetchResourcesBeforeParsing(new FetchOptions());

Further reading

This is the end of the basic tutorial and by now you should be able to understand the basics of how the parser works in general.

Feel free to proceed to the following sections (in any order).

If you find a bug

We deal with errors very seriously and stop the world to fix bugs in less than 24 hours whenever possible. It’s rare to have known issues dangling around for longer than that. A new SNAPSHOT build will be generated so you (and anyone affected by the bug) can proceed with your work as soon as the adjustments are made.

If you find a bug don’t hesitate to report an issue here. You can also submit feature requests or any other improvements there.

We are happy to help if you have any questions in regards to how to use the parser for your specific use case. Just send us an e-mail with the details and we’ll reply as soon as humanely possible.

We can work for you

If you don’t have the resources or don’t really want to waste time coding we can build a custom solution for you using our products. We deliver quickly as we know the ins and outs of everything we are dealing with. Send us an e-mail to sales@univocity.com with your requirements and we’ll be happy to assist.

The univocity team.

www.univocity.com