Introduction to the univocity HTML parser
The univocity-html-parser is our solution to extract and aggregate information from tricky HTML pages and websites. Any project you write with our parser should be a breeze code and to maintain in comparison to any other HTML processing solution you find or come up with.
Motivation
We built this library internally over 5 years to assist us building custom-made solutions for our clients who needed to extract data from large, complex sets of HTML pages with lots of data points to collect and “stitch” together. It became so useful that we decided to release it as a product that everyone can use and benefit from.
What the univocity-html-parser does differently from <insert your preferred tool here>?
-
It’s
declarative
: most of the work you have to do is to declare matching rules to narrow down the HTML nodes you want to get data from. You don’t have to manually traverse a tree structure of HTML nodes - the DOM - to locate and extract some piece of information (well, you still can but 99% of the time you won’t need to do that). -
The data collected from each matching rule is combined into
rows
in a way that eliminates the problem of having to write code to aggregate the information into records you can use. Usually these rows come out ready to insert into a database, but you can also convert them into Java Beans using annotations. The data on each row can come from different pages, or from different sections of the same page. -
It can extract
resources
such as images, CSS and javascript used by the HTML like your browser does when you save a page to your hard drive. Your stored HTML files can be rendered locally and share the same resources collected in the past instead of downloading a new set of resource files every time a new page is saved. -
Historical data processing
is supported. You can easily organize the files downloaded by the parser in a way that makes sense to you (by date, time, batch ID, etc) and re-run a parsing process over copies of HTML pages pulled in the past. This allows you to make adjustments to your code as you go and collect data that you might have missed or that should be structured differently. -
Pagination
is handled almost transparently. Just define a paginator and tell it to update URL parameters, “click” the link to the next page of results, or capture form elements to submit a POST request to the next page. -
Depending on the structure of the HTML you need to work with, you can easily implement a
change detection mechanism
to automatically identify modifications on HTML pages which can potentially have new data points your program might be missing. - It saves you lots and lots of time! Our software comes with a 14-day trial which should be more than enough for you to complete rather large projects.
We recommend you to follow through all sections of this tutorial first and get a grasp of what you can do with the univocity-html-parser. The source code of all examples shown in our tutorials is here for you to check out and execute things locally. Let’s get started!
Running the tutorial examples locally
If you are interested in seeing the examples shown here in action you can clone the source code of the tutorials using git with:
git clone https://github.com/univocity/univocity-html-parser-tutorial.git
Then create a new project “from existing sources” using our preferred IDE. Select the univocity-html-parser-tutorial
folder created in the step above and open it as a maven project.
The examples in the tutorial are TestNG unit tests. If you use eclipse make sure you have the TestNG plugin installed.
The Tutorial class has a main
method that fires up the license manager if you need.
If you have any trouble executing the tutorials send us an e-mail on support@univocity.com.
Tutorial
If you worked with HTML scraping before, you already know that the bulk of the work when extracting data from HTML is in writing the code to target specific elements of the page and get their contents in a usable fashion. The univocity-html-parser aims to drastically reduce the amount of work required there.
This first tutorial is a quick introduction on how to use the parser and there is a lot to explore. We also created detailed tutorials for each feature group of the parser. Check them out after going through this tutorial:
- Fields and matching rules
- Reading data into java beans
- Reading linked results into java beans
- Pagination
- Link following
- Downloads and historical data management
- Resource Downloading
- Listening to parser actions
In addition, we put a lot of effort in to write detailed and self-explanatory javadocs.
If you have any questions and suggestions just send an e-mail to parsers@univocity.com
. We usually reply in less than 24 hours.
The source code and all files used in the following sections of this tutorial are available on github:
Let’s get started!
Basic configuration and parsing
Many websites are built using HTML that is more or less structured around elements with IDs and CSS classes. The more structured the easier the HTML is to handle. Take for example, the (simplified) snippet of search results from an online store. It’s simply a sequence of table cells (<td
>) with products and their prices:
The HTML looks like this:
<td>
<div class="product">
<a href="/catalog/products/60311495/" style="pointer-events:none;">
<span class="prodName">IKEA PS 2014</span>
<span class="prodDesc">Pendant lamp </span>
<span class="prodPrice">$99</span>
</a>
</div>
</td>
<td>
<div class="product">
<a href="/catalog/products/80228607/" style="pointer-events:none;">
<span class="prodName">STOCKHOLM</span>
<span class="prodDesc">Pendant lamp </span>
<span class="prodPrice">$108</span>
</a>
</div>
</td>
To parse this (or any other HTML) with our parser, the first step is to define one or more entities to store the data. Each entity can have multiple fields added to them and when the parser runs the entities will hold their own rows. The code to create a list of entities depend on a HtmlEntityList and looks like this:
// Create a new instance
HtmlEntityList entityList = new HtmlEntityList();
// Configure an entity named "items"
HtmlEntitySettings items = entityList.configureEntity("items");
So we have an items
entity. The HtmlEntitySettings provides the methods to define fields and configure the entity. Let’s we collect the name, description and price of each item in the search results of the sample HTML shown earlier. All we have to do is to specify the fields of our items
entity. Each field has a name and requires a path to the relevant HTML elements:
// Let's add a few fields to the "items" entity.
// A field must have a name and matching rules associated with them
items.addField("name")
.match("span") // match any span
.classes("prodName") // with a class "prodName"
.getText(); //if a <span> with the "prodName" class is found, get the text from inside the <span>.
// Next we add a "description" field, now matching <span> elements with class "prodDesc"
items.addField("description").match("span").classes("prodDesc").getText();
// Next we create a "price" field, and we clean up the price to remove the unwanted dollar sign
items.addField("price").match("span").classes("prodPrice").getText()
.transform(price -> price.trim().replaceAll("\\$", ""));
Once our entity is configured with its fields, we can create a HtmlParser and put it to work:
// Create a parser for our entities
HtmlParser parser = new HtmlParser(entityList);
// Now we can parse the input. The `FileProvider` helps to load files from the classpath.
FileProvider input = new FileProvider("documentation/tutorial/html/example_001.html", "UTF-8");
// Call the `parse` method to parse the input and get the results.
Results<HtmlParserResult> results = parser.parse(input);
The Results produced by the parser maps entity names to their corresponding data. It also allows you to join data of multiple entities if required (we’ll save that for later). Let’s get the results of our “items” entity.
HtmlParserResult itemResults = results.get("items");
// The results have headers, rows and other information
String[] headers = itemResults.getHeaders();
List<String[]> rows = itemResults.getRows();
If we print out the headers and rows, the output will be:
name__________description___price__
IKEA PS 2014 Pendant lamp 99
STOCKHOLM Pendant lamp 108
Time to try dealing with something more intricate.
Exploring a bit more
As configuring fields and their paths represent the bulk of the work required to use our parser, we had to make sure creating rules are easy to build and flexible to handle anything. Let’s move on to something a bit messier, such as the page below:
In this example we want to collect information about companies and their addresses. Even though the page looks simple, there are no elements with IDs or CSS classes. This makes collecting the information properly using conventional tools a quite painful experience (try for yourself). The HTML looks like this:
<div>
<span>Company No: </span><span><b>123</b></span>
<table>
<tr>
<td>Legal name: <b><span>My Corporation</span></b></td>
</tr>
<tr>
<td><br/><b>Business address</b></td>
</tr>
<tr>
<td>
<table border="1">
<tr>
<td colspan="2">Street 1:<br/><span>MARKET PLAZA - AIR TOWER</span></td>
<td colspan="2">Street 2:<br/><span>25th floor</span></td>
</tr>
<tr>
<td>City:<br/><span>Sydney</span></td>
<td>State:<br/><SPAN>New South Wales</SPAN></td>
<td>Country:<br/><SPAN>Australia</SPAN></td>
<td>Postal Code:<br/><span>2222</span></td>
</tr>
</table>
</td>
</tr>
<tr>
<td><br/><b>Mailing address</b></td>
</tr>
<tr>
<td>
<table border="1">
<tr>
<td colspan="2">Street 1:<br/><span>1 Some street</span></td>
<td colspan="2">Street 2:<br/><span></span></td>
</tr>
<tr>
<td>City:<br/><span>Sydney</span></td>
<td>State:<br/><SPAN>New South Wales</SPAN></td>
<td>Country:<br/><SPAN>Australia</SPAN></td>
<td>Postal Code:<br/><span>2220</span></td>
</tr>
</table>
</td>
</tr>
</table>
<br/>
<br/>
<span>Company No: </span><span><b>456</b></span>
<table>
<tr>
<td>Legal name: <b><span>Other Corporation</span></b></td>
</tr>
<tr>
<td><br/><b>Business address</b></td>
</tr>
<tr>
<td><br/>Not available</td>
</tr>
<tr>
<td><br/><b>Mailing address</b></td>
</tr>
<tr>
<td>
<table border="1">
<tr>
<td colspan="2">Street 1:<br/><span>2 George St</span></td>
<td colspan="2">Street 2:<br/><span></span></td>
</tr>
<tr>
<td>City:<br/><span>Adelaide</span></td>
<td>State:<br/><SPAN>South Australia</SPAN></td>
<td>Country:<br/><SPAN>Australia</SPAN></td>
<td>Postal Code:<br/><span>5000</span></td>
</tr>
</table>
</td>
</tr>
</table>
</div>
The univocity-html-parser can handle this input quite cleanly. Let’s define a “company” entity and its fields first:
// We just need company ID and name.
HtmlEntitySettings company = entityList.configureEntity("company");
company.addField("company_id")
.match("div") //look for a <div>
.match("span") //inside or after that <div>, find a <span>
.precededImmediatelyBy("span").withText("company no") //see if it is preceded by another <span>
// containing the text "company no"
.getText(); //returns the text of the last matched element in the path (a <span>).
company.addField("name")
.match("span") //look for any <span>
.childOf("b") //the <span> must be a child of a <b> element
.childOf("td").withText("legal name") //the <b> element must be a child of a
// <td> element with text "legal name"
.getText(); //returns the text of the last matched element in the path (a <span>).
The above snippet introduces some new rules for matching elements. Notice how the match
method can be followed by another match
, i.e. .match("div").match("span")
. This helps narrowing down the search to only elements that appear one after the other. Any match
method allows the specification of further constraints to be applied on top of the last matched element. In the company_id
field for example, a <span>
is only considered if it is preceded by another <span>
which in turn must have the text “company no”. Notice that the withText
constraint performs a case insensitive search.
Once an element is matched by the parser, the last element in a sequence of match
rules will have its contents collected. The getText
method returns the plain text inside the matched element. If this element has children, their text will also be collected. You can also use getFollowingText
to get any text after the matched element, and many other alternatives.
Next, the “address” entity definition:
// The address entity has quite a bit of fields.
HtmlEntitySettings address = entityList.configureEntity("address");
// We want to know the ID of the company that "owns" each address.
address.addPersistentField("company_id") //a "persistent" field retains its value across all rows.
.match("div")
.match("span")
.precededBy("span").withText("company no")
.getText();
address.addField("type")
.match("tr").withText("* address") //look, a wildcard
.getText();
// The HTML structure is the same for street 1 & 2. Different rules can get to the same content, as demonstrated
address.addField("street1").match("td").withText("Street 1").match("span").getText();
address.addField("street2").match("span").precededByText("Street 2").getText();
// We created a method to configure the remaining fields as the structure is the same
// and only the text preceding each value changes
addField(address, "city", "City:");
addField(address, "state", "State:");
addField(address, "country", "Country:");
addField(address, "zip", "Postal Code:");
There’s not a lot of novelty here, except for the addPersistentField
method. A persistent field is a field that retains its value across all rows produced for an entity. Every time a new row is produced by the parser, the values of persistent fields from the previous row are copied into the new one. The values of these fields will only change if a new match is found for the persistent field.
We also used the *
wildcard on the withText
rule of field type
. The *
matches any sequence of characters. You can also use the ?
wildcard, which matches any one character.
To make the code less repetitive, a reusable method was created to take advantage of the fact that most address fields have the same structure:
private void addField(FieldDefinition address, String fieldName, String textToMatch) {
address.addField(fieldName).match("td").withText(textToMatch).match("span").getText();
}
We recommend you to structure your code with reusable methods like the above to create code that is easy to write, read and maintain.
After running the parser with the given configuration, the results are:
[ address ]
company_id__type______________street1___________________street2_____city______state____________country____zip___
123 Business address MARKET PLAZA - AIR TOWER 25th floor Sydney New South Wales Australia 2222
123 Mailing address 1 Some street Sydney New South Wales Australia 2220
456 Business address
456 Mailing address 2 George St Adelaide South Australia Australia 5000
[ company ]
company_id__name_______________
123 My Corporation
456 Other Corporation
While this result is acceptable, we would like to prevent having a row with no details for the business address.
We can narrow down the matching rules for addresses with the help of groups.
Groups
A Group demarcates sections of the HTML page where matching rules are valid. Fields within a group will only be populated if the parser is processing elements from inside the given group. Still using our address example shown earlier, we can define groups that target only the elements that appear between the <tr>
elements with “Business address” and “Mailing address”, like this:
Group businessAdressGroup = address.newGroup()
.startAt("tr").withExactTextMatchCase("Business address")
.endAt("tr").withTextMatchCase("Mailing address");
businessAdressGroup.addField("street2").match("span").precededByText("Street 2").getText();
businessAdressGroup.addField("type", "B");
...
In the example above, the field street2
will only be populated from elements inside the group boundaries. We also set the field type
to have the value “B”. Any rows produced from this group will have type = B
.
Here is the complete re-definition of the address entity using groups for business and mailing addresses:
HtmlEntitySettings address = entityList.configureEntity("address");
address.addPersistentField("company_id")
.match("div").match("span").precededBy("span").withText("company no")
.getText();
// business address group starts at <tr> with text "Business address"
// the group ends when the <tr> with "Mailing address" is reached.
Group businessAdressGroup = address.newGroup()
.startAt("tr").withExactTextMatchCase("Business address")
.endAt("tr").withTextMatchCase("Mailing address");
// Any rows produced from within this group will have field "type" set to "B" (for business addresses)
businessAdressGroup.addField("type", "B");
// Add fields to the group directly so their matching rules execute only when the group is entered.
addAddressFieldsToGroup(businessAdressGroup);
// Mailing address group starts from <tr> with text "Business address"
// Here we identify the group end using the closing tag </tr>, i.e. the group ends when the <tr> that
// follows the "Mailing address" heading is closed.
Group mailingAddressGroup = address.newGroup()
.startAt("tr").withExactTextMatchCase("Mailing address")
.endAtClosing("tr").precededBy("tr").withExactTextMatchCase("Mailing address");
//Any rows produced from within this group will have field "type" set to "M" (for mailing addresses)
mailingAddressGroup.addField("type", "M");
// Now we can add the address fields to this group too.
addAddressFieldsToGroup(mailingAddressGroup);
The mailingAddressGroup is defined using a different approach to determine the boundaries. This one starts when a “Mailing address” row is found, and ends when the <tr>
tag that contains all address fields is closed, i.e. when the </tr>
is reached.
The addAddressFieldsToGroup() method is simply defined as:
private void addAddressFieldsToGroup(Group group){
addField(group, "street1", "Street 1:");
addField(group, "street2", "Street 2:");
addField(group, "city", "City:");
addField(group, "state", "State:");
addField(group, "country", "Country:");
addField(group, "zip", "Postal Code:");
}
Notice that the fields were added to each group. These fields become part of the address entity itself and there’s no limit on how many different paths/groups are associated with a given field name.
Now, when running the parser the following results should be produced:
[ address ]
company_id__type__street1___________________street2_____city______state____________country____zip___
123 B MARKET PLAZA - AIR TOWER 25th floor Sydney New South Wales Australia 2222
123 M 1 Some street Sydney New South Wales Australia 2220
456 M 2 George St Adelaide South Australia Australia 5000
[ company ]
company_id__name_______________
123 My Corporation
456 Other Corporation
Which is pretty usable now. As the address entity include a company_id
column the rows of each entity could just be dumped into a database. However using a database would be overkill if you need to perform basic row grouping/joining. That’s where the Results object produced by the parser come into play.
Combining results
Once you call parse
the HtmlParser returns a Results object which allows you to manage the associations among rows of multiple entities. Using the example provided earlier, you can link rows of a given entity based on the values of fields that are common to them. As company and address have a company_id field, we can simply write:
results.link("company", "address");
Now, when going through the company results, we have access the addresses linked to each company (i.e. addresses that have the same company_id)
// links rows of address to company based on values of fields with the same name - "company_id" in this example.
results.link("company", "address");
HtmlParserResult companies = results.get("company"); // each company record will now have linked results
for (HtmlRecord company : companies.iterateRecords()) { // iterate over each company record
String companyName = company.getString("name"); // using the record, we can get fields by name
Long companyId = company.getLong("company_id"); // values can be read with the appropriate type
println("Addresses of company: " + companyName + " (" + companyId + ")");
Results<HtmlParserResult> linkedEntities = company.getLinkedEntityData(); // returns all results linked
// the current "company" record
HtmlParserResult companyAddresses = linkedEntities.get("address"); //get the addresses linked to the company.
// print company addresses ...
}
This code will produce the following output:
Addresses of company: My Corporation (123)
123 B MARKET PLAZA - AIR TOWER 25th floor Sydney New South Wales Australia 2222
123 M 1 Some street Sydney New South Wales Australia 2220
Addresses of company: Other Corporation (456)
456 M 2 George St Adelaide South Australia Australia 5000
Which is really handy. We can also join the results with:
HtmlParserResult joinedResult = results.join("company", "address");
This produces a new HtmlParserResult object with rows containing the columns of all joined entities. Printing out the joinedResult
above yields:
company_id__name_______________type__street1___________________street2_____city______state____________country____zip___
123 My Corporation B MARKET PLAZA - AIR TOWER 25th floor Sydney New South Wales Australia 2222
123 My Corporation M 1 Some street Sydney New South Wales Australia 2220
456 Other Corporation M 2 George St Adelaide South Australia Australia 5000
The results can also be read directly into instances of annotated classes, as demonstrated on the next section.
Annotated java beans
Instead of iterating over raw records of HtmlParserResult like we’ve been doing until now, you can collect the results as instances of classes that use a few annotations. All you need to do is to add the Parsed annotation to the fields of your class, like this:
class Address {
public enum Type {
BUSINESS('B'),
MAILING('M');
public final char code;
Type(char code) {
this.code = code;
}
}
@Parsed
@EnumOptions(customElement = "code")
private Type type;
@Parsed
@Trim
@LowerCase
private String street1;
@Parsed
@LowerCase
private String street2;
@Parsed
private String city;
@Parsed
private String state;
@Parsed
@UpperCase
private String country;
@Parsed(field = "zip")
private long postCode;
@Override
public String toString() {
return type + ": " + street1 + (street2 == null ? "" : " " + street2) + ", " + city + " - " + state + ", " + country + " " + postCode;
}
}
Now you can get easily obtain a List
of Address
with:
List<Address> addresses = results.get("address").getBeans(Address.class);
All annotations supported by univocity-parsers are available on the HTML parser as well. The HTML parser supports the additional Linked and Group annotations, to support the processing of nested collections and maps. For example, we can create a Company
class with a list of addresses:
class Company {
@Parsed(field = "company_id")
public Long id;
@Parsed
public String name;
@Linked(entity = "address", type = Address.class)
public List<Address> addresses;
@Override
public String toString() {
StringBuilder out = new StringBuilder();
out.append("Company ").append(id).append(": ").append(name);
if (addresses != null && addresses.size() > 0) {
for (Address address : addresses) {
out.append("\n * ").append(address);
}
}
return out.toString();
}
}
The Linked annotation refers to an entity
that must be available from the parent record. In this case, the address results must have been linked to company so that each company record has addresses linked to it.
For example:
HtmlParserResult companies = results.get("company");
HtmlParserResult addresses = results.get("address");
companies.link(addresses); //links addresses to companies
List<Company> companyList = companies.getBeans(Company.class);
for (Company company : companyList) {
println(company);
}
This will produce:
Company 123: My Corporation
* BUSINESS: market plaza - air tower 25th floor, Sydney - New South Wales, AUSTRALIA 2222
* MAILING: 1 some street, Sydney - New South Wales, AUSTRALIA 2220
Company 456: Other Corporation
* MAILING: 2 george st, Adelaide - South Australia, AUSTRALIA 5000
Linking results explicitly is usually required when processing multiple entities from the same page like we have been doing with the companies and addresses. However, most websites will have the information you need scattered across multiple pages. The Link following section demonstrates how to use the parser to collect data organized like this.
Rate limiting
To minimize the chance of unintentionally abusing a target server and having your access to it blocked, the parser waits for some time - 15 milliseconds by default - before making a new request to process a linked resource. This time can be configured (and disabled) from the HtmlParserSettings:
long intervalInMillis = htmlEntityList.getParserSettings().getRemoteInterval();
htmlEntityList.getParserSettings().setRemoteInterval(0); //disables the internal rate limiter.
If an interval is defined, all threads responsible for following links and/or downloading resources such as images, will wait for the given interval, counting from the time since the previous thread made a request.
This affects especially HtmlLinkFollowers, but the interval can be adjusted among requests if you define a NextInputHandler, i.e.
linkFollower.setNextLinkHandler(new NextInputHandler<RemoteContext>() { //you can use a lambda here
@Override
public void prepareNextCall(RemoteContext remoteContext) {
RateLimiter rateLimiter = remoteContext.getRateLimiter();
long threadsWaiting = rateLimiter.getWaitingCount()
if(threadsWaiting > 10){
rateLimiter.setInterval(0); //disables the rate limiter
} if(threadsWaiting > 4){
rateLimiter.decreaseWaitTime(10); //removes 10 milliseconds from the configured wait time
} else if(threadsWaiting < 2){
rateLimiter.increaseWaitTime(10); //adds 10 milliseconds to the configured wait time
}
}
});
This is also available when using a HtmlPaginator:
paginator.setPaginationHandler(new NextInputHandler<PaginationContext>() {
@Override
public void prepareNextCall(PaginationContext remoteContext) {
RateLimiter rateLimiter = remoteContext.getRateLimiter();
// ... same as before.
}
}
);
In both cases, the RemoteContext provides the active RateLimiter, which lets you know how may threads are waiting to connect to a remote server, and also allows you to increase or decrease the wait time of the next remote call. Once the request executes the time any increase/decrease is reset to 0. Use setInterval
to update the interval or disable it altogether.
Also note that the parser settings cascade, so the order in which the configuration is set matters. For example:
This affects the current page and all linked pages:
HtmlEntityList entityList = new HtmlEntityList();
//disables the internal rate limiter.
entityList.getParserSettings().setRemoteInterval(0);
// downloads files referenced by the HTML such as image, javascript and .css files.
entityList.getParserSettings().fetchResourcesBeforeParsing(new FetchOptions());
... configure linkFollowers
This affects current page only but none of the linked pages:
HtmlEntityList entityList = new HtmlEntityList();
... configure linkFollowers
//disables the internal rate limiter.
entityList.getParserSettings().setRemoteInterval(0);
// downloads files referenced by the HTML such as image, javascript and .css files.
entityList.getParserSettings().fetchResourcesBeforeParsing(new FetchOptions());
Further reading
This is the end of the basic tutorial and by now you should be able to understand the basics of how the parser works in general.
Feel free to proceed to the following sections (in any order).
- Fields and matching rules
- Reading data into java beans
- Reading linked results into java beans
- Pagination
- Link following
- Downloads and historical data management
- Resource Downloading
- Listening to parser actions
If you find a bug
We deal with errors very seriously and stop the world to fix bugs in less than 24 hours whenever possible. It’s rare to have known issues dangling around for longer than that. A new SNAPSHOT build will be generated so you (and anyone affected by the bug) can proceed with your work as soon as the adjustments are made.
If you find a bug don’t hesitate to report an issue here. You can also submit feature requests or any other improvements there.
We are happy to help if you have any questions in regards to how to use the parser for your specific use case. Just send us an e-mail with the details and we’ll reply as soon as humanely possible.
We can work for you
If you don’t have the resources or don’t really want to waste time coding we can build a custom solution for you using our products. We deliver quickly as we know the ins and outs of everything we are dealing with. Send us an e-mail to sales@univocity.com with your requirements and we’ll be happy to assist.
The univocity team.