A complete framework with all the features you need to implement simple and complex HTML parsing projects

Activate your trial license until July 31st to get a 25% discount code for your first purchase

Write clean, readable code

Our powerful, fluent API makes writing code to target data points anywhere in your page very cleanly:

address.addField("company_id").match("div").match("span").precededBy("span").withText("company no").getText();
address.addField("type").match("tr").withText("* address").getText(); // using the * wildcard
address.addField("street1").match("td").withText("Street 1").match("span").getText();
address.addField("street2").match("span").precededByText("Street 2").getText();

And obtain the values collected into rows ready for your database:

Results<HtmlParserResult> results = parser.parse(input);
HtmlParserResult addresses = results.get("address");

String[] headers = addresses.getHeaders();
List<String[]> rows = addresses.getRows();

Or converted into java beans, using annotations:

public class Company {
    private String companyName;

    @Linked(entity = "address", type = Address.class)
    public List<Address> companyAddresses;

For example:

// companies with linked addresses
List<Company> companies = result.get("company").getBeans(Company.class);
// or just all addresses collected by the parser, for all companies
List<Address> allAddresses = result.get("address").getBeans(Address.class);

Built-in pagination support

Pagination is handled for you, even if you are parsing historical files stored offline.

// follows up to 3 extra pages of results

Following links and joining the data available in linked pages is straightforward:

HtmlLinkFollower profileFollower = user.addField("profileUrl").match("a").getAttribute("href").followLink();

//add fields to link follower
//join values collected from linked page into the corresponding "user" record

Store and re-process HTML pages

Just define a directory structure that makes sense for you, taking advantage of the supported file name patterns:

//parse with:
parserSettings.setFileNamePattern("{date, yyyy-MMM-dd}/file_{page}.html");


//to re-parse files pulled in the past:

//the parser will locate the appropriate files, including paginated results and followed links

Download page resources

Render pages offline by downloading any CSS, javascript, images, etc:

FetchOptions fetchOptions = new FetchOptions();

//stores all resources in a central place to prevent re-downloading them for each HTML page visited


And more

Try the univocity-html-parser library free for 14 days. Download here.