A complete framework with all the features you need to implement simple and complex HTML parsing projects

Activate your trial license until July 31st to get a 25% discount code for your first purchase


Write clean, readable code

Our powerful, fluent API makes writing code to target data points anywhere in your page very cleanly:

address.addField("company_id").match("div").match("span").precededBy("span").withText("company no").getText();
address.addField("type").match("tr").withText("* address").getText(); // using the * wildcard
address.addField("street1").match("td").withText("Street 1").match("span").getText();
address.addField("street2").match("span").precededByText("Street 2").getText();
address.addField("country").match("table").match("table").upToHeader("th").getText();

And obtain the values collected into rows ready for your database:

Results<HtmlParserResult> results = parser.parse(input);
HtmlParserResult addresses = results.get("address");

String[] headers = addresses.getHeaders();
List<String[]> rows = addresses.getRows();


Or converted into java beans, using annotations:

public class Company {
    @Trim
    @UpperCase
    @Parsed
    private String companyName;

    @Linked(entity = "address", type = Address.class)
    public List<Address> companyAddresses;
}

For example:

// companies with linked addresses
List<Company> companies = result.get("company").getBeans(Company.class);
// or just all addresses collected by the parser, for all companies
List<Address> allAddresses = result.get("address").getBeans(Address.class);

Built-in pagination support

Pagination is handled for you, even if you are parsing historical files stored offline.

paginator.setNextPage()
        .match("span").id("nextPage")
        .match("a").getAttribute("href");
        
// follows up to 3 extra pages of results
paginator.setFollowCount(3);

Following links and joining the data available in linked pages is straightforward:

HtmlLinkFollower profileFollower = user.addField("profileUrl").match("a").getAttribute("href").followLink();

//add fields to link follower
profileFollower.addField("location")
    .match("td").classes("value").precededImmediatelyBy("td").classes("label").withText("Location").getOwnText();
    
//join values collected from linked page into the corresponding "user" record
profileFollower.setNesting(Nesting.JOIN);

Store and re-process HTML pages

Just define a directory structure that makes sense for you, taking advantage of the supported file name patterns:

//parse with:
parserSettings.setDownloadContentDirectory("{user.home}/Downloads/realEstate/");
parserSettings.setFileNamePattern("{date, yyyy-MMM-dd}/file_{page}.html");
parserSettings.setDownloadOverwritingEnabled(false);

...

//to re-parse files pulled in the past:
parserSettings.setParseDate("2015-Mar-27");

//the parser will locate the appropriate files, including paginated results and followed links

Download page resources

Render pages offline by downloading any CSS, javascript, images, etc:

FetchOptions fetchOptions = new FetchOptions();

//stores all resources in a central place to prevent re-downloading them for each HTML page visited
fetchOptions.setSharedResourceDir("{user.home}/Downloads/realEstate/cache");

parserSettings.fetchResourcesBeforeParsing(fetchOptions);
 

And more

Try the univocity-html-parser library free for 14 days. Download here.