x Use code HALFFORME to get a 50% discount at checkout. Valid for the first 100 orders.

Fields and matching rules

This tutorial details how the parser works to collect data for the fields you define explores a few gotchas you might run into when first using the parser.

How matching rules work

Before we start looking at the code, it’s important to understand how the parser works so you are able to get sensible results.

A matching rule is defined by a sequence of HTML tag names. For example:

  • match("tr") will match all <tr> elements in the HTML tree.
  • match("tr").match("span") will match all <span> elements that are inside a <tr> element.
  • match("tr").match("span").match("b") will match all <b> elements if they are in the hierarchy of a <span> element, or if they are siblings of said <span>, where the <span> appears before each <b> element and is inside a <tr>.

Essentially, every match operation will find zero or more nodes. Chaining match operations will make the parser look into the hierarchy of each the previously matched node, and its siblings.

Consider this input:

<table>
    <tr>
        <td>
            <span>This is a <b>bold</b> statement</span>
        </td>
    </tr>
    <tr>
        <td>
            <b>I'm ignored</b>
            <span>An even<b>bolder</b>statement</span>
            <b>wow</b></td>
    </tr>
</table>

And the following code, that makes use of the matching rule match("tr").match("span").match("b") example mentioned earlier:

FileProvider input = new FileProvider("documentation/tutorial/html/matching_rules.html", "UTF-8");

// Parse an input HTML into a tree structure.
HtmlElement root = HtmlParser.parseTree(input);
List<HtmlElement> expectedElements = root.query()
        .match("tr").match("span").match("b")
        .getElements();

for (HtmlElement e : expectedElements) {
    println(e);
}

Will return the following nodes:

<b>bold</b>
<b>bolder</b>
<b>wow</b>

After each call to match("tagname") you can define one or more constraints to select only elements you are interested in. The following example adds a few basic constraints to the matching rules defined earlier:

List<HtmlElement> expectedElements = root.query()
        .match("tr").under("tr").withText("* a bold")
        .match("span").match("b").withText("wow")
        .getElements();

In this example, we only match <tr> elements that are under a <tr> with any text that starts with anything (denoted by a *) followed by the words “a bold”, followed by anything. Note that withText matches the starting text, i.e. writing .withText("* a bold") is equivalent to .withText("* a bold*")). If such <tr> is found, we look for a <b> inside a <span> of the given <tr>, where the text inside <b> must start with “wow”.

The matched elements will now be:

<b>wow</b>

All available constraints are extensively documented in the BasicElementFilter interface.

How result rows are built

When you define an entity and its fields, the parser will allocate slots for each one of the fields you defined in an internal array.

Let’s have a look at a very simple input:

Display in a separate tab

Which has the following HTML:

<div>
    <span>Company No:&nbsp;&nbsp;</span><span><b>123</b></span>
    <br/>
    <span>Legal name:&nbsp;&nbsp;<b><span>univocity</span></b></span>
    <hr/>

    <span>Company No:&nbsp;&nbsp;</span><span><b>456</b></span>
    <br/>
    <span>Legal name:&nbsp;&nbsp;<b><span>FourFiveSix</span></b></span>
    <hr/>

    <span>Company No:&nbsp;&nbsp;</span><span><b>789</b></span>
    <br/>
    <span>Legal name:&nbsp;&nbsp;<b><span>My Crow Soft</span></b></span>
</div>

To collect company numbers and names, we can define a company entity and add fields to it. Each field has at least one sequence of matching rules associated with it, and they all terminate with a content reading operation (all possible operations are described in ContentReader).

The following matching rules can be used to parse the input HTML shown above:

// entities are defined in an entity list.
HtmlEntityList entityList = new HtmlEntityList();

// here we define the company entity.
HtmlEntitySettings company = entityList.configureEntity("company");

// creates a field "id" for company numbers
company.addField("id")
        .match("span").withText("company no:") // match any <span> with text "company no"
        .matchNext("span")
        .getText(); //match only the next <span> after finding a <span> with text "company no"

// creates a field "name" for company names
company.addField("name")
        .match("span").withText("Legal name:") // match any <span> with text "company no"
        .matchNext("b")
        .getText(); //match only the next <b> after finding a <span> with text "legal name"

// create a parser instance
HtmlParser parser = new HtmlParser(entityList);

// define the input file to parse
FileProvider input = new FileProvider("documentation/tutorial/html/gotchas/example_001.html", "UTF-8");

// then parse to get the results
HtmlParserResult result = parser.parse(input).get("company");

When the parser is put to run it should produce the following result:

id___name__________
123  univocity     
456  FourFiveSix   
789  My Crow Soft

But to better understand what the parser actually did, we can read the logs it prints out.

Reading the parser log

When the parser executes, it builds a HTML tree, which essentially is the same as calling:

// parses the input into a tree of HtmlElement.
HtmlElement htmlTreeRoot = HtmlParser.parseTree(input);

With the HTML structure ready, a new thread for every entity defined in your HtmlEntityList is started. These threads are responsible for walking over the HTML elements and to find matches for the rules defined in each field of your entity. The log will show the threads that are started with this message:

[HTML Parse[company]] INFO  - [company] - Parsing process started

Which essentially states that a thread for the entity named company is running. You can configure how many threads are created by the parser with:

// limits the number of threads to 4
entityList.getParserSettings().setParserThreadCount(4);

Soon after the company thread starts, the following messages will pop up in the log:

[HTML Parse[company]] DEBUG - [company.id] = '123' | Matched <span>123</span> using: span.matchText(company no:*).matchNext(span).getText
[HTML Parse[company]] DEBUG - [company.name] = 'univocity' | Matched <b>univocity</b> using: span.matchText(legal name:*).matchNext(b).getText

The messages inform you that company.id and company.name have been populated with values 123 and univocity, respectively. The HTML tags that were matched are also displayed as well as the matching rule that found them.

This is what the parser did behind the scenes for the company.id field:

Once the parser visited node <span>Company No:&nbsp;&nbsp;</span>, the first match() operation verified that the <span> actually starts with text “company no”. The first part of the matching rule is satisfied. The parser then visited to the next HTML element and found <span><b>123</b></span>. The second part of the matching rule uses matchNext(), which establishes that a <span> element must exist immediately after the previously matched element. As a <span><b>123</b></span> is in fact the next element after <span>Company No:&nbsp;&nbsp;</span> the getText() operation is applied over this last matched element, which produces the value 123.

When the next sibling node is visited (i.e. <br/>), the last part of the matching rule is cleared. While there are sibling nodes, the parser typically continues trying to match the last part of the matching rule. However as matchNext binds the previous node to next, it would never find a match again. So the first part of the matching rule is cleared as well. The next section demonstrates this behaviour more clearly.

The behavior for field company.name is essentially the same.

As printing out HTML nodes can be expensive - they may have many nested children elements, lots of attributes, etc - the log messages produce shortened descriptions of the nodes matched. The actual node in the input HTML is <span><b>123</b></span> but the log will show the matched tag with any attributes, and the text inside the tag, as if rendered by a browser, hence the log displays the message Matched <span>123</span>. You must read the log message as “the parser found a <span> element with text 123” and it was matched using the rule: span.matchText(company no:*).matchNext(span).getText.

Now, at this time the parser has the following information stored:

id___name__________
123  univocity

But there won’t be any row submitted to the output just yet. It will proceed and then find the next value for the field company.id

[HTML Parse[company]] DEBUG - [company.id] = '456' | Matched <span>456</span> using: span.matchText(company no:*).matchNext(span).getText

As the parser already has the value 123 under company.id, the values it accumulated so far will be submitted to your HtmlParserResult, as the log shows:

[HTML Parse[company]] INFO  - Submitting row 1 of entity 'company':[
    id = 123
    name = univocity
]

Internally, the parser now has the following information stored:

id___name__________
456  null

It will then proceed to match the remaining values. Every time it finds that a field has already been populated, the values accumulated will be sent to the HtmlParserResult. When the parsing thread visited all nodes of the HTML tree, it submits the last row with any values that have been accumulated, as shown in the log:

[HTML Parse[company]] INFO - [company] - Parsing process ended
[HTML Parse[company]] INFO - [company] - Submitting last record with collected data
[HTML Parse[company]] INFO - Submitting row 3 of entity 'company':[
    id = 789
    name = My Crow Soft
]  

How you approach the matching rules for each field matters. The parser will always submit a row once it finds that a field has already been populated, and then prepares a next row. If the matching rule associated with a field produces one value but another rule of another field produces multiple values, you will get multiple rows that look “broken”. Let’s have a look at common situations and how to understand then fix them.

Gotchas

Usually defining paths to the values you want to collect is a straightforward matter, but depending on the nature of your input you may get confusing results. Let’s explore common situations that might arise.

Results with random data

Let’s come back to the input presented earlier

Display in a separate tab

And its HTML:

<div>
    <span>Company No:&nbsp;&nbsp;</span><span><b>123</b></span>
    <br/>
    <span>Legal name:&nbsp;&nbsp;<b><span>univocity</span></b></span>
    <hr/>

    <span>Company No:&nbsp;&nbsp;</span><span><b>456</b></span>
    <br/>
    <span>Legal name:&nbsp;&nbsp;<b><span>FourFiveSix</span></b></span>
    <hr/>

    <span>Company No:&nbsp;&nbsp;</span><span><b>789</b></span>
    <br/>
    <span>Legal name:&nbsp;&nbsp;<b><span>My Crow Soft</span></b></span>
</div>

Now let’s make a slight modification to the matching rules. Here we use match instead of matchNext (see /*GOTCHA!*/ markers):

HtmlEntitySettings company = entityList.configureEntity("company");

// creates a field "id" for company numbers
company.addField("id")
        .match("span").withText("company no:") // match any <span> with text "company no"
        .match("span") /*GOTCHA!*/
        .getText(); // match any <span> after finding a <span> with text "company no"

// creates a field "name" for company names
company.addField("name")
        .match("span").withText("Legal name:") // match any <span> with text "company no"
        .match("b") /*GOTCHA!*/
        .getText(); // match any <b> after finding a <span> with text "company no"

This will produce a very surprising output:

id________________________name__________
123                                     
Legal name: univocity     univocity     
Company No:                             
456                                     
Legal name: FourFiveSix   FourFiveSix   
Company No:                             
789                                     
Legal name: My Crow Soft  My Crow Soft

What is going on here? We wanted to match a <span> with text "company no:" and from there match another <span> to get its text. The answer to why the results are “broken” is in the documentation of the match method:

Matches a given tag name at any distance from the current element. Navigates through sibling and children nodes.

Which is exactly what the parser is doing. Look at the logs:

[HTML Parse[company]] DEBUG - [company.id] = '123' | Matched <span>123</span> using: span.matchText(company no:*).span.getText

So far so good, the first company.id is 123.

But then the field company.id is populated again, receiving the value Legal name: univocity, as shown in the log:

[HTML Parse[company]] DEBUG  - [company.id] = 'Legal name: univocity' | Matched <span>Legal name: univocity</span> using: span.matchText(company no:*).span.getText
[HTML Parse[company]] INFO - Submitting row 1 of entity 'company':[
    id = 123
    name = null
]

What the parser did behind the scenes for the company.id field is notably different from what happened with the original code:

Once the parser visited node <span>Company No:&nbsp;&nbsp;</span>, the first match() operation verified that the <span> actually starts with text “company no”. Like before, the first part of the matching rule is satisfied. But now the second part of the matching rule uses match(), which establishes that a <span> element must exist anywhere after the previously matched element.

So <span><b>123</b></span> is matched and 123 is collected. Then <br/> is visited and the last part of the matching rule is cleared. The parser still marks the first bit (i.e. span.matchText(company no:*)) as matched. <br/> is ignored as no rules cater for it.

Finally, we reach node <span>Legal name:&nbsp;&nbsp;<b><span>univocity</span></b></span>. As the first part of the matching rule is still marked as matched, the parser runs the second part, which is just match("span"). A <span> is exactly what the parser visited, so the text of this <span> element is collected and assigned to field company.id.

The parser has the value 123 for company.id already, so it submits the data collected so far, producing:

id________________________name__________
123                       null          

The next record is prepared to be filled with the unwanted Legal name: univocity value for company.id and internally this is stored.

id________________________name__________
Legal name: univocity     null          

The rules of field company.name execute now against the same element: <span>Legal name:&nbsp;&nbsp;<b><span>univocity</span></b></span>.

Here, span.matchText(company no:*) matches. The parser moves on to the child nodes of this element and finds the inner <b><span>univocity</span></b>, satisfying the next condition (i.e. match("b")) and capturing the text.

Therefore the record is built to have:

id________________________name__________
Legal name: univocity     univocity

When the second <span>Company No:&nbsp;&nbsp;</span> is visited the rules of field company.id kick in, and as the parser is still processing neighbouring nodes of the first <span>Company No:&nbsp;&nbsp;</span> element, only the last part of the matching rules are executed. So match("span") runs over it, producing the next broken row:

id________________________name__________
Company No:               null

And so on until all elements are visited.

A quick method to figure out if your matching rules are collecting the elements you want is to test them against a HTML tree, shown next.

Testing individual matching rules

If you find yourself in a situation where results are not coming out as expected, you can verify each matching rule independently against a HTML tree to see what HtmlElements are coming out. Using the previously shown “broken” code to match company numbers:

FileProvider input = new FileProvider("documentation/tutorial/html/gotchas/example_001.html", "UTF-8");

// Parse an input HTML into a tree structure.
HtmlElement root = HtmlParser.parseTree(input);

// You can run query the nodes of the tree using the matching rule API used when defining fields of an entity.
List<HtmlElement> unexpectedElements = root.query()
        .match("span").withText("company no:")
        .match("span") /*GOTCHA!*/
        .getElements();

// Which is useful to identify any issues in your matching rules
for (HtmlElement e : unexpectedElements) {
    println(e);
}

This will print out the HTML of all matched elements, and you can quickly realize it matched unwanted HTML nodes:

<span><b>123</b></span>
<span>Legal name:&nbsp;&nbsp;<b><span>univocity</span></b></span>
<span>Company No:&nbsp;&nbsp;</span>
<span><b>456</b></span>
<span>Legal name:&nbsp;&nbsp;<b><span>FourFiveSix</span></b></span>
<span>Company No:&nbsp;&nbsp;</span>
<span><b>789</b></span>
<span>Legal name:&nbsp;&nbsp;<b><span>My Crow Soft</span></b></span>

Now that we know how the match operation behaves, we can quickly update the matching rules to use matchNext:

List<HtmlElement> expectedElements = root.query()
        .match("span").withText("company no:")
        .matchNext("span")
        .getElements();

for (HtmlElement e : expectedElements) {
    println(e);
}

To obtain the desired output:

<span><b>123</b></span>
<span><b>456</b></span>
<span><b>789</b></span>

Results with mixed data

In some rare cases you may end up with results in a single row when they should appear in separate rows. Let’s change our HTML slightly to demonstrate:

Display in a separate tab

As you can see, the company number and name may be absent from the HTML, which now look like this:

<div>
    <span>Company No:&nbsp;&nbsp;</span><span><b>123</b></span>
    <hr/>

    <span>Legal name:&nbsp;&nbsp;<b><span>FourFiveSix</span></b></span>
    <hr/>

    <span>Company No:&nbsp;&nbsp;</span><span><b>789</b></span>
    <br/>
    <span>Legal name:&nbsp;&nbsp;<b><span>My Crow Soft</span></b></span>
</div>

The matching rules we originally used seem to be sufficient at first:

entity.addField("id")
        .match("span").withText("company no:")
        .matchNext("span").getText();

entity.addField("name")
        .match("span").withText("Legal name:")
        .matchNext("b").getText();

HtmlParser parser = new HtmlParser(entityList);

FileProvider input = new FileProvider("documentation/tutorial/html/gotchas/example_002.html", "UTF-8");
HtmlParserResult result = parser.parse(input).get("company");

But the output is not what we want:

id___name__________
123  FourFiveSix   
789  My Crow Soft

Looking at the logs helps us to understand what is going wrong:

[HTML Parse[company]] DEBUG - [company.id] = '123' | Matched <span>123</span> using: span.matchText(company no:*).matchNext(span).getText

So the values collected so far for our first record are:

id________________________name__________
123                       null

Which looks fine. But when the parser won’t find a <span> with “company name” and will visit the next section, which in turn doesn’t have a <span> with “company_id”. Nothing happens until the parser finally finds the <span> with “company name” which we can see is in the next section. It will happily match this element and collect its text:

[HTML Parse[company]] DEBUG - [company.name] = 'FourFiveSix' | Matched <b>FourFiveSix</b> using: span.matchText(legal name:*).matchNext(b).getText

As it never visited a <span> with “company name” before, the first record becomes:

id________________________name__________
123                       FourFiveSix

To fix this, we need to somehow inform the parser when it’s time to submit the values it collected. Looking at the HTML, we know individual company information is separated by <hr>. We can tell the parser to submit the record when a <hr> is found by defining a record trigger. This is easily done with with a single line of code:

// uses a record trigger to notify the parser this is the end of the record and a new one will may be created.
entity.addRecordTrigger().match("hr");

Now the parser knows when to submit a company record to produce the correct output:

id___name__________
123                
     FourFiveSix   
789  My Crow Soft

The log will tell you what happened under the hood:

[HTML Parse[company]] DEBUG - [company.id] = '123' | Matched <span>123</span> using: span.matchText(company no:*).matchNext(span).getText
[HTML Parse[company]] DEBUG - [company] - Record trigger activated | Matched <hr/> using: hr
[HTML Parse[company]] INFO  - Submitting row 1 of entity 'company':[
    id = 123
    name = null
]
2018-04-03 21:10:55.191 [HTML Parse[company]] DEBUG (HtmlDataCollector.java:107) - [company.name] = 'FourFiveSix' | Matched <b>FourFiveSix</b> using: span.matchText(legal name:*).matchNext(b).getText
2018-04-03 21:10:55.191 [HTML Parse[company]] DEBUG (HtmlDataCollector.java:139) - [company] - Record trigger activated | Matched <hr/> using: hr
2018-04-03 21:10:55.191 [HTML Parse[company]] INFO  (HtmlDataCollector.java:133) - Submitting row 2 of entity 'company':[
    id = null
    name = FourFiveSix
]

That was easy. The next example is easier.

Results with missing data

In some cases the input to parse will generate one value for a field, but multiple values for other fields. Take for example the once again updated HTML:

Display in a separate tab

Now companies can have more than one name but only one company number, and the HTML became:

<div>
    <span>Company No:&nbsp;&nbsp;</span><span><b>123</b></span>
    <br/>
    <span>Legal name:&nbsp;&nbsp;<b><span>univocity</span></b></span>
    <span>Legal name:&nbsp;&nbsp;<b><span>univocity software</span></b></span>
    <hr/>

    <span>Company No:&nbsp;&nbsp;</span><span><b>456</b></span>
    <br/>
    <span>Legal name:&nbsp;&nbsp;<b><span>FourFiveSix</span></b></span>
    <hr/>

    <span>Company No:&nbsp;&nbsp;</span><span><b>789</b></span>
    <br/>
    <span>Legal name:&nbsp;&nbsp;<b><span>My Crow Soft</span></b></span>
    <span>Legal name:&nbsp;&nbsp;<b><span>Microsoft</span></b></span>
</div>

The matching rules of the previous example almost do the job:

entity.addField("id")
        .match("span").withText("company no:")
        .matchNext("span").getText();

entity.addField("name")
        .match("span").withText("Legal name:")
        .matchNext("b").getText();

entity.addRecordTrigger().match("hr");

But the output is now:

id___name________________
123  univocity           
     univocity software  
456  FourFiveSix         
789  My Crow Soft        
     Microsoft

The company.id is populated on the first row of each company only because the parser clears the stored values every time a new row is submitted. To keep the values parsed previously, you can define the id field as persistent:

entity.addPersistentField("id") //persistent fields don't lose their values until they are overwritten
        .match("span").withText("company no:")
        .matchNext("span").getText();

Now the results will be:

id___name________________
123  univocity           
123  univocity software  
456  FourFiveSix         
789  My Crow Soft        
789  Microsoft

Notice that a persistent field is also silent, meaning that any previous values will be simply overwritten and no new records will be submitted if a previous value already exists in the field.

Further reading

That’s it! We hope this tutorial gave you enough insight about the parser and how to circumvent the main complications you may face when writing your code.

Feel free to proceed to any of following sections (in any order).

If you find a bug

We deal with errors very seriously and stop the world to fix bugs in less than 24 hours whenever possible. It’s rare to have known issues dangling around for longer than that. A new SNAPSHOT build will be generated so you (and anyone affected by the bug) can proceed with your work as soon as the adjustments are made.

If you find a bug don’t hesitate to report an issue here. You can also submit feature requests or any other improvements there.

We are happy to help if you have any questions in regards to how to use the parser for your specific use case. Just send us an e-mail with the details and we’ll reply as soon as humanely possible.

We can work for you

If you don’t have the resources or don’t really want to waste time coding we can build a custom solution for you using our products. We deliver quickly as we know the ins and outs of everything we are dealing with. Send us an e-mail to sales@univocity.com with your requirements and we’ll be happy to assist.

The univocity team.

www.univocity.com