Listening to parser actions

When the HTML parser executes, it builds a tree data structure of HtmlElement and then traverses this structure trying to find which HtmlElements are matched by the rules you specified for each field of your entities.

You receive notifications from the parser via a HtmlParserListener, an abstract class with the following callback methods:

  • parsingStarted - lets you know when the HtmlParser begins parsing a web page.

  • parsingEnded - when the HtmlParser stops running.

  • elementVisited - called every time a new HtmlElement is visited by the parser. This method will be called at least once for each node of the HTML tree.

  • elementMatched - called every time the path to a field is matched and one or more values are collected by the parser for one or more fields.

Each one of these methods provides the HtmlParsingContext with information about the current state of the parser.

You can take advantage of this to implement all sorts of custom solutions, the most useful one - we believe - is the ability to identify missing data points and detect changes made to HTML forms that you have been parsing.

Detecting changes in web pages

So you created your first version of a solution that parses 1000 different data points from a bunch of online forms and you think you finished the job. The problem is that websites tend to change over time and the code that used to collect all pieces of data last year may suddenly start to miss details here and there.

One way to know that things have changed is to simply look at columns that are now always empty/blank - but you can never be sure as it’s possible that you are dealing with rarely used fields.

To make things trickier, you may also be missing new data points that have been introduced and the parser is not configured to pick them up. Unless you have someone actively looking at potentially long and tedious forms day in and day out, you will never know that there’s new information available in them.

With a custom HtmlParserListener you can leverage what you know about the page structure being dealt with and identify new or missing data points that may be of interest.

The input to parse

Let’s revisit the input shown in section Following a link:

Display in a separate tab

Which has links to pages such as this:

<table>
    <tr>
        <td class="label">Username:</td>
        <td class="value">jsmith</td>
    </tr>
    <tr>
        <td class="label">Age:</td>
        <td class="value">25</td>
    </tr>
    <tr>
        <td class="label">Location:</td>
        <td class="value">Adelaide, South Australia</td>
    </tr>
    <tr>
        <td class="label">Profile created on:</td>
        <td class="value">25/4/2008</td>
    </tr>
</table>
<div>
    <a href="../list.html">Go Back</a>
</div>

Here is the basic link following and parsing implementation:

HtmlEntityList entityList = new HtmlEntityList();

HtmlEntitySettings user = entityList.configureEntity("User");
user.addField("name").match("a").getText();

// The 'profileUrl' field has a link to the next page with user details. We want to follow that link.
HtmlLinkFollower profileFollower = user.addField("profileUrl")
        .match("a")
        .getAttribute("href")
        .followLink();

// We just add fields to the follower object. As the link follower comes from the "User" entity, the fields added
// here end up in the "User" entity.
getValueFromLabel(profileFollower, "username", "Username");
getValueFromLabel(profileFollower, "age", "Age");
getValueFromLabel(profileFollower, "location", "Location");
getValueFromLabel(profileFollower, "created", "Profile created on");

A simple detection mechanism

The code we’ve seen targets any element that has a CSS class called “value”. Our HtmlParserListener will use the elementVisited method to store any HtmlElement that has been visited and contains this “value” CSS class. Every time a HtmlElement is matched we remove it from the set of elements stored previously. At the end of the parsing process, if there is any HtmlElement left, we know they have not been matched by the parser. The implementation is so simple it hurts:

class SimpleValueDetector extends HtmlParserListener {

    /**
     * Keeps track of all possible relevant HTML elements that could contain data.
     */
    protected Set<HtmlElement> visitedSet = new HashSet<>();

    /**
     * The report is generated at the end of the parsing process, listing any data point that might have been missed by
     * the parser.
     */
    private List<String> unmatchedReport = new ArrayList<>();

    @Override
    public void parsingStarted(HtmlParsingContext context) {
        visitedSet.clear();
    }

    @Override
    public void elementVisited(HtmlElement element, HtmlParsingContext context) {
        // the parser will visit every element of the input HTML and call your elementVisited() implementation.
        // let's collect any possible elements of interest.
        if (element.classes().contains("value")) {
            visitedSet.add(element);
        }
    }

    @Override
    public void elementMatched(HtmlElement element, HtmlParsingContext context) {
        // when an element is matched, it means the parser matched a path to a field and collected its value.
        // We can remove the element from the set because our code only matches elements with CSS class="value".
        visitedSet.remove(element);
    }

    @Override
    public void parsingEnded(HtmlParsingContext context) {
        String source = getDocumentSourceName(context);
        for (HtmlElement unmatchedElement : visitedSet) {
            unmatchedReport.add(source + ": " + unmatchedElement.toString());
        }
    }

    protected String getDocumentSourceName(HtmlParsingContext context) {
        // gets the source URL or File.
        Object source = context.documentSource();
        if (source instanceof File) {
            source = ((File) source).getName();
        }
        return source.toString();
    }

    public List<String> getUnmatchedElementReport() {
        return unmatchedReport;
    }
}

One important thing to note here is that all instances of HtmlElement are destroyed after parsingEnded is called. You can’t keep them stored for further processing afterwards as their information is destroyed by the garbage collector. This is by design to prevent memory leaks from occurring. That’s why we generate the report entries inside the parsingEnded method, instead of simply returning the visitedSet used by our SimpleValueDetector class.

Now we just need to make the parser use it:

// Let's get the "User" entity associated with the profile URL
HtmlEntitySettings user = entityList
        .getEntity("User")
        .getRemoteFollower("profileUrl")
        .getEntity("User");

// Assign our custom value detector to it.
SimpleValueDetector valueDetector = new SimpleValueDetector();
user.setListener(valueDetector);

// Parse the users. Our custom listener will be called when the parser runs.
FileProvider input = new FileProvider("documentation/tutorial/html/example_007/list.html", "UTF-8");
HtmlParserResult users = new HtmlParser(entityList).parse(input).get("User");

// Let's see if there is any value we missed from the profile pages
List<String> missed = valueDetector.getUnmatchedElementReport();
if (missed.isEmpty()) {
    println("All possible values have been captured by the parser");
} else {
    println("Values *not* captured by the parser:");
    for (String entry : missed) {
        println(entry);
    }
}

Notice that our SimpleValueDetector is associated only with the User entity that parses the profile links. Each entity configured for a particular HTML input can have a different implementation. Keep in mind that the parser will process the HTML structure in parallel so if you reuse your HtmlParserListener implementation for multiple entities make sure it is thread-safe (i.e. any shared state must be synchronized).

Running the parser now will produce the following output:

Values *not* captured by the parser:
3321.html: <td class="value">26/6/2012</td>

Working with the parser context

The HtmlParsingContext available in every callback method of a HtmlParserListener provides more information about the current parser state. We can expand our SimpleValueDetector to also report the nodes that have been matched for a given field:

class ValueDetector extends SimpleValueDetector {

    /**
     * For the sake of the example, let's report all matched elements as well.
     */
    private List<String> matchedReport = new ArrayList<>();

    @Override
    public void elementVisited(HtmlElement element, HtmlParsingContext context) {
        super.elementVisited(element, context); // super stores elements with a "value" class

        // it's easy to look for <input> and <select> elements as well.
        String tag = element.tagName();
        if ("input".equals(tag) || "select".equals(tag)) {
            visitedSet.add(element);
        }
    }

    @Override
    public void elementMatched(HtmlElement element, HtmlParsingContext context) {
        super.elementMatched(element, context); // super removes matched elements

        // let's report the matched elements
        reportMatchedElements(context);
    }

    private void reportMatchedElements(HtmlParsingContext context) {
        // Let's report the matched fields and the sequence of elements the parser found.
        String source = getDocumentSourceName(context);

        // The context object returns the sequence of elements matched for one or more fields.
        // This only works in the scope of the elementMatched() method.
        Map<String, HtmlElement[]> matchedFields = context.getMatchedElements();

        for (Map.Entry<String, HtmlElement[]> e : matchedFields.entrySet()) {
            // field names are used as the map keys.
            String fieldName = e.getKey();

            // append the file name parsed
            StringBuilder tmp = new StringBuilder(source.toString()).append(" - ");

            // then the field name
            tmp.append(fieldName).append(": ");

            // the sequence of elements matched for the given field come as values of the map.
            HtmlElement[] elementPath = e.getValue();

            // prints only the tag names of all matched elements, except the last one.
            int i = 0;
            for (; i < elementPath.length - 1; i++) {
                //print the matched HTML tag
                String tag = elementPath[i].tagName();
                tmp.append("<").append(tag).append(">");
            }
            // prints the outer HTML of the last element matched
            tmp.append(elementPath[i].toString());

            matchedReport.add(tmp.toString());
        }
    }

    public List<String> getMatchedElementReport() {
        return matchedReport;
    }
}

Now, after parsing we can also get a matched element report with:

// We've also collected the elements matched, taking advantage of the information available from the
// context object the parser sends to the listener.
println("\nElements matched by the parser:");
List<String> matched = valueDetector.getMatchedElementReport();
for(String entry : matched){
    println(entry);
}

Which prints out:

Values *not* captured by the parser:
3321.html: <td class="value">26/6/2012</td>

Elements matched by the parser:
1123.html - username: <td class="value">jsmith</td>
1123.html - age: <td class="value">25</td>
1123.html - location: <td class="value">Adelaide, South Australia</td>
1123.html - created: <td class="value">25/4/2008</td>
3321.html - username: <td class="value">mag</td>
3321.html - age: <td class="value">52</td>
3321.html - location: <td class="value">Austin, Texas</td>
3321.html - created: <td class="value">16/2/2010</td>
8821.html - username: <td class="value">jmiller</td>
8821.html - age: <td class="value">31</td>
8821.html - location: <td class="value">Detroit, Michigan</td>
8821.html - created: <td class="value">10/3/2007</td>
2315.html - username: <td class="value">fish</td>
2315.html - age: <td class="value">19</td>
2315.html - location: <td class="value">Berlin, Germany</td>
2315.html - created: <td class="value">1/7/2015</td>

Implementing your custom HtmlParserListener is relatively straightforward and with a bit of creativity allows you to greatly expand the parser’s capabilities. We hope you find it useful.

Further reading

Feel free to proceed to the following sections (in any order).

If you find a bug

We deal with errors very seriously and stop the world to fix bugs in less than 24 hours whenever possible. It’s rare to have known issues dangling around for longer than that. A new SNAPSHOT build will be generated so you (and anyone affected by the bug) can proceed with your work as soon as the adjustments are made.

If you find a bug don’t hesitate to report an issue here. You can also submit feature requests or any other improvements there.

We are happy to help if you have any questions in regards to how to use the parser for your specific use case. Just send us an e-mail with the details and we’ll reply as soon as humanely possible.

We can work for you

If you don’t have the resources or don’t really want to waste time coding we can build a custom solution for you using our products. We deliver quickly as we know the ins and outs of everything we are dealing with. Send us an e-mail to sales@univocity.com with your requirements and we’ll be happy to assist.

The univocity team.

www.univocity.com