Downloads and historical data management

The parser can be configured to store each page parsed including paginated results and followed links, and allows re-running a parsing process against these stored files at a later time. All you have to do is to define a directory structure that makes sense for your situation, taking advantage of the supported file name patterns.

The HtmlParserSettings provides two basic attributes to configure where copies of the HTML pages visited should be stored:

  • downloadContentDirectory: defines the root directory where all files downloaded by your project will be located. For example: parserSettings.setDownloadContentDirectory("{user.home}/Downloads/realEstate/"); will configure the parser to download all files under the current user’s “Downloads” directory, in the “realEstate” subdirectory. You can use any system property as well - just put it between { and }. The folder structure will be created automatically by the parser if it doesn’t exist.

  • fileNamePattern: defines the pattern of subdirectories and files stored under the given downloadContentDirectory. For example: parserSettings.setFileNamePattern("{date, yyyy-MMM-dd}/results_{page}.html"); will create the “2015-Mar-20” subdirectory under “{user.home}/Downloads/realEstate/”, with a file named results_1.html. The {page} pattern refers to the number of pages visited by the parser. If there’s pagination involved, you should see results_2.html, results_3.html and so on. All patterns will be presented in the next section.

In addition to the download directory and the file name pattern, the HtmlParserSettings provides the following option:

  • parseDate: configures the parser to use a given date to evaluate any given fileNamePattern that has a {date} pattern - this will make it look for stored files and parse anything that was saved at the given date.

  • batchId: allows you to provide your own batch identifier to evaluate any given fileNamePattern that has a {batch} pattern.

  • downloadOverwritingEnabled: controls whether to overwrite any previously downloaded files when the file name pattern points to an existing file, i.e. if the pattern is “{date, yyyy-MMM-dd}/results_{page}.html”, only files downloaded today will be overwritten. If set to false, the parser will run over the existing downloaded files and - if applicable - download any remaining files as needed. This allows long running processes to be stopped and resume later.

  • downloadEnabled: enables/disables any sort of download when the parser executes. If you are processing historical files, you probably want to have downloadEnabled=false to prevent the parser to attempt to download any new files, which can happen if you change your code over time. As a safeguard, setting a parseDate or a batchId will automatically disable downloads of any sort (you will have to set downloadEnabled to true manually).

File name patterns

The following patterns are recognized in the fileNamePattern:

  • {batch} prints your custom batch ID. Example: /tmp/{batch}/page_{page} where the batch ID is set to “abc” will produce /tmp/abc/page_1.html

  • {date} prints the current date or your custom parse date. A date mask can be provided to configure how the date should be displayed (using the patterns accepted by java.text.SimpleDateFormat). Examples:
    • /tmp/file_{date, yyyy-MMM-dd}.pdf prints the formatted date /tmp/file_2016-Dec-25.pdf
    • /tmp/file_{date} prints the timestamp in milliseconds of the date /tmp/file_1482586200000.pdf

  • {page} prints the current page number from the HtmlPaginator. The page number can be padded with a given number of leading zeroes. Examples:
    • /tmp/file{page, 4}: prints /tmp/file0001.html, /tmp/file0321.html, etc
    • /tmp/file{page}: prints /tmp/file1.html, /tmp/file2.html, /tmp/file543.html, etc
    • /tmp/file{page, 2}: prints /tmp/file01.html, /tmp/file89.html, /tmp/file289.html, etc

  • {url}: prints part of the current URL being visited, the URL itself where each part is a directory, or a flattened representation of the URL. For example, given the relative URL /Property/307634/EST6886/Springfield
    • Using a 0-based index to select a section of the URL path: /tmp/{url, 2}.html prints the third section of the URL, producing /tmp/est6886.html
    • Flattening the URL: /tmp/{url, flat} flattens the URL to a single file name, producing /tmp/property_307634_est6886_springfield.html
    • Creating sub-directories based on the URL path: /tmp/{url} creates a directory structured based on the URL, producing /tmp/Property/307634/EST6886/Springfield.html
  • {$query_param}: prints the value of a given query parameter. For example, to print the q parameter in http://google.com/search?q=cup, the file name pattern /tmp/search_{$q}.html will produce /tmp/search_cup.html

  • {parent} is used when sub-pages are processed by a HtmlLinkFollower. The link follower has its own parser settings for you to provide a fileNamePattern. For example, assume the parser uses /{batch}/results_{page}.html, and it has a HtmlLinkFollower configured with {parent}/{url}.html. If batch “ZZZ” is run and the link follower opens a link with “href=/Property/307757/EBF22889/Humewood” the HTML read from that link be saved as: /ZZZ/results_1/Property/307757/EBF22889/Humewood.html

  • {entry} prints the record number associated with a link follower. For example in a search result page, each result entry visited by a link follower will have a record number associated. The first link visited by the follower will have the entry number 1, the second link visited by the follower will have entry number 2, and so on. A file name pattern with {parent}/details_{entry}.html on the link follower, where the parser is using /tmp/search_1.html, prints /tmp/search_1/details_1.html, /tmp/search_1/details_2.html, etc. The entry number can be padded with a given number of leading zeroes. Examples:
    • /tmp/file{entry, 4}: prints /tmp/file0001.html, /tmp/file0321.html, etc
    • /tmp/file{entry}: prints /tmp/file1.html, /tmp/file2.html, /tmp/file543.html, etc
    • /tmp/file{entry, 2}: prints /tmp/file01.html, /tmp/file89.html, /tmp/file289.html, etc

Time for a demonstration with actual code.

Historical data processing example

Let’s say we want to analyze trends of real estate prices around a region. This will require us to store property price information over time.

We’ve created the following example based on the structure typically seen in real estate websites, which lists properties for sale in a given region using the following URL (adapted to run on a mock test server):

"http://localhost:8086/Property/Residential?search=&location={LOCATION_CODE}&proptype=&min=&max=&minbed=&maxbed=&formsearch=true&page=1"

Where LOCATION_CODE is a parameter that takes the code of a given region.

The input to parse

Our mock server works with region 22008, so hitting the URL with LOCATION_CODE set to 22008 will produce the following property search results:

Display in a separate tab

Note: Blue links won’t work as the HTML is not being served by an actual server. Use the red links to navigate.

The HTML has the following structure:

<div id="galleryView">
    <ul>
        <li lat="0" lng="0" listingid="307928">
            <div class="listingContent">
                <h2 title="A Home That Offers Tranquility And Peace - The Sands"><a href="/Property/307928/EOJ22396/Fountains-Estate" style="pointer-events:none;">A Home That Offers Tranquility And Peace - The Sands</a></h2>
                <h3 title="Fountains Estate">Fountains Estate</h3>
                <div class="propFeatures">
                    <h3>$1,460,000</h3>
                    <a href="./EOJ22396.html" style="color:red">View</a>
                </div>
            </div>

        </li>
        <li lat="0" lng="0" listingid="307712">
            <div class="listingContent">
                <h2><a href="/Property/307712/EFB3925/St-Francis-On-Sea-Phase-I-I" style="pointer-events:none;">Uninterrupted Views!</a></h2>
                <div class="listAddress">
                    <h3>St Francis On Sea Phase I I</h3>
                </div>
                <p>Uninterrupted views of the ocean forever! This beautiful plot is waiting for you to built and live your dream. Prime spot.
                    Call us today.</p>
                <div class="propFeatures">
                    <h3>$880,000</h3>
                    <a href="./EFB3925.html" style="color:red">View</a>
                </div>
            </div>
        </li>
    </ul>
</div>
<div id="pager">
    <ul style="list-style: none;">
        <li class="pagerPrev" style="float:left"></li>
        <li class="pagerCount" style="float:left">
            Page:<span>1</span><em>|</em>
            <a href="/Property/Residential?search=&amp;location=22008&amp;proptype=&amp;min=&amp;max=&amp;minbed=&amp;maxbed=&amp;formsearch=true&amp;page=2" style="pointer-events:none;">2</a><em>|</em>
            <a href="/Property/Residential?search=&amp;location=22008&amp;proptype=&amp;min=&amp;max=&amp;minbed=&amp;maxbed=&amp;formsearch=true&amp;page=3" style="pointer-events:none;" >3</a><em>|</em>
            <a href="/Property/Residential?search=&amp;location=22008&amp;proptype=&amp;min=&amp;max=&amp;minbed=&amp;maxbed=&amp;formsearch=true&amp;page=4" style="pointer-events:none;" >4</a> ...
            <a href="/Property/Residential?search=&amp;location=22008&amp;proptype=&amp;min=&amp;max=&amp;minbed=&amp;maxbed=&amp;formsearch=true&amp;page=156" style="pointer-events:none;">156</a><em>|</em>
            <a href="/Property/Residential?search=&amp;location=22008&amp;proptype=&amp;min=&amp;max=&amp;minbed=&amp;maxbed=&amp;formsearch=true&amp;page=157" style="pointer-events:none;">157</a>
        </li>
        <li class="pagerNext" style="float:left; pointer-events:none;"><a class="btn" href="/Property/Residential?search=&amp;location=22008&amp;proptype=&amp;min=&amp;max=&amp;minbed=&amp;maxbed=&amp;formsearch=true&amp;page=2">Next</a></li>
    </ul>
</div>
<br/><br/>
<div>
    <a href="./results_0002.html" style="color:red">View page 2</a>
</div>

Storing the first level of results

First, we want to save the search results of the day.

For brevity, we’re going through the first 2 pages of results. The following code configures where the parser should save the search result. The paginator is also configured:

HtmlEntityList entityList = new HtmlEntityList();

HtmlParserSettings parserSettings = entityList.getParserSettings();
parserSettings.setDownloadContentDirectory("{user.home}/Downloads/realEstate/");
parserSettings.setFileNamePattern("{date, yyyy-MMM-dd}/location_{$location}_page_{page, 4}.html");

// won't override local files. Allows stopping and resuming the process.
parserSettings.setDownloadOverwritingEnabled(false);

//configure the paginator
HtmlPaginator paginator = entityList.getPaginator();
paginator.setCurrentPageNumber()
        .match("div").id("pager")
        .match("li").classes("pagerCount")
        .matchFirst("span").getText();

paginator.setNextPage()
        .match("li").classes("pagerNext")
        .matchFirst("a").getAttribute("href");

//Configure the paginator to visit one page of results after the first
paginator.setFollowCount(1);

There are no entities configured yet, but we can already run the parser to save the search result files and see how they look like:

String url = "http://localhost:8086/Property/Residential?search=&location={LOCATION_CODE}&proptype=&min=&max=&minbed=&maxbed=&formsearch=true&page=1";
// The search query lists all properties at a given location
UrlReaderProvider search = new UrlReaderProvider(url);

HtmlParser parser = new HtmlParser(entityList);

// We just need to set the location code and parse the URL. All files will be saved locally. Running the parser
// again on the same day will simply run over the stored files instead of actually going to the remote site.
search.getRequest().setUrlParameter("LOCATION_CODE", "22008");
parser.parse(search);

After running this code, under your user home you should find the directory Downloads/realEstate/ with a sub-directory with the current date. Inside this directory, you will find the HTML files of the first two pages of results: location_22008_page_0001.html and location_22008_page_0002.html, which matches the pattern defined in the code: location_{$location}_page_{page, 4}. Here, $location is replaced with the query parameter value of &location in the URL, and {page, 4} is the page number, padded to have a length of 4, with leading zeros to the left.

Running the code to parse again on the same date will not overwrite the already downloaded files, as parserSettings.setDownloadOverwritingEnabled(false); will prevent the download. The parser will only download again if you delete one or all of the files then run the parser again. For example, if you delete the file location_22008_page_0002.html and run the parser, the logs will print out the following:

Skipping content download from 'http://localhost:8086/Property/Residential?search=&location=22008&proptype=&min=&max=&minbed=&maxbed=&formsearch=true&page=1'. File already exists: '/home/jbax/Downloads/realEstate/2018-Mar-28/location_22008_page_0001.html'
Reading data from local file '/home/jbax/Downloads/realEstate/2018-Mar-28/location_22008_page_0001.html' (skipped download from http://localhost:8086/Property/Residential?search=&location=22008&proptype=&min=&max=&minbed=&maxbed=&formsearch=true&page=1)

Then

Parsing next page from: http://localhost:8086/Property/Residential?search=&location=22008&proptype=&min=&max=&minbed=&maxbed=&formsearch=true&page=2
Downloading content from 'http://localhost:8086/Property/Residential?search=&location=22008&proptype=&min=&max=&minbed=&maxbed=&formsearch=true&page=2' into '/home/jbax/Downloads/realEstate/2018-Mar-28/location_22008_page_0002.html'
Opening URL 'http://localhost:8086/Property/Residential?search=&location=22008&proptype=&min=&max=&minbed=&maxbed=&formsearch=true&page=2' (HTTP method GET)

Now we are ready to download and parse the details of each property listed in the search results.

Storing linked pages

The next step is to visit the link to each property to collect its details. A property details page has the following HTML structure:

<div id="content">
    <div id="listingDetail" data-listingid="307743">
        <div id="detailTitle">
            <h1>A Home That Offers Tranquility And Peace - The Sands</h1>
            <h2 class="detailAddress">Fountains Estate</h2>
            <div class="propFeatures">
                <h3 id="listingViewDisplayPrice">$1,460,000</h3>
                <ul id="detailFeatures">
                    <li title="2 Bedrooms" class="bdrm"><span>2</span> <img src="./Images/Icons/bedroom.svg" alt="" style="height: 25px;"></li>
                    <li title="2 Bathrooms" class="bthrm"><span>2</span> <img src="./Images/Icons/bathroom.svg" alt=""  style="height: 25px;"></li>
                </ul>
            </div>
        </div>
        <div class="listingInfo">
            <span><strong>Listing Number:</strong> EOJ22396</span>
        </div>
        <div class="listDetailWrapper">
            <div class="left-column">
                <div class="section property-information-section">
                    <div class="heading">Property Information</div>
                    <div class="content">
                        <div>
                            <div class="read-more-wrap property-information">
                                <ul>
                                    <li>
                                        <span class="heading">Floor Area: </span>173 sqm
                                    </li>
                                    <li>
                                        <span class="heading">Land Size: </span>440 sqm
                                    </li>
                                    <li>
                                        <span class="heading">Property condition: </span>New
                                    </li>
                                    <li>
                                        <span class="heading">Property Type: </span>House
                                    </li>
                                    <li>
                                        <span class="heading">Garaging / carparking: </span>Double lock-up
                                    </li>
                                    <li>
                                        <span class="heading">Construction: </span>Plaster
                                    </li>
                                    <li>
                                        <span class="heading">Window/door frames: </span>Aluminium
                                    </li>
                                    <li>
                                        <span class="heading">Living area: </span>Open plan
                                    </li>
                                    <li>
                                        <span class="heading">Main bedroom: </span>Double and Built-in-robe
                                    </li>
                                    <li>
                                        <span class="heading">Ensuite: </span>Separate shower, Bath
                                    </li>
                                    <li>
                                        <span class="heading">Bedroom 2: </span>Double and Built-in / wardrobe
                                    </li>
                                    <li>
                                        <span class="heading">Main bathroom: </span>Separate shower
                                    </li>

                                </ul>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </div>
</div>
<div>
    <a href="./results_0001.html" style="color:red">Back to results</a>
</div>

The following code creates a HtmlLinkFollower that downloads the HTML with the details of each property:

HtmlEntitySettings houses = entityList.configureEntity("houses");

HtmlLinkFollower houseDetails = houses.addField("propertyDetailsLink")
        .match("div").id("galleryView")
        .match("div").classes("listingContent")
        .matchNext("h2").matchNext("a").getAttribute("href")
        .followLink();
houseDetails.setNesting(Nesting.JOIN);

// We need to visit each link of the search results to get the details of each house available for sale.
// The details page of each property will be saved along with the initial search results. The third element
// of the URL path can be used to name the files.
houseDetails.getParserSettings().setFileNamePattern("{parent}/../{url, 2}.html");

houseDetails.addField("id").match("strong").withExactText("Listing Number:").getFollowingText();
houseDetails.addField("address").match("h2").classes("detailAddress").getText();
houseDetails.addField("price").match("h3").id("listingViewDisplayPrice").getText();

PartialPath info = houseDetails.newPath().match("ul").id("detailFeatures");
info.addField("bedrooms").match("li").classes("bdrm").matchNext("span").getText();
info.addField("bathrooms").match("li").classes("bthrm").matchNext("span").getText();

info = houseDetails.newPath().match("div").classes("property-information").match("li").matchNext("span").classes("heading");
info.addField("landSize").matchCurrent().withText("Land size").getFollowingText();
info.addField("propertyType").matchCurrent().withText("Property type").getFollowingText();

The relevant line for historical data processing is:

houseDetails.getParserSettings().setFileNamePattern("{parent}/../{url, 2}.html")

This configures the parser settings of the HtmlLinkFollower to save the HTML of the links it visits under {parent} - essentially meaning we want the HTML to be under the same directory we’ve been using to store the search results obtained during the current day.

The pattern we used has /../ to indicate the detail pages read by the link follower should be stored alongside the search results. We did that because by default the parser generates subdirectories named after the parent HTML that has links visited by the link follower, i.e. as location_22008_page_0001.html has the first page of results, the subdirectory location_22008_page_0001/ would be created, and inside it the HTML files of the two links of property details visited from that first page.

Finally, {url, 2} captures the third section of the visited URL path. For example, the first link visited is <a href="/Property/307712/EFB3925/St-Francis-On-Sea-Phase-I-I">Uninterrupted Views!</a>, and the third section of the path is EFB3925, so {url, 2}.html will result in a file named EFB3925.html being created.

Running the code above will produce the following output:

propertyDetailsLink___________________________________id________address______________________price_______bedrooms__bathrooms__landSize__propertyType__
/Property/307928/EOJ22396/Fountains-Estate            EOJ22396  Fountains Estate             $1,460,000  2         2          440 sqm   House         
/Property/307712/EFB3925/St-Francis-On-Sea-Phase-I-I  EFB3925   St Francis On Sea Phase I I  $880,000                         739 sqm   Land          
/Property/307743/EMC26962/West-Bank                   EMC26962  West Bank                    $890,000    3         2          607 sqm   House         
/Property/307927/ECT37343/Cambridge-West              ECT37343  Cambridge West               $450,000                         536 sqm   Land

And the files under <your home>/Downloads/realEstate/<current date>/ will be:

  • location_22008_page_0001.html
  • location_22008_page_0002.html
  • ECT37343.html
  • EFB3925.html
  • EMC26962.html
  • EOJ22396.html

If you run the parser again in the same day, it will process these local files instead of re-downloading them.

Re-processing historical data

If you run the parser every day, it will create new directories with the files collected for that day. To re-process a previously parsed result, you can just provide the date with:

//use a formatted date string, in a format that matches the pattern you specified
entityList.getParserSettings().setParseDate("2015-Mar-27");
//or
Calendar yesterday = Calendar.getInstance(); // current date
yesterday.roll(Calendar.DAY_OF_MONTH, -1); // minus 1 day
entityList.getParserSettings().setParseDate(yesterday);

HtmlParser parser = new HtmlParser(entityList);

You can then parse the same URL, without any other code changes, and the local files stored in the corresponding date directory will be used to collect the results.

Notice that when you call setParseDate or setBatchId no downloads will be performed by default, preventing any changes to your historical data. You’ll have to use setDownloadEnabled(true) explicitly to enable downloading any missing file.

If you are managing your own historical directory structure, without a {date} or {batch} pattern, you will probably want to call setDownloadEnabled(false); explicitly in your code when re-processing old files. Otherwise, if changes are made to the parsing code over time, it may find new links to download and introduce new files into the historical directories (which you may or may not want, according to your requirements).

Further reading

Now that you know how to organize, store and reparse your HTML, you can now proceed to one of the following sections (in any order).

If you find a bug

We deal with errors very seriously and stop the world to fix bugs in less than 24 hours whenever possible. It’s rare to have known issues dangling around for longer than that. A new SNAPSHOT build will be generated so you (and anyone affected by the bug) can proceed with your work as soon as the adjustments are made.

If you find a bug don’t hesitate to report an issue here. You can also submit feature requests or any other improvements there.

We are happy to help if you have any questions in regards to how to use the parser for your specific use case. Just send us an e-mail with the details and we’ll reply as soon as humanely possible.

We can work for you

If you don’t have the resources or don’t really want to waste time coding we can build a custom solution for you using our products. We deliver quickly as we know the ins and outs of everything we are dealing with. Send us an e-mail to sales@univocity.com with your requirements and we’ll be happy to assist.

The univocity team.

www.univocity.com