Resource Downloading

Typically when you use the parser to capture a HTML page (and linked pages) only the raw HTML is used. When you configure the parser to save these files locally, it’s probable they will not be rendered properly by a browser. That happens because the HTML depends on CSS files and other resources such as images that are not available offline.

For example a HTML snippet with:

<link rel="stylesheet" type="text/css" href="style.css"/>

When saved locally into /tmp/test.html a browser will try to find a style.css file beside test.html. As the CSS file was not downloaded there is no way for the browser to render the page as originally seen from its actual website.

Fortunately you can configure the parser to fetch any resources used by a page so your local files become more usable.

Collecting page resources

We’re going to build upon the example used for Storing linked pages which visits pages with real estate details and saves the HTML locally into {user.home}/Downloads/realEstate/{date}:

Display in a separate tab

The linked pages display a little icon beside the number of bedrooms and bathrooms.:

<div id="content">
    <div id="listingDetail" data-listingid="307743">
        <div id="detailTitle">
            <h1>Solid Built Home in West Bank</h1>
            <h2 class="detailAddress">West Bank</h2>
            <div class="propFeatures">
                <h3 id="listingViewDisplayPrice">$890,000</h3>
                <ul id="detailFeatures">
                    <li title="3 Bedrooms" class="bdrm"><span>3</span> <img src="./Images/Icons/bedroom.svg" alt="" style="height: 25px;"></li>
                    <li title="2 Bathrooms" class="bthrm"><span>2</span> <img src="./Images/Icons/bathroom.svg" alt=""  style="height: 25px;"></li>
                </ul>
            </div>
        </div>
        <div class="listingInfo">
            <span><strong>Listing Number:</strong> EMC26962</span>
        </div>
        <div class="listDetailWrapper">
            <div class="left-column">
                <div class="section property-information-section">

                    <div class="heading">Property Information</div>
                    <div class="content">
                        <div>
                            <div class="read-more-wrap property-information">
                                <ul>
                                    <li>
                                        <span class="heading">Land Size: </span>607 sqm
                                    </li>
                                    <li>
                                        <span class="heading">Property condition: </span>Good
                                    </li>
                                    <li>
                                        <span class="heading">Property Type: </span>House
                                    </li>
                                    <li>
                                        <span class="heading">Garaging / carparking: </span>Single lock-up
                                    </li>
                                    <li>
                                        <span class="heading">Living area: </span>Formal lounge
                                    </li>
                                    <li>
                                        <span class="heading">Main bedroom: </span>Double
                                    </li>
                                    <li>
                                        <span class="heading">Ensuite: </span>Separate shower
                                    </li>
                                    <li>
                                        <span class="heading">Bedroom 2: </span>Double
                                    </li>
                                    <li>
                                        <span class="heading">Bedroom 3: </span>Double
                                    </li>
                                    <li>
                                        <span class="heading">Main bathroom: </span>Bath
                                    </li>
                                    <li>
                                        <span class="heading">Workshop: </span>Combined
                                    </li>
                                </ul>
                            </div>
                        </div>
                    </div>
                </div>
            </div>
        </div>
    </div>
</div>
<div>
    <a href="./results_0002.html" style="color:red">Back to results</a>
</div>

Once the HTML is downloaded by the parser the images won’t show up in the browser.

We want to collect these images into a shared “cache” folder, so that copies of each page we store locally can reference to the images in our “cache”. This is easily configured with the help of the class FetchOptions:

// configure the fetch operation
FetchOptions fetchOptions = new FetchOptions();

// all resources of all pages to be stored under a "cache" folder.
fetchOptions.setSharedResourceDir("{user.home}/Downloads/realEstate/cache");

// use a download handler to control what to download - the `DownloadContext` provides many options, check the javadoc
fetchOptions.setDownloadHandler(new DownloadHandler() {
    @Override
    public void nextDownload(DownloadContext context) {
        print("[" + context.parentHtmlFile().getName() + "] ");
        //we don't want to fetch linked html pages.
        if ("html".equals(context.targetFileExtension())) {
            println("skip download from " + context.downloadUrl() + " into " + context.targetRelativePath());
            context.skipDownload();
            return;
        }

        println("download from " + context.downloadUrl() + " into " + context.targetRelativePath());
    }
});

// tell the parser to fetch the resources using our configuration.
entityList.getParserSettings().fetchResourcesBeforeParsing(fetchOptions);

// we also need to force the parser to overwrite the local files stored previously, otherwise it won't touch the existing files.
entityList.getParserSettings().setDownloadOverwritingEnabled(true);

Now, when the parser executes it will also download any files found in the page. Notice that the HTML content is changed by the parser to use the local resources so it’s not going to be exactly like the page originally downloaded. i.e. a link such as

<link rel="stylesheet" type="text/css" href="style.css"/> 

will become:

<link rel="stylesheet" type="text/css" href="c:/Users/You/Documents/whatever/cache/style.css"/> 

If the page has a <base> tag if it will be removed as well.

A DownloadHandler can be used to skip unwanted files, change their locations, and in our example, print out what is going to happen to each one of the linked resources found in each visited page.

The output the example code produces is:

[location_22008_page_0001.html] skip download from http://localhost:8086/EOJ22396.html into ../cache/EOJ22396.html
[location_22008_page_0001.html] skip download from http://localhost:8086/EFB3925.html into ../cache/EFB3925.html
[location_22008_page_0001.html] skip download from http://localhost:8086/results_0002.html into ../cache/results_0002.html
[EOJ22396.html] download from http://localhost:8086/Images/Icons/bedroom.svg into ../cache/Images/Icons/bedroom.svg
[EOJ22396.html] download from http://localhost:8086/Images/Icons/bathroom.svg into ../cache/Images/Icons/bathroom.svg
[EOJ22396.html] skip download from http://localhost:8086/results_0001.html into ../cache/results_0001.html
[EFB3925.html] skip download from http://localhost:8086/results_0001.html into ../cache/results_0001.html
[location_22008_page_0002.html] skip download from http://localhost:8086/EMC26962.html into ../cache/EMC26962.html
[location_22008_page_0002.html] skip download from http://localhost:8086/ECT37343.html into ../cache/ECT37343.html
[location_22008_page_0002.html] skip download from http://localhost:8086/results_0001.html into ../cache/results_0001.html
[EMC26962.html] download from http://localhost:8086/Images/Icons/bedroom.svg into ../cache/Images/Icons/bedroom.svg
[EMC26962.html] download from http://localhost:8086/Images/Icons/bathroom.svg into ../cache/Images/Icons/bathroom.svg
[EMC26962.html] skip download from http://localhost:8086/results_0002.html into ../cache/results_0002.html
[ECT37343.html] skip download from http://localhost:8086/results_0002.html into ../cache/results_0002.html

After running the code you should also find the following files in your filesystem:

{user.home}/Downloads/realEstate/cache/Images/Icons/bathroom.svg
{user.home}/Downloads/realEstate/cache/Images/Icons/bedroom.svg

If you use a browser to open the stored HTML files under {user.home}/Downloads/realEstate/{date}, the local images will show up beside the number of bathrooms/bedrooms. The local HTML file will have image links that look like this:

<ul id="detailFeatures"> 
<li title="3 Bedrooms" class="bdrm"><span>3</span> <img src="../cache/Images/Icons/bedroom.svg" alt="" style="height: 25px;"></li> 
<li title="2 Bathrooms" class="bthrm"><span>2</span> <img src="../cache/Images/Icons/bathroom.svg" alt="" style="height: 25px;"></li> 
</ul>

You can manipulate what and how to fetch resources of a page with the help of a DownloadHandler as shown in the example above. This callback interface calls your implementation and sends you a DownloadContext object which allows you to filter out unwanted downloads and provides all information about the download process.

Saving a web page like a browser does

You can also save web pages - or sections of it - like a browser does (typically via File-> Save Page As… or similar). All you have to do is to call the fetchResources method directly from the root HtmlElement of a HTML tree:

UrlReaderProvider url = new UrlReaderProvider("http://www.univocity.com");

//parse the web page into a HTML element tree.
HtmlElement root = HtmlParser.parseTree(url);

//configure fetch options as needed
FetchOptions fetchOptions = new FetchOptions();

// flatten directories (generates long file names, no subdirectories)
fetchOptions.flattenDirectories(true);

// you can use the fetch output object to list all downloaded files
FetchOutput output = root.fetchResources("{user.home}/Downloads/univocity.html", "UTF-8", fetchOptions);

// let's list the downloaded files
Map<File, URL> downloadedFiles = output.getResourceMap();

for (Map.Entry<File, URL> e : downloadedFiles.entrySet()) {
    println(e.getKey().getAbsolutePath() + " downloaded from " + e.getValue());
}

The downloaded files and their source URL are available from the FetchOptions. The code above should print something like this:

/home/jbax/Downloads/univocity_files/s_files_1_0393_5225_t_5_assets_favicon.ico downloaded from http://cdn.shopify.com/s/files/1/0393/5225/t/5/assets/favicon.ico
/home/jbax/Downloads/univocity_files/s_files_1_0393_5225_t_5_assets_ajax-load.gif downloaded from http://cdn.shopify.com/s/files/1/0393/5225/t/5/assets/ajax-load.gif
/home/jbax/Downloads/univocity_files/s_files_1_0393_5225_t_5_assets_main.js downloaded from http://cdn.shopify.com/s/files/1/0393/5225/t/5/assets/main.js
/home/jbax/Downloads/univocity_files/s_files_1_0393_5225_t_5_assets_blank.gif downloaded from http://cdn.shopify.com/s/files/1/0393/5225/t/5/assets/blank.gif
/home/jbax/Downloads/univocity_files/s_files_1_0393_5225_t_5_assets_slide_2.jpg downloaded from http://cdn.shopify.com/s/files/1/0393/5225/t/5/assets/slide_2.jpg
...

Any CSS file downloaded will also have their resources fetched. For example, the original CSS classes:

#pageheader .search-box .search-form i {
  width: 22px;
  height: 22px;
  background: url('//cdn.shopify.com/s/files/1/0393/5225/t/5/assets/social_spr_white.png?17530279042161470113') -221px center no-repeat;
  top: 7px;
  left: 2px;
  position: absolute;
  z-index: 1;
}

#pageheader .search-box.focus .search-form i {
  background-image: url('//cdn.shopify.com/s/files/1/0393/5225/t/5/assets/social_spr_darkgrey.png?17530279042161470113');
}

Will become something like this.

#pageheader .search-box .search-form i {
  width: 22px;
  height: 22px;
  background: url('s_files_1_0393_5225_t_5_assets_social_spr_white.png?17530279042161470113') -221px center no-repeat;
  top: 7px;
  left: 2px;
  position: absolute;
  z-index: 1;
}

#pageheader .search-box.focus .search-form i {
  background-image: url('s_files_1_0393_5225_t_5_assets_social_spr_darkgrey.png?17530279042161470113');
}

This ensures the local file can be rendered almost like the actual live web page.

Further reading

We recommend you to read through the following tutorials if you haven’t yet:

If you find a bug

We deal with errors very seriously and stop the world to fix bugs in less than 24 hours whenever possible. It’s rare to have known issues dangling around for longer than that. A new SNAPSHOT build will be generated so you (and anyone affected by the bug) can proceed with your work as soon as the adjustments are made.

If you find a bug don’t hesitate to report an issue here. You can also submit feature requests or any other improvements there.

We are happy to help if you have any questions in regards to how to use the parser for your specific use case. Just send us an e-mail with the details and we’ll reply as soon as humanely possible.

We can work for you

If you don’t have the resources or don’t really want to waste time coding we can build a custom solution for you using our products. We deliver quickly as we know the ins and outs of everything we are dealing with. Send us an e-mail to sales@univocity.com with your requirements and we’ll be happy to assist.

The univocity team.

www.univocity.com