x Use code HALFFORME to get a 50% discount at checkout. Valid for the first 100 orders.

Pagination

Almost every website has some sort of pagination mechanism that allows you to load more results from a given initial result page.

Typically, the mechanism behind the pagination involves simply a link to the next/previous page. If the link is followed, the next subset of results is displayed, along with another link to the next/previous page. In other cases, a POST request is needed because the server expects heaps of unintelligible parameters and values hidden in the page in order to track the state of a user session.

The HtmlPaginator provided by the parser assists in handling the simple and the complex cases. It’s available from every HtmlEntityList and HtmlLinkFollower through the getPaginator() method. Let’s demonstrate how it works.

The input to parse

Suppose you ran a search for the “univocity” keyword on some search engine, and it returned the following results, spread across 3 pages:

Display in a separate tab

Where the HTML to parse is:

<html>
    <head>
        <title>Search: Univocity - Page 1</title>
        <link rel="stylesheet" type="text/css" href="style.css"/>
    </head>
    <body>
        <div id="resultsContainer">
            <div class="advert block">
                <span>
                    <a href="" style="pointer-events:none;">Click Here for great deals</a>
                </span>
            </div>
            <div class="result block">
                <span>
                    <a href="https://github.com/univocity/univocity-parsers" style="pointer-events:none;">
                        GitHub - univocity/univocity-parsers
                    </a>
                </span>
            </div>
            <div class="result block">
                <span>
                    <a href="https://en.wikipedia.org/wiki/Univocity_of_being" style="pointer-events:none;">
                        Univocity of being - Wikipedia
                    </a>
                </span>
            </div>
            <div class="result block">
                <span>
                    <a href="https://www.merriam-webster.com/dictionary/univocity" style="pointer-events:none;">
                        Univocity | Definition of Univocity by Merriam-Webster
                    </a>
                </span>
            </div>
        </div>
        <div id="pageControl">
            <span id="nextPage">
                <a href="page2.html">Next Page</a>
            </span>
        </div>
    </body>
</html>

Now we need to configure the parser to visit the next page of results indicated in <a href="page2.html">Next Page</a>.

Basic pagination

The HtmlPaginator has special fields and like any other entity you can define paths to assign values to these fields. The setNextPage() method is the most important one and it allows you to build a path to the link that takes to the next page of results:

HtmlEntityList entityList = new HtmlEntityList();

//Configure the paginator
HtmlPaginator paginator = entityList.getPaginator();
paginator.setNextPage()
        .match("span").id("nextPage")
        .match("a").getAttribute("href"); // captures the link that goes to the next page

That’s all we need for the pagination of the given example. Now we can focus on collecting the data:

// Configure the entity that collects search results:
HtmlEntitySettings search = entityList.configureEntity("search");

PartialPath resultPath = search.newPath().match("div").classes("result");

resultPath.addField("title").match("a").getText();
resultPath.addField("link").match("a").getAttribute("href");

// Give the parser the first page to process.
FileProvider firstPage = new FileProvider("documentation/tutorial/html/example_009/page1.html", "UTF-8");

// It will visit all pages of the search results
HtmlParserResult searchResults = new HtmlParser(entityList).parse(firstPage).get("search");

The results will have all relevant rows available from the 3 result pages:

title_____________________________________________________________________link______________________________________________________________________________________
GitHub - univocity/univocity-parsers                                      https://github.com/univocity/univocity-parsers                                            
Univocity of being - Wikipedia                                            https://en.wikipedia.org/wiki/Univocity_of_being                                          
Univocity | Definition of Univocity by Merriam-Webster                    https://www.merriam-webster.com/dictionary/univocity                                      
univocity - ETL, data integration and data synchronization for Java       https://www.univocity.com/                                                                
The best & fastest CSV parser for Java. With TSV & Fixed ... - univocity  https://www.univocity.com/pages/about-parsers                                             
CsvParserSettings (univocity-parsers 1.3.0 API)                           http://docs.univocity.com/parsers/1.3.0/com/univocity/parsers/csv/CsvParserSettings.html  
univocity-parsers - speed and flexibility for all text formats            https://www.univocity.com/pages/parsers-features                                          
univocity - Wiktionary                                                    https://en.wiktionary.org/wiki/univocity                                                  
Deleuze, Spinoza and Univocity | Deontologistics                          https://deontologistics.wordpress.com/2009/08/03/deleuze-spinoza-and-univocity/

The paginator is built to target live websites but here it is working against files stored locally. You can check the Downloads and historical data management tutorial later to learn more about how to store, organize and reparse HTML.

Usually you’ll want to limit the number of pages to follow, so let’s add this line to visit only 1 page after the first and stop there:

paginator.setFollowCount(1);

The results will now contain only the data from the first 2 pages:

title_____________________________________________________________________link______________________________________________________________________________________
GitHub - univocity/univocity-parsers                                      https://github.com/univocity/univocity-parsers                                            
Univocity of being - Wikipedia                                            https://en.wikipedia.org/wiki/Univocity_of_being                                          
Univocity | Definition of Univocity by Merriam-Webster                    https://www.merriam-webster.com/dictionary/univocity                                      
univocity - ETL, data integration and data synchronization for Java       https://www.univocity.com/                                                                
The best & fastest CSV parser for Java. With TSV & Fixed ... - univocity  https://www.univocity.com/pages/about-parsers                                             
CsvParserSettings (univocity-parsers 1.3.0 API)                           http://docs.univocity.com/parsers/1.3.0/com/univocity/parsers/csv/CsvParserSettings.html

Time to explore a few additional settings that are commonly required to handle pagination in the real world.

Intercepting the next request

Every time a page is processed, the parser will check whether the paginator points to a next page to be parsed. Before actually making the request it gives control back to the user to manipulate the request if required. For example, this snippet processes products names and prices from a search result for the keyword “mouse”:

// Configures the paginator to follow through a list of search result pages
HtmlPaginator paginator = entities.getPaginator();
paginator.setNextPage().match("a").id("pagnNextLink")
        .classes("pagnNext").getAttribute("href");

// Collect rows from up to 2 pages after the first search results page.
paginator.setFollowCount(2);

// Print out the details of the request to be made to the next page.
// You can modify the request at will if you need.
paginator.setPaginationHandler(new NextInputHandler<HtmlPaginationContext>() {
    @Override
    public void prepareNextCall(HtmlPaginationContext remoteContext) {
        //this is the request ready to go to the next page.
        UrlReaderProvider next = remoteContext.getNextRequest();

        //the context object has additional information.
        int pageNumber = remoteContext.getPageCount() + 1;

        println("Going to page " + pageNumber + ": " + next.getUrl());
        println("Headers: " + next.getRequest().getHeaders());
        println();
    }
});

// creates a new request
UrlReaderProvider url = new UrlReaderProvider("http://localhost:8086/s/field-keywords=mouse");

// configure the request
url.getRequest().setUserAgent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:49.0) Gecko/20100101 Firefox/49.0");
url.getRequest().setHeader("Accept-Language", "en-US,en;q=0.5");

// run the parser against the first page. The paginator will kick in and get the results of another 2 result pages.
new HtmlParser(entities).parse(url);

The code in this example uses a mock server against the localhost, so the code inside the pagination handler produces:

Going to page 2: http://localhost:8086/s/ref=sr_pg_2/162-3301318-6508331?rh=i%3Aaps%2Ck%3Amouse&page=2&keywords=mouse&ie=UTF8&qid=1478752936&spIA=B00BIFNTMC,B00Y20UI1K,B01B1QBK78,B01MA15TCE
Headers: {Accept-Encoding=gzip, User-Agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:49.0) Gecko/20100101 Firefox/49.0, Accept-Language=en-US,en;q=0.5}

Going to page 3: http://localhost:8086/s/ref=sr_pg_3?rh=i%3Aaps%2Ck%3Amouse&page=3&keywords=mouse&ie=UTF8&qid=1478752938&spIA=B01M22S0SE,B01M6XU1M5,B00E290JRE,B01IMYD5JI,B00BIFNTMC,B00Y20UI1K,B01B1QBK78,B01MA15TCE
Headers: {Accept-Encoding=gzip, User-Agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:49.0) Gecko/20100101 Firefox/49.0, Accept-Language=en-US,en;q=0.5}

Notice that the headers of the initial request are still used on the subsequent requests for the next page of results. You can add/remove headers or generate a different request altogether from within a NextInputHandler.

URL based pagination

Another common implementation of pagination is having the page number as part of the URL, for example:

https://www.somewebsite.com/sch/i.html?_from=R40&_sacat=0&_nkw=memory+stick&_pgn=1&_skc=50&rt=nc
                                                           ^^^^              ^^^^

Here _pgn is the page number and _nkw are the keywords for a search.

You can use the pagination handler to update the _pgn parameter of the UrlReaderProvider. This is very easy to do with the support for parameterized URLs in the UrlReaderProvider. We can replace the values “1” and “memory+stick” in the original URL with parameter names. The URL could be written as:

https://www.somewebsite.com/sch/i.html?_from=R40&_sacat=0&_nkw={search_key}&_pgn={page_number}&_skc=50&rt=nc
                                                          ^^^^              ^^^^

With the search_key and page_number parameters defined, we can process the pagination like this:

HtmlPaginator paginator = entityList.getPaginator();

// Let's control the pagination ourselves
paginator.setPaginationHandler(new NextInputHandler<HtmlPaginationContext>() {
    @Override
    public void prepareNextCall(HtmlPaginationContext paginationContext) {
        UrlReaderProvider nextPage = paginationContext.getNextRequest();

        int nextPageNumber = paginationContext.getPageCount() + 1;

        if (nextPageNumber <= 5) {
            //update the page number parameter.
            nextPage.getRequest().setUrlParameter("page_number", nextPageNumber);

            //print the next URL
            println("Next page: " + nextPage.getRequest().getUrl());
        } else {
            println("Request not modified, pagination will stop: " + nextPage.getRequest().getUrl());
        }

    }
});

// This is our initial URL (with parameters)
String url = "http://localhost:8086/sch/i.html?_from=R40&_sacat=0&_nkw={search_key}&_pgn={page_number}&_skc=50&rt=nc";
UrlReaderProvider input = new UrlReaderProvider(url);

// Set the page_number
input.getRequest().setUrlParameter("page_number", 1);

// Set the search_key
input.getRequest().setUrlParameter("search_key", "memory stick");

// Run the parser.
new HtmlParser(entityList).parse(input);

When executed, the pagination handler will print out the following:

Next page: http://localhost:8086/sch/i.html?_from=R40&_sacat=0&_nkw=memory+stick&_pgn=2&_skc=50&rt=nc
Next page: http://localhost:8086/sch/i.html?_from=R40&_sacat=0&_nkw=memory+stick&_pgn=3&_skc=50&rt=nc
Next page: http://localhost:8086/sch/i.html?_from=R40&_sacat=0&_nkw=memory+stick&_pgn=4&_skc=50&rt=nc
Next page: http://localhost:8086/sch/i.html?_from=R40&_sacat=0&_nkw=memory+stick&_pgn=5&_skc=50&rt=nc
Request not modified, pagination will stop: http://localhost:8086/sch/i.html?_from=R40&_sacat=0&_nkw=memory+stick&_pgn=5&_skc=50&rt=nc

When the parser collected the results of the first page it called the NextInputHandler for the first time. The URL then got updated from _pgn=1 to _pgn=2. After the handler returned, the paginator “saw” the request was modified and executed it. Once that page of results got processed, it called the handler again. This process continued until the request didn’t change (_pgn was set to 5 and from there it never changed) so the pagination process stopped.

REMEMBER The paginator will continue running while the call to the pagination handler modifies the request. Keep this in mind to prevent running infinite pagination loops. You can also use paginator.setFollowCount(); to make the paginator stop.

POST based pagination

Some websites have their pagination implemented using POST requests. The majority of these contain a <form> in the HTML, and the request for the next page depends on collecting all relevant values from it then submit its values using the POST HTTP method. Rarely there are changes to the URL - it’s only the form that gets updated with new values after each request for the next page.

Getting this sort of pagination working via code can be tricky, especially when the website relies on javascript to add or remove data parameters from the POST body dynamically. The approach to get this working is to always use a browser to inspect the network activity and discover what parameters are being sent in the POST requests.

Handling pagination implemented with ASP NET

The following example demonstrates how to work with pagination of a website built with ASP.NET. This tutorial should help you to get started but every website will work differently. You’ll have to figure out how to generate proper POST requests by trial and error.

Class TutorialAspNet has the full implementation which you can use and adapt for your case.

Finding the form

Usually the pagination is controlled using values in a <form> element. One easy approach to get all values of the form is to simply access the target URL with a GET request and extract the form values. Let’s assume you want to get to http://somePageThatUsesAsp.com/frmSearch.aspx.

And it shows a search form like this:

As can be seen from the screenshot, the “Alabama” state has been selected. Let’s try to perform the same search using code:

UrlReaderProvider firstAccess = new UrlReaderProvider("http://somePageThatUsesAsp.com/frmSearch.aspx");

// Open the URL and parse web page into a HTML structure. 
HtmlElement pageRoot = HtmlParser.parseTree(firstAccess);

// locate the form
HtmlElement form = pageRoot.query().match("form").id("aspnetForm").getElement();

// get the values in the form
Map<String, String[]> data = form.inputValues();

Time to stop and verify the data parameters and their values. Printing the data map above we get:

__EVENTTARGET = 
__EVENTARGUMENT = 
__VIEWSTATE = /wEPDwUKMTU0OTkzNjExNg8WBB4JU29ydE9yZ ... a very long string
__VIEWSTATEGENERATOR = 32423F7A
ctl00$ContentPlaceHolder1$btnAccept = Ok
ctl00$ContentPlaceHolder1$search = rdbCityState
ctl00$ContentPlaceHolder1$txtCity = 
ctl00$ContentPlaceHolder1$drpState = 
ctl00$ContentPlaceHolder1$txtZip = 
ctl00$ContentPlaceHolder1$drpRadius = 1
ctl00$ContentPlaceHolder1$drpBuilingType = 
ctl00$ContentPlaceHolder1$drpCountry = 
ctl00$ContentPlaceHolder1$dpdCandaStates = 
ctl00$ContentPlaceHolder1$txtFirmname = 
ctl00$ContentPlaceHolder1$hdnTabShow = 0
ctl00$ContentPlaceHolder1$hdnTotalRows = 

Going back to the browser and inspecting the request it made, we see the following:

We got almost everything, but some data parameters did not come from our form - they are introduced in the browser via javascript. Luckily it’s easy to add them in code:

data.put("__ASYNCPOST", new String[]{"true"});
data.put("ctl00$ScriptManager1", new String[]{"ctl00$ScriptManager1|ctl00$ContentPlaceHolder1$btnSearch"});
data.put("ctl00$ContentPlaceHolder1$btnSearch", new String[]{"Search"});

Lastly, we need to set the state parameter:

// choosing state "Alabama"
data.put("ctl00$ContentPlaceHolder1$drpState", new String[]{"AL"}); 

Finally, the data to be sent for the first request is ready and we can submit a POST request:

// Clone the request configuration, and prepare a POST request to get the first page of results
UrlReaderProvider search = firstAccess.clone();
HttpRequest request = search.getRequest();
request.setHttpMethodType(HttpMethodType.POST);

// Set the data parameters. Our request is ready.
request.setDataParameters(data);

The server will return a very interesting response: a pipe-separated string with updated form values and the search results in HTML. Here is a stripped down version of the full response:

25081|updatePanel|ctl00_ContentPlaceHolder1_pnlgrdSearchResult|
<div>
    <div style="font-weight: bold;">
        <span id="ctl00_ContentPlaceHolder1_lblRowCountMessage">1 - 20 of 195 Results</span></div>
    <input type="hidden" name="ctl00$ContentPlaceHolder1$hdnTotalRows" id="ctl00_ContentPlaceHolder1_hdnTotalRows" value="195" />
    <!-- Lots of html with results -->
</div>
<table>
    <tr>
        <td>
            <a disabled="disabled" class="dis_class" style="display:inline-block;width:50px;">&lt;&lt; first</a>
            <a disabled="disabled" class="dis_class" style="display:inline-block;width:50px;">&lt; prev</a>
            <a class="LinkPaging" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdSearchResult$ctl23$ctl02','')" style="display:inline-block;background-color:#E2E2E2;width:20px;">1</a>
            <a class="LinkPaging" href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdSearchResult$ctl23$ctl03','')" style="display:inline-block;width:20px;">2</a>
            <!-- more links for the remaining pages -->
        </td>
    </tr>
</table>
|0|hiddenField|__EVENTTARGET||0|hiddenField|__EVENTARGUMENT||343908|hiddenField|__VIEWSTATE|/wEPDwU... a very long string ...1Pni
|8|hiddenField|__VIEWSTATEGENERATOR|32423F7A|121|asyncPostBackControlIDs||ctl00$ContentPlaceHolder1$btnSearch,ctl00$ContentPlaceHolder1$btnfrmSearch,ctl00$ContentPlaceHolder1$tmrLoadSearchResults|0|

The results can be easily parsed. Here is the code to read company names and cities:

// Let's create our "companies" entity:
HtmlEntitySettings companies = entityList.configureEntity("companies");

// The results come from a table with an ID like "ctl00_ContentPlaceHolder1_grdSearchResult"
// Let's create a partial path that matches this table.
PartialPath table = companies.newPath().match( "table").id("ctl00_*_grdSearchResult"); //notice the wildcard

// Names are in <a> elements with crazy ID's such as "ctl00_ContentPlaceHolder1_grdSearchResult_ctl02_hpFirmName"
// We use a wildcard to match any <a> in the table, with ID ending with "_hpFirmName"
table.addField("Name").match("a").id("*_hpFirmName").getText();

// Same story for city names
table.addField("City").match("span").id("*_lblCity").getText();

The results of this first page can be collected with:

HtmlParserResult result = new HtmlParser(entityList).parse(search).get("companies");

Finally, we can focus on making the pagination work.

Running the pagination

Looking at the HTML of the search results, it’s easy to identify that the current page button is highlighted. To go to the next page, we need to reproduce what happens in the browser when the next page button is clicked.

The highlighted page has a style attribute setting background-color:#E2E2E2. Each page button has a href attribute that looks like javascript:__doPostBack('ctl00$ContentPlaceHolder1$grdSearchResult$ctl23$ctl03',''). We need to get the string ctl00$ContentPlaceHolder1$grdSearchResult$ctl23$ctl03 from that attribute to build the POST request for the next page.

We can add a field to the paginator to collect all this for us:

HtmlPaginator paginator = entityList.getPaginator();

paginator.addField("nextPageTarget") // we use this custom paginator field in the pagination handler
        .match("tr").classes("footer_grid") // looks only at <a> elements of the pager row
        .match("a").attribute("style", "*background-color:#E2E2E2*") // finds the highlighted page - wildcards help a lot.
        .matchFirst("a").classes("LinkPaging") // matches the link the comes after
        .getAttribute("href") // the href looks like 
        .transform(target-> substringBetween(target, "'", "'")); //we want to get "ctl00$ContentPlaceHolder1$grdSearchResult$ctl23$ctl03"

Now we need to define a pagination handler to configure the POST request.

paginator.setPaginationHandler(new NextInputHandler<HtmlPaginationContext>() {
        @Override
        public void prepareNextCall(HtmlPaginationContext pagination) {

            String nextPageTarget = pagination.readField("nextPageTarget");
            if(nextPageTarget == null){ // after going through all result pages, our custom "nextPageTarget" field will be null
                pagination.stop(); //so we stop the pagination
                return;
            }

            //The following code simply produces the same request that your browser would with javascript.
            String ajaxResponse = pagination.getCurrentResponse().getContent();

            // From that pipe separated string, we need to get the updated value for "__VIEWSTATE"
            String viewState = substringBetween(ajaxResponse, "__VIEWSTATE|", "|");

            // In our request for the next page, we must send the updated __VIEWSTATE
            HttpRequest request = pagination.getNextRequest().getRequest();
            request.setDataParameter("__VIEWSTATE", viewState);

            // The value collected by our field "nextPageTarget" is also required to build the next page request.
            request.setDataParameter("__EVENTTARGET", nextPageTarget);
            request.setDataParameter("ctl00$ScriptManager1", "ctl00$ContentPlaceHolder1$pnlgrdSearchResult|" + nextPageTarget);

            // These parameters must be removed from the POST request or else the server will return the first page again.
            request.removeDataParameter("ctl00$ContentPlaceHolder1$btnAccept");
            request.removeDataParameter("ctl00$ContentPlaceHolder1$btnfrmSearch");
            request.removeDataParameter("ctl00$ContentPlaceHolder1$btnSearch");

            // Compare the request made via code against what your browser sends. The body of the POST request must
            // have the same keys and values.
            System.out.println(request.printDetails());
        }
    });

To be able to generate the correct request for the next page of results, the pipe-separated response received from the server must be parsed in order to obtain the updated value for __VIEWSTATE. Calling pagination.getCurrentResponse().getContent(); will return the body of the response.

In addition pagination.getNextRequest() will return a pre-configured POST request based on the previous response. The data parameters of the previous form will still be there, and any cookies returned in response will be set on this next request object. All we have to do is to update the necessary parameters, and remove what’s not needed.

With the code introduced above, the program will now be able to obtain the results and run the pagination without a hitch.

The results should look like:

Name_________________________________________City___________________________
Formworks Architects, Inc.                   Birmingham, AL 35233-3503      
Godwin Jones Architecture & Interior Design  Montgomery, AL 36117-3599      
Revolutionary Architecture                   Birmingham, AL 35243-2547      
Rob Walker Architects, LLC                   Birmingham, AL 35233-2317      
TAG/The Architects Group Inc                 Mobile, AL 36609-5402          
... and many more

Phew! We hope this example gives you a better understanding of what’s involved in processing paginated results where POST requests are required. The solution is always specific to the website and to get things working some patience and trial and error is needed. The key to get things working is to make sure all requests your program generate have the same parameters a browser generates. Use request.printDetails() to compare your requests against what your browser produces.

Further reading

That should have covered everything you need to be able to process paginated results. Let us know if you think anything was left missing.

If you haven’t yet, we recommend you to proceed to the following sections (in any order).

If you find a bug

We deal with errors very seriously and stop the world to fix bugs in less than 24 hours whenever possible. It’s rare to have known issues dangling around for longer than that. A new SNAPSHOT build will be generated so you (and anyone affected by the bug) can proceed with your work as soon as the adjustments are made.

If you find a bug don’t hesitate to report an issue here. You can also submit feature requests or any other improvements there.

We are happy to help if you have any questions in regards to how to use the parser for your specific use case. Just send us an e-mail with the details and we’ll reply as soon as humanely possible.

We can work for you

If you don’t have the resources or don’t really want to waste time coding we can build a custom solution for you using our products. We deliver quickly as we know the ins and outs of everything we are dealing with. Send us an e-mail to sales@univocity.com with your requirements and we’ll be happy to assist.

The univocity team.

www.univocity.com