Forge requests with XHR

2020-10-26 - (5 min read)

Sometimes there is data on a website that you are unable to scrape; sending a request with the URL of the page triggers a response that does not contain that data. And the reason is that additional requests need to be sent to fetch it, which is what a browser would automatically do for you. The result would be a page enriched with data assembled from secondary requests. Those requests are typically forged by some JavaScript code making use of XMLHttpRequest (XHR), which is an API that makes websites interactive.

Let’s assume that you do not use a headless browser here; what you want is a simple solution that does not involve running any JavaScript. How could you simulate the client‐side code responsible for making XHR possible without diving into a heavy study of JavaScript code? Well, what about forgetting about JavaScript and doing some reverse engineering instead? Let’s forge requests with XHR!

After all, any request sent to a web server is an HTTP request, and so it’s only a matter of knowing what the different parts of a request should be. This is easily done with the developer tools of a browser, where how client and server talk to each other can be studied. The crawling strategy then becomes a mix of page requests and XHR requests.

The immediate benefit of going that route is that only the requests that matter are sent; the rest can be ignored. There is no need to worry about retrieving some CSS or executing blocks of JavaScript: this method is straight to the point. Another advantage is that the program can use the same technical environment as for page requests. Keep using your libraries or your framework of choice, you should be all set!

Let’s see XHR in action through an example. As of 2020, when visiting an article on Wikipedia, mousing over any link to another article displays a snippet of that other article without leaving the current page. The text in the snippet is actually the first paragraph of the article. My goal is to download all the snippets that can appear on the page of the article about XHR, because why not. One obvious way to proceed is to visit every linked article and store the first paragraph for each one of them. While it sure can work, it does not take advantage of the snippet feature, which, you guessed it, makes use of XHR.

A quick look at the network tab in the developer tools teaches us that a request with the following characteristics is sent each time we mouse over an article link:

Method: GET
URL: https://en.wikipedia.org/api/rest_v1/page/summary/HTML [example]
Specific header: X-Requested-With: XMLHttpRequest

It turns out that the last header, commonly used for XHR, is not needed here, as the Wikipedia server will not mind if we do not add it; I will therefore ignore it for the sake of simplicity.

I have to discard any link that does not generate a snippet, which in particular includes links to the current page, the home page, and the meta pages. The corresponding response is in JSON format: the snippet text is found under the key extract, and the title of the article can be found under the key title.

My program uses Scrapy’s API to instantiate a scraping engine instance. Scrapy is a popular web scraping framework written in Python, and I use it in all my projects. I make sure to have only one concurrent request at any time and set the download delay in between requests to 1 second, which is fast enough. Wikipedia makes sure that there is no duplicate links in their articles for the most part but Scrapy is set to ignore duplicate requests by default anyway.

I first hit the URL https://en.wikipedia.org/wiki/XMLHttpRequest, and then proceed to find all relevant article links in the page. For each of these links I find the page name thanks to this simple regex: r'^/wiki/(.+)', from which I create an XHR request as described above.

The code can be found here.

Of course I did not need to rely on XHR for this particular task, as the same content is found at the top of every linked article. Replacing the XHR requests with page requests would have been done at the cost of more parsing complexity though. Parsing JSON is clearly easier than parsing HTML.

But in many cases there is no choice: the data is only accessible through XHR. We can take advantage of the fact that XHR URLs, which really are API entry points, are usually more stable in time than page URLs and therefore may need less reworking overtime. Same goes for their respective contents. On a negative note, obfuscation might in some cases make it hard to figure out all the elements composing the XHR request. For instance it might be difficult to trace back to the origin of some obscure ID without investigating some JavaScript first, even if this is rare.

All in all, forging XHR requests ourselves is a very valuable tool to have in our toolset.