BUG FIX: API call to CSV of the latest run


#1

##Changes to Latest run CSV API.
28th July 2016 11:10 GMT

An update from @dominik.deren

We have identified an issue in our implementation of the link to the latest CSV file for runs. The issue manifested itself for CSV files larger than 6 MB, by presenting an error message:

[body size is too long]

It was caused by the limitations of the underlying service, which was not able to handle files above this limit.
We fixed that limitation, and now you can successfully download files without worrying about their size. But this fix required us to break the contract of this API. So if you are using the link

https://data.import.io/extractor/<EXTRACTOR_ID>/csv/latest?_apikey=<YOUR_API_KEY>

programatically, you might want to read on to understand the changes.

First of all, let me note, that the changes made should be invisible, if you are using this link in Google Spreadsheet, or directly in the browser. Both of those environments are able to handle the changes without any modifications required on your end. Where you might experience issues, is when you are using this link though tools like curl or httpie or other programatic ways of processing this data.

Our modification changes the default response of this API, from directly returning the requested file, to actually returning a redirect link, to the file. In technical terms, it now returns na HTTP Status Code 302 (Found), rather than HTTP Status Code 200 (OK). It then includes an HTTP header Location with value containing the address to the direct file. Browsers and Google Spreadsheets are by default following such redirects, that’s why this change is invisible there. But other tools are not, and you might need to tell your tool explicitly to do it.

So, just to give you an understanding of how the response from the API is changing, below are listed two example responses, one from before the change, and one from after the change was made.

BEFORE

$ curl -v https://data.import.io/extractor/<EXTRACTOR_ID>/csv/latest\?_apikey\=<MY_API_KEY>
Trying 52.200.101.104...
Connected to data.import.io (52.200.101.104) port 443 (#0)
TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
Server certificate: *.import.io
Server certificate: COMODO RSA Domain Validation Secure Server CA
Server certificate: COMODO RSA Certification Authority
Server certificate: AddTrust External CA Root
GET /extractor/<EXTRACTOR_ID>/csv/latest?_apikey=<MY_API_KEY> HTTP/1.1
Host: data.import.io
User-Agent: curl/7.43.0
Accept: /
>
< HTTP/1.1 200 OK
< Access-Control-Allow-Credentials: true
< Content-Disposition: attachment; filename=latest.csv
< Content-Type: text/csv
< Date: Thu, 28 Jul 2016 08:39:35 GMT
< Server: openresty/1.9.7.3
< Vary: Accept-Encoding, Origin
< Via: 1.1 3711209710286c6c954f8418c2d0a852.cloudfront.net (CloudFront)
< X-Amz-Cf-Id: gXZJPtymDQOrnNPlgQr-7Eyxim_RGJnWFDaloTKwm8VK0BIksRl1og==
< x-amzn-RequestId: cfe3af3b-549e-11e6-84a5-d144af3a6a46
< X-Cache: Miss from cloudfront
< Content-Length: 13969
< Connection: keep-alive
<
"Productpadding link","Productpadding link_link","Prodimg image","Prodimg image_link","Prodname value","Prodname value_link","Proddesc value","Proddesc value_link","Prodprice price","Prodprice price_link","Proddimension value 1","Moreoptions link","Moreoptions link_link","Addtobasket label","Addtobasket label_link","Savetolist label","Savetolist label_link"
"TORBJÖRN
Swivel chair
£30
More data follows...

AFTER

$ curl -v https://data.import.io/extractor/<EXTRACTOR_ID>/csv/latest\?_apikey\=<MY_API_KEY>
Trying 52.87.149.228...
Connected to data.import.io (52.87.149.228) port 443 (#0)
TLS 1.2 connection using TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256
Server certificate: *.import.io
Server certificate: Amazon
Server certificate: Amazon Root CA 1
Server certificate: Starfield Services Root Certificate Authority - G2
GET /extractor/<EXTRACTOR_ID>/csv/latest?_apikey=<MY_API_KEY> HTTP/1.1
Host: data.import.io
User-Agent: curl/7.43.0
Accept: /
>
< HTTP/1.1 302 Found
< Access-Control-Allow-Credentials: true
< Content-Type: application/json
< Date: Thu, 28 Jul 2016 08:41:35 GMT
< Location: https://store.import.io/store/crawlRun/<RUN_ID>/_attachment/csv/<RUN_ATTACHMENT_ID>?_apikey=<MY_API_KEY>
< Server: openresty/1.9.7.3
< Vary: Accept-Encoding, Origin
< Via: 1.1 36e16637a2b5592f1b01e48a4949ddd6.cloudfront.net (CloudFront)
< X-Amz-Cf-Id: LKBz7KISYypAp5sqvXUfWuft1PUn5miW1SDnTY8QqvozeHaIaPyVwg==
< x-amzn-RequestId: 1784f008-549f-11e6-ab78-afbed84cc28b
< X-Cache: Miss from cloudfront
< Content-Length: 318
< Connection: keep-alive
<
Connection #0 to host data.import.io left intact
{"location":"https://store.import.io/store/crawlRun/<RUN_ID>/_attachment/csv/<RUN_ATTACHMENT_ID>?_apikey=<MY_API_KEY>"}%

As you can see, the current response doesn’t actually return the data, it just returns a link to the data file. The link is available both in theLocation: <FILE_ADDRESS> HTTP Header, and in the response body as a JSON object {"location": "<FILE_ADDRESS>"}.
Most of the tools have a special parameter which makes them follow such redirect automatically. To name a few:

CURL

For Curl to automatically download the file as before, you have add -L -H "Accept-Encoding: gzip" --compressed parameters. -L tells curl to automatically follow redirects, using the Location header, and -H adds additional request header to tell the server that you can access a gzipped response. Lastly, --compressed tells Curls to decompress the response, so you could see the data. Full command would look like this:

curl -v -L -H "Accept-Encoding: gzip" --compressed https://data.import.io/extractor/<EXTRACTOR_ID>/csv/latest\?_apikey\=<YOUR_API_KEY > data.csv

This will request the file, follow the redirect, tell the server that we can accept a gzipped response, decompress it on arrival and will save the data to the file data.csv.

HTTPIE

Httpie is a tool that does a lot of heavy lifting for you by default. If we want to do the same operation with it, we only need to add one parameter --follow, which tells it to start following the redirects. The full command will look like this:

http --follow https://data.import.io/extractor/<EXTRACTOR_ID>/csv/latest\?_apikey\=<YOUR_API_KEY> > data.csv

This will again request the file, and follow the redirects. Httpie by default accepts gzipped responses, and decompressed them on the fly. And finally, we are saving the response into the file data.csv.


Product releases 2016