How can I avoid IP address from getting blocked?


#1

I am trying to get data from a couple of sites. One site that I want to get the data from has about 3,000 pages of relevant data. The data is mostly lists on these pages. I know the URL’s so I thought I would do a bulk extractor with the legacy extractor software or the new extractor.

My biggest question is are the extractors for the legacy system and the new system ran on import.io servers or are they ran locally? And how do sites see the extractors? Do they see them as browsers or data scrapers or what exactly?

Should I break up the data collection over several days, weeks, etc to avoid getting blocked (either the import.io servers or my own IP address)?

I also have another site that I want to extract the data from and I know the URL’s. The data is on about 70,000+ pages. How can I do it so that neither import.io servers or myself get blocked? Again should I spread the extraction across multiple days, weeks, months?

Any help would be appreciated. Thanks!


#2

I am also interested in knowing the answer to this.


#3

Hi Guys,

On the Web Platform, there is a rotating IP pool - queries are not run on your local machine.

You shouldn’t need to split your queries up - we’ll handle the rate limiting for you - we wouldn’t want to bring a website down or get you blocked! :slight_smile:

To do that kind of volume you will need a premium account. You can sign up here: https://www.import.io/standard-plans/

Alex