I am trying to get data from a couple of sites. One site that I want to get the data from has about 3,000 pages of relevant data. The data is mostly lists on these pages. I know the URL’s so I thought I would do a bulk extractor with the legacy extractor software or the new extractor.
My biggest question is are the extractors for the legacy system and the new system ran on import.io servers or are they ran locally? And how do sites see the extractors? Do they see them as browsers or data scrapers or what exactly?
Should I break up the data collection over several days, weeks, etc to avoid getting blocked (either the import.io servers or my own IP address)?
I also have another site that I want to extract the data from and I know the URL’s. The data is on about 70,000+ pages. How can I do it so that neither import.io servers or myself get blocked? Again should I spread the extraction across multiple days, weeks, months?
Any help would be appreciated. Thanks!