Extract all the text from multiple websites


#1

Hi,

I just want a very simple extractor that merely extracts all the readable text from a large number of websites (over 1000).
I am unable to do so. When defining the extractor, even if I remove all the extra bits and select only the text regions, the extractor fails to get all the text data when running on other websites.

Please let me know if there is a simple method to extract just the text elements in a single column.


#2

Hi,
You can use common xpaths to scrape all the texts from different websites. for eg [ //text() ]
If you have any doubt contact me meetmearun21@gmil.com

Arun


#3

Thanks, yes I ended specifying xpaths too. I used the following xpath, //text()[not(parent::script) and not(parent::style)]