Lead Generation: Combining Dealer data from two pages "Dig deeper"


#1

Hey import.io community!

I am very new to this and tried to generate a leads list for a project I am working on. I tried reading through the “Collect Data from airbnb listings and dig deep” topic, which seems very similar to what i want to achieve. Unfortunately the link describing the method is dead… So hence my question.

First i made a list taken from Mobile.de. They show the listings with basic info, but you have to click deeper to find the website information and other offers the dealer has. I managed the first import through import.io and took a lot of time to manually put the webpage links behind it. I was hoping there is a better/faster way to do this.

This time i want to do it for chrono24.com

First link shows all dealers:
http://www.chrono24.com/search/haendler.htm

Once you click one of the dealers you get more info:
http://www.chrono24.com/dealer/juwelierburger/index.htm

I want to combine the two so I can ideally see the dealer basic info, their offers and their email address. If that is not possible at least the dealer info from the first long list page, and the website link to their page taken from the “deep click”. All this with the least amount of manual work as we’re looking at 6000 dealers to attach a weblink to.

Hope someone can help me, I would greatly appreciate it!


#2

This is a classic use for “chaining” extractors. First you need to build an extractor that gets all of the URL’s from the first page.

For your first one, it’s a little tricky because you have multiple pages. This is easily solved though. We’ll start with the first link which is basically your list of dealers:

http://www.chrono24.com/search/haendler.htm

That link only shows 30 dealers per page though. So scroll down to the bottom and click on the “120”. Now you’ll see that your URL has changed to:
http://www.chrono24.com/search/haendler.htm?pageSize=120

This is good because it will use less queries to get the list you are looking for. Now we need to build a list of URL’s that will show Import.io all of your URL’s that you need to get data from. So let’s click on “Page 2” at the bottom and pay attention to how the URL changes:
http://www.chrono24.com/search/haendler.htm?pageSize=120&showpage=2

This one is pretty easy because you can see that the page number is shown at the very end of the URL as "2’. The subsequent pages will look like this:

http://www.chrono24.com/search/haendler.htm?pageSize=120&showpage=2
http://www.chrono24.com/search/haendler.htm?pageSize=120&showpage=3
http://www.chrono24.com/search/haendler.htm?pageSize=120&showpage=4… and so on.

The next step is to figure out how many pages there are. We can see at the bottom of the page that there are 19 pages. So you need to build your extractor to have the 19 URL’s to the different pages.

So click “New Extractor,” and paste this into the URL field: http://www.chrono24.com/search/haendler.htm?pageSize=120

Let Import.io do some thinking and you’ll see the page come up. Now all you really need is the URL from the “Underline” column. This will give your second extractor all of the URL’s it needs to pull the data you are looking for. We’re not done yet though. Now we need to add the remaining URL’s for all of the pages. Go ahead and click on the down arrow next to “Save and Run” and click on Save. We don’t want to run this extractor yet.

You will be brought back to your extractor details screen. Import.io has a wonderful tool called the URL generator. We’ll use this to generate all of the links you need. First we need to give import.io a couple url’s to use. Paste this link into the field where it says “Enter or Paste URL here…” : http://www.chrono24.com/search/haendler.htm?pageSize=120&showpage=2

Click “Save URL’s.” Now click “Show URL Generator.” Click “Edit” and overwrite the URL that is in the list with: http://www.chrono24.com/search/haendler.htm?pageSize=120&showpage=2

Import.io tries to figure out which parameter in that link controls the page number. It thinks the “120” is the page number, but it’s actually the “2” at the end. This is easy to fix. Click the ‘x’ next to the parameter 1 field. Then simply double click on the “2” at the end of the url. You’ll be presented with some controls on how to create the list of URL’s. You already have your first and second page in the list, so you can set the range to 3 and 19 with a step of 1. Then click “add to list.” Now click “Save URL’s.”

Now you have an extractor that will build your list of URL’s for you. Pretty slick.

Run this extractor. You can also click this link to get the extractor we just built:
https://dash.import.io/ecf87edc-2a4a-4384-96a0-67cf3e127243

Now we need to create the second extractor to go get all the details you’re looking for. Click “New Extractor,” and we will paste a URL from one of the dealers into the popup. Let’s start with this link: http://www.chrono24.com/dealer/juwelierburger/index.htm

Once the webpage comes up, click on the Website tab in the upper right hand corner of the page. This allows us to see what data we’re extracting. Click “Delete All Columns” to give us a fresh start. Now this page is interesting because it has a “Show More” link at the bottom. This is annoying, but very easy to bypass. Click on the “Styles On” button in the upper left hand corner of the screen. This turns off all of the fancy web styling that can mess up our extraction. Now you need to create some columns and select the data you want to put in those columns. When you are finished, save the extractor.

Now we need to connect this extractor to the first extractor you created, this is called “chaining.”

You should be back in the extractor details screen now. Click on the dropdown that says “An explicit list of URL’s” and select “URL’s from another extractor.” In the “Search for extractor by name” field, type the name of the first extractor you created and select it. Now we need to tell it where it will find the URL’s. You can see above that the first extractor we made had the link for the dealer information in the “Underline” column. So click on URL Column and select "Underline."
You can see the extractor we just created by clicking on this link: https://dash.import.io/5872abdf-2f4c-4e50-bd75-168fd7eae694

That’s it! you’re ready to go!. Go ahead and run this extractor and watch the magic happen. As it runs you can click on the “eye” icon to preview the data you’re grabbing. Hope that helps!


#3

Hey Mark,

Thank you so much for the clear and elaborate description. I really appreciate you taking the time to help me out!

I have gone through the steps of your description, I know I could have used the links you set up for me but by doing it I learn how to do it a bit better. The extractor is running as we speak. I will let you know what the result is tomorrow when I dig into the data it is extracting. So far everything seems fine.

Once more, a big fat THANK YOU! This saves me heaps of time and effort!


#4

Not a problem. I have had a lot of help from the tech guys at import so I’m just trying to pay it forward. Hopefully it all made sense.

Mark


#5

Your mini tutorial was great. The results do show up slightly messy for the second crawler. For some reason it took a few dealer links 2 or 3 times and spreads several data points i wanted to collect over 3 lines in excel. I am not sure what has caused it to do this, and it seems very random. Some are perfectly aligned others have 2 lines others 3. But the result is workable so I’m still happy.

Thanks again, and I hope you don’t mind me hitting you up if I get stuck in the future.


#6

Yeah, sometimes you have to play with it a little bit to get the formatting just right, but I usually take care of that in excel. Happy I could help.

Mark


#7

hi Mark, can we do something similar with the trial version or is this possible only with licensed versions? I am trying to fetch a lo of data from twitter and , i need to scroll down in twitter to show all results on the page but the extractor only pulls data as initially loaded on the page i.e. based on some 30-40 tweets. There is no pagination in twitter. And i am using a trial version.


#8

Hey Harmeet,

I commented on your other post too. I did the above tutorial kindly written by Mark on a trial version. The actual crawler blew up my trial account by 1700 queries. (2200 in total and the limit is 500 a month). But as long as it is one crawler you set up they still give you the full results. I think next month I am back to 500 fresh opportunities to scrape. But I might have to wait another 4 months, I just started using this so I can’t tell you.

To answer your second half about the infinite scroll, please read this link:
https://guide.import.io/dealing-with-infinite-scroll.html

It is just in the normal “help section” of import.io so its not very hard to find. Make sure you check it next time first as they describe it pretty well. ALMOST as good as Mark does haha.

Hope this helps!


#9

Hey Leadgen, thanks a ton for your timely response. My concern is with twitter, i am still unable to find the page element in the network or the elements tab. Maybe its me or these guys have hidden it really well. I was searching Twitter with a hashtag like “pepsi” and results were going beyond 1 page.


#10

Hey Harmeet,

I don’t use twitter so I am not able to login and see the Pepsi results you are aiming for. Hopefully some one else can get you the help you require. I will try to find something and if I get something valuable I will let you know. If you don’t see anything, I wasn’t able to find it.

Good luck with finding an answer.


#11

Thanks Leadgen, will post if i find something.


#12

Hi Guys!

Twitter is a toughie.

The infinite scroll on Twitter isn’t quite as friendly as some other sites that you are able to generate pagination for through the methods you have been discussing.

We are developing a new feature in the tool that will look to make these kinds of things easier. I can’t say too much at this stage, but watch this space!

Alex


#13

Thanks Bamford for the response.