Crawler fails to find target pages, even when explicitly added as starting pages


#1

The government recently required colleges and trade schools to publish Gainful Employment disclosures using an official HTML/jQuery template.

I trained a crawler on 5 of the pages and it worked beautifully as long as JavaScript was enabled. Great. For the starting URL’s I provided a list of

  1. the 5 training pages
  2. 5,000 institutions’ URLs that I want to scrape

For the “Where to scrape data from?” field I wanted to tell it “any page in those URL’s that looks like the training pages or has Gedt.html or gedt.html anywhere in the URL”. Not knowing exactly how to do that my best guess was:

{any}
{alpha}.{alpha}.edu/
{alpha}.edu/
{alpha-num}.{alpha}.edu/
{alpha-num}.{alpha}.edu/{any}
{any}.{any}.edu/
{any}.edu/
{any}.{any}.edu/{any}
{any}.edu/{any}
{any}.edu/{any}/{any}
{alpha}.{alpha}.edu/{alpha}-{alpha}/{words}{any?}Gedt.html{query-string?}$
{alpha}.{alpha}.edu/{alpha}-{alpha}/{words}{any?}Gedt.html{query-string?}
{alpha}.{alpha}.edu/{alpha}-{alpha}/{words}{any?}Gedt.html
{any}.{any}.edu/{any}-{any}/{any?}Gedt.html
{any}.edu/{any}-{any}/{any?}Gedt.html
{any}.{any}.edu/{any}/{any?}Gedt.html
{any}.edu/{any}/{any?}Gedt.html
{any}.{any}.edu/{any?}Gedt.html
{any}.edu/{any?}Gedt.html
{any}.edu/{any?}Gedt.html
{any}.{any}.edu/{any}/{any?}Gedt.html{any}
{any}.edu/{any}/{any?}Gedt.html{any}
{any}.{any}.edu/{any?}Gedt.html{any}
{any}.edu/{any?}Gedt.html{any}
{any}.edu/{any?}Gedt.html{any}
{any}

It ran for hours and hours and produced 0 rows. It didn’t even get the training pages’ data with their URL’s explicitly specified in the starting pages.

What am I doing wrong? What can I try?