Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Let's further explore how to crawl the web.
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
With our first spider, Ike, we saw how
to process a static list of URLs.
0:00
This is great if you know all the URLs
of the pages you want to scrape.
0:05
What happens though,
0:10
when you want to start following links
that are included on the page itself?
0:11
Scrapy has some helpful methods for
handling these situations,
0:14
with link extractor, and
crawl spider classes.
0:18
A word of caution here, before we crawl
down this path, we need to be aware of
0:22
the overwhelming amount of data and
sites that are connected on the web.
0:26
Writing a spider that gets and follows
all the links on each followed web page,
0:30
can lead to a program that never ends.
0:35
Also, with the idea that any given site is
only six clicks away from any other site,
0:38
sending a spider on a massive
crawling task can potentially
0:45
lead to some sites that are way
off our originally intended topic.
0:49
We should look at setting up some rules
for our spider to follow as well.
0:53
The CrawlSpider class from Scrapy is set up
a bit differently than the spider we wrote
0:57
in the last video.
1:02
It has the same overall concept, but
instead of a start_requests method,
1:03
we define allowed_domains and start_urls.
1:08
Then we'll define a set of rules for
a spider to follow.
1:12
This lets us tell the spider which links
to match or not, follow or not, and
1:15
how to parse the information.
1:19
Let's take a look at how to implement
these concepts in a new spider.
1:21
Let's create a new file in our spider's
folder and call it crawler.py.
1:26
Crawler.py, and we need a few imports.
1:33
So from scrappy.linkextractors,
1:36
import LinkExtractor.
1:42
And, from scrappy spiders,
scrappy.spiders,
1:46
we want to import CrawlSpider and Rule.
1:51
Next, we define our class,
this time inheriting from CrawlSpider.
1:55
We'll name this one after another
famous horse, Whirlaway.
2:05
Perhaps not quite the same as
Ike from Charlotte's Web, but
2:09
a winner in his own right.
2:13
When using the CrawlSpider class, we can
set a few parameters for it to follow.
2:19
Let's start within allowed domain's limit,
2:23
to prevent our spider from
getting too far out of control.
2:26
So we do allowed domains,
we can pass in a list, for
2:29
ours we'll just do treehouse-projects,
github.io.
2:34
Next, we define a place to start.
2:44
So we do start_urls, and
2:46
we want treehouse-projects
2:53
Github.io/horse-land.
2:59
Now we can define our rules.
3:04
We'll use the LinkExtractor class and
3:06
pass in a regular expression
of links to follow or ignore.
3:09
So our rules be rule,
3:13
LinkExtractor, and our regular expression.
3:16
Then we tell our rule how to parse
the information by assigning the call back
3:24
parameter to the method name.
3:28
Let's use parse_horses, so
callback, parse_horses,
3:30
Then we tell the rule if it's
okay to follow the links.
3:39
follow=True.
3:45
And let's clean this up a little bit.
3:45
Drop these down onto new lines,
Now we can define our parsing method.
3:50
parse_horses will take self, and
the response, we'll grab the page URL.
3:59
And the page title.
4:11
We can use CSS to select
specific page elements.
4:12
The result of running
a response.css title,
4:21
is a list like object called
selector list, which represents
4:24
a list of selector objects that
wrap around XML or HTML elements.
4:29
And allow you to run further queries
to fine grain the selection or
4:35
extract the data.
4:39
For this example,
let's just print out the URL and titles.
4:40
We'll print Page URL,
4:43
Format(url), and
we'll print the page title.
4:50
Go to the terminal, and we'll ask Scrapy
to crawl our site, crawl Whirlaway.
5:02
Need to be in the right directory.
5:15
crawl Whirlaway.
5:30
And there's our information.
5:34
Scroll up here.
5:35
So again, we see that we got a 404 when
it was looking for the robots.txt.
5:42
Page URL, page title, it's kinda messy.
5:49
We can clean that up a little bit.
5:53
We only want to extract the text elements
directly inside the title element, so
5:55
let's change that up here.
5:59
So title, we want text,
and we want to extract it.
6:01
And we'll run it again.
6:10
Come up here, there's our page title,
that's much better.
6:15
Also note here in the output,
that Scrapy found those external links but
6:18
filtered them out.
6:23
Thanks, Scrapy.
6:24
Well done,
you've written two different spiders now.
6:28
One that follows links
that we provide and,
6:31
one that extracts links from a site and
follows them based on rules we set.
6:34
These are both very powerful tools for
scraping data from the web.
6:40
Being able to get the information
is a major task, and
6:44
we've seen how easy scraping makes it.
6:47
In the next stage, let's take a look at
how to handle some other common tasks,
6:50
such as handling forms and
interacting with APIs.
6:55
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up