Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Let's use the Python Library, Scrapy, to create a spider to crawl the web.
Additional Resources
- Python List Comprehensions
- Scrapy Response object
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Inside the spiders folder, let's create
a new file to crawl our sample horse site.
0:00
So Spiders > File > New File > Python,
horse.py.
0:06
To start with, we need to import scrapy.
0:14
And then we'll write a class,
we'll call it HorseSpider,
0:20
which we'll inherit from scrapy.Spider.
0:26
Now, we need to give out HorseSpider
a name, let's call it ike,
0:30
after the horse in Charlotte's Web.
0:34
Spider names must be unique
within a scrapy project.
0:39
So scrapy knows which spider
to run in the project.
0:42
We'll use it to run our
spider in just a bit.
0:46
There are two functions
we need to write in here.
0:48
Start request, which defines
the initial request to be made and
0:51
if applicable, how to follow links.
0:55
So we'll define start_requests.
0:59
For now, we'll just pass.
1:01
The other function is parse which will
tell the spider how extracted data is to
1:03
be parsed.
1:08
Inside start_requests, we provide
a list of URLs that we want to process.
1:11
So urls, and we pass in a list.
1:17
So we'll pass in our index and
our mustang.html pages.
1:21
The whole URL is treehouse-projects.
1:26
Github.io/horseland/index.html.
1:33
We'll paste that in and
change it to mustang.
1:43
Then we need to return a scrapy.Request.
1:49
This is a list comprehension.
1:52
It's going to create a new list of
request by looping to each of our URLs.
1:54
More on the teacher's notes.
2:00
So we wanna return a list
of scrapy.Request.
2:01
We want our url to be url,
our callback is gonna be self.parse.
2:10
We want that for urls in urls.
2:17
This line is looping through our urls list
and on each one calling the parse method.
2:21
Let's update that method to do something.
2:26
We could do a lot of
things inside this method.
2:31
How you parse the data on
a site will be highly dependent
2:33
on the purpose of your project, since
every use case can be a little different.
2:37
For now,
let's just save the entire HTML file.
2:42
So we'll define a url, the response.url.
2:46
This response object
represents an HTTP response
2:50
from the request we
made in start_requests.
2:54
It's usually downloaded by the downloader
and fed to the spiders for processing.
2:57
See the teacher's notes for additional
documentation on scrapy's response object.
3:03
So with our url,
we wanna get a specific page.
3:08
We'll split it, On our last slash there,
3:13
and our file name we'll call it horses.
3:20
We'll format that with our page and
we'll print out what the URL is,
3:25
And then we'll save our page.
3:36
I'm going to just write
the entire response body.
3:44
Then we'll print out the saved file name.
3:49
Nice, now in a terminal window,
We navigate to our spider's directory.
3:57
And tell scrapy to crawl
using our spider name.
4:11
So we do scrapy crawl ike.
4:15
If we look at output in our terminal,
we can find,
4:19
come up here a little bit,
To right in here.
4:23
We see that the spider looked for
4:29
our robots.txt file, which it didn't
find since the site doesn't have one.
4:30
See this 404 code here?
4:35
In our robots.txt, the pages
4:37
that we included in our URLs list were
found and saved from the parse method.
4:41
There's the URLs, there's the file names,
4:46
we'll come back up here,
there they are, very nice.
4:50
Great work on writing your first spider.
4:55
We saw the two methods that a scrapy
spider needs, start requests and parse.
4:58
We put in a list of URLs in
the start_requests method and
5:05
have it loop through that list and
process each URL with the parse method.
5:08
We could have our parse method
do something more powerful
5:13
than just saving the entire file.
5:16
But this is a nice start.
5:18
Next up though,
5:20
let's see how to write a spider that will
crawl more URLs than what we give it.
5:21
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up