🤑 Join the Treehouse affiliate program and earn 25% recurring commission!

New No-Code Track! 🚀start learning today!

🌟 Dreaming of a bright future? 🎓 Ask about the Treehouse Scholarship program! 🚀

✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

Well done!

You have completed Scraping Data From the Web!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse Continue

Crawling Spiders

5:26 with Ken Alger

Let's use the Python Library, Scrapy, to create a spider to crawl the web.

Teacher's Notes
Questions?
Video Transcript
Downloads
Workspaces

Additional Resources

Python List Comprehensions
Scrapy Response object

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Inside the spiders folder, let's create a new file to crawl our sample horse site. 0:00

So Spiders > File > New File > Python, horse.py. 0:06

To start with, we need to import scrapy. 0:14

And then we'll write a class, we'll call it HorseSpider, 0:20

which we'll inherit from scrapy.Spider. 0:26

Now, we need to give out HorseSpider a name, let's call it ike, 0:30

after the horse in Charlotte's Web. 0:34

Spider names must be unique within a scrapy project. 0:39

So scrapy knows which spider to run in the project. 0:42

We'll use it to run our spider in just a bit. 0:46

There are two functions we need to write in here. 0:48

Start request, which defines the initial request to be made and 0:51

if applicable, how to follow links. 0:55

So we'll define start_requests. 0:59

For now, we'll just pass. 1:01

The other function is parse which will tell the spider how extracted data is to 1:03

be parsed. 1:08

Inside start_requests, we provide a list of URLs that we want to process. 1:11

So urls, and we pass in a list. 1:17

So we'll pass in our index and our mustang.html pages. 1:21

The whole URL is treehouse-projects. 1:26

Github.io/horseland/index.html. 1:33

We'll paste that in and change it to mustang. 1:43

Then we need to return a scrapy.Request. 1:49

This is a list comprehension. 1:52

It's going to create a new list of request by looping to each of our URLs. 1:54

More on the teacher's notes. 2:00

So we wanna return a list of scrapy.Request. 2:01

We want our url to be url, our callback is gonna be self.parse. 2:10

We want that for urls in urls. 2:17

This line is looping through our urls list and on each one calling the parse method. 2:21

Let's update that method to do something. 2:26

We could do a lot of things inside this method. 2:31

How you parse the data on a site will be highly dependent 2:33

on the purpose of your project, since every use case can be a little different. 2:37

For now, let's just save the entire HTML file. 2:42

So we'll define a url, the response.url. 2:46

This response object represents an HTTP response 2:50

from the request we made in start_requests. 2:54

It's usually downloaded by the downloader and fed to the spiders for processing. 2:57

See the teacher's notes for additional documentation on scrapy's response object. 3:03

So with our url, we wanna get a specific page. 3:08

We'll split it, On our last slash there, 3:13

and our file name we'll call it horses. 3:20

We'll format that with our page and we'll print out what the URL is, 3:25

And then we'll save our page. 3:36

I'm going to just write the entire response body. 3:44

Then we'll print out the saved file name. 3:49

Nice, now in a terminal window, We navigate to our spider's directory. 3:57

And tell scrapy to crawl using our spider name. 4:11

So we do scrapy crawl ike. 4:15

If we look at output in our terminal, we can find, 4:19

come up here a little bit, To right in here. 4:23

We see that the spider looked for 4:29

our robots.txt file, which it didn't find since the site doesn't have one. 4:30

See this 404 code here? 4:35

In our robots.txt, the pages 4:37

that we included in our URLs list were found and saved from the parse method. 4:41

There's the URLs, there's the file names, 4:46

we'll come back up here, there they are, very nice. 4:50

Great work on writing your first spider. 4:55

We saw the two methods that a scrapy spider needs, start requests and parse. 4:58

We put in a list of URLs in the start_requests method and 5:05

have it loop through that list and process each URL with the parse method. 5:08

We could have our parse method do something more powerful 5:13

than just saving the entire file. 5:16

But this is a nice start. 5:18

Next up though, 5:20

let's see how to write a spider that will crawl more URLs than what we give it. 5:21

You need to sign up for Treehouse in order to download course files.

Sign up

You need to sign up for Treehouse in order to set up Workspace

Sign up