🤑 Join the Treehouse affiliate program and earn 25% recurring commission!

New No-Code Track! 🚀start learning today!

🌟 Dreaming of a bright future? 🎓 Ask about the Treehouse Scholarship program! 🚀

✨ Earn college credits in Cybersecurity, JS, HTML, CSS and Python

Well done!

You have completed Scraping Data From the Web!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse Continue

Using Scrapers for Site Testing

5:52 with Ken Alger

Web scraping doesn't have to entirely be about scraping data for processing. Web scraping tools can be used to test websites as well.

Teacher's Notes
Questions?
Video Transcript
Downloads
Workspaces

Testing in Python
Introduction to Selenium

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Using web scraping tools doesn't just have to be for gathering data. 0:00

It can be used to test a site as well. 0:04

Testing your code is a great development practice to get into. 0:07

Writing a unit test, and 0:11

combining them with a web scraper, can be a powerful tool for testing a site. 0:12

You can check to make sure that a page's title is as expected, 0:17

or that all of the content resides in an element with a specific CSS class. 0:20

If you need a refresher on testing in Python, 0:26

check the teacher's notes for some great resources. 0:29

Let's head back to our sample site, and 0:33

use unit tests to make sure it has the elements that we expected it to have. 0:34

Let's go back to our horse site. 0:40

We'll check to see if it's a stable version 0:41

of what we're expecting it to have. 0:44

Go over, let's create a new file, new Python file. 0:46

We'll call it horse_test.py, and we'll bring in our imports. 0:52

We need request here to bring in urlopen. 0:59

We'll bring in BeautifulSoup, and 1:06

since we're running the unit test we'll need to import unittest. 1:08

Next, we define our class and setup information. 1:14

So we'll call the class TestHorseLand, which inherits unittest and TestCase. 1:18

We'll set our soup, to start with, equal to None, and then we define a setUpClass. 1:26

And in this case, it won't take self. 1:34

We'll pass in our url, 1:37

treehouse-projects.github.io/horse-land/i- 1:41

ndex.html. 1:49

Then we define our soup object. 1:54

It's going to be BeautifulSoup, urlopen, pass in the URL, 1:58

and we want the html.parser again. 2:03

Now, let's test that the h1 text is what we're expecting it to be. 2:08

So we'll define a test for header1, 2:13

We want header1 to be equal to our TestHorseLand.soup.find. 2:19

We want to grab the h1, and get_text. 2:27

Next, we want to make sure that header1, that we're capturing here, 2:32

is equal to what our string should be. 2:36

In our case, Horse Land. 2:39

So we would do self.assertEqual, pass in our string 2:41

that we want, Horse Land, equal to header1. 2:46

And do our dunder check here, And we'll run unittest.main. 2:51

And when we run this, we get an OK, and the test passed, very nice. 3:00

Another method to test sites is with a package called selenium, 3:06

which is designed specifically for website testing. 3:10

It can be installed on PyCharm, the same as BeautifulSoup, or 3:14

it can be installed with Pipenv. 3:17

I've included a link to the installation information 3:19

in the teacher's notes, as well. 3:22

One additional step you'll need is the driver for your preferred browser. 3:24

Follow the instructions on the page to get it set up. 3:28

Let's create a new file to show off selenium. 3:31

So we can close this, Do another new Python 3:34

file, horse_test_selenium. 3:40

So we'll be using BeautifulSoup again. 3:48

And from selenium, we want to import webdriver. 3:52

We'll also want to import the time module, to allow the page to fully load. 3:59

So next, we want to tell our webdriver which browser to use. 4:05

I'm using Chrome, so I'll set that up, 4:09

Then we tell the driver to go get our page. 4:16

Horse-land, back to index.html. 4:25

Let's have our script wait a few seconds, before we process anything. 4:30

Just to give the JavaScript time to run, and load the horse images on the page. 4:33

We do time.sleep, pass in 5, that should give us plenty of time. 4:39

Now, we can utilize BeautifulSoup to parse the page. 4:44

Let's just print out the HTML, to see if we get the images. 4:47

Recall from earlier video, when we did this, 4:51

we just got an empty, unordered list. 4:54

Because BeautifulSoup doesn't wait for JavaScript. 4:56

The driver object has a function called page_source, 4:59

which gets us the source of the page at the time it was read. 5:03

So we'll say page_html, driver.page_source, and 5:07

we can use that with BeautifulSoup. 5:11

We'll pass in the page_html, we'll use our html.parser again, 5:16

and we'll pretty-print our soup. 5:22

Then, we want to make sure we close the driver. 5:28

And let's run our script, and there we go! 5:32

We see all of our images and page content. 5:42

We could now put our scraping skills to use in many productive ways. 5:45

You need to sign up for Treehouse in order to download course files.

Sign up

You need to sign up for Treehouse in order to set up Workspace

Sign up