Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Web scraping doesn't have to entirely be about scraping data for processing. Web scraping tools can be used to test websites as well.
- Testing in Python
- Introduction to Selenium
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Using web scraping tools doesn't
just have to be for gathering data.
0:00
It can be used to test a site as well.
0:04
Testing your code is a great
development practice to get into.
0:07
Writing a unit test, and
0:11
combining them with a web scraper,
can be a powerful tool for testing a site.
0:12
You can check to make sure that
a page's title is as expected,
0:17
or that all of the content resides in
an element with a specific CSS class.
0:20
If you need a refresher
on testing in Python,
0:26
check the teacher's notes for
some great resources.
0:29
Let's head back to our sample site, and
0:33
use unit tests to make sure it has
the elements that we expected it to have.
0:34
Let's go back to our horse site.
0:40
We'll check to see if
it's a stable version
0:41
of what we're expecting it to have.
0:44
Go over, let's create a new file,
new Python file.
0:46
We'll call it horse_test.py,
and we'll bring in our imports.
0:52
We need request here to bring in urlopen.
0:59
We'll bring in BeautifulSoup, and
1:06
since we're running the unit test
we'll need to import unittest.
1:08
Next, we define our class and
setup information.
1:14
So we'll call the class TestHorseLand,
which inherits unittest and TestCase.
1:18
We'll set our soup, to start with, equal
to None, and then we define a setUpClass.
1:26
And in this case, it won't take self.
1:34
We'll pass in our url,
1:37
treehouse-projects.github.io/horse-land/i-
1:41
ndex.html.
1:49
Then we define our soup object.
1:54
It's going to be BeautifulSoup,
urlopen, pass in the URL,
1:58
and we want the html.parser again.
2:03
Now, let's test that the h1 text
is what we're expecting it to be.
2:08
So we'll define a test for header1,
2:13
We want header1 to be equal to
our TestHorseLand.soup.find.
2:19
We want to grab the h1, and get_text.
2:27
Next, we want to make sure that header1,
that we're capturing here,
2:32
is equal to what our string should be.
2:36
In our case, Horse Land.
2:39
So we would do self.assertEqual,
pass in our string
2:41
that we want, Horse Land,
equal to header1.
2:46
And do our dunder check here,
And we'll run unittest.main.
2:51
And when we run this, we get an OK,
and the test passed, very nice.
3:00
Another method to test sites is
with a package called selenium,
3:06
which is designed specifically for
website testing.
3:10
It can be installed on PyCharm,
the same as BeautifulSoup, or
3:14
it can be installed with Pipenv.
3:17
I've included a link to
the installation information
3:19
in the teacher's notes, as well.
3:22
One additional step you'll need is
the driver for your preferred browser.
3:24
Follow the instructions on
the page to get it set up.
3:28
Let's create a new file
to show off selenium.
3:31
So we can close this,
Do another new Python
3:34
file, horse_test_selenium.
3:40
So we'll be using BeautifulSoup again.
3:48
And from selenium,
we want to import webdriver.
3:52
We'll also want to import the time module,
to allow the page to fully load.
3:59
So next, we want to tell our
webdriver which browser to use.
4:05
I'm using Chrome, so I'll set that up,
4:09
Then we tell the driver
to go get our page.
4:16
Horse-land, back to index.html.
4:25
Let's have our script wait a few seconds,
before we process anything.
4:30
Just to give the JavaScript time to run,
and load the horse images on the page.
4:33
We do time.sleep, pass in 5,
that should give us plenty of time.
4:39
Now, we can utilize
BeautifulSoup to parse the page.
4:44
Let's just print out the HTML,
to see if we get the images.
4:47
Recall from earlier video,
when we did this,
4:51
we just got an empty, unordered list.
4:54
Because BeautifulSoup doesn't wait for
JavaScript.
4:56
The driver object has
a function called page_source,
4:59
which gets us the source of
the page at the time it was read.
5:03
So we'll say page_html,
driver.page_source, and
5:07
we can use that with BeautifulSoup.
5:11
We'll pass in the page_html,
we'll use our html.parser again,
5:16
and we'll pretty-print our soup.
5:22
Then, we want to make
sure we close the driver.
5:28
And let's run our script, and there we go!
5:32
We see all of our images and page content.
5:42
We could now put our scraping skills
to use in many productive ways.
5:45
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up