Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
We've seen how to scrape data from a single page. Now let's see how we can capture links on one page and follow them to process additional pages.
Additional Resources
- Regular Expressions in Python
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
[MUSIC]
0:00
We've seen how we can scrape
data from a single page and
0:04
isolate all the links on that page.
0:07
We can utilize that and
start moving off a single page and
0:10
onto multiple pages, or crawling the web.
0:13
The internet constitutes over 4.5 billion
pages connected together with hyperlinks.
0:16
Web crawling is, for our purposes, the
practice of moving between these connected
0:23
web pages and
crawling along the paths of hyperlinks.
0:28
This is where the power of
automation comes into play.
0:31
We can write our application to
look at a page, scrape the data,
0:34
then follow the links, if any, on that
page and scrape the next page, and so on.
0:38
Most webpages have both internal and
external links on them.
0:44
Before we saddle up again and
0:48
get going in our code, let's think
about web crawling at a high level.
0:50
We need to scrape a given page and
generate a list of links to follow.
0:54
It's often a good idea to determine
if a link is internal or external and
0:58
keep track of them separately.
1:03
We'll go through the list of links and
separate them into internal and
1:04
external lists.
1:08
We'll check to see if we already
have the link recorded, and if so,
1:10
it will be ignored.
1:13
If we don't have a record of seeing
a particular link, we'll add it to our list.
1:15
We'll also looking at how to leverage the
power of regular expressions to account
1:19
for things like URLs.
1:23
If you need a refresher on
regular expressions in Python,
1:25
I know I occasionally do, check the
teacher's notes to get a quick refresher.
1:28
When we last look at scraper.py,
1:33
we're getting all of the links
from our horse land's main page.
1:35
Let's see how we can round up
these links and put them to use.
1:39
Looking at the output from our previous
run of scraper.py, we're getting this
1:43
internal link here for mustang.html and
then all of these external links.
1:48
We can separate those out and follow them.
1:54
First, let's make a new file.
1:56
The new Python file,
let's call it soup_follow_scraper.
2:02
I told you I'm bad at naming things.
2:08
We can minimize this.
2:10
And we'll bring in our imports,
from urllib.request,
2:13
we want urlopen,
from bs4 import BeautifulSoup.
2:20
And we'll be using regular expressions.
2:28
So let's import re to take care of that.
2:30
Let's make an internal links function that
will take a link URL, internal_links.
2:36
We'll need to open our URL
to define our html urlopen.
2:47
Inside here,
we'll pass in the start of the URL and
2:53
format it with the internal
URL we scrape from the page.
2:56
For our URL in our case,
3:01
treehouse-projects.github.io/horse-land
and
3:04
our string formatter.
3:12
And we'll format it with a linkURL.
3:19
Next, we create our Beautiful Soup object.
3:23
Soup is BeautifulSoup,
pass in our html, and
3:27
we'll use the same html parser
we've been using, html.parser.
3:32
And we want to return
the link of the soup object,
3:37
soup.find, and we want the anchor links.
3:39
We'll look for the anchor tags and
use the HREF attribute of the find method
3:44
with a regular expression to just get
the links that, in our case, end in .html.
3:50
It's inside here, re.compile,
3:56
our pattern is .html.
4:02
Let's put it to use.
4:06
So if dunder name, equals dunder main,
4:09
we want our urls to be in internal_links.
4:16
And we'll pass in our starting
URL to the internal_links method,
4:20
and in our case, it's index.html.
4:24
And then we'll do a while loop.
4:28
So we'll have a length of
our urls is greater than 0.
4:31
We want to capture the URL href.
4:36
Now we can do a lot of processing here,
but for
4:44
now let's just print out the page
information we get, print[page].
4:46
And then we'll add a little
bit of formatting.
4:50
Couple new lines in there, and
4:57
then we'll call our internal links
method again, for the next link.
5:00
Internal_links(page), let's run it and
see it in action.
5:09
Well, there we have it.
5:22
It's doing what we asked, but
it's in an infinite loop.
5:24
Index.html is finally linked to
mustang.html, which is finding
5:29
the link back to index.html,
which is, well, you get the point.
5:34
Let's add in the list to keep track
of our pages Call it site_links.
5:40
And then we'll adjust our while loop.
5:52
So if page not in site_links.
5:56
And then we'll add the pages to our list,
6:00
site_links.append(page).
6:04
We can indent all that.
6:08
Give us some more space.
6:12
So otherwise, we'll just break.
6:16
And let's run it again.
6:19
Page is not defined.
6:21
And pull that out.
6:25
Started my if statement too soon.
6:30
There we go, and
we get the links that we were expecting.
6:34
External links are handled in a similar
fashion, you do find the base url path,
6:37
and then, with regex define the pattern
you're looking for and follow the links.
6:42
I'll saddle you with the responsibility
to give it a try and
6:48
post your solution in the community.
6:52
Don't worry, I'm sure you can rein it in.
6:54
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up