💥 2026 New Year's Sale 💥 Take 50% off your first 6 months! (new subscribers only, renews at regular price).

Join the Treehouse affiliate program and earn 25% recurring commission!

Well done!

You have completed Scraping Data From the Web!

Sign up for Treehouse Back to Library

Preview

Sign up for Treehouse Continue

More Soup in the Tureen

6:18 with Ken Alger

Let's look at two Beautiful Soup methods, `find()` and `find_all()`, in greater detail.

Teacher's Notes
Questions?
Video Transcript
Downloads
Workspaces

This video doesn't have any notes.

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Related Discussions

Have questions about this video? Start a discussion with the community and Treehouse staff.

Sign up

Welcome back. 0:00

We just saw how to utilize the find all method to find 0:01

all of a particular item on the page. 0:05

We can use the find method to find the first instance of an item. 0:08

We can change this to find, get rid of our for loop here, 0:14

And run it. 0:26

I should probably have changed that name to just div, ah well. 0:30

Naming things is always a challenge for me. 0:33

We're getting all of the info back for that particular div element. 0:36

The featured one on the page here, in this case. 0:40

What if we just want the header text in here? 0:43

Since it's a child element of that div, we can chain elements together. 0:46

Let's comment this out. 0:53

Close this down. 0:57

We want featured_header = soup.find. 1:01

We want div class featured. 1:08

We just want the h2 element. 1:17

And we'll print the featured header. 1:20

Nice. 1:26

But we still have our tag elements in there. 1:27

From a data cleanliness standpoint, 1:30

it would be great if we could get rid of those, right? 1:32

Well, there's a convenient method for that called get_text. 1:35

Called get_text. 1:43

Yipee, we got some text from our site. 1:49

We scraped it out. 1:52

There's a bit of a gotcha to watch out for with this get_text method, though. 1:53

It strips away the tags from whatever we're working with, 1:57

leaving just a block of text. 2:00

Let's take away this h2 element from our text value to see what I mean. 2:02

While this is perhaps more readable for 2:12

us, it makes it much more challenging to process going forward. 2:14

If we wanted to select mustang, or 2:18

the text about them at this point, it would be more of a challenge. 2:20

The thing to remember about get_text is to use it as the last step 2:24

in the scraping process. 2:29

We've seen that the find method returns the first occurrence of an item in 2:31

a Beautiful Soup object. 2:35

It is basically a find all method with a setting of the limit of results to one. 2:36

Let's look at the parameters these methods take. 2:41

Name, which looks for tags with certain names, such as title or div. 2:45

Attrs, which allows for the searching for a specific CSS class. 2:50

We'll take a look at this here shortly. 2:54

Recursive, by default, find and find all examines all descendants of a tag. 2:57

If we set recursity over false, 3:03

it will only look at the direct children of the tag. 3:05

String or text allows for the searching of strings instead of tags. 3:09

Kwargs, which allows researching on other items, such as CSS ID. 3:15

Limit, the find all method also accepts a limit argument 3:20

to limit the results that return. 3:24

As I mentioned, find is a find all with a limit set to one. 3:26

We can pass in a string, a list, a regular expression, 3:30

a value equals true, or even a function to the name, string, or 3:34

kwargs arguments to further enhance the searching capabilities. 3:38

Let's take a look at the attrs argument to search for the CSS class or print out 3:43

all references to this primary button class, which is this button down here. 3:48

Come back over here to our code, let's comment this out. 3:54

So for button in soup.find. 4:00

Gonna look for a class, and 4:07

that class was button button--primary. 4:10

And we'll just print the buttons out. 4:20

And more, here it is. 4:27

Since class is a reserved word in Python and searching for items on page based 4:29

on class is a frequent task, Beautiful Soup provides a process for that. 4:34

We can change our code to use a special keyword argument, class underscore. 4:39

So we can take all this out, remove our closing curly bracket, 4:44

and we get the same result with a bit less typing. 4:55

Another very common task which will be useful when we want to move 4:58

from one page to another is to get all of the hyperlinks on a page. 5:02

We can navigate into a specific tag and 5:07

use the get method to abstract specific information. 5:09

Minimize that. 5:13

Again, we'll comment this out, just for clarity. 5:17

So for link in soup.find all, so we'll look for all the anchor elements. 5:21

And then we'll print out all of the href attributes. 5:29

So link and we'll get the hrefs. 5:34

We can look at these patterns to determine internal and external links. 5:42

Definitely, a handy thing to do. 5:47

Beautiful Soup is a very powerful tool, and 5:50

we've just scratched the surface of its power. 5:52

But we've seen how we can use Python to read a webpage and 5:55

get very specific data from the HTML. 5:58

It can take a bit of work to decipher the page structure, but 6:01

that is time well spent for data collection. 6:04

Before we get too much further into collecting data from websites, 6:08

we should talk about some other things to think about to be good data wranglers. 6:12

I'll see you all back here in a bit and have a look. 6:16

You need to sign up for Treehouse in order to download course files.

Sign up

You need to sign up for Treehouse in order to set up Workspace

Sign up