Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Let's look at two Beautiful Soup methods, `find()` and `find_all()`, in greater detail.
This video doesn't have any notes.
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
Welcome back.
0:00
We just saw how to utilize
the find all method to find
0:01
all of a particular item on the page.
0:05
We can use the find method to find
the first instance of an item.
0:08
We can change this to find,
get rid of our for loop here,
0:14
And run it.
0:26
I should probably have changed
that name to just div, ah well.
0:30
Naming things is always a challenge for
me.
0:33
We're getting all of the info back for
that particular div element.
0:36
The featured one on the page here,
in this case.
0:40
What if we just want
the header text in here?
0:43
Since it's a child element of that div,
we can chain elements together.
0:46
Let's comment this out.
0:53
Close this down.
0:57
We want featured_header = soup.find.
1:01
We want div class featured.
1:08
We just want the h2 element.
1:17
And we'll print the featured header.
1:20
Nice.
1:26
But we still have our
tag elements in there.
1:27
From a data cleanliness standpoint,
1:30
it would be great if we could
get rid of those, right?
1:32
Well, there's a convenient method for
that called get_text.
1:35
Called get_text.
1:43
Yipee, we got some text from our site.
1:49
We scraped it out.
1:52
There's a bit of a gotcha to watch out for
with this get_text method, though.
1:53
It strips away the tags from
whatever we're working with,
1:57
leaving just a block of text.
2:00
Let's take away this h2 element from
our text value to see what I mean.
2:02
While this is perhaps more readable for
2:12
us, it makes it much more challenging
to process going forward.
2:14
If we wanted to select mustang, or
2:18
the text about them at this point,
it would be more of a challenge.
2:20
The thing to remember about get_text
is to use it as the last step
2:24
in the scraping process.
2:29
We've seen that the find method returns
the first occurrence of an item in
2:31
a Beautiful Soup object.
2:35
It is basically a find all method with
a setting of the limit of results to one.
2:36
Let's look at the parameters
these methods take.
2:41
Name, which looks for tags with
certain names, such as title or div.
2:45
Attrs, which allows for
the searching for a specific CSS class.
2:50
We'll take a look at this here shortly.
2:54
Recursive, by default, find and find
all examines all descendants of a tag.
2:57
If we set recursity over false,
3:03
it will only look at the direct
children of the tag.
3:05
String or text allows for
the searching of strings instead of tags.
3:09
Kwargs, which allows researching
on other items, such as CSS ID.
3:15
Limit, the find all method
also accepts a limit argument
3:20
to limit the results that return.
3:24
As I mentioned, find is a find
all with a limit set to one.
3:26
We can pass in a string,
a list, a regular expression,
3:30
a value equals true, or
even a function to the name, string, or
3:34
kwargs arguments to further enhance
the searching capabilities.
3:38
Let's take a look at the attrs argument
to search for the CSS class or print out
3:43
all references to this primary button
class, which is this button down here.
3:48
Come back over here to our code,
let's comment this out.
3:54
So for button in soup.find.
4:00
Gonna look for a class, and
4:07
that class was button button--primary.
4:10
And we'll just print the buttons out.
4:20
And more, here it is.
4:27
Since class is a reserved word in Python
and searching for items on page based
4:29
on class is a frequent task, Beautiful
Soup provides a process for that.
4:34
We can change our code to use a special
keyword argument, class underscore.
4:39
So we can take all this out,
remove our closing curly bracket,
4:44
and we get the same result
with a bit less typing.
4:55
Another very common task which will
be useful when we want to move
4:58
from one page to another is to get
all of the hyperlinks on a page.
5:02
We can navigate into a specific tag and
5:07
use the get method to abstract
specific information.
5:09
Minimize that.
5:13
Again, we'll comment this out,
just for clarity.
5:17
So for link in soup.find all, so
we'll look for all the anchor elements.
5:21
And then we'll print out
all of the href attributes.
5:29
So link and we'll get the hrefs.
5:34
We can look at these patterns to
determine internal and external links.
5:42
Definitely, a handy thing to do.
5:47
Beautiful Soup is a very powerful tool,
and
5:50
we've just scratched
the surface of its power.
5:52
But we've seen how we can use
Python to read a webpage and
5:55
get very specific data from the HTML.
5:58
It can take a bit of work to
decipher the page structure, but
6:01
that is time well spent for
data collection.
6:04
Before we get too much further into
collecting data from websites,
6:08
we should talk about some other things to
think about to be good data wranglers.
6:12
I'll see you all back here in a bit and
have a look.
6:16
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up