Heads up! To view this whole video, sign in with your Courses account or enroll in your free 7-day trial. Sign In Enroll
Preview
Start a free Courses trial
to watch this video
Getting up and going with the Scrapy library.
Additional Resources
- Scrapy web site
- Scrapy installation guide
Related Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign upRelated Discussions
Have questions about this video? Start a discussion with the community and Treehouse staff.
Sign up
The script we built to
read the horse website
0:00
is a basic web crawling bot
to scrape data from a site.
0:02
Python has a great module available
which provides a more full-featured way
0:07
to quickly extract data
you need from websites.
0:12
With Scrapy, we write the rules
regarding the data we want extracted and
0:15
let it do the rest.
0:20
Let's get Scrapy installed and
then set up our first spider project.
0:21
Let's look at the Scrapy
installation guide.
0:26
We see that it will run on Python 2.7 and
3.4 and higher,
0:29
and can be installed using conda or
from PyPI with pip.
0:34
Let's add this package to
our project in PyCharm.
0:39
We want Scrapy.
0:48
And install the package.
0:53
If you find there are issues with your
installation, check the platform-specific
0:54
installation notes in the Scrapy
documentation for additional information.
0:59
Once it's finished installing,
you can come out of here.
1:03
Go to a Terminal window.
1:08
Let's create a new spider.
1:12
We'll call it AraneaSpider.
1:13
Aranea is one of Charlotte's children's
names in the classic children's book,
1:16
Charlotte's Web.
1:21
It's also the genus name of one of my
personal favorite spiders, the orb weaver.
1:23
So if we do scrapy
startproject AraneaSpider,
1:28
it creates our spider for us.
1:34
Running this command handles creating
the directory structure and setup for
1:37
a Scrapy project.
1:42
Let's see what Scrapy has provided for us.
1:43
We'll minimize this.
1:46
So here, under our folder, there's a
scrapy.cfg files which handles deployment
1:48
configuration, a project Python module
from which we'll import our code.
1:54
And there are some stub
files that are generated.
1:59
Their names are pretty descriptive,
items, middlewares,
2:02
pipelines, settings, all include
respective setting information.
2:06
Next is the spiders directory.
2:11
This is where we'll put our spiders.
2:13
Let's talk a little bit about what
a couple of these files are used for.
2:16
items.py is used to define a model
of data for scraped items.
2:20
Scrapy spiders can return
scraped data as Python dicts.
2:25
As you know, dicts lack structure.
2:29
We can use items.py to create containers,
2:33
where we can put the data
we get from a site.
2:37
Middlewares allow for custom functionality
to be built to customize the responses
2:40
that are sent to spiders.
2:45
The pipeline.py is used to
customize the processing of data.
2:46
For example, you could write
a pipeline that would cleanse the HTML,
2:51
then move down the processing
pipeline to be validated,
2:56
then store the information
into a database.
3:00
Steps along the data processing
path can be put into the pipeline.
3:03
settings.py allows for the behavior of
Scrapy components to be customized.
3:08
In our next video,
let's write our first spider.
3:13
I'll see you shortly.
3:16
You need to sign up for Treehouse in order to download course files.
Sign upYou need to sign up for Treehouse in order to set up Workspace
Sign up