sharing my code: web crawling

Question

Hi, here's the code solution he encouraged us to share with the community. Hope it works for everyone!

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html = urlopen('https://treehouse-projects.github.io/horse-land/index.html')
soup = BeautifulSoup(html.read(), 'html.parser')

for link in soup.find_all('a'):
    print(link.get('href'))

site_links = []

# Here's a function that will receive a website's internal link and parse the html on that page.

def internal_links(linkURL):
    html = urlopen('https://treehouse-projects.github.io/horse-land/{}'.format(linkURL))
    soup = BeautifulSoup(html, 'html.parser')
    return soup.find('a', href=re.compile('(.html)$'))  # anchor tags


if __name__ == '__main__':
    urls = internal_links("index.html")

    while len(urls) > 0:
        page = urls.attrs['href']
        if page not in site_links:
            site_links.append(page)

            print(page)
            print('\n=============\n')
            urls = internal_links(page)
        else:
            break

Answer 1 · 2020-11-06T07:14:46Z

November 6, 2020 7:14am

Nice work, I see that your logic made sense. However, I believe that your solution for printing the external links will include 'mustang.html' which is an internal link. This internal link must somehow be ignored if you want to display only the external links.

Welcome to the Treehouse Community

Looking to learn something new?

Mark Chesney

Mark Chesney

sharing my code: web crawling

1 Answer

Linda Shum

Linda Shum