Welcome to the Treehouse Community
Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.
Looking to learn something new?
Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.
Start your free trialJason Tran
7,393 PointsUnable to find and list all external links
Hi, I've written my function to scrape the external links from treehouse's horse website. However my function only retrieves the first external link from the webpage (https://en.wikipedia.org/wiki/Horse) and then continues on to find the first external link in the next webpage. For example:
https://treehouse-projects.github.io/horse-land/index.html
===============================================
https://www.biodiversitylibrary.org/page/726976
===============================================
https://about.biodiversitylibrary.org
===============================================
https://biodiversitylibrary.org/
and so on....
How would I go about finding and listing external links that exist only on the first webpage (in this case the treehouse horse webpage)? For instance I would like my final site_links list to be the following:
https://en.wikipedia.org/wiki/Horse
=========================================
https://commons.wikimedia.org/wiki/Horse_breeds
=========================================
https://commons.wikimedia.org/wiki/Horse_breeds
=========================================
https://creativecommons.org/licenses/by-sa/3.0/
My code is the following:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
site_links = []
def external_links(linkURL):
linkURL = re.sub('https://', '', linkURL)
html = urlopen('https://{}'.format(linkURL))
soup = BeautifulSoup(html, 'html.parser')
return soup.find('a', href=re.compile('(^https://)'))
if __name__ == '__main__':
urls = external_links('treehouse-projects.github.io/horse-land/index.html')
while len(urls) > 0:
page = urls.attrs['href']
if page not in site_links:
site_links.append(page)
print(page)
print('\n===================================\n')
urls = external_links(page)
else:
break
Thanks and greatly appreciated!!
1 Answer
Beau Genereux
5,362 PointsThis is my solution but:
- I did not check to validate whether it parses any external links on mustang.html
- I am pretty sure there are more straightforward ways to code it but this is where I am for now
- I wanted to use a two-dimensional array all_links[[internal_links],[external_links]] but was already a bit challenged..
Hope it helps :)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
# CHALLENGE: Fetch all external urls.
# An empty array to store the links.
links_int = []
links_ext = []
# Fetches all links.
def all_links(linkURL):
#print('## inside def all_links()')
html = urlopen('https://treehouse-projects.github.io/horse-land/'.format(linkURL))
soup = BeautifulSoup(html, 'html.parser')
# Get all the links on the page.
for link in soup.find_all('a'):
while len(links_int) > 0:
page = link.attrs['href']
if page not in links_int and page not in links_ext:
links_ext.append(page)
#print(page)
else:
break
# Fetches internal links only.
def internal_links(linkURL):
html = urlopen('https://treehouse-projects.github.io/horse-land/{}'.format(linkURL))
soup = BeautifulSoup(html, 'html.parser')
return soup.find('a', href=re.compile('(.html)$'))
# I read about this but don't understand it yet.
if __name__ == '__main__':
#print('## inside if __name__')
urls = internal_links('index.html')
while len(urls) > 0:
page = urls.attrs['href']
if page not in links_int:
links_int.append(page)
#print(page)
#print('\n')
urls = internal_links(page)
else:
all_links(page)
break
print('links_int =', links_int)
print('links_ext =', links_ext)
Josh Stephens
13,529 PointsJosh Stephens
13,529 PointsI think what is happening is that you are using find on your external_links function which only gets the first anchor tag. However if you change it to findAll() you get back a list with all anchors for that page. I wrote a recursive version in haste but it scrapped a little too much but if you want to try and correct it here it is