Welcome to the Treehouse Community
Want to collaborate on code errors? Have bugs you need feedback on? Looking for an extra set of eyes on your latest project? Get support with fellow developers, designers, and programmers of all backgrounds and skill levels here with the Treehouse Community! While you're at it, check out some resources Treehouse students have shared here.
Looking to learn something new?
Treehouse offers a seven day free trial for new students. Get access to thousands of hours of content and join thousands of Treehouse students and alumni in the community today.
Start your free trialdoug james
129 PointsHTML-TABLE Scraping
I am trying to make a html-table scraper function it is not working as expected i tried with wikie pedia table it gives the output of just blank lines.
Here is the code:
def scrape_table(url):
soup = BeautifulSoup(requests.get(url).text)
poll_table = soup.find('table')
headers = [header.text for listing in poll_table.find_all('thead') for header in listing.find_all('th')]
raw_data = {header:[] for header in headers}
for rows in soup.find_all('tbody'):
for row in rows.find_all('tr'):
if len(row) != len(headers): continue
for idx, cell in enumerate(row.find_all('td')):
if row.find_all('td'):
raw_data[headers[idx]].append(cell.text)
else:
raw_data[headers[idx]].append('')
return pd.DataFrame(raw_data)
1 Answer
Chris Freeman
Treehouse Moderator 68,454 PointsHey doug james, a very interesting question. The short answer is there is no "thead" to find.
After inspecting the page source (using show source—not page inspector) , and also dumping the value of soup
and poll_table
to a file for inspection, there is no "thead" to be found. Also, after inspecting the wiki markdown source, there is no explicit "thead" mechanism present. I suspect (speculate) that the thead is dynamically added using JavaScript upon page loading. This explains why "thead" is not in the scraped data.
The wiki source lists the table classes as "wikitable sortable", but the element inspector shows the table classes as "wikitable sortable jquery-tablesorter". So that might be a key to where the "thead" is being inserted.
In general, it's better to use the "page source" as the structure map when scraping (Ctrl-U using Firefox) to avoid dynamic content changes.
Post back if you have more questions. Good luck!!