Web Scraping with Pandas
Web scraping is a technique for extracting data from websites. Pandas provides the read_html()
method to extract tabular data directly from HTML pages into a DataFrame. This tutorial demonstrates how to scrape and analyze web-based data using Pandas.
Reading HTML Tables
The read_html()
method automatically parses tables from an HTML page. It returns a list of DataFrames, one for each table found. Here’s an example:
import pandas as pd
# Read tables from a web page
tables = pd.read_html("https://example.com/table-page")
# Display the first table
table1 = tables[0]
print(table1)
Output: The first table from the webpage is loaded into a DataFrame.
Explanation: The read_html()
method parses all HTML tables on the specified URL. Each table is returned as a DataFrame in a list, allowing you to access individual tables by their index.
Specifying Match Criteria
If a webpage contains multiple tables, you can filter them using the match
parameter. This ensures you only scrape tables containing specific keywords. Here’s an example:
# Read tables containing the keyword "Sales"
tables = pd.read_html("https://example.com/table-page", match="Sales")
# Display the filtered table
print(tables[0])
Output: Only tables containing the keyword Sales
are parsed and loaded.
Explanation: The match
parameter filters tables based on the presence of a specific keyword, ensuring you only extract relevant data.
Handling Web Scraping Errors
If a website does not contain valid HTML tables or the URL is inaccessible, errors may occur. Handle these scenarios using a try-except block. Here’s an example:
try:
tables = pd.read_html("https://example.com/nonexistent-page")
print(tables[0])
except ValueError as e:
print("No tables found on the page.")
except Exception as e:
print(f"An error occurred: {e}")
Output: Error messages are displayed if no tables are found or if other issues occur.
Explanation: The try-except
block handles errors gracefully by providing meaningful messages if no tables are found or the webpage is inaccessible.
Saving Scraped Data
Once scraped, you can save the DataFrame to various file formats such as CSV or Excel for further analysis. Here’s an example:
# Save the first table to a CSV file
table1.to_csv("scraped_data.csv", index=False)
print("Scraped data saved to scraped_data.csv")
Output: The scraped table is saved as scraped_data.csv
.
Explanation: The to_csv()
method saves the scraped DataFrame to a CSV file, allowing you to preserve and share the data for further use.
Key Takeaways
- HTML Tables: Use
read_html()
to parse tables from web pages into DataFrames. - Filtering: Use the
match
parameter to extract tables containing specific keywords. - Error Handling: Implement
try-except
blocks to manage missing tables or inaccessible pages. - Data Export: Save scraped data to files for further analysis.
- Efficiency: Scrape and manage web-based data efficiently using Pandas’ built-in methods.