Advanced Web Scraping Techniques
Scraping modern websites can be challenging. Many rely heavily on JavaScript to load data dynamically, or they use infinite scrolling to display only a fraction of the total content at a time. Selenium and other tools help render these dynamic pages, while handling pagination or infinite scroll requires careful strategy.
Key Topics
- Scraping Dynamic Content with Selenium and BeautifulSoup
- Handling Pagination and Infinite Scrolling
- Scraping Websites Protected by Login Forms
- Scraping AJAX-Powered Websites
Scraping Dynamic Content with Selenium and BeautifulSoup
Selenium automates real web browsers (e.g., Chrome, Firefox), allowing you to render JavaScript-driven pages just like a human user. You can then feed the rendered HTML into BeautifulSoup for parsing.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup
chrome_options = Options()
chrome_options.add_argument('--headless') # Run in headless mode
# Make sure to download the appropriate ChromeDriver
browser = webdriver.Chrome(options=chrome_options)
browser.get("https://dynamicexample.com")
page_source = browser.page_source
soup = BeautifulSoup(page_source, "html.parser")
# Extract data using BeautifulSoup
data = soup.select("div.content")
for item in data:
print(item.get_text())
browser.quit()
Explanation: Selenium launches a headless Chrome browser, navigates to the page, and loads all dynamic content. We then grab the final rendered HTML with browser.page_source
and parse it using BeautifulSoup.
Handling Pagination and Infinite Scrolling
Many sites split results across multiple pages or dynamically load more content as the user scrolls. For multi-page structures, you can increment through URLs or click a "Next" button via Selenium until all data is collected. Infinite scroll might require repeatedly sending scroll commands and capturing newly loaded sections.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
browser = webdriver.Chrome()
browser.get("https://example.com/infinite-scroll")
# Scroll to the bottom of the page multiple times
for _ in range(10): # Adjust range for more scrolling
browser.find_element_by_tag_name('body').send_keys(Keys.END)
time.sleep(2) # Wait for new content to load
page_source = browser.page_source
browser.quit()
# Parse the final HTML with BeautifulSoup
soup = BeautifulSoup(page_source, "html.parser")
data = soup.select("div.content")
for item in data:
print(item.get_text())
Explanation: This script scrolls to the bottom of the page multiple times to load new content. Adjust the range and sleep time based on the site's behavior and loading speed.
Scraping Websites Protected by Login Forms
Sites that require authentication often have login forms. You can either programmatically submit the form via requests
(if it’s a simple post form) or use Selenium to automate the login flow. Either way, you’ll need to manage sessions or cookies to maintain a logged-in state.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time
browser = webdriver.Chrome()
browser.get("https://example.com/login")
# Fill in the login form
username = browser.find_element(By.NAME, "username")
password = browser.find_element(By.NAME, "password")
username.send_keys("your_username")
password.send_keys("your_password")
password.send_keys(Keys.RETURN)
# Wait for login to complete
time.sleep(5)
# Navigate to the protected page
browser.get("https://example.com/protected-page")
page_source = browser.page_source
browser.quit()
# Parse the final HTML with BeautifulSoup
soup = BeautifulSoup(page_source, "html.parser")
data = soup.select("div.protected-content")
for item in data:
print(item.get_text())
Explanation: This script automates the login process using Selenium, then navigates to a protected page and extracts data. Adjust the element selectors and URLs based on the actual site structure.
Scraping AJAX-Powered Websites
AJAX requests load data asynchronously without refreshing the page. You can intercept these requests using browser developer tools to find the API endpoints they hit, then use requests
to fetch data directly from these endpoints.
import requests
api_url = "https://example.com/api/data"
params = {
"param1": "value1",
"param2": "value2"
}
response = requests.get(api_url, params=params)
if response.status_code == 200:
data = response.json()
for item in data['results']:
print(item)
else:
print("Failed to retrieve data. Status code:", response.status_code)
Explanation: This script sends a GET request to an API endpoint used by the website's AJAX calls, then processes the JSON response. Adjust the URL and parameters based on the actual API.
Key Takeaways
- JavaScript Content: Use Selenium or similar tools to load pages that rely heavily on JS.
- Pagination & Scroll: Automate multi-page or infinite scrolling logic to capture all data.
- Authentication: Manage sessions and cookies when dealing with login-protected areas.
- AJAX Requests: Intercept and replicate AJAX calls to fetch data directly from API endpoints.