Prev Next

Advanced Web Scraping Techniques

Scraping modern websites can be challenging. Many rely heavily on JavaScript to load data dynamically, or they use infinite scrolling to display only a fraction of the total content at a time. Selenium and other tools help render these dynamic pages, while handling pagination or infinite scroll requires careful strategy.

Key Topics

Scraping Dynamic Content with Selenium and BeautifulSoup
Handling Pagination and Infinite Scrolling
Scraping Websites Protected by Login Forms
Scraping AJAX-Powered Websites

Scraping Dynamic Content with Selenium and BeautifulSoup

Selenium automates real web browsers (e.g., Chrome, Firefox), allowing you to render JavaScript-driven pages just like a human user. You can then feed the rendered HTML into BeautifulSoup for parsing.

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

chrome_options = Options()
chrome_options.add_argument('--headless')  # Run in headless mode

# Make sure to download the appropriate ChromeDriver
browser = webdriver.Chrome(options=chrome_options)

browser.get("https://dynamicexample.com")
page_source = browser.page_source
soup = BeautifulSoup(page_source, "html.parser")

# Extract data using BeautifulSoup
data = soup.select("div.content")

for item in data:
    print(item.get_text())

browser.quit()

Explanation: Selenium launches a headless Chrome browser, navigates to the page, and loads all dynamic content. We then grab the final rendered HTML with browser.page_source and parse it using BeautifulSoup.

Handling Pagination and Infinite Scrolling

Many sites split results across multiple pages or dynamically load more content as the user scrolls. For multi-page structures, you can increment through URLs or click a "Next" button via Selenium until all data is collected. Infinite scroll might require repeatedly sending scroll commands and capturing newly loaded sections.

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

browser = webdriver.Chrome()
browser.get("https://example.com/infinite-scroll")

# Scroll to the bottom of the page multiple times
for _ in range(10):  # Adjust range for more scrolling
    browser.find_element_by_tag_name('body').send_keys(Keys.END)
    time.sleep(2)  # Wait for new content to load

page_source = browser.page_source
browser.quit()

# Parse the final HTML with BeautifulSoup
soup = BeautifulSoup(page_source, "html.parser")
data = soup.select("div.content")

for item in data:
    print(item.get_text())

Explanation: This script scrolls to the bottom of the page multiple times to load new content. Adjust the range and sleep time based on the site's behavior and loading speed.

Scraping Websites Protected by Login Forms

Sites that require authentication often have login forms. You can either programmatically submit the form via requests (if it’s a simple post form) or use Selenium to automate the login flow. Either way, you’ll need to manage sessions or cookies to maintain a logged-in state.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
import time

browser = webdriver.Chrome()
browser.get("https://example.com/login")

# Fill in the login form
username = browser.find_element(By.NAME, "username")
password = browser.find_element(By.NAME, "password")
username.send_keys("your_username")
password.send_keys("your_password")
password.send_keys(Keys.RETURN)

# Wait for login to complete
time.sleep(5)

# Navigate to the protected page
browser.get("https://example.com/protected-page")
page_source = browser.page_source
browser.quit()

# Parse the final HTML with BeautifulSoup
soup = BeautifulSoup(page_source, "html.parser")
data = soup.select("div.protected-content")

for item in data:
    print(item.get_text())

Explanation: This script automates the login process using Selenium, then navigates to a protected page and extracts data. Adjust the element selectors and URLs based on the actual site structure.

Scraping AJAX-Powered Websites

AJAX requests load data asynchronously without refreshing the page. You can intercept these requests using browser developer tools to find the API endpoints they hit, then use requests to fetch data directly from these endpoints.

import requests

api_url = "https://example.com/api/data"
params = {
    "param1": "value1",
    "param2": "value2"
}

response = requests.get(api_url, params=params)

if response.status_code == 200:
    data = response.json()
    for item in data['results']:
        print(item)
else:
    print("Failed to retrieve data. Status code:", response.status_code)

Explanation: This script sends a GET request to an API endpoint used by the website's AJAX calls, then processes the JSON response. Adjust the URL and parameters based on the actual API.

Key Takeaways

JavaScript Content: Use Selenium or similar tools to load pages that rely heavily on JS.
Pagination & Scroll: Automate multi-page or infinite scrolling logic to capture all data.
Authentication: Manage sessions and cookies when dealing with login-protected areas.
AJAX Requests: Intercept and replicate AJAX calls to fetch data directly from API endpoints.

Prev Next

Web Design

AI and Data Science

Full Stack Development

Database Tutorials

TryMeYourSelf is optimized for learning and training. Examples might be simplified to improve reading and learning.