How do I scrape data from responsive web designs?
Scraping data from responsive web designs presents unique challenges because content structure, CSS classes, and even element visibility can change based on screen size and device type. Responsive websites dynamically adapt their layout, which means your scraping strategy must account for these variations to reliably extract data across different viewport configurations.
Understanding Responsive Design Challenges
Responsive web designs use CSS media queries, flexible grids, and dynamic content loading to provide optimal user experiences across devices. For web scrapers, this creates several challenges:
- Dynamic CSS classes: Elements may have different classes for mobile vs desktop views
- Hidden elements: Content might be hidden on certain screen sizes using
display: none
orvisibility: hidden
- Layout shifts: Element positioning and hierarchy can change dramatically
- Content prioritization: Some content may only appear on specific device sizes
- JavaScript-dependent rendering: Mobile layouts often rely heavily on JavaScript for functionality
Setting Up Simple HTML DOM for Responsive Scraping
While Simple HTML DOM is a server-side HTML parser that doesn't execute JavaScript or apply CSS, you can still scrape responsive sites effectively by understanding their structure and using strategic approaches.
Basic Setup with User Agent Rotation
<?php
require_once 'simple_html_dom.php';
class ResponsiveScraper {
private $user_agents = [
'desktop' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
'mobile' => 'Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1',
'tablet' => 'Mozilla/5.0 (iPad; CPU OS 14_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0 Mobile/15E148 Safari/604.1'
];
public function fetchHTML($url, $device_type = 'desktop') {
$context = stream_context_create([
'http' => [
'header' => "User-Agent: " . $this->user_agents[$device_type] . "\r\n",
'timeout' => 30
]
]);
$html = file_get_contents($url, false, $context);
return str_get_html($html);
}
}
$scraper = new ResponsiveScraper();
$dom = $scraper->fetchHTML('https://example.com', 'mobile');
?>
Strategies for Responsive Data Extraction
1. Multi-Viewport Scraping Approach
Since responsive sites serve different content based on the requesting device, scrape the same page with multiple user agents to capture all available data:
<?php
function scrapeMultipleViewports($url) {
$scraper = new ResponsiveScraper();
$results = [];
foreach (['desktop', 'tablet', 'mobile'] as $device) {
$dom = $scraper->fetchHTML($url, $device);
if ($dom) {
$results[$device] = [
'title' => $dom->find('title', 0)->plaintext ?? '',
'main_content' => $dom->find('.main-content, #content, .content', 0)->plaintext ?? '',
'navigation' => extractNavigation($dom, $device),
'sidebar' => $dom->find('.sidebar, .side-nav', 0)->plaintext ?? ''
];
}
}
return $results;
}
function extractNavigation($dom, $device) {
// Look for common responsive navigation patterns
$nav_selectors = [
'.navbar ul li a',
'.mobile-menu a',
'.hamburger-menu a',
'nav a',
'.menu-item a'
];
$nav_items = [];
foreach ($nav_selectors as $selector) {
$elements = $dom->find($selector);
foreach ($elements as $element) {
if (!empty(trim($element->plaintext))) {
$nav_items[] = [
'text' => trim($element->plaintext),
'href' => $element->href ?? '',
'device' => $device,
'selector' => $selector
];
}
}
}
return $nav_items;
}
?>
2. CSS Class Pattern Recognition
Responsive sites often use predictable CSS class patterns. Create robust selectors that account for these variations:
<?php
function extractWithResponsiveSelectors($dom) {
// Common responsive patterns
$responsive_patterns = [
// Bootstrap-style responsive classes
'content' => [
'.col-12',
'.col-md-8',
'.col-lg-6',
'.content',
'.main-content',
'main'
],
'sidebar' => [
'.col-md-4',
'.col-lg-3',
'.sidebar',
'.aside',
'aside'
],
'navigation' => [
'.navbar-nav',
'.nav-menu',
'.mobile-nav',
'.hamburger-menu',
'.menu'
]
];
$extracted_data = [];
foreach ($responsive_patterns as $section => $selectors) {
foreach ($selectors as $selector) {
$element = $dom->find($selector, 0);
if ($element && !empty(trim($element->plaintext))) {
$extracted_data[$section] = [
'content' => trim($element->plaintext),
'html' => $element->outertext,
'selector_used' => $selector
];
break; // Use first successful match
}
}
}
return $extracted_data;
}
?>
3. Handling Hidden and Collapsed Content
Many responsive designs hide content using CSS. While Simple HTML DOM can't evaluate CSS, you can still access hidden elements by looking for common hiding patterns:
<?php
function extractHiddenContent($dom) {
$hidden_content = [];
// Look for commonly hidden elements
$hidden_selectors = [
'.hidden-xs', // Bootstrap hidden on mobile
'.hidden-sm', // Bootstrap hidden on small screens
'.d-none', // Bootstrap 4+ display none
'.collapse', // Collapsible content
'.accordion-body', // Accordion content
'.tab-content', // Tab content
'.dropdown-menu' // Dropdown menus
];
foreach ($hidden_selectors as $selector) {
$elements = $dom->find($selector);
foreach ($elements as $element) {
if (!empty(trim($element->plaintext))) {
$hidden_content[] = [
'selector' => $selector,
'content' => trim($element->plaintext),
'html' => $element->outertext
];
}
}
}
return $hidden_content;
}
?>
JavaScript Alternative for Dynamic Content
For heavily JavaScript-dependent responsive sites, consider using headless browsers. Here's a Node.js example using Puppeteer that complements Simple HTML DOM:
const puppeteer = require('puppeteer');
class ResponsivePuppeteerScraper {
constructor() {
this.viewports = {
mobile: { width: 375, height: 667 },
tablet: { width: 768, height: 1024 },
desktop: { width: 1200, height: 800 }
};
}
async scrapeResponsive(url) {
const browser = await puppeteer.launch();
const results = {};
for (const [device, viewport] of Object.entries(this.viewports)) {
const page = await browser.newPage();
await page.setViewport(viewport);
// Set appropriate user agent
const userAgent = device === 'mobile'
? 'Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15'
: 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36';
await page.setUserAgent(userAgent);
await page.goto(url, { waitUntil: 'networkidle2' });
// Wait for responsive transformations to complete
await page.waitForTimeout(2000);
results[device] = await page.evaluate(() => {
return {
title: document.title,
visibleContent: Array.from(document.querySelectorAll('*'))
.filter(el => el.offsetParent !== null)
.map(el => ({
tag: el.tagName,
class: el.className,
text: el.textContent?.trim().substring(0, 100)
}))
.filter(item => item.text && item.text.length > 0)
};
});
await page.close();
}
await browser.close();
return results;
}
}
// Usage
(async () => {
const scraper = new ResponsivePuppeteerScraper();
const data = await scraper.scrapeResponsive('https://example.com');
console.log(JSON.stringify(data, null, 2));
})();
Best Practices for Responsive Scraping
1. Implement Fallback Selectors
Always provide multiple selector options to handle different responsive states:
<?php
function robustElementSelection($dom, $selectors, $fallback_text = '') {
foreach ($selectors as $selector) {
$element = $dom->find($selector, 0);
if ($element && !empty(trim($element->plaintext))) {
return [
'content' => trim($element->plaintext),
'selector' => $selector,
'success' => true
];
}
}
return [
'content' => $fallback_text,
'selector' => null,
'success' => false
];
}
// Example usage
$title_selectors = [
'h1.page-title',
'.title-wrapper h1',
'.hero-title',
'h1',
'.main-title'
];
$title_data = robustElementSelection($dom, $title_selectors, 'Title not found');
?>
2. Monitor and Adapt to Layout Changes
Responsive sites frequently update their CSS frameworks and breakpoints. Implement monitoring to detect when your scraping patterns need updates:
<?php
function validateScrapingResults($results, $expected_fields) {
$validation_report = [
'success_rate' => 0,
'missing_fields' => [],
'recommendations' => []
];
$successful_extractions = 0;
foreach ($expected_fields as $field) {
if (isset($results[$field]) && !empty($results[$field]['content'])) {
$successful_extractions++;
} else {
$validation_report['missing_fields'][] = $field;
}
}
$validation_report['success_rate'] =
($successful_extractions / count($expected_fields)) * 100;
if ($validation_report['success_rate'] < 80) {
$validation_report['recommendations'][] =
'Consider updating selectors or using headless browser approach';
}
return $validation_report;
}
?>
3. Handle Common Responsive Frameworks
Different CSS frameworks have distinct patterns. Tailor your approach accordingly:
<?php
function detectResponsiveFramework($dom) {
$framework_indicators = [
'bootstrap' => ['.container', '.row', '.col-', '.btn-', '.navbar'],
'foundation' => ['.grid-container', '.grid-x', '.cell'],
'bulma' => ['.container', '.columns', '.column'],
'tailwind' => ['.flex', '.grid', '.w-', '.h-', '.p-', '.m-'],
'material' => ['.mdc-', '.mat-']
];
$detected_frameworks = [];
foreach ($framework_indicators as $framework => $indicators) {
$matches = 0;
foreach ($indicators as $indicator) {
if (!empty($dom->find($indicator))) {
$matches++;
}
}
if ($matches >= 2) {
$detected_frameworks[] = $framework;
}
}
return $detected_frameworks;
}
?>
Advanced Techniques
1. Combining Multiple Data Sources
Merge data from different viewport scrapes to create a comprehensive dataset:
<?php
function mergeResponsiveData($viewport_results) {
$merged_data = [
'title' => '',
'content' => [],
'navigation' => [],
'metadata' => []
];
foreach ($viewport_results as $device => $data) {
// Use the longest/most complete title
if (strlen($data['title']) > strlen($merged_data['title'])) {
$merged_data['title'] = $data['title'];
}
// Combine unique navigation items
foreach ($data['navigation'] as $nav_item) {
$exists = false;
foreach ($merged_data['navigation'] as $existing_item) {
if ($existing_item['text'] === $nav_item['text']) {
$exists = true;
break;
}
}
if (!$exists) {
$merged_data['navigation'][] = $nav_item;
}
}
// Store device-specific content
$merged_data['content'][$device] = $data['main_content'];
}
return $merged_data;
}
?>
2. Performance Optimization
When scraping responsive sites across multiple viewports, optimize your approach:
<?php
class OptimizedResponsiveScraper {
private $cache = [];
private $concurrent_limit = 3;
public function scrapeWithCaching($urls, $devices = ['desktop', 'mobile']) {
$results = [];
foreach ($urls as $url) {
$cache_key = md5($url . implode('', $devices));
if (isset($this->cache[$cache_key])) {
$results[$url] = $this->cache[$cache_key];
continue;
}
$viewport_results = [];
foreach ($devices as $device) {
$dom = $this->fetchHTML($url, $device);
if ($dom) {
$viewport_results[$device] = $this->extractData($dom);
}
// Brief delay to avoid overwhelming the server
usleep(500000); // 0.5 seconds
}
$results[$url] = $viewport_results;
$this->cache[$cache_key] = $viewport_results;
}
return $results;
}
private function extractData($dom) {
return extractWithResponsiveSelectors($dom);
}
}
?>
Testing Responsive Scraping
Command Line Testing
Use curl to test different user agents and verify responsive behavior:
# Test mobile viewport
curl -H "User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 14_6 like Mac OS X) AppleWebKit/605.1.15" \
-H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" \
https://example.com
# Test desktop viewport
curl -H "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36" \
-H "Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8" \
https://example.com
# Compare responses
diff mobile_response.html desktop_response.html
Automated Testing Framework
<?php
class ResponsiveScrapeValidator {
private $test_cases = [];
public function addTestCase($url, $expected_elements) {
$this->test_cases[] = [
'url' => $url,
'expected' => $expected_elements
];
}
public function runTests() {
$results = [];
foreach ($this->test_cases as $test) {
$viewport_data = scrapeMultipleViewports($test['url']);
$test_result = [
'url' => $test['url'],
'passed' => true,
'failures' => []
];
foreach ($test['expected'] as $device => $elements) {
foreach ($elements as $element) {
if (!$this->elementExists($viewport_data[$device], $element)) {
$test_result['passed'] = false;
$test_result['failures'][] = "Missing $element in $device view";
}
}
}
$results[] = $test_result;
}
return $results;
}
private function elementExists($data, $element) {
return isset($data[$element]) && !empty($data[$element]);
}
}
// Usage
$validator = new ResponsiveScrapeValidator();
$validator->addTestCase('https://example.com', [
'mobile' => ['title', 'navigation'],
'desktop' => ['title', 'navigation', 'sidebar']
]);
$test_results = $validator->runTests();
foreach ($test_results as $result) {
echo "Test for {$result['url']}: " . ($result['passed'] ? 'PASSED' : 'FAILED') . "\n";
if (!$result['passed']) {
foreach ($result['failures'] as $failure) {
echo " - $failure\n";
}
}
}
?>
Common Responsive Patterns
1. Bootstrap Grid System
<?php
function extractBootstrapContent($dom) {
$bootstrap_data = [];
// Extract content from Bootstrap grid
$containers = $dom->find('.container, .container-fluid');
foreach ($containers as $container) {
$rows = $container->find('.row');
foreach ($rows as $row_index => $row) {
$columns = $row->find('[class*="col-"]');
foreach ($columns as $col_index => $column) {
$bootstrap_data["row_{$row_index}_col_{$col_index}"] = [
'classes' => $column->class,
'content' => trim($column->plaintext),
'html' => $column->outertext
];
}
}
}
return $bootstrap_data;
}
?>
2. CSS Grid Detection
<?php
function extractGridLayouts($dom) {
$grid_containers = $dom->find('.grid, [style*="display: grid"], [style*="display:grid"]');
$grid_data = [];
foreach ($grid_containers as $index => $container) {
$grid_items = $container->children();
$grid_data["grid_$index"] = [
'container_classes' => $container->class,
'item_count' => count($grid_items),
'items' => []
];
foreach ($grid_items as $item_index => $item) {
$grid_data["grid_$index"]['items'][] = [
'position' => $item_index,
'classes' => $item->class,
'content_preview' => substr(trim($item->plaintext), 0, 100)
];
}
}
return $grid_data;
}
?>
Error Handling and Debugging
1. Comprehensive Error Logging
<?php
class ResponsiveScrapingLogger {
private $log_file;
public function __construct($log_file = 'responsive_scraping.log') {
$this->log_file = $log_file;
}
public function logViewportScrape($url, $device, $success, $data = null, $error = null) {
$log_entry = [
'timestamp' => date('Y-m-d H:i:s'),
'url' => $url,
'device' => $device,
'success' => $success,
'data_extracted' => $data ? count($data) : 0,
'error' => $error
];
file_put_contents($this->log_file, json_encode($log_entry) . "\n", FILE_APPEND);
}
public function getFailedScrapingAttempts() {
if (!file_exists($this->log_file)) {
return [];
}
$lines = file($this->log_file, FILE_IGNORE_NEW_LINES);
$failed_attempts = [];
foreach ($lines as $line) {
$entry = json_decode($line, true);
if (!$entry['success']) {
$failed_attempts[] = $entry;
}
}
return $failed_attempts;
}
}
?>
2. Debug Helper Functions
<?php
function debugResponsiveElements($dom, $device_type) {
echo "=== Debug Report for $device_type ===\n";
echo "Total elements: " . count($dom->find('*')) . "\n";
// Check for responsive indicators
$responsive_classes = [
'hidden-xs', 'hidden-sm', 'hidden-md', 'hidden-lg',
'd-none', 'd-block', 'd-inline',
'mobile-only', 'desktop-only', 'tablet-only'
];
foreach ($responsive_classes as $class) {
$elements = $dom->find(".$class");
if (!empty($elements)) {
echo "Found " . count($elements) . " elements with class '$class'\n";
}
}
// Check navigation elements
$nav_elements = $dom->find('nav, .navbar, .menu, .navigation');
echo "Navigation elements found: " . count($nav_elements) . "\n";
// Check for common layout elements
$layout_elements = [
'header' => 'header, .header',
'main' => 'main, .main, .content',
'sidebar' => '.sidebar, .aside, aside',
'footer' => 'footer, .footer'
];
foreach ($layout_elements as $name => $selector) {
$elements = $dom->find($selector);
echo ucfirst($name) . " elements: " . count($elements) . "\n";
}
echo "========================\n\n";
}
?>
Conclusion
Scraping responsive web designs requires a multi-faceted approach that accounts for varying layouts, hidden content, and device-specific presentations. While Simple HTML DOM provides excellent HTML parsing capabilities, combining it with strategic user agent rotation, robust selector patterns, and fallback mechanisms creates a reliable scraping solution.
For sites with heavy JavaScript dependencies or complex responsive behaviors, consider integrating headless browser solutions like Puppeteer to handle dynamic content rendering. Additionally, when dealing with complex navigation patterns, you might find it helpful to understand how to handle AJAX requests using Puppeteer for sites that load content dynamically.
Remember to implement monitoring and validation systems to ensure your responsive scraping strategies remain effective as websites evolve their designs and frameworks. Regular testing across multiple viewports and maintaining flexible selector strategies will help your scrapers adapt to changing responsive design patterns over time.