How can I select elements based on their href or src attributes?
Selecting elements based on their href
or src
attributes is a fundamental skill in web scraping and automation. CSS attribute selectors provide powerful methods to target elements by their attribute values, allowing you to precisely identify links, images, scripts, and other resources on web pages.
Understanding CSS Attribute Selectors
CSS attribute selectors use square brackets []
to match elements based on their attributes and values. These selectors work with any HTML attribute, making them particularly useful for targeting elements with specific href
and src
values.
Basic Attribute Selector Syntax
/* Select elements with a specific attribute */
[attribute]
/* Select elements with exact attribute value */
[attribute="value"]
/* Select elements where attribute contains a substring */
[attribute*="substring"]
/* Select elements where attribute starts with a string */
[attribute^="prefix"]
/* Select elements where attribute ends with a string */
[attribute$="suffix"]
Selecting Elements by href Attributes
The href
attribute is commonly used in anchor tags (<a>
) and link tags (<link>
). Here are various techniques to select elements based on their href
values:
Exact Match Selection
Select links with an exact href
value:
/* Select link to specific page */
a[href="https://example.com/about"]
/* Select relative links */
a[href="/contact"]
JavaScript Implementation:
// Using querySelector for single element
const specificLink = document.querySelector('a[href="https://example.com/about"]');
// Using querySelectorAll for multiple elements
const contactLinks = document.querySelectorAll('a[href="/contact"]');
console.log('Found contact links:', contactLinks.length);
Python with BeautifulSoup:
from bs4 import BeautifulSoup
html = """
<html>
<body>
<a href="https://example.com/about">About</a>
<a href="/contact">Contact</a>
<a href="mailto:info@example.com">Email</a>
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
# Select exact href match
about_link = soup.select('a[href="https://example.com/about"]')
contact_links = soup.select('a[href="/contact"]')
print(f"About links found: {len(about_link)}")
print(f"Contact links found: {len(contact_links)}")
Partial Match Selection
Select links containing specific substrings in their href
attributes:
/* Select all external links containing 'github' */
a[href*="github"]
/* Select all PDF download links */
a[href*=".pdf"]
/* Select all secure HTTPS links */
a[href*="https://"]
JavaScript Example:
// Find all GitHub links
const githubLinks = document.querySelectorAll('a[href*="github"]');
// Find all PDF links
const pdfLinks = document.querySelectorAll('a[href*=".pdf"]');
// Process each GitHub link
githubLinks.forEach(link => {
console.log('GitHub link:', link.href, 'Text:', link.textContent);
});
Prefix and Suffix Matching
Target links that start or end with specific patterns:
/* Select all external HTTPS links */
a[href^="https://"]
/* Select all mailto links */
a[href^="mailto:"]
/* Select all links ending with specific file extensions */
a[href$=".zip"]
a[href$=".docx"]
Python Example:
import requests
from bs4 import BeautifulSoup
# Fetch and parse a webpage
url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Find all external HTTPS links
external_links = soup.select('a[href^="https://"]')
# Find all email links
email_links = soup.select('a[href^="mailto:"]')
# Find all download links
download_links = soup.select('a[href$=".zip"], a[href$=".pdf"], a[href$=".docx"]')
print(f"External links: {len(external_links)}")
print(f"Email links: {len(email_links)}")
print(f"Download links: {len(download_links)}")
# Extract href values
for link in download_links:
print(f"Download: {link.get('href')} - {link.get_text(strip=True)}")
Selecting Elements by src Attributes
The src
attribute is used in elements like <img>
, <script>
, <iframe>
, and <video>
. Here's how to select these elements based on their source URLs:
Image Selection
/* Select images from specific domain */
img[src*="cdn.example.com"]
/* Select images with specific file extensions */
img[src$=".jpg"]
img[src$=".png"]
img[src$=".webp"]
/* Select images from relative paths */
img[src^="/images/"]
JavaScript Implementation:
// Find all CDN images
const cdnImages = document.querySelectorAll('img[src*="cdn.example.com"]');
// Find all PNG images
const pngImages = document.querySelectorAll('img[src$=".png"]');
// Extract image information
const imageData = Array.from(cdnImages).map(img => ({
src: img.src,
alt: img.alt,
width: img.naturalWidth,
height: img.naturalHeight
}));
console.log('CDN Images:', imageData);
Script and Resource Selection
/* Select external JavaScript files */
script[src^="https://"]
/* Select specific analytics scripts */
script[src*="google-analytics"]
script[src*="gtag"]
/* Select CSS files from CDN */
link[href*="cdn.jsdelivr.net"]
Python Example with Selenium:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
# Set up Chrome driver
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(options=chrome_options)
try:
driver.get("https://example.com")
# Find all external scripts
external_scripts = driver.find_elements(By.CSS_SELECTOR, 'script[src^="https://"]')
# Find all images from specific path
local_images = driver.find_elements(By.CSS_SELECTOR, 'img[src^="/images/"]')
# Extract script sources
script_sources = [script.get_attribute('src') for script in external_scripts]
print("External scripts found:")
for src in script_sources:
print(f" - {src}")
print(f"\nLocal images found: {len(local_images)}")
finally:
driver.quit()
Advanced Attribute Selection Techniques
Combining Multiple Attribute Selectors
You can combine multiple attribute selectors for more precise targeting:
/* Select HTTPS images with PNG extension */
img[src^="https://"][src$=".png"]
/* Select external PDF links */
a[href^="https://"][href$=".pdf"]
/* Select images from specific domain with alt text */
img[src*="example.com"][alt]
Case-Insensitive Matching
Use the i
flag for case-insensitive attribute matching:
/* Case-insensitive extension matching */
a[href$=".PDF" i]
img[src$=".JPG" i]
JavaScript Example:
// Modern browsers support case-insensitive selectors
const pdfLinks = document.querySelectorAll('a[href$=".pdf" i]');
// Fallback for older browsers
const allLinks = document.querySelectorAll('a[href]');
const pdfLinksManual = Array.from(allLinks).filter(link =>
link.href.toLowerCase().endsWith('.pdf')
);
console.log('PDF links found:', pdfLinks.length || pdfLinksManual.length);
Practical Web Scraping Examples
Extracting All Media Resources
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin, urlparse
def extract_media_resources(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# Extract all images
images = []
for img in soup.select('img[src]'):
src = img.get('src')
absolute_url = urljoin(url, src)
images.append({
'url': absolute_url,
'alt': img.get('alt', ''),
'type': 'image'
})
# Extract all videos
videos = []
for video in soup.select('video[src], source[src]'):
src = video.get('src')
if src:
absolute_url = urljoin(url, src)
videos.append({
'url': absolute_url,
'type': 'video'
})
# Extract external scripts
scripts = []
for script in soup.select('script[src^="https://"]'):
scripts.append({
'url': script.get('src'),
'type': 'script'
})
return {
'images': images,
'videos': videos,
'scripts': scripts
}
# Usage
media_data = extract_media_resources('https://example.com')
print(f"Found {len(media_data['images'])} images")
print(f"Found {len(media_data['videos'])} videos")
print(f"Found {len(media_data['scripts'])} external scripts")
Link Analysis and Categorization
function analyzePageLinks() {
const links = document.querySelectorAll('a[href]');
const analysis = {
internal: [],
external: [],
email: [],
telephone: [],
downloads: []
};
links.forEach(link => {
const href = link.href;
const text = link.textContent.trim();
if (href.startsWith('mailto:')) {
analysis.email.push({ href, text });
} else if (href.startsWith('tel:')) {
analysis.telephone.push({ href, text });
} else if (href.match(/\.(pdf|zip|doc|docx|xls|xlsx)$/i)) {
analysis.downloads.push({ href, text });
} else if (href.startsWith(window.location.origin)) {
analysis.internal.push({ href, text });
} else if (href.startsWith('http')) {
analysis.external.push({ href, text });
}
});
return analysis;
}
// Usage
const linkAnalysis = analyzePageLinks();
console.log('Link Analysis:', linkAnalysis);
Integration with Web Scraping Tools
When handling authentication in Puppeteer, you might need to select login form elements by their action attributes:
// Puppeteer example for form selection
await page.goto('https://example.com/login');
// Wait for login form and select it by action attribute
await page.waitForSelector('form[action*="login"]');
// Fill form fields
await page.type('input[name="username"]', username);
await page.type('input[name="password"]', password);
// Submit form
await page.click('button[type="submit"]');
For complex navigation scenarios, such as when you need to interact with DOM elements in Puppeteer, attribute selectors help identify specific navigation elements:
// Select navigation links by href patterns
const navLinks = await page.$$eval('nav a[href^="/"]', links =>
links.map(link => ({
href: link.href,
text: link.textContent.trim()
}))
);
console.log('Navigation links:', navLinks);
Best Practices
Use Specific Selectors: Combine attribute selectors with element types for better performance and specificity.
Handle Relative URLs: Always consider both absolute and relative URLs when matching href attributes.
Escape Special Characters: Use proper escaping for attribute values containing special characters.
Performance Considerations: Attribute selectors can be slower than ID or class selectors, so use them judiciously.
Cross-Browser Compatibility: Test case-insensitive selectors across different browsers and versions.
Conclusion
Selecting elements by their href
and src
attributes provides powerful capabilities for web scraping and automation tasks. Whether you're extracting links, analyzing media resources, or navigating complex web applications, CSS attribute selectors offer the precision and flexibility needed for effective element targeting. Master these techniques to build more robust and reliable web scraping solutions.