Sitemaps play a crucial role in SEO and web crawling by providing a structured list of URLs that a website wants search engines to index. Instead of scraping a website by following links page by page, crawling the sitemap is a much faster way to discover all available URLs.
However, manually parsing sitemaps comes with challenges. Websites often use index sitemaps, which link to multiple smaller sitemaps, requiring extra steps to handle them correctly. Some sitemaps may also contain thousands of URLs, making it tedious to extract everything efficiently.
This is where ultimate-sitemap-parser
(usp
) simplifies the process. It is a Python library designed to automatically:
This guide will show you how to use ultimate-sitemap-parser to crawl the ASOS sitemap and quickly extract all the available URLs.
Before we begin crawling sitemaps, ensure you have the following set up:
To run the script, you need Python installed on your system. If you haven’t installed it yet:
python3 --version
Next, install the ultimate-sitemap-parser
(usp
) library using pip
:
pip install ultimate-sitemap-parser
Now that we have ultimate-sitemap-parser
installed, let’s dive into how to use it to extract URLs from the ASOS sitemap. We’ll cover the basic functionality of usp
and highlight some of its key features.
The ultimate-sitemap-parser
library provides a simple way to fetch and parse a website’s sitemap. Instead of manually handling XML parsing, usp
does it for us with just a few lines of code.
Let’s start by extracting all URLs from the ASOS sitemap:
from usp.tree import sitemap_tree_for_homepage
# Define the target website
url = "https://www.asos.com/"
# Fetch and parse the sitemap
tree = sitemap_tree_for_homepage(url)
# Extract and print all URLs
for page in tree.all_pages():
print(page.url)
One of the biggest advantages of usp
is its ability to automatically handle nested index sitemaps. Some websites, like ASOS, structure their sitemaps into multiple sub-sitemaps (e.g., product pages, categories, blog pages). usp
will:
The script above will return all URLs across all sitemaps without needing any extra code.
If you're interested in extracting specific types of URLs (e.g., only product pages), you can filter them based on patterns.
For example, if ASOS product URLs contain /product/
, we can modify our script like this:
# Extract and filter only product page URLs
product_urls = [page.url for page in tree.all_pages() if "/product/" in page.url]
# Print filtered URLs
for url in product_urls:
print(url)
Instead of printing URLs, we might want to store them for later use. Here’s how to save them to a CSV file:
import csv
from usp.tree import sitemap_tree_for_homepage
# Define the target website
url = "https://www.asos.com/"
# Fetch and parse the sitemap
tree = sitemap_tree_for_homepage(url)
# Extract all URLs
urls = [page.url for page in tree.all_pages()]
# Save URLs to a CSV file
csv_filename = "asos_sitemap_urls.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["URL"]) # Write header
for url in urls:
writer.writerow([url])
print(f"Extracted {len(urls)} URLs and saved to {csv_filename}")
In this blog, we explored how to efficiently crawl sitemaps using the ultimate-sitemap-parser
(usp
) library. Instead of manually handling XML parsing and dealing with nested index sitemaps, usp
simplifies the process with just a few lines of code.
We demonstrated:
usp
automatically handles nested sitemaps. With usp, crawling sitemaps becomes effortless, making it a great choice for web scraping, SEO analysis, and website auditing.
If you want to explore more about usp
, check out its official documentation and GitHub repository:
usp
ultimate-sitemap-parser
on GitHub Whether you're building a large-scale scraper or just need quick access to a website’s URLs, usp
is a powerful tool to have in your Python toolkit.Happy Scraping!