The Easy Way to Crawl Sitemaps with Python

Python, Feb-07-20255 mins read

Sitemaps play a crucial role in SEO and web crawling by providing a structured list of URLs that a website wants search engines to index. Instead of scraping a website by following links page by page, crawling the sitemap is a much faster way to discover all available URLs.

However, manually parsing sitemaps comes with challenges. Websites often use index sitemaps, which link to multiple smaller sitemaps, requiring extra steps to handle them correctly. Some sitemaps may also contain thousands of URLs, making it tedious to extract everything efficiently.

This is where ultimate-sitemap-parser (usp) simplifies the process. It is a Python library designed to automatically:

Fetch and parse XML sitemaps.
Handle nested index sitemaps without extra code.
Extract all URLs efficiently with a simple function call.

This guide will show you how to use ultimate-sitemap-parser to crawl the ASOS sitemap and quickly extract all the available URLs.

Prerequisites

Before we begin crawling sitemaps, ensure you have the following set up:

1. Install Python

To run the script, you need Python installed on your system. If you haven’t installed it yet:

Download and install the latest version of Python from the official website.
Verify the installation by running the following command in your terminal or command prompt:

python3 --version

2. Install ultimate-sitemap-parser

Next, install the ultimate-sitemap-parser (usp) library using pip:

pip install ultimate-sitemap-parser

Crawling Sitemaps with `ultimate-sitemap-parser` (usp)

Now that we have ultimate-sitemap-parser installed, let’s dive into how to use it to extract URLs from the ASOS sitemap. We’ll cover the basic functionality of usp and highlight some of its key features.

1. Fetching the Sitemap and Extracting URLs

The ultimate-sitemap-parser library provides a simple way to fetch and parse a website’s sitemap. Instead of manually handling XML parsing, usp does it for us with just a few lines of code.

Let’s start by extracting all URLs from the ASOS sitemap:

from usp.tree import sitemap_tree_for_homepage

# Define the target website
url = "https://www.asos.com/"

# Fetch and parse the sitemap
tree = sitemap_tree_for_homepage(url)

# Extract and print all URLs
for page in tree.all_pages():
    print(page.url)

2. Handling Nested Sitemaps

One of the biggest advantages of usp is its ability to automatically handle nested index sitemaps. Some websites, like ASOS, structure their sitemaps into multiple sub-sitemaps (e.g., product pages, categories, blog pages). usp will:

Identify index sitemaps and fetch their child sitemaps.
Recursively extract URLs from all linked sitemaps.

The script above will return all URLs across all sitemaps without needing any extra code.

3. Extracting Only a Subset of URLs

If you're interested in extracting specific types of URLs (e.g., only product pages), you can filter them based on patterns.

For example, if ASOS product URLs contain /product/, we can modify our script like this:

# Extract and filter only product page URLs
product_urls = [page.url for page in tree.all_pages() if "/product/" in page.url]

# Print filtered URLs
for url in product_urls:
    print(url)

4 Storing URLs in a File

Instead of printing URLs, we might want to store them for later use. Here’s how to save them to a CSV file:

import csv
from usp.tree import sitemap_tree_for_homepage

# Define the target website
url = "https://www.asos.com/"

# Fetch and parse the sitemap
tree = sitemap_tree_for_homepage(url)

# Extract all URLs
urls = [page.url for page in tree.all_pages()]

# Save URLs to a CSV file
csv_filename = "asos_sitemap_urls.csv"
with open(csv_filename, "w", newline="", encoding="utf-8") as file:
    writer = csv.writer(file)
    writer.writerow(["URL"])  # Write header
    for url in urls:
        writer.writerow([url])

print(f"Extracted {len(urls)} URLs and saved to {csv_filename}")

Conclusion

In this blog, we explored how to efficiently crawl sitemaps using the ultimate-sitemap-parser (usp) library. Instead of manually handling XML parsing and dealing with nested index sitemaps, usp simplifies the process with just a few lines of code.

We demonstrated:

How to extract URLs from the ASOS sitemap.
How usp automatically handles nested sitemaps.
How to store extracted URLs in a CSV file for further analysis.

With usp, crawling sitemaps becomes effortless, making it a great choice for web scraping, SEO analysis, and website auditing.

Next Steps

If you want to explore more about usp, check out its official documentation and GitHub repository:

📖 Documentation:Getting Started with usp
💻 GitHub Repository:ultimate-sitemap-parser on GitHub

Whether you're building a large-scale scraper or just need quick access to a website’s URLs, usp is a powerful tool to have in your Python toolkit.Happy Scraping!

By: ProxyScrape