Web scraping has evolved from simple rule-based extraction to more advanced techniques that rely on large language models (LLMs) for context-aware data extraction. ScrapegraphAI is at the forefront of this evolution, enabling web scraping through powerful LLMs like OpenAI, Gemini, and even local models like Ollama. In this blog, we'll dive into what ScrapegraphAI is, how it works, and walk through a real-world example of scraping data from a website with proxy integration.
In this blog, we will cover:
ScrapegraphAI is a robust web scraping framework (open source) that leverages large language models to dynamically extract data from websites. Unlike traditional scrapers that rely on rigid CSS selectors or XPath, ScrapegraphAI uses LLMs to interpret and extract structured data from a wide range of sources, including dynamic web pages and files such as PDFs. Simply specify the information you're after, and let ScrapeGraphAI do the heavy lifting, providing a more flexible and low-maintenance option compared to traditional scraping tools. A key feature of ScrapegraphAI is its ability to let users define a schema for the data they want to extract. You can specify a structured format for your output, and ScrapegraphAI will adjust the extracted data to match this schema.
One of the standout features of ScrapegraphAI is its flexibility in choosing LLMs, with support for:
ScrapegraphAI offers several standard scraping pipelines to fit various needs. Some of the most common ones include:
In the next section, we’ll focus on the SmartScraperGraph, which allows for single-page scraping by simply providing a prompt and a source URL.
To follow along, you need to install a few dependencies. You can do this by running the following command:
pip install scrapegraphai openai python-dotenv
playwright install
Once you’ve installed these, make sure you have your OpenAI API Key ready. Store it in a .env file to keep your credentials secure:
OPENAI_APIKEY=your_openai_api_key
Let’s say we want to extract information about all the books on Books to Scrape, including:
Here’s a code example using ScrapegraphAI’s SmartScraperGraph pipeline:
import os
from dotenv import load_dotenv
from scrapegraphai.graphs import SmartScraperGraph
# Load the OpenAI API key from .env file
load_dotenv()
openai_key = os.getenv("OPENAI_APIKEY")
# Define configuration for the LLM
graph_config = {
"llm": {
"api_key": openai_key,
"model": "openai/gpt-4o-mini",
},
}
prompt = """
Extract all the books from this website including
- Book Name
- Price
- Availability
- Reviews
"""
# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
prompt=prompt,
source="https://books.toscrape.com/",
config=graph_config
)
if __name__ == '__main__':
result = smart_scraper_graph.run()
print(result)
GPT-4o-mini
model by providing the API key and specifying the model name.run()
method starts the scraping process, and the result is printed as a list of dictionaries, each containing details about a book.Here’s an example of what the output might look like:
{'Book Name': 'A Light in the Attic', 'Price': '£51.77', 'Availability': 'In stock', 'Reviews': 'NA'},
{'Book Name': 'Tipping the Velvet', 'Price': '£53.74', 'Availability': 'In stock', 'Reviews': 'NA'},
{'Book Name': 'Soumission', 'Price': '£50.10', 'Availability': 'In stock', 'Reviews': 'NA'},
{'Book Name': 'Sharp Objects', 'Price': '£47.82', 'Availability': 'In stock', 'Reviews': 'NA'},
# ... more books ...
As you can see, the scraper successfully pulls details for each book in a structured format, ready for use in your data pipeline.
When scraping at scale or targeting websites with anti-scraping measures, integrating proxies becomes essential to avoid IP bans, captchas, and rate-limiting. Using proxies not only provides anonymity but also ensures that you can scrape large amounts of data without interruptions.
One of the best options for this is residential proxies, as they come from real residential IP addresses, making them harder to detect and block.
Residential proxies from ProxyScrape are perfect for web scraping scenarios, especially when targeting websites with strict anti-scraping measures. We offer rotating IP addresses from various locations, ensuring that your requests appear as if they are coming from real users. This helps to bypass restrictions, evade bans, and ensure continuous access to the data you need.
Now let’s see how proxies are integrated with ScrapegraphAI:
from dotenv import load_dotenv
import os
from scrapegraphai.graphs import SmartScraperGraph
# Load the OpenAI API key from .env file
load_dotenv()
openai_key = os.getenv("OPENAI_APIKEY")
# Define the configuration with proxy integration
graph_config = {
"llm": {
"api_key": openai_key,
"model": "openai/gpt-4o-mini",
},
"loader_kwargs": {
"proxy": {
"server": "rp.proxyscrape.com:6060",
"username": "your_username",
"password": "your_password",
},
},
}
prompt = """
Extract all the books from this website including
- Book Name
- Price
- Availability
- Reviews
"""
# Create the SmartScraperGraph instance
smart_scraper_graph = SmartScraperGraph(
prompt=prompt,
source="https://books.toscrape.com/",
config=graph_config
)
# Run the scraper and print the result
if __name__ == '__main__':
result = smart_scraper_graph.run()
print(result)
loader_kwargs
key in the graph_config
. Here, you define your proxy server address, username, and password.In this blog, we explored the power of ScrapegraphAI, a modern web scraping tool that uses large language models (LLMs) to extract structured data from websites intelligently. We walked through its key features, including various scraping pipelines like the SmartScraperGraph, and provided a practical example of scraping book data from a website using OpenAI’s GPT-4o-mini model.
Also, we showed how to integrate proxies, especially ProxyScrape's residential proxies. Proxies are crucial for staying anonymous, bypassing restrictions, and maintaining data access, especially with sites that use anti-scraping tactics like IP bans or rate limits.
By integrating ProxyScrape's residential proxies, you ensure your web scraping activities are more efficient, secure, and scalable, even on the most challenging websites.