In today’s world, everyone makes use of new technologies. You can get access to the structured data in an automated fashion with the help of web scraping. For instance, you can use web scraping for: Infinite scrolling, also known as endless scrolling, is a web design technique that websites often use with AJAX or Javascript
In today’s world, everyone makes use of new technologies. You can get access to the structured data in an automated fashion with the help of web scraping. For instance, you can use web scraping for:
Infinite scrolling, also known as endless scrolling, is a web design technique that websites often use with AJAX or Javascript for loading additional content dynamically when the user scrolls down to the bottom of the webpage. This technique gained popularity due to its success on social media sites. For instance, infinite scrolling in Twitter is produced through asynchronous loading. Twitter makes AJAX calls after the page is loaded for continuously adding new content as it scrolls. Though infinite scrolling has many advantages, it is not recommended for goal-oriented finding tasks that require people to locate particular content.
Let’s first understand the benefits of scraping infinite scrolling pages.
Following are some of the reasons to scrape infinite scrolling pages.
Apart from the above benefits of scraping infinite scrolling pages, there are some cons as well like:
Let’s see how to scrape infinite scrolling pages using Python with the help of the below-mentioned steps.
You need to import the Selenium library.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time
Here you have to choose the browser that you want to use. We will go with Chrome as it offers more options than Firefox.
def get_selenium():
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('headless')
driver = webdriver.Chrome(chrome_options=options)
return (driver)
The headless argument mentioned above is pretty important. Selenium will not open Chrome in a new window when it runs headless in Python. However, if you encounter a problem while scraping, you can comment on the headless option and see what’s going on in Chrome and what is loaded on the page.
We can omit the two flags i-e; ignore-certificate-errors and incognito.
If you encounter a captcha or a cookie banner that prevents your page from loading, you can click OK and proceed to the page normally. However, if the browser gets closed unexpectedly, you can use time.sleep() to pause the code and take ample time to debug.
You need to look into your page HTML structure for fixing infinite scrolling and follow the below-mentioned steps.
You can consider the example below for a better understanding.
selenium = get_selenium()
selenium.get("your/url")
last_elem = '';
while True:
current_last_elem = "#my-div > ul > li:last-child"
scroll = "document.querySelector(\'" + current_last_elem + "\').scrollIntoView();"
selenium.execute_script(scroll)
time.sleep(3)
if (last_elem == current_elem)
break
else
last_elem = current_elem
In the above code, we used jQuery and Javascript inside Python.
Here,
selenium.get("your/url.com/{0}".format(keyword))
"document.querySelector(\'" + .. + "\').scrollIntoView();"
Here, your format should be correct, so you need to pay attention to the single and double quotes and the escape characters.
Some of the frequently occurring problems when doing infinite scrolling are as:
It is possible to trigger a js script from within Python and get a list as a result.
For instance, we can use the code below to get the sources from all the images on the page.
js_script = '''\
var jslist = []
document.querySelectorAll('img').forEach(i => jslist.push(i.src));
return jslist;
'''
python_list = selenium.execute_script(js_script)
In the above code,
We can use the same approach for the href links by:
Afterwards, we can run the script with selenium.execute_script().We can then stock the returned value by js in a python variable i-e., python_list.
This is how we can scrape infinite scrolling pages using Python.
You know that a proxy is a third-party server that acts as an intermediary between a client requesting a resource and a server providing that resource. If you want to use proxies with Selenium and Python, you can use the following lines of code.
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--proxy-server=%s'% hostname +":"+port)
driver = webdriver.Chrome(chrome_options=chrome_options)
For handling infinite scrolling, you can use scroll-proxy that supports programmatic scrolling of the scrollable views within a view hierarchy. If you use npm, you can install scroll-proxy using the command below. We will be using js to demonstrate the use of scroll-proxy.
npm install scroll-proxy --save
After installing scroll-proxy, you can instantiate a ScrollProxy object using the below code.
var myScroll = new ScrollProxy();
You can see we did not pass any arguments to the ScrollProxy constructor because it will report actions by default when the user scrolls the page.
However, if you want to get updates when the user scrolls inside some specific HTML element, you have to pass it into the constructor.
var myDiv = document.querySelector('.scrollable');
var myDivScroll = new ScrollProxy(myDiv);
Below are some reasons to use proxies while scraping infinite scrolling.
We discussed that infinite scrolling is preferred when the user isn’t looking for specific information. News websites and social media feed that constantly generate new content can benefit from infinite scrolling. On the other hand, business pages and E-commerce sites aren’t good candidates for infinite scrolling as users seek specific information. Further, we discussed the steps involved in scraping infinite scrolling pages using Selenium. We can also use rotating residential proxies for handling infinite scrolling as they help avoid captchas and filter our suspicious header requests.
Hope you got an understanding of how to scrape infinite scrolling pages using Python.