backend
Use Selenium with Python to Target the XPath of a Particular Object
Extract data from each item in a list of items instead of the whole webpage.
Introduction
Earlier this year, I needed to write a program for my friend to collect the data of one of the NFT collections on the NFTrade site. He wanted the following data included: all the NFTs currently for sale in the collection and the price of each NFT in US dollars based on the current market price of the BNB cryptocurrency that the NFT is for sale in, and he wanted it listed out line by line in a CSV file that he could sort and manipulate.
Unfortunately, the NFTrade website does not have a public API so instead of writing a Node.js script to fetch the data via HTTP calls, I built a small script to navigate to the website page and actually "scrape" the data off of it.
Having not written a web scraper before, I chose to write the program in Python as it seems to be a very popular programming language choice for a task such as this. While I built this scraper, the project requirements evolved and got more complex, and I learned a bunch of useful new techniques when using Python, which I'll be sharing in a series of posts over the coming months.
After I'd settled on using the Selenium Python package to start a Selenium WebDriver instance and scrape the data from NFTrade, I hit a snag: I was collecting all the individual NFT data to loop through and pull the details from, but each time the loop ran it only collected the data from the first NFT in the list.
I was stumped and turned to the Internet for help and (once again) Stack Overflow came through for me.
When extracting scraped web data using Selenium WebDriver's XPath to target the data, in order to search inside a particular element instead of the whole document, a period (.
) must be added in front of the XPath. I'll show you how to do it in this post.
NOTE: I am not normally a Python developer so my code examples may not be the most efficient or elegant Python code ever written, but they get the job done.
Selenium Python package
For my project, I ended up using the Selenium Python package because it can scrape websites with dynamically loaded data, which is how NFTrade works. A user lands on a collection of NFTs and as they scroll down the page, periodically more NFTs are loaded into the browser window via JavaScript.
Selenium Python operates the Selenium WebDriver software and drives a browser just like a user would, and although its original use was for automated end-to-end testing of software, it can also be used to scrape data off of live web pages.
If you'd like to read more about how to use Selenium Python for this, I encourage you to visit my first blog post on this subject where I go in depth on it.
What is XPath?
Among the many useful methods included with the Selenium Python package are the methods find_element()
, find_elements()
, and By.XPATH
.
The find_element
methods do what their name implies: find an element (or elements) given a By
strategy and locator.
By
accepts element IDs, names, attributes, XPaths, class names, etc. And XPath is a syntax used to navigate through elements and attributes in a standard XML document (webpage).
Due to the NFTrade site's structure, I used XPath expressions to identify all the individual NFTs on the page and scrape the data from each NFT to include later in my CSV file.
The Problem: XPath was targeting the whole document
After I'd written the code to initially fire up my Selenium WebDriver instance, navigate to the NFTrade site, and load the NFTs into the browser that I wanted to scrape the data from, I had a list of NFT info I needed to slim down to just the data points I wanted to include in the CSV.
NOTE: If you'd like to see how to scrape the browser data in-depth, read my first blog post here.
For me, this was things like: the NFT ID number (part of the NFT card's name) and the the NFT's sale price in BNB.
Inside of the __main__
method in my Python script, I'd scraped the data from the webpage with the get_cards()
method, and then I wanted to loop through the NFT data I'd collected and extract the data points from each NFT card with my get_nft_data()
method.
Here's the __main__
method code for reference:
for_sale_scraper.py
if __name__ == '__main__':
scraper = ForSaleNFTScraper();
cards = scraper.get_cards(max_card_count=200)
card_data = []
for card in cards:
info = (scraper.get_nft_data(card))
card_data.append(info)
# pprint the card data to ensure we're getting the correct data out of each card
pprint(card_data)
And here is the code for my get_nft_data()
method.
def get_nft_data(self, nft_data):
"""Extracts and prints out NFT specific data."""
nft_name_element = self.driver.find_element(By.XPATH, '//div[contains(@class, "Item_itemName__ckoHR")]')
nft_name = nft_name_element.get_attribute("innerHTML")
nft_id = nft_name.partition('#')[-1]
nft_price_element = nft_data.find_element(By.XPATH, '//div[contains(@class, "Item_itemPriceValueTxt__lblqJ")]')
nft_price = nft_price_element.get_attribute("innerHTML")
return {
'id': int(nft_id), 'price': nft_price
}
What my get_nft_data()
method is supposed to do is:
- Use the XPath within each card to get the NFT's name via
get_attribute("innerHTML")
(innerHTML
targets the text content of the element which includes its ID number at the end of the name),partition()
the string into a tuple based on the#
included in the name, and take the last element in the tuple (which is the ID number). - It should also get the NFT's price (also using
get_attribute("innerHTML")
) via a second XPath within the NFT card. - Finally return those two elements together as a new object with the keys of
id
andprice
.
In theory, I wanted that to happen for every NFT I'd collected into my card_data
list. In practice, I got 200 elements that all contained the ID and price data from the very first card in my card_data
list.
Not quite what I was hoping for.
The Solution: How to restrict XPath inside a particular element
After several failed variations of the code above and searching through many Stack Overflow posts without success, I finally wrote my own SO post explaining my situation and asking for help from the greater web development community.
If you'd like to see my original Stack Overflow post and the helpful responses provided, here is a link.
Just over 30 minutes after posting my question, a kind soul answered it and got me moving forward again. That's the power of the web dev community at its finest.
Below is the corrected get_nft_data()
code that actually gets the data from each individual NFT as the data is looped over. I also added some comments between lines of code to explain what's happening at each step.
def get_nft_data(self, nft_data):
"""Extracts and prints out card specific data."""
# get full card name "NFT_CARD #1234" by XPATH
nft_name_element = nft_data.find_element(By.XPATH, './/div[contains(@class, "Item_itemName__ckoHR")]')
nft_name = nft_name_element.text
# parse out just ID number from name
nft_id = nft_name.partition('#')[-1]
# get nft recently sold value by XPATH
nft_bnb_sale_price = nft_data.find_elements(By.XPATH, './/div[contains(@class, "Item_itemPriceValueTxt__lblqJ")]')
# if there is a for sale price, take it
if nft_bnb_sale_price:
nft_price = nft_bnb_sale_price[0].text
else:
# if there's no value, just put None in place of a value
nft_price = None
return {
'id': int(nft_id),
'nft_price': nft_price
}
In this version of the Python code, there's three main differences.
-
The first is that instead of using
self.driver.find_element
, this code is usingnft_data.find_element
. By substitutingnft_data
instead ofself.driver
, it allows the XPath search to be restricted to a particular element. -
Second, inside of each
find_element
method referencingBy.XPATH
, the XPath being passed has a.
in front of it. So'//div[contains(@class, "Item_itemName_ckoHR")]'
becomes'.//div[contains(@class, "Item_itemName__ckoHR")]'
.The dot (
.
) further restricts the XPath's search inside of a particular element (or "context node"). If the.
is not included, XPath will search the whole document, which is why it was always finding the values from the first NFT element each time the loop ran. -
Last, the user suggested I could use
.text
to get the NFT name and BNB price instead of having to write out the lengthier.get_attribute("innerHTML")
each time to reach the text in the NFT, which was a nice improvement in code readability.
The solution also mentioned there was a chance that some of the NFTs collected from the page may not have a price listed (NFTrade displays all NFTs in a collection, not just the ones for sale), and recommended wrapping the code that gets the nft_price
in an if / else
block so if the price is present, it will be collected and returned, and if it's not, it'll return None
in place of the value and not throw an error in the code.
Hence this code for checking sale price:
# if there is a for sale price, take it
if nft_bnb_sale_price:
nft_price = nft_bnb_sale_price[0].text
else:
# if there's no value, just put None in place of a value
nft_price = None
Test the refactored code
With my newly refactored get_nft_data()
method at the ready, it was time to test it out in my Python script.
As a reminder, my __main__
method looked like this:
if __name__ == '__main__':
scraper = ForSaleNFTScraper();
cards = scraper.get_cards(max_card_count=200)
card_data = []
for card in cards:
info = (scraper.get_nft_data(card))
card_data.append(info)
# pprint the card data to ensure we're getting the correct data out of each card
pprint(card_data)
This time, when I ran the script from the command line with python for_sale_scraper.py
, here's a screenshot of the output I received.
As you can see from the image, I got an array of items and each item in the array had a different id
and nft_price
from the others. Now the get_nft_data()
method was working correctly, targeting the next NFT in the card_data
array with each successive iteration in the loop and pulling out the data specific to that card.
Success!
With that hurdle overcome, I was ready to move on to the next steps of this project: converting the NFT's BNB prices into the current USD prices and assembling them into a CSV spreadsheet. Those tasks will be covered in detail in upcoming blog posts.
Conclusion
When I was asked to put together a spreadsheet of all the NFTs for sale in a particular collection on NFTrade, I ended up using Python to build a website scraper to accomplish it and learned a lot of new problem-solving techniques in the process.
I managed to load and collect all the NFT data from a web page with the assistance of the Selenium Python package, but I got throughly stuck when trying to iterate through my data to extract the ID and price for each NFT from it.
Luckily, the Stack Overflow community came through for me and taught me about the finer points of using Selenium WebDriver's XPath to target specific elements within a page instead of the whole page when looking for particular pieces of data, which got me unstuck and on my way to building my spreadsheet of data.
Thank goodness for the knowledge sharing of the online web development community - I am truly grateful to be able to turn to them when I've exhausted my own ideas to solve a problem.
Check back in a few weeks — I’ll be writing more blogs about the problems I had to solve while building this Python website scraper in addition to other topics on JavaScript or something else related to web development.
Thanks for reading. I hope learning how to restrict XPath searches to a particular element on a page instead of the entire page proves helpful for you in the future like was for me.
References & Further Resources
- Selenium Python docs
- Selenium WebDriver docs
- NFTrade website
- XPath documentation
- Original Stack Overflow post
- Previous blog post about scraping data from a lazy-loading website using Selenium Python
Want to be notified first when I publish new content? Subscribe to my newsletter.