backend

First published December 18, 2023

Use Selenium with Python to Target the XPath of a Particular Object

Extract data from each item in a list of items instead of the whole webpage.

Introduction

Earlier this year, I needed to write a program for my friend to collect the data of one of the NFT collections on the NFTrade site. He wanted the following data included: all the NFTs currently for sale in the collection and the price of each NFT in US dollars based on the current market price of the BNB cryptocurrency that the NFT is for sale in, and he wanted it listed out line by line in a CSV file that he could sort and manipulate.

Unfortunately, the NFTrade website does not have a public API so instead of writing a Node.js script to fetch the data via HTTP calls, I built a small script to navigate to the website page and actually "scrape" the data off of it.

Having not written a web scraper before, I chose to write the program in Python as it seems to be a very popular programming language choice for a task such as this. While I built this scraper, the project requirements evolved and got more complex, and I learned a bunch of useful new techniques when using Python, which I'll be sharing in a series of posts over the coming months.

After I'd settled on using the Selenium Python package to start a Selenium WebDriver instance and scrape the data from NFTrade, I hit a snag: I was collecting all the individual NFT data to loop through and pull the details from, but each time the loop ran it only collected the data from the first NFT in the list.

I was stumped and turned to the Internet for help and (once again) Stack Overflow came through for me.

When extracting scraped web data using Selenium WebDriver's XPath to target the data, in order to search inside a particular element instead of the whole document, a period (.) must be added in front of the XPath. I'll show you how to do it in this post.

NOTE: I am not normally a Python developer so my code examples may not be the most efficient or elegant Python code ever written, but they get the job done.

Selenium Python package

For my project, I ended up using the Selenium Python package because it can scrape websites with dynamically loaded data, which is how NFTrade works. A user lands on a collection of NFTs and as they scroll down the page, periodically more NFTs are loaded into the browser window via JavaScript.

Selenium Python operates the Selenium WebDriver software and drives a browser just like a user would, and although its original use was for automated end-to-end testing of software, it can also be used to scrape data off of live web pages.

If you'd like to read more about how to use Selenium Python for this, I encourage you to visit my first blog post on this subject where I go in depth on it.

What is XPath?

Among the many useful methods included with the Selenium Python package are the methods find_element(), find_elements(), and By.XPATH.

The find_element methods do what their name implies: find an element (or elements) given a By strategy and locator.

By accepts element IDs, names, attributes, XPaths, class names, etc. And XPath is a syntax used to navigate through elements and attributes in a standard XML document (webpage).

Due to the NFTrade site's structure, I used XPath expressions to identify all the individual NFTs on the page and scrape the data from each NFT to include later in my CSV file.

The Problem: XPath was targeting the whole document

After I'd written the code to initially fire up my Selenium WebDriver instance, navigate to the NFTrade site, and load the NFTs into the browser that I wanted to scrape the data from, I had a list of NFT info I needed to slim down to just the data points I wanted to include in the CSV.

NOTE: If you'd like to see how to scrape the browser data in-depth, read my first blog post here.

For me, this was things like: the NFT ID number (part of the NFT card's name) and the the NFT's sale price in BNB.

Inside of the __main__ method in my Python script, I'd scraped the data from the webpage with the get_cards() method, and then I wanted to loop through the NFT data I'd collected and extract the data points from each NFT card with my get_nft_data() method.

Here's the __main__ method code for reference:

for_sale_scraper.py

if __name__ == '__main__':
   scraper = ForSaleNFTScraper();
   cards = scraper.get_cards(max_card_count=200)
   card_data = []
   for card in cards:
    info = (scraper.get_nft_data(card)) 
    card_data.append(info)

    # pprint the card data to ensure we're getting the correct data out of each card
    pprint(card_data)

And here is the code for my get_nft_data() method.

def get_nft_data(self, nft_data):
        """Extracts and prints out NFT specific data."""
        nft_name_element = self.driver.find_element(By.XPATH, '//div[contains(@class, "Item_itemName__ckoHR")]')
        nft_name = nft_name_element.get_attribute("innerHTML")
        nft_id = nft_name.partition('#')[-1]

        nft_price_element = nft_data.find_element(By.XPATH, '//div[contains(@class, "Item_itemPriceValueTxt__lblqJ")]')
        nft_price = nft_price_element.get_attribute("innerHTML")
        return {
            'id': int(nft_id), 'price': nft_price
            }

What my get_nft_data() method is supposed to do is:

Use the XPath within each card to get the NFT's name via get_attribute("innerHTML") (innerHTML targets the text content of the element which includes its ID number at the end of the name), partition() the string into a tuple based on the # included in the name, and take the last element in the tuple (which is the ID number).
It should also get the NFT's price (also using get_attribute("innerHTML")) via a second XPath within the NFT card.
Finally return those two elements together as a new object with the keys of id and price.

In theory, I wanted that to happen for every NFT I'd collected into my card_data list. In practice, I got 200 elements that all contained the ID and price data from the very first card in my card_data list.

Not quite what I was hoping for.

The Solution: How to restrict XPath inside a particular element

After several failed variations of the code above and searching through many Stack Overflow posts without success, I finally wrote my own SO post explaining my situation and asking for help from the greater web development community.

If you'd like to see my original Stack Overflow post and the helpful responses provided, here is a link.

Just over 30 minutes after posting my question, a kind soul answered it and got me moving forward again. That's the power of the web dev community at its finest.

Below is the corrected get_nft_data() code that actually gets the data from each individual NFT as the data is looped over. I also added some comments between lines of code to explain what's happening at each step.

def get_nft_data(self, nft_data):
      """Extracts and prints out card specific data."""
      # get full card name "NFT_CARD #1234" by XPATH
      nft_name_element = nft_data.find_element(By.XPATH, './/div[contains(@class, "Item_itemName__ckoHR")]')
      nft_name = nft_name_element.text
      # parse out just ID number from name
      nft_id = nft_name.partition('#')[-1]

      # get nft recently sold value by XPATH
      nft_bnb_sale_price = nft_data.find_elements(By.XPATH, './/div[contains(@class, "Item_itemPriceValueTxt__lblqJ")]')
      
      # if there is a for sale price, take it 
      if nft_bnb_sale_price:
          nft_price = nft_bnb_sale_price[0].text
      else: 
      # if there's no value, just put None in place of a value    
          nft_price = None   

      return {
          'id': int(nft_id), 
          'nft_price': nft_price
          }

In this version of the Python code, there's three main differences.

The first is that instead of using self.driver.find_element, this code is using nft_data.find_element. By substituting nft_data instead of self.driver, it allows the XPath search to be restricted to a particular element.
Second, inside of each find_element method referencing By.XPATH, the XPath being passed has a . in front of it. So '//div[contains(@class, "Item_itemName_ckoHR")]' becomes './/div[contains(@class, "Item_itemName__ckoHR")]'.

The dot (.) further restricts the XPath's search inside of a particular element (or "context node"). If the . is not included, XPath will search the whole document, which is why it was always finding the values from the first NFT element each time the loop ran.
Last, the user suggested I could use .text to get the NFT name and BNB price instead of having to write out the lengthier .get_attribute("innerHTML") each time to reach the text in the NFT, which was a nice improvement in code readability.

The solution also mentioned there was a chance that some of the NFTs collected from the page may not have a price listed (NFTrade displays all NFTs in a collection, not just the ones for sale), and recommended wrapping the code that gets the nft_price in an if / else block so if the price is present, it will be collected and returned, and if it's not, it'll return None in place of the value and not throw an error in the code.

Hence this code for checking sale price:

 # if there is a for sale price, take it 
      if nft_bnb_sale_price:
          nft_price = nft_bnb_sale_price[0].text
      else: 
      # if there's no value, just put None in place of a value    
          nft_price = None

Test the refactored code

With my newly refactored get_nft_data() method at the ready, it was time to test it out in my Python script.

As a reminder, my __main__ method looked like this:

if __name__ == '__main__':
   scraper = ForSaleNFTScraper();
   cards = scraper.get_cards(max_card_count=200)
   card_data = [] 
   for card in cards:
    info = (scraper.get_nft_data(card))
    card_data.append(info)

    # pprint the card data to ensure we're getting the correct data out of each card
    pprint(card_data)

This time, when I ran the script from the command line with python for_sale_scraper.py, here's a screenshot of the output I received.

Output from each of the NFT cards run through the get_nft_data() method — If you look closely you'll see that each object in this area has a different ID and price in it.

As you can see from the image, I got an array of items and each item in the array had a different id and nft_price from the others. Now the get_nft_data() method was working correctly, targeting the next NFT in the card_data array with each successive iteration in the loop and pulling out the data specific to that card.

Success!

With that hurdle overcome, I was ready to move on to the next steps of this project: converting the NFT's BNB prices into the current USD prices and assembling them into a CSV spreadsheet. Those tasks will be covered in detail in upcoming blog posts.

Conclusion

When I was asked to put together a spreadsheet of all the NFTs for sale in a particular collection on NFTrade, I ended up using Python to build a website scraper to accomplish it and learned a lot of new problem-solving techniques in the process.

I managed to load and collect all the NFT data from a web page with the assistance of the Selenium Python package, but I got throughly stuck when trying to iterate through my data to extract the ID and price for each NFT from it.

Luckily, the Stack Overflow community came through for me and taught me about the finer points of using Selenium WebDriver's XPath to target specific elements within a page instead of the whole page when looking for particular pieces of data, which got me unstuck and on my way to building my spreadsheet of data.

Thank goodness for the knowledge sharing of the online web development community - I am truly grateful to be able to turn to them when I've exhausted my own ideas to solve a problem.

Check back in a few weeks — I’ll be writing more blogs about the problems I had to solve while building this Python website scraper in addition to other topics on JavaScript or something else related to web development.

Thanks for reading. I hope learning how to restrict XPath searches to a particular element on a page instead of the entire page proves helpful for you in the future like was for me.

References & Further Resources

Selenium Python docs
Selenium WebDriver docs
NFTrade website
XPath documentation
Original Stack Overflow post
Previous blog post about scraping data from a lazy-loading website using Selenium Python

Tags:

big data

python

selenium

web scraping

webdriver

Want to be notified first when I publish new content? Subscribe to my newsletter.