Using the Screaming Frog SEO Spider and OpenAI Embeddings to Map Related Pages at Scale

Gus Pelogia

Posted 23 September, 2024 by in Screaming Frog SEO Spider, AI

Using the Screaming Frog SEO Spider and OpenAI Embeddings to Map Related Pages at Scale

Since Screaming Frog SEO Spider version 20.0 was released, SEOs can connect Screaming Frog and OpenAI for several use cases, including extracting embeddings from URLs.

Using embeddings is a powerful way to map URLs at scale at a high speed and low cost. In this blog post, we’ll explain step by step what they are and how to map them using Screaming Frog, ChatGPT (OpenAI API) and Google Colab. This post is a more complete version of my original post gathering more use cases and feedback from SEOs who tried it.

After your crawl, all you need to do is upload a sheet and you’ll receive back another one, with your source URL and related ones in another spreadsheet. It’s that easy!

This article is a guest contribution from Gus Pelogia, Senior SEO Product Manager at Indeed.


Use Cases

Before we dive into the how, let’s explain the why. Mapping pages at scale has several use cases, such as:

  • Related pages, if you’ve a section on your website where you list related articles or suggested reads on the same topic
  • Internal linking beyond matching anchor text, your links will have a better context because the page topic is related
  • Page tagging or clustering for cases where you want to create link clusters or simply understand performance per topic, not per single page
  • Keyword relevance, such as written on the iPullRank blog, where they explain a method to find the ideal page to rank for a keyword based on keyword and page content

What Are Embeddings?

Let’s get it straight from the horse’s mouth. According to Google on their Machine Learning (ML) crash course:

Embeddings make it easier to do machine learning on large inputs like sparse vectors representing words. Ideally, an embedding captures some of the semantics of the input by placing semantically similar inputs close together in the embedding space. An embedding can be learned and reused across models.

In my own SEO words: embeddings are unique numbers attributed to words on a page.

If this is still not clear, don’t get caught up on the concept. You can still find similar pages without knowing the theory.


What Is Cosine Similarity?

So far, you’ve thousands of embeddings mapped. Each URL has hundreds of these large numbers separated by a comma. The next step is to understand cosine similarity. As read in this iPullRank article, cosine similarity is “The measure of relevance is the function of distance between embeddings”.

In my own SEO words: with embeddings, you transformed pages into numbers. With cosine similarity, you’re finding how topically close these numbers/words/pages are. Using the Google Colab script (more on it later) you can choose how many similar pages you want to put next to each other.

You’re matching the whole page content, not just the title or a small section, so the proximity is a lot more accurate.


Using Screaming Frog + OpenAI to Extract Embeddings

Here’s where things start getting more hands-on. First of all, you need to get an OpenAI API and add some credit to it. I’ve extracted embeddings from 50.000 URLs with less than $5 USD, so it’s not expensive at all.

Open Screaming Frog and turn JavaScript rendering on. From the menu, go to Configuration > Crawl Config > Rendering > JavaScript.

Then, head to Configuration > Custom > Custom JavaScript:

Lastly, select Add from Library > (ChatGPT) Extract embeddings […] > Click on “JS” to open the code and add your OpenAI key.

Now you can run the crawl as usual and embeddings will be collected. If you want to save a bit of time, untick everything on Configuration > Crawl and Extraction since you won’t look at internal links, page titles or other content or technical aspects of a website.


Using LLMs to Create a Python Script

After having your crawl done, it’s time to use ChatGPT again to create the code for your tool. Ask something along the lines of: “Give me a Python code that allows me to map [5] related pages using cosine similarity. I’ll upload a spreadsheet with URLs + Embeddings on this tool. The code will be placed on Google Colab”.

You can try it yourself or use my existing Related Pages Script to upload your sheet directly, reverse engineer the prompt or make improvements. The tool will ask you to upload your csv file (the export from Custom JavaScript created by Screaming Frog). The sheet should have two headers:

  • URL
  • Embeddings

Once it processes the data, it’ll automatically download another csv with Page Source and Related Pages columns.

As with anything AI related, you’ll still want to manually review everything before you make any drastic changes.


Common Issues

While this is an easy to use tool, some problems might come up. Here are the ones I’ve seen so far:

  • Rename the headers in your Screaming Frog export to “URL” and “Embeddings”
  • CSV file has URLs without embeddings, such as crawled images or 404 pages, which don’t generate embeddings. Make sure every column has a valid URL and the embedding is visible
  • The crawl has a high speed and you started getting errors from OpenAI. Decrease crawling speed, go grab a coffee and let it do its work
  • OpenAI has many models and some page crawls might fail due to the number of output tokens requested. Generate your API using gpt-4o mini (up to 16.384 tokens) twice as much as gpt-4 (8.192 tokens). If some pages still fail, remove them from the crawl

Gus Pelogia is a Senior SEO Product Manager at Indeed, the #1 job site in the world. He spoke at events such as BrightonSEO and LondonSEOXL. Gus is also a contributor to Moz, Wix and other well-known industry blogs. Working in cross-functional teams including content creators, UX designers, engineers, data scientists and product managers, he aims to make SEO accessible and easy to understand. His work is focused on impact, buy-in and processes from ideation to release and measurement, avoiding staying in the SEO bubble. Prior to his SEO career, Gus graduated as a journalist at Faculdade Cásper Líbero in Brazil and published two books about the independent alternative rock scene in Brazil.

2 Comments

  • Hello! Very interesting article.

    How big a website should be for this to make sense?

    I mean, would it make sense on a 15.000 urls big site?

    Thanks in advance

    Javier

    Reply
  • Lisa 2 months ago

    Hi, Gus:

    I ran the crawl, but I’m getting a response back fore the content type text/css, Status: Blocked by Client inspector.

    Any ideas what I can do?

    Lisa

    Reply

Leave A Comment.

Back to top