Search Results for:
*
Robots.txt
The robots.txt is a file that sits on the root of a domain, for example: https://www.screamingfrog.co.uk/robots.txt This provides crawling instructions to bots visiting the site, which they voluntarily follow. In this guide, we’ll explore why you should have a robots.txt,...
Page Titles
Writing a good page title is an essential skillset for anyone in SEO, as they help both users and search engines understand the purpose of a page. In this guide we take you through the fundamentals, as well as more...
Search function
The search box in the top right of the interface allows you to search all visible columns. It defaults to regular text search of the ‘Address’ column, but allows you to switch to regex, choose from a variety of predefined...
Exclude
Configuration > Exclude The exclude configuration allows you to exclude URLs from a crawl by using partial regex matching. A URL that matches an exclude is not crawled at all (it’s not just ‘hidden’ in the interface). This will mean...
URL rewriting
Configuration > URL Rewriting The URL rewriting feature allows you to rewrite URLs on the fly. For the majority of cases, the ‘remove parameters’ and common options (under ‘options’) will suffice. However, we do also offer an advanced regex replace...
Robots.txt
The Screaming Frog SEO Spider is robots.txt compliant. It obeys robots.txt in the same way as Google. It will check the robots.txt of the subdomain(s) and follow (allow/disallow) directives specifically for the Screaming Frog SEO Spider user-agent, if not Googlebot...
Crawling
The Screaming Frog SEO Spider is free to download and use for crawling up to 500 URLs at a time. For £199 a year you can buy a licence, which removes the 500 URL crawl limit. A licence also provides...
How do I extract multiple matches of a regex?
If you want all the H1s from the following HTML: <html> <head> <title>2 h1s</title> </head> <body> <h1>h1-1</h1> <h1>h1-2</h1> </body> </html> Then we can use: <h1>(.*?)</h1>
Why is my regex extracting more than expected?
If you are using a regex like .* that contains a greedy quantifier you may end up matching more than you want. The solution to this is to use a regex like .*?. For example if you are trying to...
How does the Spider treat robots.txt?
The SEO Spider is robots.txt compliant. It checks robots.txt in the same way as Google. It will check robots.txt of the (sub) domain and follow directives specifically any for Googlebot, or for all user-agents. You are able to adjust the...