Proper way to crawl a code-built website with Screaming Frog

Screaming Frog: Best Practices for Scanning a Coded Website

Hello everyone,

I’m currently working on a client’s website and have run into a rather puzzling issue. When I conduct a full site crawl, it reveals an unexpectedly large number of URLs.

To refine the results, I attempted to crawl only the sitemap. This approach yielded a more manageable number of URLs, but I’m worried that some crucial pages might be overlooked.


Do you have any advice on how to balance the extensive crawl results with a sitemap that may be missing some important pages?


2 responses to “Proper way to crawl a code-built website with Screaming Frog”

  1. When you’re using Screaming Frog or any other SEO tool to crawl a website built with code, especially one where you’re seeing discrepancies between a full-site crawl and a sitemap crawl, there are several strategies you can employ to ensure you’re not missing important pages. Hereโ€™s how you can tackle this issue:

    1. Understanding Crawls vs. Sitemaps

    • Crawl: When you initiate a full-site crawl, Screaming Frog systematically visits every page that’s accessible via internal links on your website. This often includes pages that you may not want to be indexed (e.g., admin pages, duplicate content).

    • Sitemap: A sitemap is a guideline provided by the website’s developers, specifying which pages they consider important. However, it may not include all the pages that should be indexed, especially if it’s not regularly updated.

    2. Analyzing the Full Crawl

    • Assess Large Numbers of URLs: Evaluate if there are unexpected patterns or anomalies. Are there pages with parameterized URLs, session IDs, or pagination issues? These can artificially inflate the number of crawlable URLs.

    • Discover Orphans: Use the crawl data to find orphan pages (pages that aren’t linked to internally but still exist). This can help identify potentially important pages that are missing in the sitemap.

    3. Evaluate the Sitemap

    • Check for Completeness: Ensure the sitemap is complete and includes all necessary URLs, especially if it’s generated dynamically by the website. Use Screaming Frog’s Sitemap tab to import and validate your XML sitemap against the current crawl results.

    • Compare Against Analytics: Use tools like Google Analytics and Google Search Console to identify high-traffic pages not listed in your sitemap. These are likely important and should be included.

    4. Filtering and Configurations in Screaming Frog

    • Apply Custom Exclusions/Inclusions: Adjust Screaming Frog’s settings to exclude certain URL patterns (like session IDs or pagination) that are not meant to be indexed. This will help in narrowing down to significant pages only while using Include and Exclude options.

    • Adjust Crawl Depth: Limit the crawl depth to focus on more significant pages (those less than 3 clicks from the homepage) if the website has deep navigational structures.

    5. Cross-reference with Other Tools

    • Use Multiple Sources: Cross-reference data from Bing Webmaster Tools, Ahrefs, or any other SEO tools that can highlight external
  2. Great post! It’s a common challenge to balance thorough crawling with the risk of missing important pages. One approach I recommend is to initially perform a full crawl and analyze the URLs detected. This can help you identify any patterns or outliersโ€”such as duplicate URLs, parameters, or paginated contentโ€”that could be skewing your results.

    Once you grasp the full scope, you can refine your crawl settings in Screaming Frog by excluding specific parameters or even using the URL exclusion filters to focus on your critical pages. Additionally, consider utilizing the “Custom Extraction” feature to pull in relevant on-page elements, which can provide further insights into content that might not be included in the sitemap.

    Lastly, Iโ€™d suggest cross-referencing your sitemap with the full crawl results. Ensure that high-priority pages are listed in the sitemap and check for any discrepancies. By integrating these methods, you should be able to create a more accurate representation of the site while minimizing the risk of overlooking important content. Happy crawling!

Leave a Reply

Your email address will not be published. Required fields are marked *