Reviewing and Improving My robots.txt

Quick Review Needed for a robots.txt File

Is Anything Incorrect?

Hello everyone,

I’m looking for a quick review of the robots.txt file content below. Is there anything obviously wrong with it? Any feedback would be appreciated. Thank you! ๐Ÿ™‚

“`plaintext
Sitemap: https://www.mysite.com.hk/sitemap.xml

User-agent: AdsBot-Google
Disallow:

User-agent: Googlebot-Image
Disallow:

User-agent: dotbot
Disallow: /

User-agent: BLEXBot
Disallow: /

User-agent: Barkrowler
Disallow: /

User-agent: serpstatbot
Disallow: /

User-agent: GeedoBot
Disallow: /

User-agent: MJ12bot
Disallow: /

User-agent: DataForSeoBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: VelenPublicWebCrawler
Disallow: /

User-agent: TurnitinBot
Disallow: /

User-agent: Riddler
Disallow: /

Directories

User-agent: *
Disallow: /404/
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /includes/
Disallow: /lib/
Disallow: /magento/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /stats/
Disallow: /var/
Disallow: /ApiPhp/
Disallow: /SID=
Disallow: /review/
Disallow:
/productreviewscollection/
Disallow: */questionanswerscollection/

Paths (clean URLs)

Disallow: /index.php/
Disallow: /catalog/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalogsearch/
Disallow: /checkout/
Disallow: /control/
Disallow: /customer/
Disallow: /customize/
Disallow: /media/oms/
Disallow: /catalogutils/
Disallow: /sendfriend/

Disable all queries

Disallow: /?
Allow: /
?p=*

Files

Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_A


2 responses to “Reviewing and Improving My robots.txt”

  1. Certainly! Here’s a detailed review of the provided robots.txt file:

    Overview

    A robots.txt file is used to instruct web crawlers about which parts of your site they are allowed to visit or index. It’s a crucial part of website management for SEO and resource management. Let’s go through the content provided.

    Analysis

    1. Sitemap Specification:

    plaintext
    Sitemap: [https://www.mysite.com.hk/sitemap.xml](https://www.networldsports.com.hk/sitemap.xml)

    • Issue: The link text and the actual URL do not match.
    • Recommendation: Ensure you use the correct URL for the sitemap to help search engines locate it.

    plaintext
    Sitemap: https://www.mysite.com.hk/sitemap.xml

    Remove the Markdown link syntax if it is a plain-text file, as it is unnecessary there.

    1. Bot Directives:

    plaintext
    User-agent: AdsBot-Google
    Disallow:

    • AdsBot-Google, Googlebot-Image: Allowing all access, which is typically fine unless specific pages should be blocked.

    plaintext
    User-agent: dotbot
    Disallow: /

    • Blocking Selected Bots: You have several bots fully disallowed, which seems intentional for dotbot, BLEXBot, Barkrowler, and others. This is fine if they shouldn’t crawl any part of your site.

    • General Directions:

    plaintext
    User-agent: *

    • Blocking Specific Directories and Paths: This is where you specify directories that should not be crawled by any bot. Make sure these paths are correctly inputted and reflect directories/files you want to hide from crawlers.

    • Path Blocking and File Restrictions:

    plaintext
    Disallow: /404/
    Disallow: /cgi-bin/

    • Specific Directories: It appears you are blocking access to backend and administrative directories and PHP files, which is a common and generally wise strategy.
    • PHP Files:

      plaintext
      Disallow: /*.php$

      • Check for Exceptions: Ensure any public-facing PHP endpoints required for your site (like AJAX handlers) aren’t accidentally blocked.
  2. It looks like you’ve put a lot of thought into your `robots.txt` file! Here are a few insights that might help refine it further:

    1. **Check Your Sitemap URL**: The sitemap link appears as `https://www.mysite.com.hk/sitemap.xml` in your `robots.txt`, but your post suggests a different domain (`https://www.networldsports.com.hk/`). Make sure to update this accordingly to ensure search engines can locate your sitemap effectively.

    2. **Block Specific Bots with Caution**: While it’s great to restrict unwanted crawlers like `MJ12bot` and `BLEXBot`, consider whether you need to block them entirely. For instance, if some bots bring value by indexing content that can lead to legitimate traffic, you might want to allow them or at least monitor their impact.

    3. **Order of Rules**: The order of user-agent directives can matter. Be mindful of the wildcard user-agent `*` coming after specific ones. Specific rules are usually prioritized, meaning any bot that matches a specific rule wonโ€™t even reach the broader instructions beneath it.

    4. **Test for Syntax Errors**: Make sure there are no unintentional syntax errors in your disallowed paths. The presence of `` tags in paths (like `/SID=
    ` and `/productreviewscollection/
    `) could cause parsing issues. You might want to clean up this formatting to ensure it doesn’t lead to potential mishaps.

    5. **

Leave a Reply

Your email address will not be published. Required fields are marked *