Quick Review Needed for a robots.txt
File
Is Anything Incorrect?
Hello everyone,
I’m looking for a quick review of the robots.txt
file content below. Is there anything obviously wrong with it? Any feedback would be appreciated. Thank you! ๐
“`plaintext
Sitemap: https://www.mysite.com.hk/sitemap.xml
User-agent: AdsBot-Google
Disallow:
User-agent: Googlebot-Image
Disallow:
User-agent: dotbot
Disallow: /
User-agent: BLEXBot
Disallow: /
User-agent: Barkrowler
Disallow: /
User-agent: serpstatbot
Disallow: /
User-agent: GeedoBot
Disallow: /
User-agent: MJ12bot
Disallow: /
User-agent: DataForSeoBot
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: VelenPublicWebCrawler
Disallow: /
User-agent: TurnitinBot
Disallow: /
User-agent: Riddler
Disallow: /
Directories
User-agent: *
Disallow: /404/
Disallow: /app/
Disallow: /cgi-bin/
Disallow: /includes/
Disallow: /lib/
Disallow: /magento/
Disallow: /pkginfo/
Disallow: /report/
Disallow: /stats/
Disallow: /var/
Disallow: /ApiPhp/
Disallow: /SID=
Disallow: /review/
Disallow: /productreviewscollection/
Disallow: */questionanswerscollection/
Paths (clean URLs)
Disallow: /index.php/
Disallow: /catalog/
Disallow: /catalog/product_compare/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/
Disallow: /catalogsearch/
Disallow: /checkout/
Disallow: /control/
Disallow: /customer/
Disallow: /customize/
Disallow: /media/oms/
Disallow: /catalogutils/
Disallow: /sendfriend/
Disable all queries
Disallow: /?
Allow: /?p=*
Files
Disallow: /cron.php
Disallow: /cron.sh
Disallow: /error_log
Disallow: /install.php
Disallow: /LICENSE.html
Disallow: /LICENSE.txt
Disallow: /LICENSE_A
2 responses to “Reviewing and Improving My robots.txt”
Certainly! Here’s a detailed review of the provided
robots.txt
file:Overview
A
robots.txt
file is used to instruct web crawlers about which parts of your site they are allowed to visit or index. It’s a crucial part of website management for SEO and resource management. Let’s go through the content provided.Analysis
plaintext
Sitemap: [https://www.mysite.com.hk/sitemap.xml](https://www.networldsports.com.hk/sitemap.xml)
plaintext
Sitemap: https://www.mysite.com.hk/sitemap.xml
Remove the Markdown link syntax if it is a plain-text file, as it is unnecessary there.
plaintext
User-agent: AdsBot-Google
Disallow:
plaintext
User-agent: dotbot
Disallow: /
Blocking Selected Bots: You have several bots fully disallowed, which seems intentional for
dotbot
,BLEXBot
,Barkrowler
, and others. This is fine if they shouldn’t crawl any part of your site.General Directions:
plaintext
User-agent: *
Blocking Specific Directories and Paths: This is where you specify directories that should not be crawled by any bot. Make sure these paths are correctly inputted and reflect directories/files you want to hide from crawlers.
Path Blocking and File Restrictions:
plaintext
Disallow: /404/
Disallow: /cgi-bin/
PHP Files:
plaintext
Disallow: /*.php$
It looks like you’ve put a lot of thought into your `robots.txt` file! Here are a few insights that might help refine it further:
1. **Check Your Sitemap URL**: The sitemap link appears as `https://www.mysite.com.hk/sitemap.xml` in your `robots.txt`, but your post suggests a different domain (`https://www.networldsports.com.hk/`). Make sure to update this accordingly to ensure search engines can locate your sitemap effectively.
2. **Block Specific Bots with Caution**: While it’s great to restrict unwanted crawlers like `MJ12bot` and `BLEXBot`, consider whether you need to block them entirely. For instance, if some bots bring value by indexing content that can lead to legitimate traffic, you might want to allow them or at least monitor their impact.
3. **Order of Rules**: The order of user-agent directives can matter. Be mindful of the wildcard user-agent `*` coming after specific ones. Specific rules are usually prioritized, meaning any bot that matches a specific rule wonโt even reach the broader instructions beneath it.
4. **Test for Syntax Errors**: Make sure there are no unintentional syntax errors in your disallowed paths. The presence of `` tags in paths (like `/SID=
` and `/productreviewscollection/
`) could cause parsing issues. You might want to clean up this formatting to ensure it doesn’t lead to potential mishaps.
5. **