Bot detection and server side routing

To ensure search engine bots are served the generated static pages, we place the static files directory generated by the webscraping script on the previous page in the WebRoot directory, next to Blazor's index.html.

For an Apache HTTP Server setup, we modify the .htaccess file to redirect all traffic through an index.php script. This script handles routing, deciding whether the request should go to static content (for bots) or to the index.html (for users). The .htaccess file also ensures automatic HTTPS redirection.

.htaccess Rewrite redirection


# Enforce HTTPS for all traffic
RewriteEngine On
RewriteCond %{HTTPS} !=on
RewriteRule ^(.*)$ https://%{HTTP_HOST}%{REQUEST_URI} [L,R=301]

# Route all requests to index.php for handling if corresponding file or directory does not exist
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteRule ^(.*)$ /index.php [L]

With this setup, all requests for the Blazor framework are funneled through the index.php file. In index.php, we analyze the requested URLs and either serve the static files to bots or redirect users to index.html, where Blazor’s client-side routing takes over; It is explained further in-depth further down.

Index.php script


<?php
	error_reporting(0);
	ini_set('display_errors', 0);

	$loader = require_once __DIR__ . '/composer/vendor/autoload.php';
	$currentAbsoluteDirectory = __DIR__;
	use Jaybizzle\CrawlerDetect\CrawlerDetect;
	$CrawlerDetect = new CrawlerDetect;
	// Check the user agent of the current 'visitor'
	if($CrawlerDetect->isCrawler()) 
	{
		// true if crawler user agent detected
		$staticDirectory = '/static';
		$request_uri = $_SERVER['REQUEST_URI'];
		$document_path = parse_url($request_uri, PHP_URL_PATH);
		$sanitizedRequestedPath = str_replace(['..', '.', ',', '*'], '', $document_path); // no weird navigation chars as a malicious bot could potentially escape which file they can read with the readfile
		$sanitizedRequestedPath = str_replace(['//'], '/', $sanitizedRequestedPath);  // just felt like something to remove

		$requestedPath =  trim($sanitizedRequestedPath, '/'); // Remove leading/trailing slashes
		if ($requestedPath === '') 
		{
			$requestedPath = 'index';
		}	
		$absoluteStaticDirectoryNavigation = realpath($currentAbsoluteDirectory . $staticDirectory);
		$staticFilePath = $absoluteStaticDirectoryNavigation . '/' . $requestedPath . '.html';
		$resolvedPath = realpath($staticFilePath); // resolve the absolute path on the filesystem
		$staticIsAtStart = (strpos($resolvedPath, $absoluteStaticDirectoryNavigation) === 0); // make sure you are navigating into the /static directory and not anywhere else if the user managed to do that
		if($absoluteStaticDirectoryNavigation && $resolvedPath && $staticIsAtStart)
		{
			readfile($resolvedPath);
		}
		else
		{
			echo "Invalid bot/crawler navigation, please follow only links";
		}
	}
	else
	{
		// a non bot user is served the blazor app aka the index.html file
		$blazorAppPath = $currentAbsoluteDirectory . '/index.html';
		$realBlazorAppPath = realpath($blazorAppPath);
		if($realBlazorAppPath)
		{
			readfile($realBlazorAppPath);
		}
		else
		{
			echo "Blazor app not found";
		}
	}
?>

How the script works

Now, the PHP script uses the CrawlerDetect library to analyze the user-agent of incoming requests and determine if the request is coming from a bot or a human user. CrawlerDetect identifies a wide variety of bots and web crawlers based on their user-agent strings. If the request is from a bot, the script proceeds to analyze the requested URL and attempts to resolve it to a corresponding static file in the static directory.

If the static file exists, it is served directly to the bot via the readfile function in PHP. This process bypasses any additional redirects on the user-side, ensuring that the bot gets the static content directly. This method is crucial for SEO, as it allows search engines to index the static HTML content, improving visibility and ranking.

Security Measures against malicous paths

The script implements security measures to prevent malicious bots from exploiting vulnerabilities by sanitizing the requested path to remove unusual characters like .., *, or //, which could allow bots to escape the static directory and access other parts of the server's file system. It also verifies(after sanitation) that the requested directory is within the static directory. This protection ensures that the requested file path remains strictly within the static directory, preventing bots from bypassing .htaccess rules and gaining unauthorized access to sensitive files. If a bot attempts to access a location outside the allowed directory or provides an invalid path, the script will return a warning message: "Invalid bot/crawler navigation, please follow only links."

Serving humans

On the other hand, if the request is coming from a human user (not a bot), the script redirects the user to the index.html file, which is the entry point for the Blazor framework. At this point, Blazor takes over, and the client-side routing and dynamic rendering are handled by the JavaScript framework, with no additional server-side intervention required.

Final Thoughts on script and approach

This setup enables the website to deliver static content directly to search engine bots for improved indexing, while providing a full interactive experience to human users via Blazor. The PHP script effectively distinguishes between bots and users, ensuring both functionality and security, while preventing unauthorized file access.

Despite its advantages, I’m curious why this solution isn’t more widely adopted. Is there a flaw, or is it considered overly complex? To me, it seems like a straightforward and effective solution to host client-side heavy applications on a traditional web host. I’ll need to test it further to determine if it’s truly as efficient as it appears.

I also want to note that this script/approach is not only for Blazor specifically, it should work with any client-side heavy application that needs to be prerendered for SEO.

But in the end the result is that it allows google to correctly index our website from the static files as seen in the picture below:

Website description working for satoma.se showing the right description from the static version

Improving SEO with Dynamic Rendering for WebAssembly Sites