Improving SEO with Dynamic Rendering for WebAssembly Sites

Finished
P.1/2 2024-12-01

Dynamic Rendering SEO Optimization for Blazor WebAssembly

This website is built using Blazor WebAssembly, a framework that allows for running C# code directly in the browser via WebAssembly. Unlike traditional server-side rendering, where the server is responsible for delivering the entire page content, Blazor WebAssembly delegates much of the work to the client. This means that when a user visits the site, the browser is responsible for making requests to APIs, running JavaScript, and rendering the content on the page.

While this approach offers many benefits, such as improved interactivity and a smoother user experience, it also presents challenges. Since much of the content is rendered on the client side, search engine bots may have difficulty properly indexing the website. This can result in incomplete indexing, where search engines are unable to crawl or fully understand the content, potentially impacting SEO performance.

For example, when Google fails to index or correctly parse the content of your website it could look like the following for blazor:

Website description reading: loading website... an unhandled error has occured

The Challenge and My Solution

After exploring several options, I found that many solutions were either too expensive or overly complicated for my needs. Paid services like Prerender.io offer dynamic rendering as a service, but they require redirecting requests to an external server to generate the static content, adding complexity and potentially increasing costs. Additionally, many solutions required significant changes to my infrastructure, which didn't fit well with the simplicity of a traditional web host setup where you just want to upload your files and have everything run smoothly.

Other solutions, such as running the website as a Blazor Server, where content is rendered on the server and sent to the client, didn't work for my use case. While Blazor Server can solve the issue of rendering content server-side, they usually do not work with traditional web hosts, it would mean relying on an external platform to host the server-side rendering. For example, platforms like Azure Web Apps (e.g., azurewebsites.net) are often used for Blazor Server applications, which require a more complex setup and ongoing dependency on a third-party service. I wanted to avoid this dependency. Instead, I preferred to keep things simple, with a traditional web host where I could upload everything to my own domain without needing to rely on external services or complex server-side infrastructure.

To solve this, I created a custom solution using Puppeteer and Node.js. Puppeteer is a headless browser automation tool that allows you to programmatically control a browser, simulate user interactions, and capture fully rendered pages. I wrote a script that can run on a local version of my Blazor WebAssembly website, hosted through a Visual Studio project at the normal localhost:7273.

The script essentially loads the website as if it were a real user, waits for the content to load and render, and then captures a static version of the fully rendered HTML. This static version can be saved and served specifically to search engine bots or used for SEO purposes, ensuring that they receive the same content that a user would see when interacting with the live site. This approach was simple, cost-effective, and didn't require altering my existing hosting infrastructure.

It's important to note that this technique is not the same as malicious cloaking. Instead, it's recognized by Google as dynamic rendering in the offical Google docs, a method that allows search engines to view the same content as users, ensuring accurate indexing without the deceptive practices associated with cloaking; However, it is worth nothing that this does indeed use the same techniques as cloaking, but the main differentiating factor is if it is malicous or not.

The Core Script and How It Works

The script uses Puppeteer for headless browsing and JSDOM for HTML manipulation to ensure search engine bots index the fully rendered content of a Blazor WebAssembly website. Here's how it works:

  1. Duplication avoidance using Canonical Links: The script relies on canonical URLs to avoid duplicate pages, ensuring that each page is rendered from its authoritative version. If a page is accessible via multiple URLs, only the canonical URL (from the tag) is processed. This avoids duplicates and ensures proper indexing.

  2. Resolving link locally and live: The canonicalBaseUrl parameter you provide (e.g., https://www.satoma.se) determines how internal links are handled. Even if the website is hosted on a local server (e.g., localhost), the script will only follow links that match the canonical base URL, ensuring no external URLs are crawled.

    The script always transforms fully qualified URLs (e.g., https://localhost:7273/page1) into relative URLs. This allows the script to crawl and generate the correct structure for local testing or generation, while still working seamlessly for the live website when hosted on the canonical domain.

    With both of these attributes it means:

    • The script only follows internal links, ensuring each link matches the specified canonicalBaseUrl.

    • When the site is hosted locally, the script automatically converts fully qualified URLs into relative ones, making it suitable for local testing and content generation. For example, if the site is hosted on "localhost" but a link points to the full URL (e.g., https://www.satoma.se/page1), the script will resolve it to a relative URL (/page1), as long as the canonicalBaseUrl correctly points to www.satoma.se.

    • External links or incorrectly formatted URLs are ignored, ensuring the crawl remains focused on the website's content.

  3. Rendering and Saving Pages: The script loads each page, waits for it to fully render, then strips out unnecessary <script></script> tags (since search engine bots dont need them). It saves the fully rendered HTML to disk, to be served later specifically to bots, ensuring that bots index the content exactly as users would experience it, with no dynamic elements that could hinder SEO.

  4. Automatic Sitemap: The script will also automatically generate a sitemap.xml for you just because it already has crawled the entire website it is just in the case that you want to use it. It is not particularly smart and sets the priority based on how many slashes(/) there is in the relative path and sets the timestamp to the current time.

After the script has finished running it will have generated a file structure for you which you can serve to bots, that looks something like the following:

The scripts rendered static files

NOW THE FOLLOWING SCRIPT SHOULD ALLOW YOU TO GENERATE A STATIC VERSION OF YOUR WEBSITE; HOWEVER, THE ISSUE OF REDIRECTING ONLY BOTS TO THIS STATIC VERSION IS REMAINING, GO TO THE NEXT PAGE TO SEE HOW TO DO THAT!


The script implementation

The main script(scrapePages.js):

const fs = require('fs');
const path = require('path');
const puppeteer = require('puppeteer');
const urlLib = require('url');
const jsdom = require("jsdom");
const tldjs = require('tldjs');
const { JSDOM } = jsdom;
const { turnUrlIntoFilePath, CleanDirectory, getCleanRelativeUrl, isValidUrl, isSameDomain } = require('./urlToFilePath.js');

async function ScrapeRenderAllPages(startUrl, canonicalBaseUrl, baseDir) 
{
    /*Setup*/
    const absoluteBaseDir = path.join(__dirname, baseDir);
    CleanDirectory(absoluteBaseDir);
    const baseUrl = new URL(startUrl).origin; // Extract the base URL
    console.log("baseUrl: " + baseUrl);
    const browser = await puppeteer.launch();
    const page = await browser.newPage();
    const visitedUrls = new Set(); // only contains absolutes
    const pagesToRender = [baseUrl]; // start url
    // visit pages
    while (pagesToRender.length > 0)
    {
        const url = pagesToRender.shift();
        // clean url from trailing slashes
        const urlWithoutTrailing = url.replace(/\/$/, '');
        if (visitedUrls.has(urlWithoutTrailing)) 
        {
            continue;
        }
        console.log("Visiting url : " + urlWithoutTrailing);
        visitedUrls.add(urlWithoutTrailing);
        // ----
        console.log(`Rendering: ${urlWithoutTrailing}`);
        await page.goto(urlWithoutTrailing, { waitUntil: 'networkidle0', timeout: 60000 });
        const canonicalUrl = await page.$eval('link[rel="canonical"]', (el) => el.href).catch(() => urlWithoutTrailing);
        const cleanRelCanonicalUrl = getCleanRelativeUrl(canonicalBaseUrl, canonicalUrl)
        console.log("Clean relative canoncial Url: " + cleanRelCanonicalUrl);
        // save page as its valid and not indexed before
        // get file path
        const filePath = turnUrlIntoFilePath(baseUrl, cleanRelCanonicalUrl, absoluteBaseDir);
        const dir = path.dirname(filePath);
        if (!fs.existsSync(dir)) 
        {
            fs.mkdirSync(dir, { recursive: true });
        }
        // Save the rendered HTML
        const html = await page.content();
        // Create a DOM element from the HTML content
        const dom = new JSDOM(html);
        const document = dom.window.document;
        // strip out all the <script> tags
        const scripts = document.querySelectorAll('script');
        scripts.forEach(script => script.remove());
        // Get the modified HTML without the <script> tags
        const htmlWithoutScript = document.documentElement.outerHTML;
        // save it
        fs.writeFileSync(filePath, htmlWithoutScript);
        console.log(`Saved: ${filePath}`);
        // find all the links on the page:, NOTE PUPETTER AUTOMATICALLY EVALUATES THESE TO ABSOLUTE
        const links = await page.$$eval('a[href]', (anchors) => anchors.map((a) => a.href));
        for (const link of links) 
        {
            /*
                First check if it has any domain
            
            */
            const potentialLinkedDomain = tldjs.getDomain(link);
            if (potentialLinkedDomain != null)
            {
                // it is associated with a domain, so potentially not relative to our domain, check it
                if (!isSameDomain(canonicalBaseUrl, link)) {
                    console.log("Is Not same domain: " + link);
                    // it is a link to an external website
                    continue;
                }
            }
            else
            {
                /*
                    Here pupetter will have resolved any links to absolute links, we only really get here
                    if we run against localhost since tldjs doesnt consider those valid domains, but we strip
                    the domain down again since pupetter has resolved it for us and check that we have valid relative
                    navigations without any sub modifiers, these are relative to our host, so not the canonicalUrl
                */
                const cleanedLocalLink = getCleanRelativeUrl(baseUrl, link);
                // could be a relative here, check that it doesnt contain any invalid chars such as *.
                if (cleanedLocalLink.includes('*') || cleanedLocalLink.includes('.'))
                {
                    console.log("relative link contains invalid modifiers: " + cleanedLocalLink);
                    continue;
                }
            }
            // at this point we can have links to our canonical website or relative to our own, we have to strip both cases such that we only get the relative parts
            const cleanedFromBase = getCleanRelativeUrl(baseUrl, link);
            const cleanedFromCanonical = getCleanRelativeUrl(canonicalBaseUrl, cleanedFromBase);
            // we can then recreate an absolute link by appending our base to the url 
            const absoluteUrl = urlLib.resolve(baseUrl, cleanedFromCanonical);  //  wont resolve some edge cases like mailto: links and other such garbage, will just return cleanedFromCanonical, this is not guaranteed though thats why we do an extra check with isValidUrl later on 

            // Only add internal links (same domain) and not visited URLs
            if (absoluteUrl.startsWith(baseUrl) && !visitedUrls.has(absoluteUrl))  // make sure it start with our base and we havent visited it before
            {

                if(isValidUrl(absoluteUrl)) // make sure url is valid 
                {
                    pagesToRender.push(absoluteUrl);
                }
            }
        }
    }
    return visitedUrls;
}

function calculatePriority(relativeUrl)
{
    console.log("calculating prio for: " + relativeUrl);
    // count the number of slashes after our baseurl
    const slashCount = ((relativeUrl.match(/\//g) || []).length); 
    // more slashes mean a lower priority as its further from root
    const priority = Math.max(1.0 - (slashCount * 0.1), 0.1); // 0.1 is lowest prio
    return priority;
}

function createSitemap(urls, baseUrl, canonicalWebUrl)
{
    const sitemap = [];
    urls.forEach(currUrl =>
    {
        const cleanedFromBase = getCleanRelativeUrl(baseUrl, currUrl);
        const fullyCleanedRelative = getCleanRelativeUrl(canonicalWebUrl, cleanedFromBase);
        const prio = calculatePriority(fullyCleanedRelative);
        const canonicalAbsolute = urlLib.resolve(canonicalWebUrl, fullyCleanedRelative); // make an absolute using the canoncial url
        sitemap.push(
            {
                url: canonicalAbsolute,
                priority: prio
            });
    });

    return sitemap;
}

function convertSitemapToXML(sitemap)
{
    let xml = '<?xml version="1.0" encoding="UTF-8"?>\n<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">\n';
    const currentDate = new Date().toLocaleDateString('en-CA'); // 'en-CA' format is YYYY-MM-DD
    sitemap.forEach(entry => {
        xml += `  <url>\n    <loc>${entry.url}</loc>\n    <priority>${entry.priority}</priority>\n    <lastmod>${currentDate}</lastmod> \n  </url>\n`;
    });
    xml += '</urlset>';
    return xml;
}
const baseDirectory = "static";

const webserverUrl = "https://localhost:7273/";
const canonicalWebUrl = "https://www.satoma.se/";

ScrapeRenderAllPages(webserverUrl, canonicalWebUrl, baseDirectory).then((visitedUrls) => 
{
    console.log("All pages have been rendered");

    const sitemap = createSitemap(visitedUrls, webserverUrl, canonicalWebUrl);
    const xmlContent = convertSitemapToXML(sitemap);
    fs.writeFileSync('sitemap.xml', xmlContent, 'utf8');

    process.exit(0);
 });
    

JavaScript

Supporting functions for main script(urlToFilePath.js):

const path = require('path');
const fs = require('fs');
const tldjs = require('tldjs');

function isValidUrl(urlToBeChecked)
{
    /* Check that it isnt pointing to an outside source */
    return !urlToBeChecked.includes('#');
}

function isSameDomain(baseDomain, urlToBeChecked)
{
    const baseDomainReal = tldjs.getDomain(baseDomain);
    const toBeCheckedReal = tldjs.getDomain(urlToBeChecked);
    if (baseDomainReal === null || toBeCheckedReal === null)
    {
        return false;
    }
    return (baseDomainReal === toBeCheckedReal);
}

function getCleanRelativeUrl(BaseUrl, SubUrl)
{
    const BaseUrlWithoutProtocol = BaseUrl.replace(/^https?:\/\//, '');
    const SubUrlWithoutProtocol = SubUrl.replace(/^https?:\/\//, '');
    /*Strip away trailing slashes*/
    const BaseUrlWithoutTrailSlash = BaseUrlWithoutProtocol.replace(/\/$/, '');
    const SubUrlWithoutTrailSlash = SubUrlWithoutProtocol.replace(/\/$/, '');
    /*Strip away invalid chars*/
    const BaseUrlFullyStripped = BaseUrlWithoutTrailSlash.replace(/[*?"<>|]/g, '');
    const SubUrlFullyStripped = SubUrlWithoutTrailSlash.replace(/[*?"<>|]/g, '');
    /*Get relative sub url from base*/
    const RelativeSubUrl = SubUrlFullyStripped.replace(BaseUrlFullyStripped, '');
    return RelativeSubUrl;
}

function turnUrlIntoFilePath(BaseUrl, SubUrl, BaseDir) // https://www.satoma.se/
{
    /*Strip the protocols*/
    /*Get relative sub url from base*/
    const RelativeSubUrl = getCleanRelativeUrl(BaseUrl, SubUrl);
    /*If no subnavigation we are on index*/
    let returnPath;
    if (RelativeSubUrl === '') 
    {
        returnPath = path.join(BaseDir, 'index.html');
    }
    else
    {
        const FilePath = path.join(BaseDir, RelativeSubUrl);
        const AsHtmlDoc = FilePath.endsWith('.html') ? FilePath : FilePath + '.html';
        returnPath = AsHtmlDoc;
    }
    console.log("Return path: " + returnPath);
    return returnPath;
}

function CleanDirectory(dirPath) {
    if (fs.existsSync(dirPath)) {
      fs.rmSync(dirPath, { recursive: true, force: true });
    }
    fs.mkdirSync(dirPath); 
  }

module.exports = {
    turnUrlIntoFilePath, 
    CleanDirectory,
    getCleanRelativeUrl,
    isValidUrl,
    isSameDomain
};
JavaScript

Prev Page.1/2 Next
[1]
Dynamic Rendering SEO Optimization for Blazor WebAssembly
2024-12-01
[2]
Bot detection and server side routing
2024-12-01

© 2024 Satoma. All rights reserved

An unhandled error has occurred. Reload 🗙