A blog comment receiver for Cloudflare Workers

24 October 2023 Development Cloudflare web development

Background

A number of years back I switched to a static site generator for damieng.com, firstly with Jekyll, and then with Nuxt when I wanted more flexibility. I've been happy with the results and the site is now faster, cheaper, more secure and easier for me to maintain.

The tricky part was allowing user comments without a third-party service like Disqus. I came up with a solution that has comments into markdown files just like the rest of the site content so they can be published as part of the build process.

The accepting new comments side of things was an Azure function that received a comment via a form post and created a PR. If I approve the PR it gets merged and that automatically triggers a site rebuild to bring it in.

Perfect! Or so I thought...

Azure problems

I could go into a lot of detail here but suffice to say that while a million executions per month are free for a function storing it's data in Azure Storage is not and Microsoft will bill you $0.02-$0.03 a month. That's fine until your credit card expires and you discover Azure's billing system is hard-wired to only accept credit cards matching the country you signed up with and you can't change country. (I've moved from the US to the UK since setting up my account there)

So because Microsoft couldn't charge me $0.03 they notified me of intent to shut down my service and the simplest solution was just to rewrite the function to run elsewhere (it was in C#).

Given card processors charge almost 10x this to process the payment Microsoft must lose money by billing people these trivial amounts.

I do not recommend using Azure if you might move country.

Cloudflare Workers to the rescue

So I went ahead and rewrote my comment receiver on my current provider of choice - Cloudflare. I've had success with their Pages product with Nuxt and their free tiers are very generous (100,000 per day for workers).

This time however I thought I'd code it directly for Cloudflare Workers in TypeScript without using Nuxt - just as an experiment more than anything else.

Here's how things went.

Project setup

You start by using the templating tool for Cloudflare by running:

npm create cloudflare@2

Then give your function a name, choose "Hello World" Worker as the template and yes to Typescript.

This will create a new folder with a package.json and tsconfig.json and a src/index.ts file with a simple worker that returns a Hello World response.

Now we'll install one dependency (GitHub's API is trivially easy to use without a library but we're not going near home-rolled YAML encoding) and open up VS Code (or your editor of choice):

cd your-project-name
npm install yaml --save-dev
code .

Okay, time to edit some files!

The code

Replace the contents of src/index.ts with the following which is also available as a Gist:

import { stringify } from "yaml"

export default {
  async fetch(request: Request, env: Env, ctx: ExecutionContext): Promise<Response> {
    // Make sure this is a POST to /post-comment
    if (request.method !== "POST" || new URL(request.url).pathname !== "/post-comment") {
      return new Response("Not found", { status: 404 })
    }

    // We only accept form-encoded bodies
    if (request.headers.get("content-type") !== "application/x-www-form-urlencoded") {
      return new Response("Bad request", { status: 400 })
    }

    // Get and validate the form
    const form = await request.formData()
    const validationError = validateForm(form)
    if (validationError) {
      return validationError
    }

    // Validate the Turnstile recaptcha if configured to do so
    if (env.TURNSTILE_SECRET_KEY) {
      const passedTurnstile = await isTurnstileValid(form.get("g-recaptcha-response") ?? "")
      if (!passedTurnstile) {
        return new Response("Failed Turnstile validation", { status: 403 })
      }
    }

    // Details required for the branch/filename
    const commentId = crypto.randomUUID()
    const postId = form.get("post_id")?.replace(invalidPathChars, "-")

    // Get the starting point for the github repo
    const repository = await github()
    const defaultBranch = await github(`/branches/${repository.default_branch}`)

    // Create a new branch for the comment
    const newBranchName = `comments-${commentId}`
    await github(`/git/refs`, "POST", {
      ref: `refs/heads/${newBranchName}`,
      sha: defaultBranch.commit.sha,
    })

    // Create a new file for the comment
    const frontmatter = {
      id: commentId,
      date: new Date().toISOString(),
      name: form.get("name") ?? undefined,
      email: form.get("email") ?? undefined,
      avatar: form.get("avatar") ?? undefined,
      url: form.get("url") ?? undefined,
    }

    await github(`/contents/content/comments/${postId}/${commentId}.md`, "PUT", {
      message: `Comment by ${form.get("name")} on ${postId}`,
      content: btoa("---\n" + stringify(frontmatter) + "---\n" + form.get("message")),
      branch: newBranchName,
      author: {
        name: form.get("name"),
        email: form.get("email") ?? env.FALLBACK_EMAIL,
      },
    })

    // Create a pull request for it
    await github(`/pulls`, "POST", {
      title: `Comment by ${form.get("name")} on ${postId}`,
      body: form.get("message"),
      head: newBranchName,
      base: repository.default_branch,
    })

    // Redirect to the thanks page
    return Response.redirect(env.SUCCESS_REDIRECT, 302)

    async function github(path: string = "", method: string = "GET", body: any | undefined = undefined): Promise<any> {
      const request = new Request(`https://api.github.com/repos/${env.GITHUB_REPO}${path}`, {
        method: method,
        headers: {
          Accept: "application/vnd.github+json",
          Authorization: `Bearer ${env.GITHUB_ACCESS_TOKEN}`,
          "User-Agent": "Blog Comments via PR",
          "X-GitHub-Api-Version": "2022-11-28",
        },
        body: body ? JSON.stringify(body) : undefined,
      })

      const response = await fetch(request)
      if (!response.ok) {
        throw new Error(`GitHub API returned ${response.status} ${response.statusText}`)
      }
      return await response.json()
    }

    async function isTurnstileValid(clientTurnstile: string): Promise<boolean> {
      const form = new FormData()
      form.set("secret", env.TURNSTILE_SECRET_KEY)
      form.set("response", clientTurnstile)
      form.set("remoteip", request.headers.get("CF-Connecting-IP") ?? "")
      const response = await fetch("https://challenges.cloudflare.com/turnstile/v0/siteverify", {
        body: form,
        method: "POST",
      })

      if (!response.ok) return false
      const json = (await response.json()) as any
      return json.success === true
    }
  },
}

function validateForm(form: FormData): Response | undefined {
  if (form === null) return new Response("Form not decoded", { status: 400 })

  // Validate the form fields
  if (isMissingOrBlank(form.get("post_id"))) return new Response("post_id must not be empty.", { status: 422 })

  if (reservedIds.test(form.get("post_id") ?? ""))
    return new Response("post_id must not use reserved Windows filenames.", {
      status: 422,
    })

  if (isMissingOrBlank(form.get("message"))) return new Response("message must not be empty.", { status: 422 })

  if (isMissingOrBlank(form.get("name"))) return new Response("name must not be empty.", { status: 422 })

  // Validate the email if provided
  if (!isMissingOrBlank(form.get("email"))) {
    if (!validEmail.test(form.get("email") ?? ""))
      return new Response("email must be a valid email address if supplied.", {
        status: 422,
      })
  }

  // Validate the website URL if provided
  if (!isMissingOrBlank(form.get("url"))) {
    try {
      new URL(form.get("url") ?? "")
    } catch {
      return new Response("url must be a valid URL if supplied.", {
        status: 422,
      })
    }
  }
}

function isMissingOrBlank(str: string | null): boolean {
  return str === null || str === undefined || str.trim().length === 0
}

export interface Env {
  FALLBACK_EMAIL: string
  SUCCESS_REDIRECT: string
  GITHUB_REPO: string
  GITHUB_ACCESS_TOKEN: string
  TURNSTILE_SECRET_KEY?: string
}

const invalidPathChars = /[<>:"/\\|?*\x00-\x1F]/g
const validEmail = /^[^\s@]+@[^\s@]+\.[^\s@]+$/
const reservedIds = /CON|PRN|AUX|NUL|COM[1-9]|LPT[1-9]/i

Configuring API keys

Now you're going to obtain some API keys. You can initially store these in the .dev.vars file using a KEY=value format one per line. It will look a little like:

GITHUB_ACCESS_TOKEN=github_pat_[rest of your personal access token]
TURNSTILE_SECRET_KEY=0x[rest of your Turnstile secret key]

GitHub access token

Now you'll need to generate a GitHub personal access token with:

Read access to commit statuses and metadata
Read and Write access to code and pull requests

Store the access token GitHub gives you with the KEY GITHUB_ACCESS_TOKEN.

Turnstile secret key (optional but recommended)

If you want to use the Cloudflare Turnstile service to validate the recaptcha then you'll need to generate a private key and store it in the file with the KEY TURNSTILE_SECRET_KEY. You'll also need to follow the Turnstile client-side render guide on how to integrate the client-side portion and ensure you send the value you get back from the service client side to the server with the name g-recaptcha-response.

Set this in the config file with the key TURNSTILE_SECRET_KEY.

Non-secret configuration

We're almost done now you need to modify the wranger.toml file in your project root with the contents:

name = "create-comment-pr"
main = "src/index.ts"
compatibility_date = "2023-10-10"

[vars]
FALLBACK_EMAIL = "[email protected]"
GITHUB_REPO = "myorg/mysiterepo"
SUCCESS_REDIRECT = "https://mywebsite.com/thanks"

Testing

Now you can start the local test server (which will also handle the Typescript without needing a manual build step) with:

wrangler dev

And test it either by submitting to localhost:8787 or by using a too like Thunder Client in VS Code.

Deploying

Now you're ready to deploy to production!

wrangler secret put GITHUB_ACCESS_TOKEN
[enter your GitHub personal access token]
wrangler secret put TURNSTILE_SECRET_KEY
[enter your Turnstile secret key]
wrangler deploy

Now change your testing tool to published URL and try it out and if that works update your form post to the new location.

Enjoy!

Damien

Email form sender with AWS Lambda, Brevo & reCAPTCHA

24 October 2023 Development aws web development

Background

In my previous article Email form sender with Nuxt3, Cloudflare, Brevo & reCAPTCHA I showed how to use Nuxt3, Brevo & reCAPTCHA with Cloudflare.

In this article I will show how to use AWS Lambda instead of Cloudflare to send the email.

AWS Lambda configuration

You're going to need an AWS Lambda function so go ahead and create one. I was able to get away with 128MB of memory and a 20 second timeout. You'll also need to configure it as a function URL with the "NONE" auth type to allow anonymous posting to it. Also remember to enable cross-origin resource sharing and set the allowed origin to your website's domain(s) with POST method and you probably want to allow all headers with "*".

This step will also give you the function URL you'll need to copy into your website's form action attribute.

Code

Here's the code to paste into your AWS Lambda function.

/* global fetch */

export const handler = async (event, context) => {
  console.log("Starting contact form function");

  // Validate parameters
  const { firstName, lastName, email, message, token } = JSON.parse(event.body)
    
  if (!firstName || !lastName || !email || !message || !token) {
    return { statusCode: 400, body: JSON.stringify({ statusMessage: "Missing required fields" })}
  }
  
  // Validate captcha
  const verifyResponse = await fetch("https://www.google.com/recaptcha/api/siteverify", {
    method: "POST",
    headers: {
      "content-type": "application/x-www-form-urlencoded",
    },
    body: `secret=${process.env.RECAPTCHA_SECRET}&response=${token}`,
  })
  const verifyResponseBody = await verifyResponse.json()
  if (!verifyResponse.ok) {
    return "Unable to validate captcha at this time."
  }

  if (verifyResponseBody.success !== true) {
    return "Invalid captcha response."
  }
  
  // Send email
  const emailSendResponse = await fetch("https://api.brevo.com/v3/smtp/email", {
    method: "POST",
    headers: {
      accept: "application/json",
      "api-key": process.env.BREVO_KEY,
      "content-type": "application/json",
    },
    body: JSON.stringify({
      to: [{ email: process.env.EMAIL_TO_ADDRESS, name: process.env.EMAIL_TO_NAME }],
      sender: {
        email: email,
        name: `${firstName} ${lastName}`,
      },
      subject: process.env.EMAIL_SUBJECT,
      textContent: message,
    }),
  })

  if (!emailSendResponse.ok) {
    return "Message could not be sent at this time."
  }

  return "Message sent!"
}

Environment Variables

Finally you'll need to configure a few environment variables. These are:

RECAPTCHA_SECRET - Your reCAPTCHA secret key
BREVO_KEY - Your Brevo API key
EMAIL_TO_ADDRESS - The email address you want to send the email to
EMAIL_TO_NAME - The name of the person you want to send the email to
EMAIL_SUBJECT - The subject of the email

Conclusion

I hope you find this useful. Of course you could always use AWS's own SES service to send the email but there are plenty of examples of how to do that already.

Disclaimer

I am a Brevo partner eligible for commission on sales I make directly with them however I do not receive any compensation for this article or have link-based referrals for commission.

Email form sender with Nuxt3, Cloudflare, Brevo & reCAPTCHA

31 May 2023 Development Nuxt Cloudflare Brevo web development

I've been using Nuxt quite extensively on the static sites I work on and host and use Cloudflare Pages to host them. It's a great combination of power, flexibility, performance and cost (free).

While one of the sites I manage uses ActiveCampaign successfully for both their newsletters and contact forms, this latest customer just wanted plain emails ideally with no subscription involved.

Thus started my journey to find a simple, free, email service that I could use to send emails from a Nuxt3 site hosted on Cloudflare Pages.

Brevo email sender

What I needed was an email service that I could POST to that would then send the email on to the customers email address with the message from the contact form as well as a reply-to address corresponding to the contact information from the form.

I found Brevo (referral link) which provides a Transactional Email feature that did exactly what I wanted and their free tier has 300 emails/day which is more than enough for their contact form. As a bonus it includes some really great logging and statistics to keep an eye on things and make development and testing a little easier.

The developer docs guide you through using their SDK and while I normally go that route this was such a simple use case that I decided to just use their HTTP API directly and avoid bringing in dependencies especially given this is going to be called from a serverless function (more on that in a moment).

Once you've signed up you need to head over to get a v3 API key which you will need to send in the headers when making the API request.

function sendEmail(fromEmail: string, firstName: string, lastName: string, message: string): string {
  const response = await fetch("https://api.brevo.com/v3/smtp/email", {
    method: "POST",
    headers: {
      accept: "application/json",
      "api-key": "[your brevo api key here]",
      "content-type": "application/json",
    },
    body: JSON.stringify({
      sender: {
        name: `[from name]`,
        email: "[from email address]",
      },
      to: [{ email: "[to name]", name: "[to email address]" }],
      replyTo: {
        email: fromEmail,
        name: `${firstName} ${lastName}`,
      },
      subject: "Web Contact Form",
      textContent: message,
    }),
  })

  if (!response.ok) {
    return "Message could not be sent at this time."
  }
}

Okay, so that part is done but we need to call this from somewhere...

Creating a Nuxt3 server API

I thought about putting a serverless function up on my usual spots - Azure Functions or AWS Lambda - but given the site is already on Cloudflare Pages and they now support Workers it would be nice to keep it together.

Given that the rest of the site is already a static site on Nuxt3 and this infrastructure should include server-side support let's see what we can do.

First off we create a new file called sendContactForm.post.ts and put it into the server/api folder in the Nuxt3 project. This tells Nuxt3 we want a function and that it should be on the server and that it is post only.

A simple API to call our new function would look something like...

export default defineEventHandler(async (event) => {
  // Validate parameters
  const { name, email, message } = await readBody(event)
  if (!name || !email || !message) {
    throw createError({ statusCode: 400, statusMessage: "Missing required fields" })
  }

  return sendEmail(email, name, message)
})

This is a pretty simple function that validates the parameters and then calls our sendEmail function from earlier. To call it from our contact form we'd do something like this:

// Add this to the <script setup lang="ts"> section of your form and wire up fields and button and add client validation.

const email = ref("")
const name = ref("")
const message = ref("")

const result = await fetch("/api/sendContactForm", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
  },
  body: JSON.stringify({
    email: email.value,
    name: name.value,
    message: message.value,
  }),
})

const status = await result.text()
//...

You can bind the text boxes using v-model and add some validation to make sure they are filled in correctly before submitting the form.

Now starting local development with yarn dev we can submit the form and see the email come through. (If it doesn't then check the console log for errors or Brevo's great logs page to see what happened).

reCAPTCHA to prevent spam

Now that we have a working contact form we need to prevent spam. I've used reCAPTCHA before and it works great. There are two basic parts - the page executes some JS from Google that generates a token that is tied to what they think about the users behavior and likely spammyness.

We then take that token and sent it to our API which calls Google with the token and our secret key and Google tells us the score. Literally with v3 we get a score between 0 and 1. If it's less than 0.5 then it's probably spam and we can reject it.

To get going you first sign up with Google reCAPTCHA and create a new "site". You'll need the site key and secret key for the next steps. You can add "localhost" to the domains for local testing.

Client-side detection

Now we need to add the reCAPTCHA script to our site. The easiest way is to add the vue-recaptcha-v3 package to our project.

yarn add vue-recaptcha-v3

Then we need create a new plugin in plugins/google-recaptcha.ts:

import { VueReCaptcha } from "vue-recaptcha-v3"

export default defineNuxtPlugin((nuxtApp) => {
  nuxtApp.vueApp.use(VueReCaptcha, {
    siteKey: "[your site key here]",
    loaderOptions: {
      autoHideBadge: true,
      explicitRenderParameters: {
        badge: "bottomright",
      },
    },
  })
})

Next we'll need to call it from our contact form to get a token we can send to the server when the form is submitted.

<script setup lang="ts">
import { useReCaptcha } from "vue-recaptcha-v3"

const recaptchaInstance = useReCaptcha()
const recaptcha = async () => {
  await recaptchaInstance?.recaptchaLoaded()
  return await recaptchaInstance?.executeRecaptcha("emailSender")
}

const sendMessage = async (e: Event) => {
    const token = await recaptcha()
    const result = await fetch("/api/sendContactForm", {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
      },
      body: JSON.stringify({
        token: token,
        email: email.value,
        firstName: firstName.value,
        lastName: lastName.value,
        message: message.value,
      }),
    })
})
//...
</script>

Server-side validation

Now that we have the client-side taken care of we turn to the server-side which will send the token to Google and get the score. Here's a little function to do just that.

function validateCaptcha(token: string): string {
  const response = await fetch("https://www.google.com/recaptcha/api/siteverify", {
    method: "POST",
    headers: {
      "content-type": "application/x-www-form-urlencoded",
    },
    body: `secret=${process.env.RECAPTCHA_SITE_KEY}&response=${token}`,
  })

  const verifyBody = await response.json()
  if (!verifyBody.ok)
    return "Unable to validate captcha at this time."
  if (verifyBody.success !== true || verifyBody.score < 0.5)
    return "Invalid captcha response."

  return "ok"
}

And now we can revise our handler to call our new validation function and decide whether to send the email or return an error.

export default defineEventHandler(async (event) => {
  // Validate parameters
  const { firstName, lastName, email, message, token } = await readBody(event)
  if (!firstName || !lastName || !email || !message) {
    throw createError({ statusCode: 400, statusMessage: "Missing required fields" })
  }

  const captchaResult = await validateCaptcha(token)
  if (captchaResult !== "ok")
    return captchaResult

  return sendEmail(email, firstName, lastName, message)
})

Deploying to Cloudflare Pages

Now I knew that Cloudflare Pages supported serverless functions and that Nuxt3 supported static site generation and serverless functions but it had its own syntax and I wasn't sure how it held together. The good news is it does, specifically:

You have to use build and not generate or Nuxt will skip building the functions
Nuxt can convert the server/api folder into serverless functions
It uses a server called Nitro to do this
Nitro supports multiple modes including three for Cloudflare alone

cloudflare which uses the service-worker "Cloudflare Worker" syntax
cloudflare-pages (combining everything into a _worker.js file)
cloudflare-module

The docs say it should detect the environment and do the right thing but you can always override it in your nuxt.config.ts file (you'll also need to do this if you're wanting to check what it generates locally):

// https://nuxt.com/docs/api/configuration/nuxt-config
import { defineNuxtConfig } from "nuxt/config"

export default defineNuxtConfig({
  // ...
  nitro: {
    preset: "cloudflare-pages",
  },
  // ...
})

If you are using the Cloudflare Wrangler tool or GitHub actions to deploy then you'll probably need to search around the web for more information. If you're using the Cloudflare Pages UI then just make sure the tech stack is set for Nuxt and that it's using yarn build (you'll also need to set your environment variable for RECAPTCHA_SITE_KEY).

Final testing

Once that's all in place you should be able to go into the Cloudflare Pages UI and see that it now has functions detected. Head over to your contact form, fill in the details and hit send!

If all goes well you should see the email come through and the result of the form submission should be "ok". If you get an error then check the console log for errors or the Brevo logs to see what happened.

Conclusion

This was a fun little project to put together and I'm really happy with the result. I'm sure there are other ways to do this but this is a nice combination of free services that work well together and are easy to set up and maintain.

I hope you found this useful!

Damien

Disclaimer

I am a Brevo partner eligible for commission on sales I make directly with them however I do not receive any compensation for this article or have link-based referrals for commission.

Extracting files from Tatung Einstein disk images

13 February 2023 technology Tatung Einstein vintage computing magnetic media

Recently Kevin Edwards got hold of some 3" disks containing source code to various old commercial games. He imaged them with the Kryoflux flux-level imager (Greaseweazle and FluxEngine are also good options). These tools produce highly accurate images of magnetic media that rips through copy protection and format concerns even allowing you to write the image back to disk with that in tact. This level of detail emits large files - 11.7MB for a single-sided Spectrum disk that normally holds 173KB is quite typical. 4KB data tracks happily turn into 215KB flux.

Powerful as these tools are they don't give you access to the files contained within that disk although some can write emulator-compatible images like DSK. As somewhat versed in 3" media and DSK files through my archive experience and my open source Disk Image Manager tool he asked if I could take a look into achieving that.

Here's how the journey went. If you need some background detail check out my Floppy Disk Primer.

1. Determining disk geometry

Kevin sent me both a raw IMG which contains data written sequentially for use with tools like the Linux DD command and a DSK file which contains additional more metadata such as track information, sector sizes, IDs, data errors etc. It's a lot more helpful especially when dealing with old computers and copy protection.

The first step was to load the DSK into Disk Image Manager and see what's inside that DSK.

This showed us the disk image is single-sided and 40 tracks (we knew that from the hardware spec) but also that there are 10 sectors per track (the Spectrum +3/CPC and PCW 3" disks used 9) and fairly typical 512 byte sectors.

With only the raw IMG this would have been harder to determine especially as I could find no information online about the Einstein's disk format either through the simple Google search or using Archive.org to search through hundreds of thousands of scanned and OCR'ed magazines and manuals. The latter is a gold-mine often missed by retro enthusiasts looking for information.

2. Locating the file allocation table (FAT)

Now we know a little about the disk the next step is to find where the file allocation table (FAT) lives. The FAT will contain a list of entries - typically 32 bytes each for CP/M-influenced systems - that contain the file name, extension, how big the file is as well as which sectors on the disk belong to this file. For space and efficiency sectors are grouped together into blocks known as allocation units - something you'll still see today when you format an SSD on modern systems.

These file allocation tables are normally pretty easy to find as they contain the filenames but sometimes that might look to be corrupt this can be either if it is deleted - sometimes they null out the first character - or if there are flags set on files like system, read-only or archive where they tend to use a high-bit on part of the file extension to save space (file names had to be pure ASCII with codes under 127 anyway).

On a +3 this can be usually found at track 1 sector 0 right after the reserved track while the CPC Data format puts it at track 0 sector 0 but paging through the disk image Kevin sent I found it at track 2 sector 0 which made sense. Two reserved tracks, probably enough for system boot loader into the OS.

Screenshot of Disk Image Manager showing a FAT at track 2 sector 5

At this point I decided to check out some more Einstein images I found online but they all had the FAT track 2 sector 5... This initially confused me until I realised that the sector ID - the actual number the real machines use to identify sectors - for sector 0 was 5 and that yes the real sector 0 was stored at 5. I'm not sure whether this was a result of the imaging process itself (CPCDiskXP 1.6 was used) or whether the Einstein formatted disks this way. Sector numbers are often non-sequential to ensure that the machine has time to process the sector before the next one spins by so interleaves them (sector interleave) which speeds things up. This here was not the case tho as it was per-track (sector skew) but that's a question for another day.

3. How the disk is organized

The FAT contained 20 files spread over these two sectors and then two empty uninitialized sectors before we started to see data which indicates to me that this FAT covers 4 sectors. My first thought was 1K block sizes and 2 blocks just like the +3 but then I decided to go hunting and came across the source code for CP/M 3.1 on the Einstein which would be incredibly useful given how little low-level technical information about the Einstein that exists online.

Hidden in a innocently named config.g is the following snippet

unsigned char dpb[4][17] = {
 { 0x28,0x00, 0x04, 0x0f, 0x00, 0x01,0x01, 0x3f,0x00,
   0x80,0x00, 0x10,0x00, 0x02,0x00, 0x02, 0x03 },
 { 0x50,0x00, 0x04, 0x0f, 0x01, 0xc7,0x00, 0x3f,0x00,
   0x80,0x00, 0x10,0x00, 0x00,0x00, 0x02, 0x03 },
 { 0x28,0x00, 0x04, 0x0f, 0x01, 0xc7,0x00, 0x3f,0x00,
   0x80,0x00, 0x10,0x00, 0x00,0x00, 0x02, 0x03 },
 { 0x50,0x00, 0x04, 0x0f, 0x00, 0x8f,0x01, 0x7f,0x00,
   0xc0,0x00, 0x20,0x00, 0x00,0x00, 0x02, 0x03 } };

The DPB is the disk parameter block which describes to CP/M operating systems a little about how the disk is organized. The menu later in the program indicates the first is for 40 track single-sided, then double-sided followed by 80 track single-sided and 80 track double-sided so we only need the first entry, let's map it into a CP/M 3.1 DPB.

DEFW    spt = 0x0028 ; Number of 128-byte records per track
DEFB    bsh = 0x04   ; Block shift. 3 => 1k, 4 => 2k, 5 => 4k....
DEFB    blm = 0x0f   ; Block mask. 7 => 1k, 0Fh => 2k, 1Fh => 4k...
DEFB    exm = 0x00   ; Extent mask, see later
DEFW    dsm =   0x0101 ; (no. of blocks on the disc)-1
DEFW    drm = 0x003f ; (no. of directory entries)-1
DEFB    al0 = 0x00   ; Directory allocation bitmap, first byte
DEFB    al1 = 0x80   ; Directory allocation bitmap, second byte
DEFW    cks = 0x0010 ; Checksum vector size, 0 or 8000h for a fixed disc.
                   ; No. directory entries/4, rounded up.
DEFW    off = 0x0002 ; Offset, number of reserved tracks
DEFB    psh = 0x02   ; Physical sector shift, 0 => 128-byte sectors
                   ; 1 => 256-byte sectors  2 => 512-byte sectors...
DEFB    phm = 0x03   ; Physical sector mask,  0 => 128-byte sectors
                   ; 1 => 256-byte sectors, 3 => 512-byte sectors...

Okay, this is good. We can see there the reserved two tracks before the FAT which we already knew but importantly here we know that the block size is 2K and that there is 1 directory block unlike the +3 with its 1K and 2 directory blocks.

Why does that matter if the number of sectors used and file entries available is the same? Well, we need to know about that list of blocks each file uses and here we now know they are 2K blocks not 1K.

Now there is something in this DPB however that doesn't add up and that will cause us trouble later.

4. Decoding the FAT

Looking here is the FAT I found on the BBC BASIC disk image which clearly shows the FAT in a CP/M format, specifically for each entry (also called an extent) in this table we have:

Byte 0: Specifies user area for 0-16 with some special meanings like disk labels above that
Bytes 1-8: Specifies file name in plain ASCII (always upper-case)
Bytes 9-11: Specifies the file extension with the high-bit possibly set (read-only, system file, archive)
Byte 12: Extent number from 0-31 for files bigger than the allocation list for this entry would permit
Byte 13: How many bytes used in the last record
Byte 14: Extent number multiplier by 32 for extents > 31 (done this way for backward compatibility)
Byte 15: How many records (128 byte entries) are used by this extent
Bytes 16-32: List of allocation blocks either in 8-bit or 16-bit little endian format

Off  Hex                                              ASCII
  0  00 53 4F 52 54 52 45 41 4C C2 42 43 00 00 00 10  SORTREALÂBC?????
 16  01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ????????????????
 32  00 43 52 43 20 20 20 20 20 C3 4F 4D 00 00 00 16  ?CRC     ÃOM????
 48  02 00 03 00 00 00 00 00 00 00 00 00 00 00 00 00  ????????????????
 64  00 42 42 43 42 41 53 49 43 C3 4F 4D 00 00 00 78  ?BBCBASICÃOM???x
 80  04 00 05 00 06 00 07 00 08 00 09 00 0A 00 0B 00  ????????????????
 96  00 55 4E 4C 49 53 54 20 20 C3 4F 4D 00 00 00 0A  ?UNLIST  ÃOM????
112  0C 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ????????????????
...

So looking at this for the first file we get:

User: 0
Filename: SORTREAL
Extension: BBC
Flags: Read-Only
Extent: 0
Bytes used: 0 (odd)
Extent number multiplier: 0
Records: 0x10 (16 so 2K allocated)
Allocation blocks: 0x001

It was at this point my original plan for writing an extract tool took a detour. If this is CP/M enough perhaps we can use an existing CP/M extraction tool?

5. Hello cpmtools

There are a few tools out there but we'll be using cpmtools as it has some simple commands and most critically allows us to tell it about new disk formats.

A quick apt-get install cpmtools and we have access to cpmls to list the contents of an image and cpmcp to copy files to and from the disks but first we need to tell it about our Tatung Einstein format.

We do that by editing the diskdefs file with sudo nano /etc/cpmtools/diskdefs and logically we would add the following text to match the DPB:

diskdef einstein
  seclen 512
  tracks 40
  sectrk 10
  blocksize 2048
  maxdir 64
  skew 1
  boottrk 2
  os 2
end

Now we can do this:

cpmls -F -f einstein myimage.dsk

And hopefully see a good-looking directory. In our case we did on the image Kevin prepared but not on the ones I downloaded such as the BBC Basic. This is almost certainly because cpmtools is using the physical position of the sector within a track rather than correctly relying on the sector IDs. Thankfully we can use another tool to work around this.

6. Quick detour to SAMdisk

SAMdisk from Simon Owen is a fantastic tool for disk imaging and besides being able to read and write to real floppies it can also perform a number of conversions including from flux images produced by the Kryoflux. It does a better job of creating DSKs from copy-protected disks than anything else. My preference for imaging is going over once with a flux imager and saving the files in flux then converting it to the intended format with SAMdisk.

For our purposes though we just need to de-skew those sectors for cpmtools to understand. We'll just convert the DSK file to a RAW image such as produced by tools like dd in Linux.

samdisk copy myimage.dsk myimage.raw

7. Back to cpmtools

Okay, now when we perform our cpmls we get this:

Directory For Drive A:  User  0

    Name     Bytes   Recs   Attributes   Prot 
------------ ------ ------ ------------ ------

ANIMAL   BBC     4k     24     R        None
ANIMAL   DAT     2k     10              None
BBCBASIC COM    16k    120     R        None
BBCBASIC HLP    26k    200     R        None
CONVERT  COM     4k     18     R        None
CRC      COM     4k     22     R        None
CRCKLIST CRC     2k      9              None
F-INDEX  BBC     8k     53     R        None
F-RAND0  BBC     2k     11     R        None
F-RAND1  BBC     4k     18     R        None
F-RAND2  BBC     8k     61     R        None
F-RSER1  BBC     2k      3     R        None
F-RSER2  BBC     2k      9     R        None
F-RSTD   BBC     2k      9     R        None
F-WESER1 BBC     2k      8     R        None
F-WESER2 BBC     4k     24     R        None
F-WSER1  BBC     2k      5     R        None
F-WSER2  BBC     4k     23     R        None
F-WSTD   BBC     4k     20     R        None
HELP     COM     2k     16     R        None
HELP     HLP     8k     53     R        None
MERGE    BBC     2k      6     R        None
READ     ME      4k     18     R        None
RUN      COM     2k      1     R        None
SETTIME  BBC     2k      8     R        None
SORT     BBC     2k     15     R        None
SORTREAL BBC     2k     16     R        None
UNLIST   COM     2k     10     R        None

Total Bytes     =     99k  Total Records =     790  Files Found =   28
Total 1k Blocks =    126   Used/Max Dir Entries For Drive A:   29/  64

Perfect. Sensible-looking file-names, flags, sizes and records. Let's copy some files!

mkdir extracted
cpmcp -f einstein myimage.raw 0:* extracted/

Note that the 0: is required for user area 0 which is almost always the one you want. If any other user area shows in the cpmls command also copy those (file names might be duplicated between user areas tho so you'd only get one of them).

Screenshot of Disk Image Manager showing a corrupt extracted file

We first look at a file and.... there's blank information in the middle of the file. A ton of nulls confirmed by my hex editor. And I'm missing data at the end of the file.

Alarm bells are going off. Something is wrong with the allocation block list. If I look at them in a trusty hex view I can clearly see we have a valid block number followed by a zero then another block number followed by a zero. They're also counting up sequentially so they are very likely correct.

What's surprising is those 0's. A file allocation table can be either 8-bit (so no zeros in the middle) or 16-bit (zeroes after a lot of the bytes until they get too large for a byte).

The operating system determines whether to use 8 or 16-bit allocation entries by looking at the number of blocks on the disk. If it's less than 256 then it uses 8-bit allocation blocks. If it's more than 256 then it uses 16-bit allocation blocks so we calculate...

40 tracks * 10 sectors * 512 bytes per sector / 2048 block size = 100 blocks.

So why does the Einstein format use 16-bit allocation blocks? I suspect it might be a mistake in transposing 100 decimal into 0x0101 hex in the DPB or maybe it was fully intentional for compatibility with something else (but not CP/M it seems). Either way cpmtools is not happy and there isn't a setting we can use.

8. A silly hack

So given we can't force cpmtools to use a 16-bit allocation block or directly specify the number of blocks to influence that choice like the DPB was doing is there anything we CAN do?

Of course we have an ace up our sleeve.

diskdef einstein
  seclen 512
  tracks 103
  sectrk 10
  blocksize 2048
  maxdir 64
  skew 1
  boottrk 2
  os 2
end

By telling cpmtools there are 103 tracks on the disk it will calculate there are 257 blocks and use 16-bit allocation block numbers. This works fine for us reading data off a disk as there can't possibly be blocks pointing that high up onto the (non-existent) parts of the disk.

Be very careful though about writing! You could easily write beyond the actual disk definition or end up with a disk image you can no longer read into an emulator or write back to the einstein.

9. File endings

After all that we can now extract those files and they look good! The text files however have a 0x1A character towards the end and then some duplicate text from before the 0x1A.

0x1A is SUB/substitute in the ASCII table but back in the CP/M days it was used as a "soft-eof" character to indicate the end of file. As cpmtools doesn't listen to this (it can't know if it's ASCII or binary) anything beyond this character in text files can be safely trimmed off.

10. Going forward

I have expanded my Disk Image Manager tool to understand Einstein format disks as well as provide single and bulk file export from disk images whether they be Einstein, CPC, PCW or +3 format (or presumably another CP/M format, untested!) - simply right click the files in the files window and choose the relevant Save option!

In the mean time keep an eye on Kevin Edwards Mastodon for announcements and details as to what games the source code has been recovered for!