Cosmic Bytes

RSYNC as a Backup Server: Simple, Powerful and Flexible

2023-02-13T00:00:00Z

Summary

I will talk about using RSYNC for backing up files (notes, media, etc) to a server from multiple devices. I will talk about its benefits, how to do it, and examples of its extensibility and flexibility.

Pre-requisites

No pre-requisites to read and understand this article. However, to actually implement this, you do need some kind of self-hosted server, and it is preferred you have some familiarity with the command-line or basic system administration (but you can probably get away without it if you are determined).

Introduction: Why Have a Backup Server?

With our increasing reliance on digital data, resiliency against accidental or malicious loss has become increasingly important. No one likes to lose their photos full of memories that they hold dear!

Saving your data on just a single device (for example, your smartphone) without backups puts you at risk of losing this data. This can happen for a number of reasons:

hardware failure
losing the device
disasters (electric surge, fires or floods).

The Solution? Backing up your data and replicating it to another device!

Now, my post is not talking about a typical backup solution. Typical backup solutions are usually:

Third party services: Google Drive, Apple iCloud, Dropbox, etc.
full-featured self-hosted services: NextCloud, SeaFile

Instead we will talk about another method using RSYNC, which is a self-hosted solution granting you many benefits in flexibility, customization, and increased control, at the expense of being more DIY solution. We will explore that in detail shortly!

With that out of the way, let's talk more about RSYNC.

What is RSYNC?

Basic Usage

RSYNC is "a fast, versatile, remote (and local) file-copying tool" according to its manual page.

In the most simple case, rsync is similar to the cp unix utility: it can copy a file from one path to another.

rsync /path/to/file /path/to/destination

Remote Transfer Capability

One of rsync's unique capabilities is its ability to copy or transfer files over a network connection.

This is crucial for our backup system, since we will be backing up files to a separate device, and one of the best way to do this is network transfer.

rsync is required to be installed on both devices.

There are two ways rsync can transfer files remotely:

One device must be running the rsync daemon, and the second device runs the rsync command
One device runs the rsync command and uses SSH for the connection. The only requirement is that the second device has rsync installed and ssh working.

For more information and details on how to transfer files remotely, please reference rsync's manual page or the many resources on the internet going over this.

RSYNC Replicates Filesystem State

Although basic usage shows rsync to act very much like cp, it can be a lot more powerful. I found it very helpful to think of rsync as a tool to replicate the filesystem state (or part of said state) between one path and another (on one or two different devices) rather than just a basic copying tool. This is different than the working mechanism of the cp utility. This may not make sense yet, so let me demonstrate.

Suppose I have two different paths. One path contains a set of files, the other contains another set of files, and but there are some similarities:

ls /path/to/dir1
#output:
file1 file2

ls /path/to/dir2
#output:
file2 file3

as we can see, file2 is present in both directories, while file1 is only present in dir1, whereas file3 is only present in dir2

If we use the cp command such as in:

cp --recursive /path/to/dir1/. /path/to/dir2/

ls /path/to/dir2
# output
file1 file2 file3

the cp command copies all files in dir1 to dir2, and any similar files are overwritten.

Here is what rsync could have done instead:

Only copy files from dir1 that are not in dir2 (no overwriting). Ignore any files that are present in both directories (even if content is different!)
Only copy files from dir1 that are not in dir2. For files that are present in both directories, only copy it over if the content is different. You can be way more granular than this, such that it is only copied over if the edit timestamp is more recent, for example.
Delete files in dir2 if they are not present in dir1.

This is part of what I mean by "replicating state" of a filesystem. It is not only copying files, it is defining a specific state that the filesystem should be in.

And there are many more options! (you can find them all in the rsync manual page). You can ask rsync to copy and preserve file ownership, edit timestamps, file permissions, or you can ask it to change file permissions. RSYNC grants you very fine-grained control over the process, and the possibilities are endless!

I did not go too in-depth here, as my goal is only to quickly showcase the powers of RSYNC and convince you it is fit for this job. There are many resources on using RSYNC out there, including its detailed manual page, so I urge you to check them out.

RSYNC's Diffing Algorithm and Network Performance

As we saw in the previous section, RSYNC is able to skip over copying certain files based on certain conditions (they are already present in the destination path, or their content is similar, or it has not been edited since last sync, etc).

RSYNC utilizes a diffing algorithm to achieve this that is very fast (and configurable). This is very useful for our use case. Suppose you have a photo gallery on your smartphone that you want to backup and sync to your backup server. If every time you ran a backup, it copies EVERY FILE over the network, it will take a while and that is not efficient. RSYNC can instead only copy over files that are not already present on the server. This way, it transfers much less files, and the process is much faster.

Given RSYNC has fairly low overhead, this makes the transfers quite fast.

Why RSYNC? Why not NextCloud or Other Self-Hosted Solutions?

RSYNC is not the best option for everyone, and other alternatives may very well be a more appropriate option for you.

Reasons not to use RSYNC

it is a backup solution only. It cannot help you browse your backed up files, delete them, share them, etc.
more DIY than alternatives (if you don't like DIY)
Primarily CLI-based (if you don't like CLIs)
can lead to issues if mis-configured
By default, may not have a feature you want, but its customizability makes any missing feature easier to DIY

It is important to note that for many people, those are pros, not cons.

Why RSYNC?

Focuses on one task and optimizes it really well (Unix philosophy)
fast and performant
versatile, flexible, and rich with features and CLI options
Built-in support for SSH for traffic encryption and remote user authentication
widely available with minimal requirements. Pre-installed on many linux distributions!
has a diffing algorithm, allowing to only transfer files that aren't already transferred
the directory structure of transferred files is under user's control

Why not SyncThing?

Syncthing is an open source remote file sync program. Syncthing can run on multiple devices and synchronize files in one or several directories across several devices (no need for a centralized server).

Syncthing is great! It is not wrong to use Syncthing, it all depends on your needs. While it has a lot of overlap with rsync pertaining to our use case, there are some subtle differences:

Syncthing is replicating the exact state of a directory between different devices. RSYNC grants you fine-grained control on what to copy, as discussed in more detail earlier
With Syncthing, deleting a file on one device deletes them in all synced devices. RSYNC has that behavior as optional, so you can prevent accidental deletions.
Syncthing is encrypted using TLS, and has strong authentication. RSYNC can optionally be encrypted and authenticated with SSH, but otherwise needs a custom solution for encryption and authentication.
Syncthing has a lot of features, such as version control, that RSYNC does not have included and would need a separate program for.
RSYNC is much more scriptable and interoperable with other Linux programs and utilities.

Now let's get down to business and implement our RSYNC backup system!

Basics Architecture for RSYNC Backups

Here is what we need:

a server with rsync installed (and either ssh daemon or rsync daemon running)
a device with data it wants to backup, and rsync installed
a network connecting both devices (this could be your home wifi)
invoke rsync to backup! (or set it up to invoke automatically)
... profit?

Wait, is it really that simple? that's it? yeah! Well, kinda ... This is more of a basic (but working!) example. It is quite sufficient, but also leaves you a lot of room to add onto it. First, we will follow those steps and setup something basic, then explore some options for customizations or add-ons (though I can't go through all of them, because the possibilities are endless!)

Installing RSYNC

Linux

If you are using Linux, there is a decent chance rsync is already installed. Just try:

rsync --version

if it does not error, you're in luck!

Otherwise, you can install it using your distribution's package manager. Some examples below:

# Debian, Ubuntu and some of their derivatives
sudo apt install rsync

# Arch and derivatives
sudo pacman -S rsync

# Gentoo
emerge rsync

I bet it is possible to install it with other means. You can probably find resources on this on the internet.

Android using Syncopoli

There are two ways I know to run rsync on Android: Syncopoli and Termux.

Syncopoli is a an android native client for the rsync protocol.

Has a GUI
can schedule the triggering of the rsync job
has most of rsync's features, but is not one-to-one with native rsync, so some features (most of them are uncommon) are not available.

Android using Termux

The second way to use rsync on android is using Termux.

Termux is amazing. It is a terminal emulator for Android, and allows you to explore the world of the Linux terminal from your android device, and this includes rsync.

CLI instead of GUI. This means it requires more keyboard, but that can be automated.
Access to many other linux programs and utilities, and can easily chain rsync with them.
Can be scheduled with crontab and tasker
You get the same rsync as the one on Linux with all its features

Other

You can probably find installation guides for other platforms. I will not include them here. The android one is a bit difficult to find, so I thought it would be good to include here.

Configuring Server-Side RSYNC

There are two basic ways to setup RSYNC on the server side:

Server must have SSH Daemon setup and be SSH-able
Server must have rsync daemon running

Both methods have pros and cons.

SSH

SSH grants you SSH encryption and authentication by default for the rsync, which is a big security benefit.
SSH requires the client have access to a user that has shell access. Depending on your security model, this may be okay, or may be less secure.
SSH may be slower due to the encryption
requires no rsync-specific setup. If ssh is working and rsync is installed, clients can begin using immediately.

RSYNC Daemon

Has many features that are otherwise unavailable (such as chroot)
does not require shell access, which can be a security benefit depending on security model
transfers unencrypted. This can be mitigated in many ways, such as with secure tunnels (SSH tunnel, TLS tunnel, etc), but that is a more complex setup
weak authentication with weakly encrypted password exchange. Can also be mitigated like above
requires a dedicated daemon to be running in the background.

Setting up SSH is out of scope for this post. Many readers have it setup already, and there are many guides on the internet to do it.

Setting up RSYNC Daemon

To setup the rsync daemon, you first need to setup your rsyncd.conf file to configure the rsync daemon.

Here is an example rsyncd.conf file with some descriptive comments:

chroot = false

[modulename]
path=/path/to/destination
read only = false

[module2]
...

In the example above, [modulename] defines a module that clients can interact with. You can have multiple modules. a module is usually associated with one path for clients to rsync with, and a specified set of rules (such as our read only = false). Any rules defined under the [modulename] only apply to that module. Rules that are defined above all the modules apply to all the modules (such as our chroot).

I highly recommend exploring the manual page for rsync and rsyncd.conf. There are many options that you will find useful.

Before jumping to the next section, make sure you create the directory where the client will sync data to, if it does not exist already.

Invoking from Client Side (Linux or Termux)

Now that we have rsync installed, we can invoke it!

If your server is using SSH, then we can invoke rsync like:

rsync [options] /path/to/local/directory remoteuser@hostname:/path/to/remote/directory

If your server is using rsync daemon, you can instead do:

rsync [options] /path/to/local/directory rsync://hostname:port/modulename

There are other ways of invoking rsync that are documented in the manual pages.

What should we put in options? Well, that is up to you. One very common option is the --archive, which combines several of the most commonly used flags. It copies over all the files in the source directory that are either not present in the destination, or have different content in the destination. Any files that are identical are not transferred.

Conclusion and Future Considerations

Well there we have it! We have made a basic backup setup with rsync and discussed its benefits and weaknesses. This setup, although basic, is still very powerful and capable, but there is still a lot that we can improve:

Scheduling: This can be done with crontab, systemd (linux) and Tasker (for android)
Trigger by file watch: Instead of invoking rsync at time intervals, you can use a file watcher that invokes rsync when a file is changed. This is particularly useful for notes. This is doable on linux but I am not sure if it is possible on Android.
Viewing Remote Files: We can use programs to view our files on the server from a remote client. here is a list of self-hostable image & video gallery programs
Use unprivileged user (security enhancement): this can be done by using a locked down user on the server side, so that the client cant mistakenly do too much damage
Run RSYNC Daemon with unprivileged user: many run rsync daemon with root. Some (like me) run it with an unprivileged user. This means you loose the chroot feature, but a potential vulnerability in rsync would be less destructive.
Use secure tunnel with RSYNC Daemon: As we discussed, rsync daemon is unencrypted and uses weak authentication. This is usually less of a problem for local network transfers, but you can use secure tunnels to mitigate it. Whether it is VPN, SSH tunnel, or other options, rsync can do them all! I might write a post about this as information about it is a bit scarce on the internet.

Thank you for reading! This is one of my first posts, so please give me feedback if you have it.

Customizing Find and Replace in the Command Line

2023-03-19T00:00:00Z

Introduction

In this post, we will build commands for performing find and replace operations via the command line. We will first build it for a basic case, explore the customizations available to us, and build on top of that for more complex cases, such as performing the operation on a directory, previewing changes before committing them, adding a confirmation prompt, etc.

The goal of this article is not to just teach you how to find and replace in the command line. It is also aimed at teaching general command line skills and concepts, exploring the powers of the command line and the wealth of customizations it grants us.

NOTE: I recommend you follow along. Open your terminal, create a test folder / directory and create random test files to apply the learnings you read here. Learning is best done by doing!

Why Not Use Text Editor / IDE?

Just about all text editors and IDEs have a find and replace feature. Those are great to use if you are already in the middle of editing a file, but in my opinion, using the command line is better and faster (once you get used to it) for cases such as:

You are not already in a text editor, and you only want to perform find and replace.
You want to apply the find and replace operation to multiple files. This is especially useful in a large codebase.

You may disagree and find that the IDE is preferable. There's only one way to find out, so buckle up and get ready to see it in action!

Using SED to Find and Replace

sed is the ultimate utility for find and replace (though not the only one). sed, standing for Stream EDitor, takes a text input (which could also be a file) and performs filtering and transformations on it as specified by the user. This output can be optionally directed to a file, including the same file it was read from, or piped into another program (or standard output).

sed has many options for text manipulation commands, but the most common is the s command, standing for substitute. It is so popular that it is the only sed command many people know, and some even think it is sed's only functionality. sed has many other commands, and you can find them by reading the info page (run info sed in the command line, or visit the gnu website).

Basic Usage

To use sed for performing find and replace on an input, you can run the following:

sed [options] 's/[regex-pattern]/[text-to-replace-with]/[flags]' [input-file]

the part in single quotes is called the "command" or "script" for sed. It starts with s, denoting substitute. The regex-pattern is a regex pattern for the text to replace. text-to-replace-with is, self-descriptively, what sed will replace the text matched by the pattern with.

Basic example:

sed 's/patternToReplace/replaceWith/' somefile.txt

OR:

echo 'random text with patternToReplace here' | sed 's/patternToReplace/replaceWith/'

SED's Flags

Flags modify sed's behavior, such as denoting how many of the matched patterns to replace, whether or not to be case insensitive, printing the line with a match, and some more advanced features, such as executing the result of the substitution as a shell command.

Replace first N Occurrences

To replace only the first N occurrences, we can use a number as a flag, denoting how many times to perform the find and replace operation.

sed 's/replaceThis/replaceWith/2' somefile.txt

The above command will take only the first two occurrences of "replaceThis" and replace it with the value "replaceWith". Without the flag, it would only replace the first occurrence.

Replace All Occurrences

we use the g flag, denoting global.

sed 's/replaceThis/replaceWith/g' somefile.txt

Find and Replace In-Place (Overwrite File)

By default, when sed performs a find and replace (or any operation) on a file or text input, it prints the result to standard output. It does NOT save the modifications to the file. To save it in the same file we used as input, we can use the -i or --in-place flag:

sed -i 's/replacethis/replacewith/' somefile.txt

NOTE For MacOS and BSD Users: The BSD version of sed (also used on MacOS) is slightly different than the GNU version of sed found on most linux systems. The -i flag is one of those differences, where the flag requires a value (at least an empty one). I recommend you check the documentation, but the command will look something like:

sed -i '' -e 's/replacethis/replacewith/' somefile.txt

Alternatively, you can always install GNU sed on MacOS and BSD (available on homebrew and other package managers).

Save The Substitution Result to New File

If, instead of overwriting to the same file, we want to save it to a new file, we just redirect the command's result:

sed 's/replacethis/replacewith/' somefile.txt > newfile.txt

the > newfile.txt syntax is the redirect operator, which redirects the standard output of a command to a file (overwriting it if it already exists).

There Are Many More Options!

This is only a subset of all features available in sed, and not even all the options for the substitute command! My intention is not to duplicate sed's documentation, but rather introduce you to sed and cover relevant concepts.

Alternatives to SED

sed is a great tool for this task, but are there alternatives? Of all popular utilities, sed is the simplest. More complex alternatives may grant you more features depending on what you are looking for, but it comes at a cost. You can even do what sed does with a full featured scripting languages that has a wealth of libraries, like python or nodejs. But for simple tasks, it is overkill and requires a lot more code.

However, there are tools and languages that only require a bit more code than sed, but optionally have a lot more features. For example, awk is one example as we will discuss below, and there is also perl / raku as a full featured scripting language that has a lot of text editing capabilities, and you can perform substitution with little code.

AWK

A very common alternative to sed is awk. like sed, awk is also a text manipulation tool. There are a few differences:

awk is able to recognize and manipulate delimited text such as tabulated data, columned text, csv, etc. sed has no concept or recognition of this beyond its regex capabilities.
awk is a much more complete programming language than sed. It has loops, conditionals, data structures like arrays, and the like.
Consequently, awk can perform analytics tasks such as calculating sums or counting data, or perform transformations onto the columns, such as removing certain columns or combining them.

Recursive Find and Replace (In a Directory and All Sub-directories)

Now this is all great, but this is only slightly more impressive than text editor find-and-replace. Let uss see how the command line can extend the powers we already uncovered to apply the same functionality to multiple files.

The beauty of the shell (and I suppose most programming languages) is that we have constructs that allow us to apply a command to a dynamic input or operand (input here is the file(s) we are applying find and replace to). This dynamic input can be our list of many files. The question is, which files do we want to perform find and replace on? There are a few situations I have in mind:

Apply to all files in current directory
Apply to all files in current directory and sub-directories (recursion)
Apply only to files in a directory that have a certain metadata (file extension, name, when it was last edited, etc)
Apply only to files in a directory based on the file's content (i.e. the file's content has a pattern match)

There are many command line utilities that can help us here. I will focus on two: find and grep.

find: Search and Filter Files by Metadata

The find utility searches through all files and directories in a given directory based on their metadata. By default, it prints the file paths to standard output.

It has many useful options and filters. Things like: file or directory name, creation timestamp, modification timestamp, path, file size, etc. I will not go through them all (check the documentation or other resources).

Basic Usage

find path/to/directory -type f -name "*.md"

The -type f flag says: only return files, not directories.
The -name "*.md" flag says: only return files whose name ends in .md, meaning a markdown file. The * is a wild card match character.
The command above will therefore return files inside path/to/directory that meet the two criteria above.

We can even apply this to multiple directories:

find path/to/directory path/to/second/directory -type f -name "*.md"

Executing a Command Based on find 's Output

Executing a command based on another command's output is very common in the command line world. This is very often done via piping; directing a command's standard output to the standard input of another command. We did this earlier with sed and echo:

echo 'random text with PatternToReplace here' | sed 's/PatternToReplace/ReplaceWithThis/'

The | is the pipe operator.

This will not work with find. We will likely have more than one file being outputted by find, so we cannot simply pipe the result into sed, as it will think the file path/name is text and not a filename whose contents we want to modify. There are a few alternatives:

Command Substitution

The $() is the shell command substitution syntax. We can use:

sed 's/somePattern/ReplaceWith/g' $(find some/path -type f)

What command substitution does is it executes the command inside the parentheses, and substitutes the result in its place instead. So the shell will substitute $(find some/path -type f) with the file paths, and only then will it execute the sed command.

One advantage of this method is that we are only invoking the sed command once, but with the input being multiple files. Now I haven't benchmarked this, but sometimes, invoking the command once with several files as input vs. a single invokation per file might perform better. However in some cases, that option is not even available, and a command can only accept one file as argument. Some of the below options might help.

Piping with xargs

We said earlier that we cannot simply pipe the output of find into sed. However, we can pipe the output into a utility called xargs. xargs can take the output of a command that has multiple outputs like find, and then apply another command (in our case sed) once to every one of those outputs from the first command. Here is how we can do it:

find path/to/directory -type f -print0 | xargs -0 sed -i 's/somePattern/replaceWith/g'

xargs will take the output of find command before the pipe operator, and execute the sed command for every one of find's outputs.

Notice the -print0 and -0 flags. -print0 tells find to separate its output using a null character, and -0 in xargs tells it to look for the same said null character. This way, we make sure that whatever find uses to separate different paths in its output, xargs will use it to differentiate one file's path from another.

Like other command line utilities, xargs is very customizable and can fit many use cases, and there are many options documented in its documentation.

Using find's Built-in -exec Option

The options above are great and usable on various command line utilities. But find has its own built-in functionality for executing a command for every path matching its filters. This option is the -exec option.

find path/to/directory -type f -name "*.md" -exec sed -i 's/somePattern/replaceWith/g' {} \;

We simply put the command after -exec.

However, notice two unusual pieces of syntax. First, the {}. This syntax tells find where to insert a path it matched as an argument for the sed command. When executing the command, find will replace {} with the path. the \; is really just an escaped semicolon. The semicolon ; is needed to tell find where the -exec command is done. This can be useful if you want to tack on other options to the find command.

One thing to note is that you can have multiple -exec in the same find command! There's a lot of potential customizations that can be made for more complex situations.

Adding a Confirmation Prompt

Suppose we want to find and replace, and but we are not sure we want to replace every instance of the pattern we match. One way of solving this is previewing the change and asking the user to confirm if they want to do it.

There are several ways of doing this. We can either script the confirmation and preview on our own, or we can use another of find's built-in functionalities.

find's -ok Option

find's -ok option behaves very similarly to -exec, with one difference: before executing the command given, the user will be shown the exact command (after substitution) and ask the user to confirm if they want to execute it. Only then will the command be executed.

find path/to/directory -type f -ok sed -i 's/somePattern/replaceWith/g' {} \;

But this only asks for confirmation. Before confirming, I would ideally want to look at the file and its contents. How can we do this?

Previewing Changes

As we said earlier, find supports chaining multiple -execs in order. This applies to -ok as well. So, we can add a command that would preview the change without modifying the file, then use -ok to ask the user if they would like to proceed.

Using grep To Show State Before Modification

grep is an amazing and commonly used tool. It can search a file for a pattern and print to standard output the line(s) containing the pattern. We can use grep to show the state of the lines to be modified right before we modify it. It will also highlight the part of the line that will get modified. Since grep also uses regex, we can use the same regex we use with sed.

find path/to/directory -type f -exec grep somePattern {} \; -ok sed -i 's/somePattern/replaceWith/g' {} \;

The above command is very similar to the one previous, except we added the -exec flag with the grep command. The order here is very important. The grep command will execute first, then, the -ok command will be shown to the user and executed upon confirmation, as discussed earlier.

The grep has a lot of customizations that can be explored. One common example is the --context flag, which shows you lines before and after the match. For example, grep --context 2 --regexp somePattern [file-name] will show two lines before and after the match. If you want only lines before the match or only after the match, there is --after-context and --before-context.

Conclusion

This concludes the first chapter of our endeavor to explore options for find-and-replace in the command line, and explore the many options to customize them, and build up a command-line script to perform find-and-replace recursively with a confirmation prompt and preview (to avoid un-intended destructive behavior). The commands we built together are powerful on their own, but I hope that this post also helped you explore the endless possibilities for customization so you can fine-tune them to your needs, or use them on other commands, even unrelated to find-and-replace!

Unix Permissions - Part 1: Anatomy of Permissions

2023-04-20T00:00:00Z

Introduction

permission denied: an error message that every command line user has faced before. Often times, users solve this by running sudo [previous command] (sometimes without considering consequences!). The error arises due to Traditional Unix Permissions System: The security mechanism that POSIX operating systems use by default, to protect the system from malicious actors, compromised processes, or even user mistakes and errors. Whether you aim to secure your system, or improve your productivity when working around an existing permission on your Linux or MacOS computer, learning how to use Unix permissions is a crucial part of system administration and day-to-day command line usage.

Summary

In this article, we will talk about the traditional Unix permission system. We will talk about what it is exactly and how it works under the hood. We will break down the system into its individual peaces, and talk about each one individually, and how they all tie in together, to gain a full understanding of how it all works.

This is the first part of a series; the next parts will explore part of the suite of command-line tools offered to us to use and work around this security system, and look at an example of using the Unix permissions to architect our system's security model. Please check them out as well!

Pre-requisites

Basic familiarity with a Unix-like system (Linux, macOS, or BSD) and the command line is encouraged, but not absolutely necessary.

Anatomy and Model of Unix Permissions

There are 5 main entities in traditional Unix Permissions:

User: The elementary unit. Has permissions associated with it
Group: a group of users
File: the entity we are imposing protections on
Process: Gets assigned a user to gain its permissions
File Permissions: denotes which users (and groups) have which accesses to a certain file

To understand how each of those work together, let us explore them in detail one-by-one.

User

When you first hear the word "user", you might imagine an actual person that uses the computer. This is how most people use them in Microsoft Windows, and indeed, Unix systems used them the same way initially. Most early computers were used by multiple people, and Traditional Unix permissions system was deliberately developed to secure users from each other. Our modern threats of malware or spyware from the internet were non-existent in that time.

However, times have changed. Nowadays, most computers are not multi-user, and in cases which they are, other users using the same machine are the least of your security worries. Threats coming from the internet (malware, hackers, etc) or vulnerabilities from the many applications installed on your system are far more important.

But the user model can still satisfy modern needs if we think of users differently. Instead of associating a user with a person, we conceptualize it as a unit that has a collection of permissions, and can be assigned to a process (including a login shell session) so that said process adopts that user's permissions. Upon attempting to access a file, the operating system will check the permissions associated with the file, and confirm that the user has the necessary permissions to perform the operation.

Group

A group is, as you'd expect, a group of users. It is a many-to-many relationship; a user can belong to multiple groups, and a group can have multiple users.

Like users, groups also have permissions associated with them. If a user is part of a group, they are granted the group's permissions.

As we will see later, groups are a key component in allowing many-to-many relationships between users and their permissions to access files. It is only with groups that we are able to grant accesses to a file to more than one user.

Permissions

Permissions are the core component of Unix permissions. Each file has three types of permissions:

Read: Ability to open the file and read its content (or copy it)
Write: Ability to modify the file's content, completely overwrite it, or delete it
Execute: Ability to run as a process (relevant for programs or scripts)

These permissions are defined separately for each "scope". A scope is a certain grouping of users. Each file has 3 scopes associated with it:

The user that owns the file, or the owner.
The file's group, or group owner.
other: All other users not included in the other scopes

For example, for a specific file, we can say:

The (user) owner has read and write access to the file.
The users belonging to the file's group have read access to the file. They can check out its content but cannot modify anything.
All other users have no permissions to this file. They cannot see its content or modify it

This would be an example of a list of permissions associated with a file.

Files

Files are the element that Unix permissions system aims to protect. As we saw earlier, each file has a set of permissions associated, defining which users can do what. This is dictated by the file's owner and group, which are also defined separately for each file.

Upon creating a new file, it is usually assigned the owner and group based on the process or shell session that created it, but they can be changed. Setting the right owner and group is an important step in devising a security model to protect a certain file or your system as a whole. We will discuss this more deeply in later sections.

How it Really Works

Let us now describe how the operating system manages access in traditional Unix permissions. We will take a scenario where a certain user attempts to access a certain file. How exactly can we predict if the access will be granted?

Initial Setup

We start with a regular Linux system
The system already has a set of files on its filesystem, and a set of users and groups defined
Each file on the syetm has a user owner and a group owner
Each file will have defined a list of permissions, denoting:
- What accesses (read, write, or execute) the user owner has
- What accesses the group owner has
- What accesses all other users have

Putting Unix Permissions to Test!

Let's see what it would look like if Unix permissions on the system we setup above are put to test.

Suppose a user (or process) tries to access a file, the following happens
The system determies type of access requested. Is it read, write, or execute?
Next, the system determines which scope the user falls into
- Is the user the user owner of the file? Then the scope is "user"
- Does the user belong to the group owner of the file? THen the scope is "group"
- If none of the above, the scope is "other"
The system fetches the permissions associated with the scope determined in the previous step
The system checks whether the requested access in the list of permissions for that scope
If the system verifies that the requested access is in the permissions, then access is granted. Otherwise, access is rejected.

NOTE: This is a conceptual model to help us understand how it works, and is not meant to represent the exact implementation behind the scenes.

Conclusion

We now have a conceptual model of how Unix permissions work, which will be very handy when troubleshooting issues or securing a certain system or file. Whenever you come across a complex issue, trying to imagine the possible steps the system took to arrive at the error message it gave you can be the key to finding the solution, or they key to finding the vulnerability in your security model.

In part 2 of this series, we will dive into command line tools that help us work with permissions, such as setting or modifying permissions, or setting or modifying file owners or groups. Please check it out and give me feedback if you have it!

Unix Permissions - Part 2: chmod and Other Permission Command-line Apps

2023-05-14T00:00:00Z

Introduction

In the previous part of this series, we learned about Unix permissions and how they work, and established a conceptual model for understanding them. In this section, we will explore the practical side: How do we actually use Unix Permissions? How can I set or change permissions following the model from earlier?

We will look at various command-line applications that work with Unix permissions, and we will take a deep dive into two of the most important ones: chmod and chown.

chmod

chmod is a utility to define or modify permissions of a file.

Recall above when we talked about permissions, and how every file has permissions for 3 access types (read, write and execute) for each of 3 scopes (user, group and others).

chmod is precisely the utility that helps us set and define those permissions! It is the most important tool that we will discuss.

The command uses the following syntax:

chmod [OPTIONS] [MODE],[ANOTHER MODE]... [FILENAME]

The "MODE" is the important part here. It is basically a way to express the "permissions" that we want to define or change.

There are two different syntaxes for mode: symbolic mode, and octal mode. Regular mode is verbose, whereas octal mode uses numbers to make a shorter command. Let us explore them both.

Symbolic MODE Syntax

We can summarize symbolic mode syntax as follows: [scope][operator][accesses].

[scope] is the same scope we discussed when we spoke about permissions earlier. The scope is either the owner, group, or others.

There is one more category that chmod allows: "all", which denotes all of the scopes. Using all would be the equivalent of defining the same permissions separately for each scope. You can do it all in one MODE instead.

Each scope is denoted by a single letter in the symbolic MODE syntax:

u for the user owner
g for the group owner
o for all other users
a for all the above

[accesses] is the access types we spoke of previously, and they are also each denoted by single letters:

read, denoted by r
write, denoted by w
execute, denoted by x

Finally, the operator tells us what exactly to do with the access types for the specific scope:

+ to add access to existing accesses
- to remove from existing accesses
= to set the accesses. In other words, those not listed in our mode will be removed

Each mode has exactly one scope and one operator, but can have one or more of the accesses. In other words, a mode can change only one access type for a given scope, or it can change all 3 for that scope.

To change multiple scopes, you need to write a separate symbolic mode for each.

For example, if we want to add read and write access for the group owner, we can do the following:

chmod g+rw someFile.txt

We used the plus + operator here, which as we said, adds accesses and does not set them. This means that if the file had execute access for the group owner, it would still be there. If we want to make sure only the accesses we declare in the mode persist, we must use the equal = operator:

chmod g=rw someFile.txt

Suppose that, simultaneously, we want to remove write access from "all other users". To do this, we can add a another mode. Modes are comma separated.

chmod g=rw,o-rw someFile.txt

It is important to note that there are other options for MODE that I did not discuss here. Please feel free to read the documentation for chmod to learn more interesting things you can do with permissions.

Octal MODE Syntax

Octal MODE syntax is the most commonly used syntax. It simplifies MODE syntax to use only 3 digits. Each digit represents which permissions are to be defined for each scope. We provide 3 digits to define the permissions for the 3 different scopes.

The 3 digits define permissions for the 3 scopes in the following order:

User
Group
Others

or:

chmod [user digit][group digit][all digit] someFile.txt

With each digit denoting the access types we want to set for the given scope.

How do we determine what number goes with which level access? well it goes like this:

1 for x or execute access
2 for w or write access
4 for r or read access
0 for no access

So, if I want write access for the user, read access for the group, but no access for the rest, we can do:

chmod 240 someFile.txt

Keep in mind, the permissions here are being set, not added to existing permissions. In other words, if "other users" had any access prior, they go away. If group user had write access or execute access prior, it also goes away.

But what if we want more than one access? what if we want both read and write access for the user? Don't sweat it, we are in luck! The numbers were chosen carefully so that they can represent assigning multiple permissions in a single digit. To do so, simply add the digits together for the permissions you want. The numbers were chosen such that every sum is unique, and chmod can deduce which accesses you intend to give with just a single digit.

If we want both read and write access? we just add the digits for read and write: 2+4=6.

So if we want read and write access for user, read access for group, and read access for all others:

chmod 644 someFile.txt

What if we want read, write and execute for the user, read and write for the group, but only read for everyone else?

chmod 764

Congratulations! you just learned how to use chmod!

NOTE: Only the file's owner and the root user is able to execute chmod.

chown

Standing for "change owner", chown does exactly what you just head. It takes a file and changes the owner. This includes both the user owner and optionally the group.

This is very important. While chmod allows us to change the "permissions" that are applied to a file's owner and group, it is just as necessary that we are able to set these owners and groups. This defines the set of users for which we are setting or granting permissions.

To change the owner with chown, we can:

chown nezar someFile.txt

which sets the owner of the file to the user "nezar".

To change the group as well, we have to separate the user and group by a colon :

chown nezar:movies someFile.txt

This changes the owner to "nezar" and the group to "movies".

NOTE: Only root user is able to change the owner of a file. However, the owner user can change the group, but only to other groups they belong to. Otherwise, root privileges are required.

Other Tools

There are a few other tools that are helpful. Below are examples of some of those tools. I will not cover them in depth, but feel free to explore them on your own.

ls -l: The ls command lists all files in a directory. The -l option displays more details on each file, including its permissions
useradd: Creates a new user
groupadd: Creates a new group
usermod and groupmod: Modify existing users and groups. Helpful to assign a user to a group,or change the groups a user belongs to
userdel and groupdel: Delete users and groups
chgrp: very similar to chown but only for changing a group
umask: Change the default permissions assigned to newly created files in a directory
newgrp: Change the group assigned to the current login session
su: Change the current user or run a command as another user
sudo and doas: More advanced versions of the su command

Conclusion

In the next part of this series, we will look at how we can use what we learned so far to secure a file or a system using Unix Permissions. Please remember to provide feedback, it will help me write better in the future!

Unix Permissions - Part 3: Devising a Security Model

2023-06-07T00:00:00Z

Introduction

In the last sections of this series, we explored various CLI applications that work with Unix permissions and learned about how the permission system really works. In this last part of the series, we will look at using those learnings to secure files and systems.

Our increased reliance on the internet and computers for sensitive data makes digital security increasingly important. We will look at how Unix permissions fall in a general threat scenario, and which threat models it can defend against, and which it is unideal to defend against.

Threat Models Unix Permissions Can Defends Against

Unix permissions can defend against attacks that target files on the file system, as that is what they are designed to protect. It can protect these files from:

A process or application that is malicious, buggy or compromised
A malicious or incompetent user who has shell access (physically or remotely via SSH)

If a file is protected with the right permissions, the malicious actor could be prevented from reading or stealing the contents of a file, modifying or overwriting the content, or deleting the file entirely.

Threat Models Where Unix Permissions Will Not Help

An attacker who has root shell access: root access can bypasses Unix permissions
A process that has root access
An attacker that has physical access to the hardware: With physical access, the attacker can separate the hardware from the operating system, bypassing the operating system's protections. They can connect the hardware to another computer to access the files, for example.
Data not on the filesystem: Traditional permissions can only protect files. Data in network transit or displayed on a graphical application displayed on a monitor are not covered.

As we can see, Unix permissions are necessary but not sufficient. There are other security layers that can be responsible for the thread models where Unix permissions are insufficient, but they are out of scope for this article.

Devising a Security Model Using Traditional Unix Permissions

Now that we have gained an understanding of Unix permissions and how to work with them, we can actually start using them! What we learned so far is sufficient for working with individual files or directories on a case-by-base basis, but what about if we take a step back and look at the big picture of securing our system?

We already have the necessary skills of interacting with the permission system, but the system is flexible enough that there are several ways of going about using traditional permissions to devise a security model for your system. This can range from giving everything permissive permissions to make matters as easy as possible without much regard to security, or we could go totally overboard while sacrificing convenience and ease of use.

I will discuss with you a security model that I recommend. It may be overkill for some, but it would still be valuable to learn, and you can feel free to adjust it yourself as you see fit, as it is easy to only partially apply this strategy.

Least Privilege Principle

The least Privilege Principle tells us that every user must only given the minimum permissions necessary for them to perform the tasks they need. This means that any malicious or compromised actor is limited in the damage they could possibly cause and the data they can access. This is what is sometimes referred to as the attack surface.

If a certain user is given more permissions that they need, there is little benefit to doing so. At the same time, it risks increasing the attack surface if this user is ever compromised. This means that reducing permissions to the minimum needed is beneficial.

Moreover, this means that having one or few users have all or most permissions is undesired. Instead, having more users, each having a smaller subset of permissions, can lead to increased security.

Imagine, for example, that a process got compromised, and the attacker can now manipulate this process. If we did not follow the principle of least privilege, the process could end up having the ability to cause a lot of damage or compromise a lot of data. The attacker can manipulate data, steal it, or control other processes. But if least privilege principle was followed, even with a compromised process, the attacker may end up with little to nothing of interest to do.

We model our system in a way such that an attacker ends up in this situation: even in the case that the attacker compromises a process somehow, the process is very limited and hence the damage they can do is limited as well. Even if we secure the process that we think it is unlikely to be compromised, we can never be sure, and there is little sacrifice in implementing least privilege.

So how exactly do we use traditional permissions to implement least privilege principle?

Assigning Users to Access Patterns

Let us start with our users. Recall that users are the elementary unit that has permissions assigned, and can request access based on those permissions. Therefore,this is the entity whose permissions we want to limit as much as possible. We associate users with certain access patterns, such as:

a certain program or process we expect to run, such as a web-server
a script that will get auto-triggered
a remote device or script accessing our system via SSH
an actual user / person

Each of those should have their own users that have only the least amount of permissions needed to perform what they need. Whenever a new user is created, you must carefully determine what accesses they need and only grant them that access.

Granting Permissions Via Groups

Now that we figured out how to distribute users, let's give them the permissions they need!

Recall that a user can gain permission to a file by being part of one of three scopes: the user owner of a file, part of the group owner of the file, or all other users.

If we want to grant a user access to a file, our natural inclination might be to just make that user the owner of the file, and define permissions as such. This model is insufficient in many cases, as there can only be one owner per file, and thus deprives us from the ability of granting access to more than one user.

It would also not be wise to rely on the "all other users" scope, as this deprives us of the granularity of choosing which users are granted access to a file. Yes, we do not want to grant access to only one user, but we also don't want to grant access to unvetted users.

This leaves us with the last scope: the group. Controlling access by modifying permissions of the file's group and adding users we want to grant permissions to said group gives us the best balance. We can control which users get access, and what access they get.

This is how we arrive to our approach: to give a user permissions to a file, they must belong to the file's group.

Modeling and Organizing Groups

If we grant a user access to a file by assigning it to the file's group, we would then organize groups by associating a group with a single category of access, or more precisely a group of files that have common access expectations.

Let's take an example. Suppose we have a directory that includes a set of private notes. To follow the model we devised, we would create a group for all of those notes. We will conveniently call it "notes":

# create the group
groupadd notes
# assign the notes directory to the notes group
chgrp notes notes/

So now, whenever we want to grant any user access to "notes", we would just add it to that group. To do this with the user "nezar":

usermod --append --groups notes nezar

and now the user nezar is granted access to notes!

Of course, we must remember to set the permissions for users and groups correctly, and to choose a good user owner for the directory:

chmod --recursive 660 notes/
chown root: --recursive notes

the above makes the user owner and the group able to read and write the files, and all other users unable to read, write or execute the files. Then, it sets root as the user owner.

And we can follow this model for all other files and directories. To summarize:

identify a semantic grouping of files that would have common permissions for a set of users
create a group with a name semantically describing that group of files, similar to what we did with our "notes"
assign the files to the group with chgrp or chown
define the permissions for the files with chmod
add users to the group using usermod --append --group ... to grant them access

Limitations and Alternatives of Unix Permissions

The traditional Unix permissions model is simple yet powerful. Its biggest benefit is that it is the default on most Unix-like systems, but sometimes it is too simple and limited.

For example, traditional permissions may allow us to define permissions for a group of users, but it does not allow us to set different permissions for different groups of users. We can only set different permissions for the 3 scopes granted to us: the user owner, the group, and all other users. One of those can only fit one single user, and the other grants access to a group of access without any granularity. This means we cannot define permissions for two or more specific groups of users.

Thankfully, there are alternatives. I will list some of them but not cover them in this post, but look out for future posts as I may write about them in the future!

POSIX Access Control Lists
AppArmor
SELinux

Demistifying Containers and Docker

2023-06-09T00:00:00Z

Introduction

Containers have swiftly taken over the software development world over the past decade, saving many software developers tons of time. But it is often taken for granted, as it "just works" and stays out of your way with its black box magic.

But when the time comes that you have to mess with it, unfamiliarity can bring a lot of friction, and you may wonder why your team is introducing this seemingly unnecessary complexity. "Why can't we just run the app locally like normal people?!" you may think. Well today, I am here to tell you that containers provide you a ton of value!

Summary

In this series, we will break the mystery surrounding the black box magic of containers. In this post in particular, we will start with asking ourselves: what are containers, and what is a container actually doing when it is running our application inside of it? what difference does it make to run the application locally vs. running it in a container? What problems does containerization solve?

We will focus on standard containers as defined by the Open Container Initiative, of which Docker is part. There are other types of containers that are out of scope for this post.

Application Dependence on Environment

When developing software, you are not only dealing with your application's code. There are piles of dependencies onto which your code is stacked and relies upon to behave as expected. I am not only talking about dependencies that your code explicitly lists as packaged dependencies in your favorite package manager; there are many other dependencies that are less visible. For example:

programming language's runtime or compiler
Operating System and Distribution (Debian, Alpine, CentOS, Red Hat, etc.)
Web server or reverse proxy (nginx, apache, etc.)
Operating system libraries and applications (glibc, the shell, etc).
cryptographic dependencies (e.g. OpenSSL, GPG)
Database (PostgreSQL, MySQL, MongoDB, etc.)

And there are likely many others, depending on the application. Each of those has its own explicit and implicit dependencies. Consider that a difference in the versions of one of those dependencies, or in their configuration or setup, may cause a variation or failure in your application's execution. Imagine how hard it is to track down the source of this change in behavior.

This is only exacerbated if the application is running on a computer that is not even your's. How could you possibly know the environment details of a remote computer where your application is running?

Reproducibility of Environments

The natural solution to the issue of the reliance on environment for an application to work properly is to tell the user how to setup the correct environment, via a setup guide (README.md). This is harder for consumer-facing applications, as it relies on technical competency of the user. Often times, however, and especially for enterprise applications (which is the focus for this post), the "user" is another developer (or devOps engineer) using or deploying the application.

However, even when the user is technically competent, there is still large room for error, and it is almost impossible to nail down every possible error the user can do.

There is no telling what is installed on the user's computer. With millions of software applications out there, do you really know how each of them will affect your application?

And even if you do, the constantly changing state of your application, and updates to the environment, would require constant re-evaluation of this setup guide. Will you go through re-setting up your environment every time you make changes to your codebase?

Another computer may have a different operating system, a different set of applications they have pre-installed, or certain configurations. Will your setup guide cover all those edge cases or tell the user if their edge case is not covered?

If only there was a way to automatically setup an environment without the requirement of user intervention ... and somehow make this environment isolated, such that it is controlled, predictable, and does not effect the rest of the computer ... and somehow make all of this portable and very easy for someone to take it and run it on their own computer ...

That is precisely what Containerization technology aims to do!

What are Containers

For ease of understanding, let's break down the individual pieces of the definition. A Container is a:

isolated and restricted environment
encapsulates an application and all of its dependencies (down to OS libraries, but not the kernel )
have predictable behavior regardless of underlying infrastructure
runs on a host Linux kernel shared with other containers

The above are what we need for this first part of the series. For the next parts of this series, we will unpack that containers:

have standardized tools and methods to operate (create, run, stop, snapshot, etc.)
can be uploaded and downloaded and ran by a runtime without modification

Containers are specified in greater detail in the Open Container Initiative Specification, which is an effort to standardize containerization technology, of which Docker is a part.

Let's break down this definition further.

Isolation

Containers are isolated environments, and this is one of the most important concepts. Isolation helps keep the environment controlled and the application's execution predictable. What exactly does this mean? what exactly is isolated?

Container isolation occur at multiple levels: file system isolation, processes, user IDs, network, system resources (RAM, CPU, etc) and others.

To achieve this, containers use existing Linux features: Namespaces and Control groups (cgroups). cgroups allow isolation and restriction of system resources, while namespaces handle most of the rest.

File System Isolation

containers provide an isolated file system. In other words, the container has its own file system that is separate from the host computer's and other containers' file systems.

To better understand this, think of the file system of your current computer. On linux, if you look at the root path, you will find this:

[user@computer /]$ ls /
bin   dev  home  mnt         proc run   srv   sys  usr
boot  etc  lib   lost+found  opt  root  sbin  tmp  var

now imagine if a similar structure to this is replicated, but under a subdirectory.

[user@computer ~]$ mkdir /home/user/sub-directory
[user@computer ~]$ cd /home/user/sub-directory
# create all 
[user@computer ~]$ mkdir bin dev home mnt ...

So now we have a structure that is very similar to our root, except all the folders are empty. But it is starting to resemble its own small computer inside our big one!

To take this further, we can use change-root, or chroot. change-root changes the root directory. The root directory is the directory accessible by the / path, like when you do cd / or ls / like we did above. But chroot allows us to change this (temporarily) to another sub-directory to perform certain commands.

For chroot to work, we need at least one program to be present so we can run it. We will use bash, which can usually be found under /bin/bash.

# print current working directory
[user@computer ~]$ pwd
/home/user/sub-directory
[user@computer ~]$ cp -a /bin/bash ./bin/bash
# copy libraries used by bash
# For convenience, we will copy the entire lib directory
[user@computer ~]$ cp -a /lib ./
[user@computer ~]$ cp -a /lib64 ./
# chroot syntax: chroot [path to chroot into] [command to run]
[user@computer ~]$ chroot ./ /bin/bash
bash-5.1#

To exit the change-rooted environment, use "exit" command.

Now we have entered the change-rooted environment! It is not very useful at the current state, though. If we try to run most things, chances are it will error out:

bash-5.1# ls
bash: ls: command not found

the ls command does not exist in our isolated environment. This means that whenever we load our own application in here, it will only use programs it can find in the isolated environment, which is exactly what we are looking for!

Note: While chroot is a useful analogy, it does not create a secure file system isolation, and it is possible for a malicious program to affect the host file system. Moreover, contrary to popular belief, standard containers don't use chroot anymore, but use pivot_root instead.

Note 2: change-rooting is not the only mechanism used for file system isolation. Namespace mount isolation is also used. It is very similar to process isolation, which also utilizes namespaces, and is discussed in the next section.

Process Isolation

A process in Linux is a running program. When a program is ran, it is assigned a process ID, which helps keep track of it, and perform operations like stopping or pausing it, launch spin-off processes, allocating resources, and so on.

to check currently running processes, we can use the ps command.

[user@computer ~]$ ps -au
    PID TTY          TIME CMD
    345 tty1     00:00:10 sway
    398 tty1     00:00:02 swaybar
    610 tty1     00:00:00 swaybg
   1841 tty1     00:00:01 vim
   2155 tty1     00:00:05 bash
   3001 tty1     00:00:08 node

Containers use a Linux feature called namespaces, in which Linux can create a brand new process namespace for the container. You can imagine it as an isolated, blank space fo the containers processes and process IDs. Processes of other containers (and the host) cannot be "seen" by processes inside the container, or be influenced or manipulated by them. It looks as if only the container's processes are the ones that exist. You can verify this by running the ps command inside the container to see the visible processes inside.

Linux namespaces support more than just process namespaces! Let's talk about few of the important ones.

Network Namespace

Containers also use network namespaces for network isolation. This helps prevent network collisions and cross-container influences. For example, suppose you have two applications choosing to run on the same port 3000. Since network ports are namespaced, you can avoid these collisions. Both containers can run on port 3000 in their own network namespace, which are isolated from each other.

We can also map port 3000 from each of those containers to a different port on the host, giving them access to and from the wider network.

User Namespace

Containers can create their own users and groups that are separate from other containers and the host system. I spoke extensively about users and groups in my series on traditional Unix permissions.

Control Groups

control groups, or cgroups, are another linux feature that allow isolating and restricting system resource utilization by containers. This allows us to limit resource usage by containers, or ensure that they are provided sufficient resources to run.

Containers Share a Single Host Kernel

Although containers strive to create isolated environments, their isolation stops at the kernel level. Everything above the kernel, such as Operating System libraries and user land applications are isolated, but the kernel itself is not.

Wait a minute, what even is a kernel?
The kernel is the program that acts as the first layer that interacts with the hardware. It abstracts many of the varying hardware features and makes it available to user land programs via its API.
Moreover, the kernel is also what makes it possible to run multiple applications at once and schedule them. It manages resource allocation between those different processes. The kernel is very important, and most of our code is written with the assumption that there is a Kernel handling those tasks.

Instead, containers either use the same host kernel if it is Linux, or they emulate a Linux kernel and run all the containers on the same emulated kernel.

Virtual Machines (VMs) are another tool used for creating isolated environments. They emulate the hardware, and each VM has its own instance of a kernel running, providing deeper isolation.

Why not Use VMs then?

VMs are much slower, since emulating hardware and running a separate kernel is a resource-intensive task. It creates subpar performance, and for not that much added benefit. The Kernel is rarely ever the cause of software incompatibility, and it is usually the same across different machines.

Moreover, containerization is a technology for software development productivity. Its main purpose is easing the software development and deployment cycle, and thus provide tons of features making them very easy to use.

While VMs provide a ton of added security, that is usually not why containers are used. Containers are used to make development easier. To ensure that the execution of our application is predictable and reproducible. We don't like surprises in production!

Conclusion

Containers avoid developers a ton of wasted time. They allow us to keep the execution of our applications predictable, and to avoid the issues caused by differing system environments across different computers.

We saw how containers create isolated environments using Linux features. They isolate the file system, the process space, and others, in order to rule out the possibility of interference from the host environment.

This demistifies what a running container looks like, and what exactly the runtime does to achieve this. In the next part of this series, we will look at how containers are created and built, and how it is ensured that this build process yields the output we expect. Stay tuned!

What is the Fediverse, and Frequently Asked Questions

2023-06-09T00:00:00Z

Introduction

Social media is a constantly evolving space, with new ideas coming out all the time and new platforms overtake others, or establish new unexplored territories. I would even argue that social media has become the core part of the internet. It has overtaken the early internet dominated by individual websites and forums.

The Fediverse is one of those ideas in social media that has a very interesting perspective, both from a social and a technical point of view. While the Fediverse has not overtaken mainstream social media by any means, it has occupied a firm niche with millions of active users (accoring to The-Federation.info), and the number seems to increase in sudden bursts in response to certain events, like Elon Musk's buying of Twitter and Reddit charging for API usage by third party clients.

The biggest problem with the Fediverse is that it is intimidating and mysterious to new comers, and seasoned users frequently repeat terms sounding like jargon to those unfamiliar. Terms like instance, federates, seem to confuse people, and there's a degree of choice paralysis among new users regarding which "instance" to chose when creating an account.

Summary

In this article, I hope to demystify some of the Fediverse concept as a new curious observer myself with limited experience. We will talk about what the fediverse is, how it works on a high level, some frequently asked questions, and how to get started if you are interested. I will not be discussing the ActivityPub protocol or any implementations, as this currently falls outside of my scope of knowledge.

What is the Fediverse?

"Fediverse" can describe the world of federated social networks and platforms. Think platforms like Facebook (social media), Twitter (micro-blogging), YouTube (video sharing), Medium (blogs and articles), WhatsApp (instant messaging), etcetera.

A federated platform is one where an instance of said platform can interact and interconnect with other instances of the same platform, or instances of another federated platform.

While each instance of the platform could function independently as a full-fledged social platform, at the same time, multiple instances communicating with each other act as if it is a single platform. Content. They do this through a process called Federation.

If this does not make sense, yet, let's try an example!

Can you give an example?

Suppose you are using a federated platform, and let us take Mastodon as an example, which is a federated micro-blogging (imitating twitter) platform.

Suppose I sign up on one instance of Mastodon and create an account (yes, there are many instances of mastodon!). Let us assume this Mastodon instance is instance-A.com.I can use Mastodon as expected, posting to my profile, interacting with other users on instance-A.com.

Suppose you want to use Mastodon too, but happened to sign up on instance-B.com, a different instance from the one I signed up on. While you can similarly interact with other users on instance-B, seeing their posts, and they see your posts, and so on, you can actually even interact with me on instance-A! You may search for my profile then find it, you can follow me, comment on my posts and share them, and the fact that our instances are hosted separately is no obstacle.

Self-hosted Private Instances Federate too!

In fact, I can even choose to self-host my own instance, where I am the only user. I can just install Mastodon on my server and run it, tag it to domain cosmic-mastodon.com, and suddenly I can interact with you on instance-B!

Federating with Other Platforms

Suppose I am not a huge fan of the twitter-like micro-blogging. I can decide to instead use another platform, as long as it uses the same protocol (ActivityPub is the most common, and used by Mastodon). Say for example I want video content and decide to use PeerTube, which is modeled after YouTube. I can post my videos onto a PeerTube instance, and you can subscribe to my account from your Mastodon account on instance-B.com without any issue! Say I want a more threaded-like interface, similar to Reddit. You can use KBin or Lemmy, and you can again follow my account on lemmy or even follow an entire lemmy community (similar to a sub-Reddit).

Federation: How do Instances Talk with Each Other?

The process of different instances of federated platforms communicating with each other to allow users from each instance to interact with each other and exchange each others' content is what is called Federation.

This is possible despite the fact that different instances are hosted separately and enjoy an amount of independence and control over themselves. Each instance only directly controls itself.

At the same time, federated platforms usually agree on a protocol, or a set of standards for how to communicate with each other (sort of like a language). This allows different instances to be inter-connected despite the separation boundary. They can send content to and from each other, and because they speak the same language (using same protocol), they can integrate this content as if it originated from inside their instance, and give the user the experience of a single inter-connected platform.

What is a Federation Protocol (e.g. ActivityPub)?

A federation protocol is a set of standards for how an instance of a federated platform can communicate with others to form an inter-connected network. You can imagine it like a "language" that different instances speak, so that they can know what to expect when attempting to communicate with another instance.

As an analogy, suppose you and I want to agree to a protocol for our long distance communications. We might agree on the following protocol:

For every communication, I will send you a single page via mail.
On top of that page, there will be a big text at the top. That's the title.
On the page, there will be one line that says "sender: [name]" where [name] is someone's name. Thats the name of the person sending you the letter (most likely me!)
There should be a paragraph that starts with "summary:". Thats the letter's summary.

... and so on. Communication protocols are more or less similar, but involve more computing and network concepts. For example, if one federated instance sends a post to another, how would they send the replies or comments to the post? should they expect them to be in the initial communication? what happens if the communication is too large and the network times out? what happens if the user viewing the content is blocked? and so on.

What is an "Instance" of a Federated Platform?

An "instance" of a federated platform is exactly what it sounds like. I can go and install a federated platform like lemmy on my server, and have an entire instance of lemmy on my server, with all its features. The only difference between my instance and others is the content thats on it, and what other instances federate with me or block me (more on this below).

Instances Are Self-Contained

An instance can function entirely on its own. In fact, many platforms allow turning off Federation, and some instances do so. This way, the platform is isolated and does not connect with instances, and it is fully capable of doing so.

Instances Control Themselves Independently

Even with Federation, instances have a great amount of control over themselves. An instance stores and controls its own data, can control its communication with other instances, and can control the underlying resources of an instance. This theoretically allows for:

Controlling who gets to sign up and create an account on an instance
Banning or deleting users from their own instance
Banning users on other instances, but only from their own instance
Banning (or de-federating) entire instances from their own instance
Deleting content from their own instance
Scaling up or down hosting resources for better performance or cost savings

Instances Can't Exert the Same Control on Other Instances

While an instance can infleunce other instances indirectly, like de-federating (disallowing communication with another instance), banning users on other instances from their own instance, and so on, control over other instances is very limited:

Can't delete posts on other instances from users on other instances
Can't control communication between two other instances
Can't ban users on other instances from their own instance

What does it mean for a an instance to "federate" or "de-federate" with another instance?

When an instance "federates" with another, it means it opens up communication with that other instance, allowing content from the other instance to be visible and interact-able from the first instance.

Consequently, when an instance de-federates from another, they cease this communication, making posts from the other instance invisible and uninteract-able from the first.

you may still see posts from a de-federated instance

Note that if there are 3 instances, A, B, and C, and both A and B federate with C, but A and B do not federate with each other, it may still be possible for content from B to be seen in A and vice versa. For example, it may be possible that a user from instance A comments on a post from instance C. This could mean the comment sits on instance C, and thus can be visible to instance B as well.

Instance typically federate with all by default (black-list)

Most instances typically federate with all other instances by default, and black-list certain instances when necessary. Some do follow white-list federation though, meaning an instance has to be explicitly allowed to federate with.

Which instance should I sign up on? Does it make a difference?

Choosing a first instance is what seems to stop most people from starting. While there is some differences across instances, most of the time, it will not change much, and it is hard to choose a wrong instance. Any instance will theoretically give you access to the entire platform, assuming it is not a widely hated instance that most other instances block.

So what difference does it make? there are a couple of things:

Account Sign-up Requirements Vary Across Instances

Some instances have open sign ups, others have them completely closed. Some instances require you to fill out a form, others require you to verify via email. Typically larger instances will have more restrictions, as the task of moderating their instance becomes more difficult, and it is an effort to curb spam.

Some instances have larger federation networks than others

Although in theory, an instance should have access to all other instances on the network via federation, in practice there are a couple of things preventing this:

Your instance has blocked / de-federated with a number of instances
Your instance is blocked / was de-federated from by a number of other instances
Your instance implements whitelist federation. Meaning only explicitly allowed instances are federated with

This is usually not a big concern, unless you join a notoriously hated instance that everyone blocked, or you for some reason have an interest in content on those notorious instances.

In general, if there is someone you really want to follow, then you want to make sure your instance federates with theirs.

Rules and Moderation

Different instances will have different rules and moderation styles. Some are really strict, others are fairly lax.

However, instances that are too lax might get blocked by other instances, if they create moderation problems for them.

Performance and Scaling

It costs to host an instance, and when the number of users grow on an instance, scaling issues start to occur. This can sometimes be remediated by increasing resources or optimizations on the platform itself. Smaller instances tend to do much better in terms of scaling.

Verdict

We should not over complicate this matter, so I think choosing an instance can be done in the following steps:

Search the web for a list comparison of instances of the platform you'd like. There is instances.social for Mastodon and awesome-lemmy-instances for Lemmy for example.
Choose a random instance! some of them give you random ones at the top anyways
Check if it matches your criteria, as it most likely will. Do the rules match your preference? is the sign up process easy enough? Is the federation network sufficient for you? Is the uptime high enough?
Don't aim for popular instances with high number of users. They are less likely to have open registration, and that is to curb their scaling issues.

What are the Benefits of Federated Platforms?

I plan to add more details to this answer in the future

Why even consider federated platforms? There are multiple reasons!

Accessing or Interacting with Content from Any Platform or Client
Removing reliance on single instances of a platform
Resisting takedowns by avoiding single point of failure
Decentralizing control of social platforms
Ability to self-host, allowing users greater degree of control

Alternatives to Federated Platforms?

TO BE ANSWERED LATER

Conculsion

I hope this article helped answer some of your questions about the Fediverse. I will continue adding to this as I learn more.

Securing SSH With Passwordless Public Key Authentication

2023-06-22T00:00:00Z

Introduction

When you launch a shell session to a remote computer or server via SSH, the default behavior will prompt you for a password. This is called "Password Authentication". We can use public ky cryptography to provide us an alternate method for authentication, that can provide us with added security, convenience, and scalability benefits. This method is called "Public Key Authentication", or "SSH Key Authentication", and it is the topic of this article.

Summary

We will talk about the motivation behind public key authentication. Why should you use it instead of password authentication? We will then explain what it is and how it works, then go through the setup process (which is fairly easy). We will then end with a discussion on best practices and security considerations.

Feel free to skip around if you only want the how-to.

Why Use Public Key Authentication?

Passwords Suck

The biggest and most important reason to use SSH keys is that passwords suck. Seriously, passwords have a ton of problems and are also annoying to use. You have to make them strong and complex, you have to use different ones for each account and use case, but at the same time you also have to remember them all. It's a bit contradictory, isn't it?

Depending on passwords for authenticating SSH means your security is only as good as your password's strength, uniqueness, and your users' handling of the password.

SSH cryptographic keys are random and longer than any password you will memorize, except you don't have to memorize them!

Having to Type a Strong Password Every Time you Authenticate

Having to type a password every time you authenticate is tedious, especially when the password is made longer and include different types of characters for added strength. While this is more about convenience, many users resort to unsafe methods, such as keeping the password in a non-encrypted note for easy copy-pasting.

While there are work-arounds that make it easier, such as making a password not required for a certain duration of time after it is entered, that does not eliminate the issue entirely, and in some cases may compromise security.

Passwords Prone to Human Error

We already discussed some of those above. Humans commit a lot of mistakes in favor of convenience when handling passwords. Writing them down on paper where others can see, writing them on notes on shared computers, sending them via unencrypted messages, making them weak, relating them to personal information (like using your dog's name) and so on.

Passwords Struggle at Scale

Consider the issues that arise due to scale, like in enterprise settings. For example, you have to share the password, and you have to make sure to do it via a secure method, where most people typically do not.

Even methods deemed secure have issues. It only takes one person messing up to compromise the security of the system, and suddenly everyone has to use a new password.

With SSH keys, every person, user, or connecting device can have their own key. Any of these keys can be invalidated without affecting the others.

SSH Keys Allow Secure Automation

SSH Keys allows for authentication without manual steps. As long as the client already has the right private key in an expected location, running the ssh command can authenticate immediately. This allows for secure automated tasks (including via shell scripts) involving SSH connections, which are often utilized in enterprise settings.

Passwords are Sent Over to the Server

This is not a significant concern, as OpenSSH employs state of the art cryptography to secure and authenticate the session, even when using passwords. While it is true that, with password authentication, the client must send the password to the server, the password is strongly encrypted, and it would be highly unlikely for someone to retrieve the password if the connection is intercepted.

Still, it is worth mentioning that with public key authentication, the key used for authentication technically never has to leave your computer, encrypted or not.

How does it work then, if it does not have to be sent over? How does the server know I am who I am? Let's find out in the next section!

How it Works

SSH Public Key Authentication, as evident from its name, uses Public Key Cryptography or Asymmetric Cryptography. This is contrasted with Private Key Cryptography, or Symmetric Cryptography.

Symmetric Cryptography

Private key cryptography involves the creation of a single (cryptographic) key called the private key. This key is usually stored in a file. If you look at it, you'll find it is a string of random characters. You can think of it as one really strong and long password that you don't have to memorize.

This key is used to encrypt data (whether it is messages, files, passwords, etc), but it is also the same key that can decrypt it. This is why it is called symmetric encryption. The same key is used for both encryption and decryption.

Public Key Cryptography

Public key cryptography, on the other hand, usually involves a pair of two keys: One key is called the private key, the other is the public key. The private key must be kept secret at all costs; keeping it secret is detrimental to the mechanism's security. Whereas exposing the public key is not as detrimental, and sometimes poses insignificant risks depending on the situation.

In public key cryptography, if a message is encrypted via one of the key pair, it can only be decrypted by the other key, and vice versa. For example, if we encrypt a message using the private key, we cannot decrypt that message even with the private key. But we can decrypt it using the public key, and only with the public key.

Encryption can Prove Identity

While encryption can be used to conceal and secure the content of a message, it can also be used to prove identity. This is sometimes called a Cryptographic Signature or Digital Signature. How does it work exactly?

Suppose I create a cryptographic public-private key pair (we will see how to do this later below). I give you the public key, while I keep the private key. Then, suppose I send you a message I encrypted using the private key. You decrypt it successfully, and are able to read it. Yay! But wait. What about proving identity?

Well, there is something else that happened here. The fact that you were even able to decrypt the message proves something: this message was encrypted using the private key. Not any private key, but the one that corresponds to the public key I gave you! This proves to you that this message came from me, and only me (or, possibly, someone who stole my private key).

Alternatively, suppose someone else sends you a message trying to pretend to be me. If they do not have my private key, then when you attempt to decrypt the message via the public key that I gave you, then it will fail to give you a readable message. This will prove to you that whoever sent this message is most likely not me.

This is crucial, because SSH uses this characteristic of asymmetric cryptography for authentication. It is a way for a client to prove they are who they are; a client that is granted access to SSH into the system.

How SSH Public Key Authentication Works

When using public key cryptography with SSH, first of all, we have to have our key pair as we discussed earlier.

The public key must be stored on the remote server that we would connect to or SSH into. This server may have several public keys belonging to more than one entity that may attempt to SSH into it.

The client holds the private key, where it will use it to prove its identity to the server when it attempts to connect. If the private key on the client matches with the public key on the server, then it should succeed. But how does it happen exactly?

The exact mechanism varies by protocol, which can be configured. But here is an example of how it commonly happens.

When the client initiates the SSH connection with the server, initially, the two machines create the initial handshake to calculate a secret shared key. This key does not grant the client access yet, but is used to secure the following communications.

Assuming the right configurations are set (discussed further down), the client creates a message based on the session ID, and signs it with the private key it has. It then sends both the signed message (but without session ID) and the corresponding public key to the server (alongside other data like username, protocol, etc).

On the server side, the supplied public key is checked to verify if it is in the authorized list of public keys for the supplied username. Then, the server prepends the session ID back to the signed message, and attempts to verify the signature. If that is successful, the server grants the client access, and a shell session is launched!

Using SSH Public Key Authentication

Generating SSH Key Pair

On the client (the device that will initiate the SSH connection to the server), use ssh-keygen to generate a SSH key pair. You can use the -t option to specify the type of key generated. The default is rsa, but ed25519 is another recommended alternative.

nezar@cosmicbytes ~ $ ssh-keygen -t ed25519

You'll get the following output, asking you to choose the file where the key will be saved. If you hit enter, it will save it in the default path it specifies.

Generating public/private ed25519 key pair.
Enter file in which to save the key (/home/nezar/.ssh/id_ed25519):

It will then ask you to enter an optional passphrase. The passphrase provides added security by encrypting the key and making it only accessible by entering the passphrase, but it does require you to type it whenever you initiate SSH connection. If you do not wish to use a passphrase and keep things passwordless, leave it empty and just hit Enter.

Enter passphrase (empty for no passphrase):
Your identification has been saved in /home/nezar/.ssh/id_ed25519
Your public key has been saved in /home/nezar/.ssh/id_ed25519.pub

Now we have our key pair generated!

Adding the Public Key to the Server's Authorized List

Next, we need to tell the server that if they ever get a SSH client attempting to authenticate with the private key we just generated, that they should accept it. To do this, we must first add the public key that we generated on the client to the server's authorized list.

Using `ssh-copy-id`

If you can already SSH to the server from the host (presumably via password), this can be automated using the ssh-copy-id utility. You can use it just like you would use the ssh command, but instead of it launching a shell session, it sends the public key to the server and adds it to its authorized list.

nezar@cosmicbytes ~ $ ssh-copy-id [user]@[server_address]

Replace [server_address] with the address to the server, and [user] with the remote user name on that server that you wish to authenticate SSH sessions with (remove the square brackets [] ).

If the server address is 192.168.1.50 and the user is "maple", then the command would be:

nezar@cosmicbytes ~ $ ssh-copy-id maple@192.168.1.50

Manually

If you are unable to connect to the server via SSH, you can attempt to transfer it manually. To do this, transfer the public key file ending with .pub using your desired, secure transfer method to the server. The file should be found at the path /home/[your_user]/.ssh/ (where [your_user] is your user name on the client). Make sure not to transfer or touch the private key (the file that does not end with .pub), as exposing it could compromise your security.

After transferring the file to the server using your preferred method, launch a terminal or shell session on the server using the user you want the client to gain SSH authorization on.

Check if the home directory of the user has the .ssh directory. If not, make sure to create it:heart with the proper permissions.

# make sure you're in home directory
maple@server ~ $ cd ~
# check to see if .ssh directory exists
maple@server ~ $ test -d ".ssh" && echo ".ssh dir exists!" || echo ".ssh dir doesn't exist"
# create the directory if it does not exist
maple@server ~ $ mkdir .ssh
# set the right permissions
maple@server ~ $ chmod 700 .ssh

Add the content of the public key to the authorized_keys file under the .ssh directory. Make sure to append so as to avoid unintentionally erasing previous entries in there.

maple@server ~ $ cat path/to/key.pub >> ~/.ssh/authorized_keys
# make sure the right permissions are set on the file
maple@server ~ $ chmod 600 ~/.ssh/authorized_keys

And now we have our SSH public key authentication working! If you face any issues, restart ssh daemon to make sure changes took affect, or restart the machines altogether. Then you can attempt to SSH and observe our new authentication method in action.

Disabling Password Authentication

You can disable password authentication entirely on the server for enhanced security. This way, only SSH keys can be used. To do this, we need to set the PasswordAuthentication option in sshd_config to no.

Open the file /etc/ssh/sshd_config in your preferred text editor and add the following line:

PasswordAuthentication no

You can do this on a per-user basis using the Match User statement:

Match User maple
    Password Authentication no

Make sure you do this at the end of the file, so that other options are not also restricted to the user maple.

Future Considerations and Best Practices

We have successfully added Public Key Authentication to our SSH setup. While this comes with both security and convenience benefits, it is not free of pitfalls and cons.

For example, a common pitfall is improper key management. This can especially get difficult as the usage scales, like in an enterprise setting. When you're managing a large number of keys, it is important to keep track of which ones are in use, removing unused ones.

It is also important to ensure that each key is granted a limited set of privileges based on what is needed. Privileges need to be routinely revised to ensure a key's user is not granted more or less permissions than needed.

Key rotation is an important part of key management, and forgetting it is another common pitfall. Just like passwords, keys should be rotated too! This ensures that unused keys or secretly compromised keys are expired and their threats are eliminated.

Use passphrases wisely. Passphrases have a different security model than typical passwords. They are only used for unlocking a key that is only accessible on one device. They can delay an attacker from obtaining the private key from the private key file, as they are now forced to brute force it. Rotating passphrases is also advised.

It may also be worth considering SSH Certificates instead of Keys. They can offer benefits for enterprise scaling, but they are also with their own pros and cons. We will discuss them in a future article.

Thanks for reading!

Messaging in Distributed Systems (Pub-Sub, Message Queues, etc)

2023-11-15T00:00:00Z

Introduction

Messaging is a powerful design pattern in distributed systems, allowing us to decouple system components and enforce greater resiliency and reliability in communication between system components. Typically, one of many dedicated messaging technologies (or message-oriented middleware) is used, with common examples being Kafka, RabbitMQ, and managed services like AWS SQS or GCP's Pub-Sub. But messaging is not suitable for all systems and use cases. While it can greatly simplify certain types of distributed systems and solve many problems, using them in the wrong place can over-complicate the system, make it more expensive, and introduce performance bottlenecks. Before using a messaging technology, it is key to understand the tradeoffs it offers.

In this article, we will discuss messaging at a high level, gaining an understanding of what it really is, its capabilities and limitations. While messaging is a broad term, encompassing more specific patterns such as message queuing, publish-subscribe messaging, and data streaming, they share many aspects and key concepts, and we will focus on messaging as a whole. I will discuss those specific patterns and specific technologies in-depth in future articles.

What is a Messaging Service

In distributed systems, a messaging service is a middleware component that handles receiving and sending messages between other components of the system. It is essentially a communication middle man, abstracting away communication between components and simplifying it. It takes messages from one or more system components, and relays it to one or more other components.

| Message Producer (A) | ----> | Message Broker | ----> | Message Consumer (B) |

Why Use Messaging?

On a first look, it might seem like messaging is an unnecessary addition to the system, and we are just placing an unnecessary middle-man when we could establish communication directly. Why not just have the message producer send the message directly to the consumer? This question is valid, and for many cases, that's what you should do if the communication between two components is simple. But a message service will typically do a lot more than merely relay messages, such as persisting messages, retrying failures, solving complicated routing patterns, etc. Let's take a deeper look at what it is capable of.

Example

Before we dive deeper, let us consider an example where using a message broker is useful. Think of this example as you read the rest of the article, as it may help deepen your understanding.

Suppose we have a video sharing platform (similar to TikTok). After the video is uploaded, we want to run the video through our community guideline scanner. We decided to do this asynchronously. So after the video is uploaded, we post a message to our message broker, and eventually, this message gets fed into our community guideline scanner.

This is not the only way to architect this, but we take this approach as an example. We also assume that the community guideline scanner can be a long running or CPU-intensive process.

Video Uploader ----> Message Queue ----> Community Guidelines Scanner

Now let us dive deeper into what a message broker can possibly do for us.

Decoupling System Components

By using a messaging middleware, services A and B are now decoupled, and the contract between them is simplified. Services A and B can each focus on their own tasks, and service A only has to concern itself with sending a message to the message middleware, while service B only has to concern itself with reading or receiving a message. Beyond that, the services are now decoupled and are not concerned with the details or implementation of the other. They only have to agree on the message format or protocol. As per our example, the video uploader is no longer concerned with the Community Guidelines Scanner. It only needs to focus on video uploading and sending the message to the messaging middleware.

"What do you mean they don't have to concern themselves with other details or implementation? I don't use messaging middleware and I have never had to deal with that!". Bear with me, I discuss that in the remaining points.

Asynchronous Workflows

If service A wants to call service B to continue execution, but it can be asynchronous, we acquire many benefits using a message middleware. Service A can just send a message to the message middleware, and now its free to process other requests, while the middleware handles telling B to process the remainder of the workflow asynchronously, or merely stores the message until B asks for it. Service A no longer has to concern itself with whether B has received the message or not, or when it is done processing it. This is delegated to the message middleware.

Following our example, as we said earlier, the video uploader does not have to deal with the community guidelines scanner or even delivering the message to it successfully. We do not have to wait for the long running process. We are immediately free to handle more video uploads again.

Handling Retries

When one service A has to communicate or call another service B, any failures in service B may lead to loss of data or workflow breakages. Without a message middleware, service A would have to be concerned with making sure the delivery of the message to service B was successful, and that service B was executed successfully. It is often preferable to free service A to process other requests, and offload this task to a message middleware to handle it. A message middleware is able to retry delivering a message until it is processed successfully. It is also capable of ensuring that retrying failures does not impact other traffic negatively.

Following the example, we would not want a failure in the guidelines scanner to not be retried. But we also do not want the video uploader to handle that. A message broker will take over handling retries.

Prioritization and Ordering

Some message middlewares allow for prioritizing the consumption of messages based on a specific ordering algorithm. Sometimes, an ordering scheme may be as simple as processing messages in the order they were received, "First In First Out", or something more complex, like sorting by priority assigned by the message producer.

In our example, consider if we implemented prioritization. Maybe we want videos coming from users with a record of guideline violations to have higher priority in being scanned. Our message broker can ensure that higher priority videos will get scanned first.

Routing

A service might be seeking to communicate with many, sometimes an undefined or changing number of receivers. This can complicate the service, and it can be simpler to instead send a message to a message broker and delegate the task of communication to it. The message broker can then handle delivering the message to the right receivers. Receivers can even pull the messages, which prevents from having to keep track of receivers. This means that messaging can greatly simplify communication routing. An almost-identical use case is a receiver wanting to receive messages from an undefined or changing number of senders.

Moreover, when operating at scale, there may be a changing number of instances of a given service as it scales up or down. While there are other solutions to this, a message broker can simplify this task as we just explained.

Deduplication

Message middlewares can often handle deduplicating messages. If the same message was sent multiple times, but it is necessary to process a message only once, deduplication can be enacted. Moreover, a message middleware can ensure that a received message is relayed to only one consumer, and only once (or only until successfully processed once). This way, we prevent duplicate processing of a message.

Following our example, we would not want to waste compute resources scanning the same video twice, since scanning may be expensive.

Handling Traffic Spikes

One use case where message middlewares can be quite helpful is when handling traffic spikes, especially if they're unpredictable and we are working with long-running or intensive processes or asynchronous workflows. By using a message middleware, we can queue and persist the messages until they are processed, without worrying about not having enough instances up to handle our traffic.

Scaling Independently

By decoupling system components, not only have we separated concerns, but we can allow the services to scale independently from each other. One service might be long running and intensive, and having larger scaling requirements.

Persisting Messages for Offline Consumers

Imagine if, in our example, the guidelines scanner went down. We would still want to scan all these videos when it comes back up. A message broker can persist these messages long term, until they are consumed. This is similar to how if you were offline for a while, you would still receive messages on your phone that were sent while you were offline, once you come back online.

Limitations of Messaging Middleware: When NOT To Use Them?

While messaging can be a worthy addition to your system, it is not without its setbacks and limitations. In general, it is rare for an addition to a distributed system to not come with its hurdles. Let us discuss them:

Limitations of Most Additions to a Distributed System

The following applies to many distributed system components. I will not discuss them in detail, as they deserve their own discussion:

Over-complicating the system
Increasing cost
Incurring latency costs: A message has to travel through greater distances and more components
Performance costs
Introducing extra points of failure that have to be maintained
Introducing more paths for the application flow to be kept track of
More components for which we need to configure, and unless is a managed service, we have to manage and maintain its infrastructure

Synchronous Workflows

A message middleware is unsuitable for synchronous workflows. It is often the case that a service A invokes a service B synchronously. This could be because service A uses the response from B to continue its processing, or it could be because service A handles the response to a client as part of a request-response cycle.

Message middlewares are great for asynchronous workflows, and often times, it might be beneficial to convert your synchronous workflow into asynchronous,either partially or completely. But this is not always possible, and one has to evaluate the benefits of such conversion

Latency

Although latency increases are common to many distributed system components, many (but not all) message middlewares may incur greater latency costs than others.

In general, a message middleware will often persist the message rather than merely route it. A message received by the middleware may take time to become available for consumption or processing. The middleware has to first persist the message, and a distributed message middleware may have to replicate the persisted message to multiple clusters.

This is why most message middlewares are used in asynchronous workflows.

However, it is important to point out that this does not apply to all message middlewares. Some are designed to handle stream processing at incredibly low latency, but still garner the benefits of the messaging pattern. I will discuss this in a future article about Apache Kafka.

Conclusion

Let us revise our definition of messaging in distributed systems. Recall that a message middleware is a service that acts as a communication middle-man between two or more system components. Other system components can send messages to the messaging service, or receive (or pull) messages from it.

But this was not sufficient to truly grasp the power of messaging. A messaging service abstracts away the complications of communication in distributed systems, and allows system components to simplify communication by only sending or receiving messages. The message middleware can handle retrying failures, persisting messages, and simplify routing communication to multiple receivers or instances of a receiver. It can allow the system to scale better, be more reliable, and prevent error states.

Before using a message broker, remember that they are not always trivial to setup and maintain, and they may be adding performance bottlenecks to your system. Make sure to consider the tradeoffs before putting in the effort!

The Finite State Machine

2024-01-06T00:00:00Z

Introduction

The Finite State Machine is one of many models of computation in computer science. Although it has great significance in the study of automata and computation theory, it also has practical significance, such as its application in text parsing, compilers, and machine learning, and network devices. Not only that, they are found around us in every day code (such as event-driven programs) and can usually be used to improve or simplify your every day code. You probably have written a finite state machine without even knowing it!

What is it?

A Finite State Machine is an abstract machine that performs computation. What makes it stand out to other computing machines is that it can only be in one state at a time, out of a finite (limited) number of pre-defined states. The machine takes an input, reading its elements one-by-one (think characters of a word or string), and then changes its state based on the input element. The machine has a transition function, which tells it how to change its state based on its input. In other words, the machine changes state according to its transition function.

To understand this abstract talk better, let's take an example.

Example: Detecting Consecutive Errors

We have an e-commerce site that allows customers to place orders. We have received complaints about the site's performance, and we suspect the search end-point is the offending one. We tolerate one-off errors occurring due to network conditions, but two or more errors in a row are not. We want to investigate logs of the search end-point to see if it has had two or more errors in a row.

We have a list of all our application logs over the past month to scan for consecutive errors. We can write an algorithm using a typical programming language to do this. Below is a Python example:

Imperative Code Solution

def are_errors_consecutive(logs):
    consecutive_errors = 0
    for log in logs:
        if isErrorLog(log):
            consecutive_errors += 1 # increment by one
        else:
            consecutive_errors = 0 # reset back to 0
        if consecutive_errors >= 2:
            return true
    return false

We first initialize a variable consecutive_errors then start a for-loop through the input. The variable holds our state throughout the loop. For each log, we check if it is an error, and update the state variable by incrementing it if an error was found, or resetting to 0 if non-error log is encountered. If we see that there have already been two consecutive errors, we return early from the function.

We used isErrorLog() since we did not specify what our input looks like exactly. Its implementation depends on the form of the input.

The program we just wrote is very similar to a finite state machine! Let us explore the finite state machine implementation.

Finite State Machine Solution

A finite state machine can only be in one state. Thankfully, our function already used one state variable with only a finite number of possibilities: 0, 1, and 2 errors detected.

The state has two values of special interest: the starting state and acceptance state. Our starting state is 0, and acceptance is 2 consecutive errors detected. If execution ends with the machine in the accept state, then the input altogether is accepted. That is the "return" of the finite state machine: whether the input is accepted or not.

Real world implementations of the finite state machine may deviate slightly from these definitions. For example, we may have the final state as the return value rather than a binary accept or reject. Those machines would still closely resemble the abstract finite state machine.

Transition Function

An important part of the state machine is how it moves from state to another. The finite state machine reads the elements of the input one-by-one. For every element, it may change its state based on the input element and its current state. This is called the "transition function", defining how the machine changes state.

In our example, we can describe the transition function as follows:

- If state is 0, and non-error log encountered, new state is 0
- If state is 0, and error log encountered, new state is 1
- state = 1, input = non-error -> state = 0
- state = 1, input = error -> state = 2
- state = 2, input = non-error -> state = 2
- state = 2, input = error -> state = 2

One last difference about the finite state machine is that it does not return early. It will read the input in its entirity until it is done.

Congratulations, we just devised a finite state machine! Let us walk through a sample execution.

Sample Execution of Finite State Machine

Let's take an example input. For simplicity, let us define a non-error log as 0, and an error log as 1. Then, let us take the following input example: 00100110

The machine starts with its starting state as 0 (state = 0)
it reads the first input element as 0. Based on the transition function, the new state remains at 0 (state = 0)
same as above happens for the second input element, as it is also a 0 (state = 0)
the third input = 1. Based on transition function, state moves to 1. (state = 1)
input = 0. Based on transition function, state moves to 0 (state = 0)
input = 0, new state = 0 (state = 0)
input = 1, new state = 1 (state = 1)
input = 1, new state = 2 (state = 2)
input = 0, new state remains at 2 (state = 2)

we Finish execution on an accept state, so our input is accepted!

Notice how it is easier to reason about the algorithm by thinking of it as a bunch of states and transitions between them, rather than an imperative loop. While imperative loops are a common encounter for programmers and make a lot of sense, state machines are even easier to make sense of for non-technical people. All it takes is thinking of states and transitions. The difference in ease of understanding may be more apparent in more complex problems compared to our simple example.

Definition of Finite State Machine

To define a specific finite state machine, we must define the following:

An Alphabet: The range of values it accepts as input elements
States: The finite list of possible states the machine can hold
Starting State
Accept State
Transition Function: Defines how the machine changes state, given an input element and current state as input to the function

Our example above can be defined as follows:

An Alphabet: The set ["error log", "non-error log"]
States: [0,1,2]
Starting State: 0
Accept State: 2
Transition Function: We already defined it above in the transition function section

A deterministic finite state machine bearing the above is a machine that holds a single state, and reads an input, changing state based on its transition function as it goes.

What we listed above are all the constraints needed to define a Finite State Machine.

You Have Built Finite State Machines Without Noticing!

Notice how the example we presented was originally solved using a traditional loop in the common imperative style of code. If you're a programmer, you have probably written many Finite State Machines without even noticing that they are! This is one of the important aspects of learning models of computation. It is not only about choosing which model of computation to use, but also about identifying them, which can help with conceptualizing and visualizing, as well as recognizing the properties you can apply to them.

Characteristics of Finite State Machines

Limited Memory and Good Performance

The "finite state" part means that the machine's memory is bounded. This is in contrast to many other models of computation where memory is unbounded. This makes the Finite State Machine very performant in many cases, and useful in certain low level programming such as embedded devices and network devices.

This also means that the Finite State Machine is unsuitable for applications where finite memory is insufficient. For example, consider the problem of verifying whether an input has a certain number of 0's followed by the same exact number of 1's. This is impossible for a Finite State Machine to solve, as it requires unbounded memory, counting an unbounded number of 0's in order to verify that the number of 1's is equal.

Note: nearly all real world problems have some memory bound, even if unrealistically high. So you can still use Finite State Machines there, but it just may not be the best idea to do so.

Simplicity

The simplicity of the Finite State Machine makes it easy to reason about. Modeling problems using it can help you identify superfluous or error states, use reduction methods to simplify the problem, or make use of the other characteristics we discuss below.

Easy to Construct Complex Machines from Simpler Ones

Suppose you have a complex computation problem. You can break down the problem into simpler ones, solve each of them with a finite state machine, then combine the result.

This is because Regular Languages (see note below) are "closed" under certain operations, such as union or concatenation. This means that languages formed from these operations applied to regular languages are also regular (i.e. solvable by Finite State Machine).

A Regular Language is a set of strings that, for all of its members, there exists a single Finite State Machine that can accept all of them. In other words, if those members are given to said Machine as input, all of them would end in accept state. Conversely, an irregular language is almost the same, except there is not a finite state machine that can solve them all.

If you know two regular languages and their finite state machines, you can find the finite state machine that accepts the language made from the union (or concatenation, for example) of those two languages from the two finite state machines.

This is useful in constructing regular expression (regex) patterns, which were originally built on Finite State Machines!

Equivalence of Deterministic and Non-Deterministic Finite State Machines

Most of what we have seen is about deterministic Finite State Machines, in which transition functions have deterministic outcomes. That is, given a current state and an input, there is always one possible non-null state outcome. Moreover, every state has a transition for every input-state combination.

Non-deterministic Finite State Machines do not have this requirement. In other words, a transition could have one or more output states for the same input. How does this make sense? well, when multiple outputs are encountered, you can imagine as if the process splits, with two new state machine processes born out of it, one for each state. Execution continues as normal, and one of these "processes" ending in an accept state is sufficient for the entire input to accepted. If this does not make sense, I will be writing a separate blog post on Non-Deterministic Finite State Machines with examples, so stay tuned! Feel free to check the Wikipedia article in the meantime.

The key takeaway though: it may seen that Non-Deterministic FSMs (Finite State Machine) are more powerful and capable of solving more problems than their deterministic counterpart. In reality, every Non-Deterministic FSM has an equivalent deterministic FSM. Not only that, there is a specific mechanism for converting a Non-deterministic FSM to a deterministic one.

This is quite important. Some problems are much easier to conceptualize with a non-deterministic model, but can be easily converted to a deterministic model, conserving the simplicity of the latter.

The Pumping Lemma

Although out of scope for this article, regular languages have an important mathematical property that they all share called the Pumping Lemma. The key takeaway is that this makes it possible to mathematically prove whether a language (set of inputs) is regular (solvable by finite state machine) without having to search for a specific implementation.

Applications of Finite State Machines

Although important in theory, Finite State Machines have practical applications. They are commonly used in lexical parsing, network devices, machine learning and others.

Moreover, as we saw earlier, you might write a program that is a Finite State Machine without noticing it! You don't have to go out of your way to write programs as FSMs to gain their benefits. Sometimes, it is enough to recognize them in existing problems and solutions. It'll make it easier to map the program's state to the real-world representation. You might represent state and its transitions in a clearer, more maintainable and readable way. Moreover, being more intentional about state management will help you avoid errors in the long run.

Some examples that you are more likely to run into day-to-day:

Event-Driven Programming: Event driven programming is increasingly common in distributed systems, and they can often be modeled as state machines
Reactive UI State: If you have ever created a React.JS component, you very likely modeled your component state as a state machine
Regular Expressions (Regex): Explained in the next section

Regular Expressions (Regex)

You may be surprised to learn that regular expressions are entirely based on finite state machines, at least in their original theory. If you are not familiar, regular expressions are commonly used in text parsing tools, employing a declarative syntax for defining patterns to be matched.

While regex syntax is annoying to some, you might find it easier when you realize that it is based on set operations (like union and Kleene's star).

Note: Most modern implementations of regular expressions have extensions that extend beyond finite state machines, but the base theory and syntax are still based on FSMs.

Finite State Machines are commonly used in parsing applications, and regular expressions is one example. In fact, regular expressions are equivalent to finite state machines. Regular expression syntax is a syntax for defining regular languages. This will be discussed in a future blog post.