A brainfuck interpreter with serialised memory
Invented in 1993 by Urban Müller, brainfuck is a language designed to have a very very small compiler.
As I am learning Rust, I thought that writing a compiler for brainfuck would be a great starting project. The final interpreter neatly fits in a single file, but it is no code golf winner. The main resource I used for this project is the Esolang wiki page.
Environment and syntax
Brainfuck involves manipulating the values of one big array. Traditionally, this array was 30,000 cells long. However, modern day implementations seem to vary in array length and cell capacities (we'll address this later).
To manipulate these array values, brainfuck instructions use a pointer initialised to the first element of the array. The language specifies instructions to move the pointer left and right of the array, as well as incrementing and decrementing the cell value.
In fact, there are only 8 valid brainfuck instructions, all one character long. All other characters in the source code are ignored as comments. The following is a table describing them.
| Instruction | Description |
|---|---|
| > | Move the pointer right |
| < | Move the pointer left |
| + | Increment the cell |
| - | Decrement the cell |
| . | Output the numeric value of the cell as a character |
| , | Read input and set the cell to the numeric value of the character |
| [ | Jump past the matching bracket if the cell value (of the pointer) is 0 |
| ] | Jump back to the matching bracket if the cell value (of the pointer) is non-zero |
The language supports reading the input of a single character, which numeric value is written to the cell the pointer is on. brainfuck also supports the output of single characters by the cell numeric value, and can also read single character input and set the numeric value of the cell correspondingly.
Notably, the only control flow present in this language are square brackets, which change the execution pointer depending on cell value conditions.
With this set of instructions and an infinite length tape, brainfuck is a Turing-complete language. I think this is quite neat. Armed with this information, I designed a simple interpreter in Rust.
Implementation
I first setup the interpreter to take in a single argument - the brainfuck source code file name.
IO and argument handling in Rust
This was before the days I found out about clap. I will be using it for future projects, and it helps prevent writing out tedious panics like:
if args.len() != 2 {
panic!("[ERROR] Incorrect number of arguments \
provided. 1 expected, {} provided.",
args.len() - 1);
}
Once the file is read to memory, I simply filter out any characters that are not part of the character set (i.e. comments). I found this elegant with Rust's easy predicate/functional programming syntax.
const BRAINFUCK_CHARS: [char; 8] = ['>', '<', '+', '-', '.', ',' ,'[',']'];
commands.retain(|c| BRAINFUCK_CHARS.contains(&c));
Going with tradition, I set the memory tape size to be 30,000 cells in a mutable vector, and iterated through and executed the instructions. However, the positions of the control flow instructions (i.e. the square brackets) can be precomputed for clarity.
Matching brackets and stacks
I used a stack algorithm to compute the matching brackets, returning a vector of Option<usize>. For each '[', its index is push onto the stack, and a ']' means that the latest index was popped off the stack, and represented the index of the matching bracket. This return a vector of Option<usize>, which contained None if the instruction in that position is not a square bracket, and Some(position) with position indicating the matching square bracket index.
Of course, this necessitates that a vector of the same size as the number of instructions are created, which is inefficient. A tidier solution would be to return this same list without any of the None entries. When checking for square brackets, the code simply has to keep track of the current count of brackets seen. This count will yield the index in the matching bracket list.
// Compute the positions of the matching bracket for each bracket.
// Returns a vector of matching bracket positions where relevant.
fn loop_closures(commands: &Vec<char>) -> Vec<Option<usize>> {
let mut stack : Vec<usize> = Vec::new();
let mut closure: Vec<Option<usize>> = vec![None; commands.len()];
let mut current_index: usize = 0;
for command in commands {
if command == &'[' {
stack.push(current_index.clone())
}
if command == &']' {
assert_ne!(
stack.len(),
0,
"[SYNTAX ERROR] Unmatched loop pairs at position {}.",
current_index
);
let matching_index: usize = stack.pop().unwrap();
closure[current_index] = Option::from(matching_index);
closure[matching_index] = Option::from(current_index);
}
current_index += 1;
}
assert_eq!(
stack.len(),
0,
"[SYNTAX ERROR] Unmatched loop pair(s) at {:#?}",
stack
);
return closure
}
Interpretation and ASCII
Interpretation was simply done by pattern matching for the right characters and executing the commands. This either involved changing the pointer, the array, or to do IO.
For the sake of IO, I chose for the array to contain u8 characters, or $2^8-1=255$ values. This can be easily cast to and from ASCII characters, and I was able to easily implement input:
// Read input and set as cell value.
let input_first_char: char = input[0];
memory[pointer] = input_first_char as u8;
and output.
// Print current cell value as ASCII char.
print!("{}", memory[pointer] as char);
Adding serialisation
That's basically it. All the instructions and the environment needed to interpret brainfuck source code is there.
I also wanted to add a little extension to this project, and was also not entirely satisfied with the measly 30,000 cells in the memory array. I tried scaling up to the maximum usize value, and (expectedly) ran into issues allocating such a large contiguous chunk in memory. Because I wanted to keep things relatively lightweight in memory consumption and I wanted to maximise memory size, I chose to serialise the memory in blocks using the serde crate.
To implement this, I checked when the pointer is moved beyond the bounds of the current memory block. This would be the pointer going past usize::MAX or going under $0$. If this happens, the current memory block is serialised and stored with its identifier. The memory blocks are tracked using a u128, and the memory correct memory block is deserialised into memory or created where needed.
This creates a continuous memory tape which length is u128::MAX * 30,000 (very large), successfully increasing the effective length. This bit of code does this tidily.
if changing_file_number != current_file_number {
// Serialise the current mem and write to mem/ folder as unique mem block.
let writing_bytes: &Bytes = Bytes::new(&memory);
let writing_file_name: String = current_file_number.to_string() + "mem.txt";
let mut writing_file: File = File::create("mem/".to_owned() + &*writing_file_name)
.expect("Unable to create mem file");
writing_file.write_all(writing_bytes).expect("Unable to write to mem file");
// Deserialise the new mem block if available. If not created create the new block.
let reading_file_name: String = changing_file_number.to_string() + "mem.txt";
match File::open(reading_file_name) {
Ok(mut reading_file) => {
reading_file.read_to_end(&mut memory)
.expect("Unable to read mem block");
}
Err(_) => {
// File does not exist. Create a new one in memory.
memory = vec![0; MAX_MEMORY_INDEX];
}
}
}
read_head += 1;
current_file_number = changing_file_number;
}
In fact, we can quickly calculate how long the effective tape length or memory will be. With a memory block size of 30,000 and $2^{128}$ blocks using a u128, this yields a memory size of:
10,208,471,007,628,153,903,901,238,222,953,046,343,680,000
Pretty cool.
Conclusion
Overall - a nice gentle introduction to Rust!
One performance bottleneck would be if the pointer is consistently moving between the boundaries of memory blocks. This would mean a lot of expensive IO activity and serialisation. Keeping the surrounding memory blocks (and as many blocks that a loop potentially encompasses) loaded at all times would remove this potential performance issue.