Things I learnt while working on ZulipFS

3 February 2025 • #recurse

This post is written during a batch at the Recurse Center, a retreat for programmers.

I came across FUSE recently, which brought up an idea: what if I could access a Zulip instance as a filesystem?

This led to the creation of ZulipFS, where channels are represented as directories, and topics as files within those directories. The usage and code for the project is available here. This post talks about things I learnt about FUSE, filesystems in general and design choices I made in the process.

How FUSE works in a nutshell

FUSE lets you mount a folder containing files and folders. These could be actual files and folders (eg: SSHFS where files from a remote network are mounted) or virtual files (eg: TabFS, where information from browser tabs are presented as files and folders). You write your own implementation for relevant system calls relating to file operations, and FUSE will run your implementation instead of the standard implementation.

The system calls I implemented in ZulipFS are:

read + write for reading and writing files
readdir to list directory information
getattr that returns the metadata for each file and directory

File metadata is more important than I thought

When I implemented and tested reading a message from a topic, only part of a message got printed. I checked if the API call was returning a partial message, but that seemed fine.

Turns out, I had the file size set to 512 bytes as I was working off of this example code. So read checks for the size of a file and prints only that many bytes, which makes sense!

I now had to figure out a way to get the file size before the message has been read.

Reading and knowing a file are two different things

The file size is set in the getattr function, which is called each time a file is read or listed. So the Zulip API would be called twice for a file read - once in getattr to get the length of the message, and then in read to display the message itself.

For this, I created a function called get_topic which can be called by getattr and read:

def get_topic(self, channel, topic):
  channel_id = self.channels[channel]['stream_id']

  # returns the ID of the last message along with the topic name
  topicslist = self.client.get_stream_topics(channel_id)['topics']
  self.topics[channel] = { self.normalize(t['name']): t for t in topicslist }

  # get the message contents using the message ID
  message = self.client.get_raw_message(self.topics[channel][topic]['max_id'])
  message_fmt = f"""[{datetime.fromtimestamp(message['message']['timestamp'])}] {message['message']['sender_full_name']}
{message['raw_content']}
""".encode()

  self.topics[channel][topic] = {
    'last_message': message_fmt,
    'last_timestamp': float(message['message']['timestamp']),
  }
  return self.topics[channel][topic]

Making the same API calls twice felt a bit excessive for reading, but it was okay as long as it wasn't slowing things down. Then I tried listing all the topics in a channel via ls, and things slowed down…A LOT.

Why, you ask? The function handling directory listing, readdir, calls getattr for EACH FILE in the directory. If a channel has 300 topics, that's 300 API calls before ls completes execution. To add to the chaos, get_topic above uses API calls instead of one, which means 600 API calls before ls completes execution. I had to find ways to optimize this.

Lazy loading files?!?!?

The first optimization attempt was to remove the get_topic call from getattr, and call it only in read. I placed an exception block in getattr, which would assign a file size of 65535 bytes on mount, and a subsequent read would fill the hash map with the correct values, which getattr would take the next time its called.

def getattr(self, path):
# ...snip...
# topic/file
try:
  channel, topic = path[1:].split('/')
  try:
    timestamp = self.topics[channel][topic]['last_timestamp']
  except KeyError:
    timestamp = now

  try:
    st.st_size = len(self.topics[channel][topic]['last_message'])
  except KeyError:
    st.st_size = 65535
# ...snip...

This worked initially, but caused problems when I wanted to append new messages instead of just displaying the last one. After lots of trial and error, a question popped up in my head: What if I don't create all topic files right away, and add them only after someone tries to read or list it?

This seemed like a great idea as it would significantly reduce the number of API calls made at once. Things might slow down eventually as you read more and more topics, but it would still be faster than trying to list all topics at once.

Another optimization I was able to make was combining the two API calls into one using get_messages, which powers Zulip's search functionality. I can pass it the name of a channel and topic, and ask it to return the last message of that topic. If either the channel or topic doesn't exist, it'll return an empty result.

def get_topic(self, channel, topic):
  request = {
    "anchor": "newest",
    "num_before": 1,
    "num_after": 0,
    "narrow": [
      {"operator": "channel", "operand": self.channels[channel]['name']},
      {"operator": "topic", "operand": self.zulip_name(topic)},
    ],
    "apply_markdown": False,
  }
  try:
    message = self.client.get_messages(request)['messages'][0]
    message_fmt = f"""[{datetime.fromtimestamp(message['timestamp'])}] {message['sender_full_name']}
{message['content']}
""".encode()

    self.topics[channel][topic] = {
      'last_message': message_fmt,
      'last_timestamp': float(message['timestamp']),
    }
  except IndexError:
    # channel or topic doesn't exist
    pass

  # if a channel or topic doesn't exist, this statement will cause an
  # exception in the function where this is called.
  return self.topics[channel][topic]

These optimizations made things fast enough that I could call get_topic from getattr again, so I could get rid of the extra try/except blocks:

# topic/file
try:
  channel, topic = path[1:].split('/')
  t = self.get_topic(channel, topic)
  st.st_mode = stat.S_IFREG | 0o644
  st.st_nlink = 1
  st.st_size = len(t['last_message'])
  st.st_mtime = t['last_timestamp']
except (KeyError, ValueError):
  return -errno.ENOENT

Appending new messages

I presented the pre-optimization version at the weekly Recurse Center presentations, and fellow batchmates Nolen and Kevin O suggested to add the ability to read new messages from a topic as they arrive by running tail -f on the file. This seemed like a good idea, and more useful than displaying just the last message.

I initially thought appending would require implementing a system call, but it was easier than I thought - if the timestamp of the current message is newer than the previous one, I append the new message to the end of the previous one in get_topic. I also needed an additional check for whether the topic had been read before or not, to initialize the file for the first time.

The try block in get_topic now looks like this:

try:
  message = self.client.get_messages(request)['messages'][0]
  timestamp = float(message['timestamp'])
  message_fmt = f"""[{datetime.fromtimestamp(message['timestamp'])}] {message['sender_full_name']}
{message['content']}
""".encode()

  if topic not in self.topics[channel]:
    # First message in file
    self.topics[channel][topic] = {
      'last_message': message_fmt,
      'last_timestamp': timestamp,
    }
    else:
      # Subsequent messages appended to file
      if timestamp > self.topics[channel][topic]['last_timestamp']:
        self.topics[channel][topic] = {
          'last_message': self.topics[channel][topic]['last_message'] + b"\n" + message_fmt,
          'last_timestamp': timestamp,
        }
except IndexError:
  # channel or topic doesn't exist
  pass

Filename gotchas

One of the earliest errors I encountered was displaying names that had slashes in them. In Linux and other Unix-based OS's, a slash is considered as a delimiter for a directory rather than part of a filename. One thing I'd seen certain apps do is change special characters to their URL-encoded versions, so I replaced all instances of / with %2F.

Another set of characters that are inconvenient to type in the terminal are emojis. I was initially thinking of getting rid of them, but then I realized that looking up the channel name would become tricky.

I remembered seeing textual representations of emojis, which turns out are called shortcodes, and they're written as text in-between colons :. For example, the shortcode for 📝 is :memo:, and these are understood by Zulip. Python has an emoji package that converts emojis to shortcodes and vice versa.

With that, I had two functions to convert Zulip names to a valid filename and vice versa.

def file_name(self, name):
  return emoji.demojize(name.replace('/', '%2F'))

def zulip_name(self, name):
  return emoji.emojize(name.replace('%2F', '/'))

Acknowledgements

Thanks to Sophia for reviewing a draft of this post.