Running autonomous robots on city streets is a software engineering challenge. Some of the robots in this software run on their own, but many of them actually run in the backend. Things like remote control, path finding, robot matching with customers, fleet health management but also interaction with customers and merchants. All of this has to be done 24×7 without interruption and scaled dynamically to match the workload.
Starship’s SRE is responsible for providing cloud infrastructure and platform services to run these backend services. We have standardized Governors For our microservice and it is running on top AWS. Mongodib The primary database for most backend services, but we like it PostGreSQL, Especially where strong typing and transaction guarantee are required. For asynchronous messaging Kafka It is the preferred messaging platform and we are using it for almost everything from robots to video stream shipping. We rely on observation Prometheus And Grafna, Loki, Left And Jaigar. Managed by CICD Jenkins.
A good portion of the SRE period is spent on maintaining and improving the Kubernetes infrastructure. Kubernetes is our main deployment platform and there is always something to improve, be it fine tuning of autoscaling settings, adding pod disruption policy or optimizing the use of spot instances. Sometimes it’s like laying bricks – simply installing a helm chart to provide specific functionality. But often “bricks” must be carefully sorted and evaluated (good for locky key log management, service mesh is one thing and then that) and sometimes functionality does not exist in the world and must be written from scratch. When this happens we usually go back to Python and Golong but rust and CO when needed.
Another major infrastructure for which SRE is responsible is data and databases. Starship started with a single monolithic Mongodibi – a technique that has worked well so far. However, as business grows we need to revisit this architecture and start thinking about supporting robots in the thousands. Apache Kafka is part of the scaling story, but we also need to figure out sharding, regional clustering, and microservice database architecture. On top of that we are constantly developing tools and automation to manage the existing database infrastructure. Example: Add MongoDb Monitoring with a custom sidecar proxy to analyze database traffic, enable PITR support for databases, automate routine failures and recovery tests, collect metrics for Kafka re-sherding, enable data retention.
Finally, one of the most important goals of site reliability engineering is to reduce downtime for starship production. Although the SRE is sometimes called upon to address infrastructural disruptions, more effective work is being done to prevent disruptions and ensure that we can recover quickly. This can be a very broad topic, ranging from having rock solid K8s infrastructure to engineering practice and business processes. Great opportunity to make an impact!
A day in the life of an SRE
Arrive at work, sometime between 9 and 10 (sometimes working from a distance). Have a cup of coffee, check slack messages and emails. Review the warning shots at night, see if we have anything interesting out there
Find out if MongoDb connection latency has increased overnight. Dig into the Prometheus matrix with Grafana to see if this is happening during the backup. Why is this suddenly a problem, we’ve run those backups for ages? It appears that we are aggressively compressing backups to save on network and storage costs, and this is consuming all available CPUs. Looks like the database load has increased a bit to make it noticeable. This is happening on a standby node, not affecting production, although still a problem, the initial one should fail. Add a cumin item to fix it.
When passing, change the MongoDb probe code (Golang) to add more histogram buckets to better understand latency distribution. Run a Jenkins pipeline to keep the new probe in production.
There’s a standup meeting at 10am, share your updates with the team and learn what others are doing – set up monitoring for a VPN server, set up a Python app device with Prometheus, set up a service monitor for external services, MongoDb connection Problems debugging, piling canary deployments with flaggers.
At the end of the meeting, resume the work planned for the day. One of the planned things I plan to do today is to place an additional Kafka cluster in a test environment. We are running Kafka in Kubernetes so it should be easy to take existing cluster YAML files and tweak them for new clusters. Or, second thought, should we use helmets instead, or perhaps a better Kafka operator is now available? No, not going there – too much magic, I want more clear control over my statefulsets. Raw YAML it. After an hour and a half the new cluster is running. The setup was fairly straightforward; A configuration change is required for Kafka Broker registering init containers in DNS. A short bash script is required to set up an account in Zookeeper to generate certificates for applications. Kafka Connect was set up to capture database change log events a bit that is not running in replicaset mode and Debzium cannot get out of it. Backlog it and move on.
Now it’s time to prepare a scenario for the wheel of misfortune practice. At Starship we run these to improve our understanding of systems and share problem solving strategies. It works by breaking down some parts of the system (usually testing) and trying to solve and alleviate the problem of some unfortunate person. In this case I will set up a load test Hey To overload the microservices for route calculations. Set it up as a Kubernetes job called “Heimmaker” and hide it well enough so that it doesn’t immediately show up on the Linkerd service network (yes, evil). Then run the “wheel” exercise and note any gaps we have in the playbook, metrics, alerts, etc.
In the last few hours of the day, block all obstacles and try to get some coding done. I have re-applied the Mongoproxy BSON parser as a streaming asynchronous (Rust + Tokio) and want to find out how well it works with real data. It turns out there is a bug somewhere in the parser’s guts and I need to add deep logging to get it out. Find a great tracing library for Tokyo and get away with it.
Disclaimer: The facts described here are based on a true incident. Not all happened on the same day. Some meetings and interactions with colleagues have been edited. We are hiring.