Avoid Data Bottlenecks – Data Mesh at Starship | by Taavi Pungas | Starship Technologies


Tavi Pungas

One gigabyte of data for a bag of groceries. This is what you get when you make robotic deliveries. That’s a lot of data – especially if you repeat it over a million times like us.

But the rabbit hole goes deeper. The data is also incredibly diverse: robot sensors and image data, user interactions with our apps, order-to-transaction data, and more. And everything from training the Deep Neural Network to creating polished visualizations for our business partners, and everything in between is equally diverse.

So far, we’ve been able to handle all of these complexities with our centralized data team. So far, continuous index growth has led us to find new ways to work to maintain momentum.

We found the data mesh paradigm to be the best way to move forward. I will describe the adoption of starship on the data mesh below, but first, let’s have a brief summary of the procedure and why we decided to go with it.

What is an information forgery?

Data Mesh Framework was first described by Jamak Dehani. The paradigm is based on the following key concepts: data products, data domains, data platforms, and data governance.

The main purpose of the Data Mesh Framework is to help large companies overcome data engineering barriers and deal with complexities. It therefore addresses many of the details relevant to an enterprise setting, from data quality, architecture, and security to administration and organizational structure. As it stands, only a few companies have publicly announced their adherence to the data mesh paradigm – all big multi-billion-dollar initiatives. Nevertheless, we believe that it can be successfully applied to small companies as well.

Data mesh at Starship

Work data around people who produce or use data

In order to run the Hyperlocal Robotic Delivery Marketplace around the world, we need to turn to comprehensive data on valuable products. Data comes from robots (such as telemetry, routing decisions, ETAs), merchants and customers (including their apps, orders, offers, etc.), and all operational aspects of the business (from short remote operator work to global logistics of spare parts) and robots) .

Diversity in usage is one of the main reasons why we are attracted to the data mesh system – we want to run data work very close to the people who produce or use the data. By adhering to the data mesh principles, we hope to meet the different data needs of our team by keeping the central supervision reasonably light.

Since Starship is not yet on the enterprise scale, it is not realistic for us to implement all aspects of the data mesh. Instead, we are focused on a simplified approach that makes sense to us now and puts us on the right path for the future.

Data products

Determine what your data products are – including each owner, interface and user

Applying product thinking to our data is the basis of the whole process. We think of anything that reveals data to other users or processes the data as a product. It can publish its data in any format: as a BI dashboard, a Kafka subject, a data warehouse view, a predictive microservices response, and so on.

A common example of a data product at Starship could be a BI dashboard for site leads to track their site’s business volume. A more detailed example would be a self-service pipeline for robot software engineers to send any kind of driving information from robots to our data lake.

However, we do not consider our data warehouse (actually a Databrix Lakehouse) as a single product, but as a platform that supports a number of interconnected products. Such granular products are usually owned and manufactured by Data Scientists / Engineers, not Dedicated Product Managers.

The product owner is expected to know who their users are and what needs they are addressing with the product – and based on that, define and live up to the quality expectations for the product. Perhaps as a result, we have begun to pay more attention to interfaces, elements that are important for usability but difficult to change.

Most importantly, it makes it easier for users to understand and prioritize the ideas that each product is creating for them. This is important in a startup context where you need to move quickly and not have time to perfect everything.

Data domain

Group your data products in the domain by reflecting the organizational structure of the company

Before we became aware of the data counterfeit model, we successfully used its format Lightly embedded data scientists In the starship for a while. In effect, some of the core team’s data team members worked part-time with them – whatever the specific team.

We continue to define data domains in line with our organizational structure, this time taking care to cover every part of the company. After mapping the data product to the domain, we hired a data team member to curate each domain. This person is responsible for overseeing the entire set of data products in the domain – some owned by the same person, some by other engineers on the domain team, or even by some other data team members (e.g. due to resources).

There are many things we like about our domain setup. First and foremost, now one person in each area of ​​the company is overseeing its data architecture. Given the intricacies of each domain, this is possible only because we share the work.

Creating structures in our data products and interfaces has helped us create better insights into the data world. For example, in a situation with more domains than data team members (currently 19 vs. 7), we are now doing better to ensure that each of us is working on an interrelated issue. And we now realize that in order to alleviate the growing pain, we need to reduce the number of interfaces used across the boundaries of our domain.

Finally, the subtle bonus of using data domains: We now feel that we have a recipe for dealing with all sorts of new situations. Whenever a new initiative comes along, it becomes much clearer to everyone where it is and with whom it should be run

There are also some open questions. While some domains naturally tend to disclose source information and others tend to accept and convert it, there are some where both have a substantial amount. Should we split them up when they get too big? Or should we have subdomains in adults? We have to make this decision down the road.

Data platform

Empower people to create your data products with quality without being centralized

The goal of the data platform at Starship is straightforward: making it possible for a single data person (usually a data scientist) to take care of the end-to-end domain, that is, keeping the central data platform team out of the day – today’s job. This requires domain engineers and data scientists to provide good tooling and standard building blocks for their data products.

Does this mean you need a complete data platform team for the data mesh system? Not really. Our data platform team consists of a single data platform engineer, who in parallel spends half their time embedding in a domain. The main reason why we lean so much towards data platform engineering is the choice of Spark + Databrix as the core of our data platform. Our earlier, more traditional data warehouse architecture has placed a significant data engineering overhead on us due to the diversity of our data domains.

We found it useful to make a clear distinction between data stacks versus platform components versus all other components. Some examples of what we provide to domain groups as part of our data platform:

  • Databrix + Spark as a work environment and a versatile computing platform;
  • One-liner functions for data ingestion, such as from the Mongo collection or Kafka subject;
  • Example of an airflow for determining a data pipeline;
  • Templates for creating and deploying predictive models as microservices;
  • Data product cost tracking;
  • BI and visualization tools.

As a general approach, our goal is to standardize as much as is understandable in our current context – even the bits we know will not be standardized forever. Unless it helps productivity right now and focuses on any part of the process, we’re happy. And of course, some elements are currently completely missing from the platform. For example, data quality assurance, data discovery and tooling for data generation that we have set for the future.

Data governance

Strong personal ownership supported by feedback loop

Having fewer people and parties is actually an asset in some aspects of governance, such as making decisions is much easier. On the other hand, our core governance question is also a direct consequence of our size. If there is a single data person in every domain, they cannot be expected to be experts in every possible technical aspect. However, they are the only people who have a detailed idea of ​​their domain. How can we maximize the likelihood of making a good choice within their domain?

Our answer: Through a culture of ownership, discussion and feedback within the team. We have generously borrowed from the management philosophy on Netflix and cultivated the following:

  • Personal responsibility for the outcome (one’s product and domain);
  • Asking for different opinions before making a decision, especially those that affect other domains;
  • Request feedback and code review as both quality process and personal growth opportunity.

We also made a few specific deals on how we communicate with quality, wrote down our best practices (including naming rules), and so on. However, we believe that good response loops are the key to making the guidelines a reality.

These policies also apply outside of our data team’s “building” work – which is the focus of this blog post. Clearly, the way our data scientists are building standards in companies has more to do with providing data products.

One final thought about governance – we will keep repeating the way we work. There will never be a single “best” way to work and we know we have to adapt over time.

Last word

This is it! These were the 4 basic data mesh concepts applied to Starship. As you can see, we’ve found a method of data networking that suits us as a sleek growth-level company. If you find this interesting in your context, I hope reading about our experience has been helpful.

If you would like to join our work, check out our career page for a list of open positions. Or check out our YouTube channel to learn more about our world-leading robotic delivery service.

Contact me if you have any questions or concerns and let’s learn from each other!



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *