Introduction

In recent years, the fields of Internet of Things (IoT) has seen an increasing amount of attention. Traditionally the distance between internet connected clients and servers has been relatively large, both geographically as well as topologically speaking. These clients, typically computers, would connect to a set of centralized servers to perform some action or retrieve resources. Nowadays, the internet is populated much more densely, with the growing number of smart connected IoT devices as clients. This scale up pushes the need for service providers to move their services closer to these clients in order to keep latency low and do avoid single points of failure. Usually this comes in the form of Content Delivery Networks (CDNs) or computations performed on the edge of the network. Instead of all clients connecting to a single centralized set of servers, often deployed on other continents, they now connect to regional deployments, sometimes even within the same city. This is not only beneficial for the clients, as distributing the load across the globe enables more fine grained performance and thus cost optimizations. Typically, these edge deployments are not very powerful and limited in their ability to scale, as their main purpose is not to replace the centralized servers, but rather complement them [1]. It is thus important to understand the performance characteristics of any software deployed on the edge to ensure good reliability and performance for serving potentially tens of thousand of the clients.

Another field that has been widely explored in the wake of emerging cloud systems is that of engineering practices, especially around the DevOps methodology and observability of distributed systems. In 2010, Sigelman et. al published a technical report about Dapper, the large-scale distributed tracing infrastructure developed at Google at that time [2]. Since then a lot of development has been put into cloud native systems. Along with it, the tooling for distributed tracing evolved with the development of tools like OpenTelemetry1. Traditionally, distributed tracing has been applied inside private clusters to enable software engineers to accurately pinpoint performance bottlenecks and reliability issues in distributed systems by gaining insights into processes that span multiple services. In these cloud environments, a typical deployment would be used by a couple of hundred to a thousand clients simultaneously.

However, to our knowledge, not much research has gone into the combination of these fields of research, namely highly distributed tracing of IoT devices. This combination would allow identifying reliability or performance problems in the infrastructure effectively, as these devices live the furthest from any centralized server. As they are running often in the hands of the customer itself, and thus are the most susceptible to different kinds of failures, they offer a unique vantage point from which to observe infrastructure. To enable this, distributed tracing infrastructure needs to be deployed closer to the edge of the network and be made available to potentially tens of thousands of clients on the internet, an order of magnitude more than what a current typical deployment typically handles.

In this paper we want to explore the feasibility of this approach by benchmarking the maximum throughput of the OpenTelemetry Collector, the reference implementation of the OpenTelemetry standard. We are especially interested in the performance characteristics when deployed on commodity hardware equipped with limited resources, as is often the case in edge computing environments.

The rest of the paper is structured as follows. In section 2 we provide an overview of the specifics of distributed tracing and OpenTelemetry in particular. Section 3 describes our study design which which we explain in more detail in Section 4. In Section 5 we present the results of our study which we discuss in Section 6 before concluding the paper and giving a brief outlook onto future work in Section 7.

Background

Whereas traditionally services were created in a monolithic architecture, many service providers now opt for a so called micro service architecture. In this architecture, a single application is split into smaller functional units that can be independently scaled in a dynamic fashion. This often leads to a significant reduction in cost, but on the other hand is usually considered more complex and thus prone to failures. The advancements in development around cloud native micro service architectures also increased the need for better tooling to collect telemetry from applications deployed in this way. Being able to access the logs or metrics of a single service is not enough anymore to debug issues, as they are often dependent on other services.

Distributed tracing at the core is a method to enable visibility into performance and reliability of processes spanning multiple dependent services. It’s most basic concept is that of a “trace". A trace is a directed, acyclic graph of spans. Spans, in turn, simply encode the duration a certain action takes. The whole system comes together by services recording spans for certain actions and sending these spans into the centralized tracing infrastructure together with a traceID. This traceID can be either generated, or passed between services when a communication happens that is related to the same action. Each service is responsible for recording its own spans, sending them to a centralized tracing service, and finally passing the traceID on to other services. Finally, all the spans collected from different services can be correlated by their traceID and thus grant visibility into processes that span multiple services. These larger collections of spans are referred to as traces. The fact that only a traceID needs to be passed between services makes distributed tracing a lightweight addition to existing communication protocols. On the other hand, tracing, the process of recording spans within a service, comes at a cost for the service. Existing tooling is often intrusive in that it requires code to be added for each span that should be recorded.

Figure 1 depicts an example of a trace as it might have been recorded by two services (blue and orange).

Open source tools like Jaeger2, Zipkin3 or OpenTracing4 emerged over the last decade to solve similar problems around distributed tracing. Aside from these, a number of vendors sell products using proprietary protocols around the collection of traces. Many of them go beyond the core of distributed tracing by allowing developers to augment spans with additional metadata, such as attaching events or whole log lines to spans. However, as these different tools serve very similar use cases, the need for a single standard emerged to reduce the cost of adoption that is associated with supporting multiple different protocols and the risk of vendor lock in.

OpenTelemetry1 is an attempt to consolidate these different standards for logs, metrics and traces into a single set of APIs, tools and SDKs. In this paper we focus only on the OpenTelemetry Collector (OTEL). Its purpose is to provide an interfaces to multiple different tracing protocols in a single application. Tools like Jaeger then can send all their traces to OTEL, which transforms them according to the standard and sends them off to another service for processing. While at the surface this looks like a simple pipeline for trace data, the collector goes beyond that by offering a set of so called “span processors". These are processing stages that can perform additional transformations on the incoming data, such as adding or removing metadata from spans, sampling spans based on different attributes or derive new telemetry data such as metrics or logs from the incoming data.

While the OpenTelemetry provides SDKs and implementations for clients, as well as the collector, at the point of writing, it does not provide any kind of analytics service that allow for storing, querying and analyzing any of the collected traces. For this, a third party tools needs to be used that accepts telemetry data according to the standard.

Study Design

In order to assess the fitness of the OpenTelemetry Collector (OTEL) for an edge-deployed scenario we conduct a maximum throughput benchmark. We are especially interested in the following three dimensions of the its performance quality:

  • What is the number of concurrent clients that a single OTEL instance can handle?

  • At what sustained rate can this OTEL instance accept traces without producing errors?

  • Does OTEL introduce any additional latency, especially under different configurations?

Besides the performance under simple forwarding workloads we also take a look at two special configurations.

The first configuration is one in which OTEL has to mutate the incoming data. A sensible use case for this would be clients unintentionally sending unwanted metadata, for example sensitive data like customer credentials. In such a scenario, this unwanted metadata should be stripped from all spans as soon as possible to not violate any regulations.

Another scenario we look at is sampling. The use case for this would be an attempt to limit the amount of incoming data. This can have different reasons, for example to reduce the load on systems by dropping data that is not interesting.

For this, our study is designed as follows. We deploy a single OTEL instance on a virtual machine running in Google Cloud. On another virtual machine, our benchmarking tool benchd is running. This tool creates a number of workers that generate and send a full trace to the collector. All workers run concurrently in their own thread and do not influence each other in any way. The collector in turn is configured to send all traces back to benchd to measure the roundtrip time for each trace sent. The number of clients created as well as various parameters of the trace created can be configured and randomized.

Overall, we are run four different kinds of benchmarks we call “ benchmark plans":

  1. The basic plan is a raw performance benchmark with minimal configuration and randomization. It is used to establish a baseline, as well as verify the setup under maximum load.

  2. The realistic plan generates a realistic load profile with a variable number of spans per trace, as well as extra attributes added to each trace.

  3. The mutate plan is one in which we introduce a special attribute in a fixed percentage of traces by the clients. A matching configuration is deployed in OTEL that causes it to drop the special attribute.

  4. The sample plan is the same as the realistic one, but now OTEL is configured to sample only 25% of all incoming traces. Specifically, we perform what is commonly referred to as “head sampling" or “probabilistic sampling", which samples random values as they come in. This kind of sampling bases its sampling decision soley on the traceID, which makes it very efficient, especially as more data with the same traceID arrives. It’s counterpart, “tail sampling" works by marking a decision based on a fully received trace instead. Depending on the parameters, it can sample based on different metadata found throughout the trace. This allows for fine grained control over the sampled data. For example the collector can be configured to always sample all traces that contain an error. However, this approach incurs a severe performance cost, as decisions can only be made on a full trace which has to be buffered in memory while more data arrives (hence the name “tail sampling").

For each of these workloads, the benchmark consists of two phases. In the first phase we create a variable workload to find the supposed maximum throughput of the system. In this case, benchd creates new clients at a fixed rate, specifically 5 clients every second. The benchmark runs until OTEL can not reliably accept traces anymore. The indicators for this are that the OTEL process itself crashes (due to running out of memory for example), becomes too slow to accept or forward traces in a timely manner, or actively rejects new incoming traces.

In the second phase, we create a static workload with a fixed amount of concurrent clients sending traces. The exact number is based on the maximum number we saw in phase one, shortly before OTEL stops behaving as expected. The workload in this phase then runs for a fixed time of thirty minutes in order to assess the sustainability of the peak performance from the previous phase.

Implementation Details

In this section we explain in detail the implementation of our benchmark. Specifically, we describe the intrinsics of the benchmarking infrastructure as well as the design of benchd, our benchmarking client.

Benchmarking Infrastructure

As described in Section 3, our benchmark runs on top of virtual machines in Google Cloud. For this we use Terraform5 to create the machines and all supporting infrastructure such as firewall rules. It is also used to generate and deploy all necessary configurations. Unlike common in many cloud environments, we opted to not deploy OTEL in a container or use any orchestration tool such as Kubernetes as to not impair the performance of the service which could skew our benchmarking results[3].

In total, we deploy three virtual machines inside a single region and availability zone. We use version 0.43 of the OpenTelemetry Collector in the “contrib” distribution6. This distribution is special in that it contains the set of processors needed to run the collector in the sampling and mutating configuration. OTEL is running on commodity hardware, an instance of type e2-standard-2 with 2 vCPU and 8 GiB of memory, running Debian 11. The only performance optimization we apply is to raise the file descriptor limits for the OTEL process to allow it to make the best use of the hardware.

The instance running our benchmarking client benchd uses an e2-standard-4 with 4 vCPU and 16 GiB of memory. This is to ensure that benchd does not become a bottleneck in our measurements. Similiarly, we raise the file descriptor limits here.

Additionally, we deploy a third instance running monitoring software that collects various metrics from both instances running OTEL and benchd every second. These metrics include system metrics like CPU time, memory usage, disk utilization and network utilization, as well as process metrics like current number of active clients, the current send rate and latency in benchd or accepted and rejected spans in OTEL. This helps us ensure that we can identify failures or bottlenecks that can invalidate our benchmark early. It also enables us to run quick exploratory benchmarks to test assumptions without having to analyze the results file created by benchd. Lastly, we can correlate the system metrics to certain behaviour in OTEL.

The Benchmarking Client

As mentioned in Section 3, our benchmarking client benchd runs a number of concurrent clients we call workers. Each worker is essentially a loop that performs the following four operations: 1) generate a trace, 2) flush the trace to the collector, 3) wait for the trace to return back to benchd, and 4) log the results. Traces are generated recursively calling a function that generates spans. This causes all spans to be nested, creating a “parent-child" relationship. While this is a realistic model for single threaded applications, it eliminates the possibility of creating “sibling-spans", which are being created in parallel of one another. Each trace sent produces exactly one log line containing the following information: A timestamp for the log line itself, the ID of the worker, the final state of the request, the parameters of the trace that was generated, the time the current loop iteration started, and finally the start and end times of both the sending and receiving operations. All timestamps are collected with millisecond precision. Aside from that, each request can finish in one of several states: success, send timeout, send error, received timeout, worker shutdown.

This fine grained logging not only allows us to perform an analysis on multiple dimensions, but also can be used to recreate the results easily by feeding the result log back into benchd. In order to get comparable and meaningful results, we resample all log lines to a granularity of 1 second.

Results

In this section we present the results for each of the four benchmark plans as described in Section 3. Table 1 shows the parameters which were used to generate the workload for each plan. All of these values are an upper bound for the respective parameter. At runtime, each worker randomly selects a number for each parameter between 0 and the configured number. These numbers are then logged by the worker.

Table 1: Benchmark Plan Parameters
PlanSpansSpan DurationExtra AttributesRisk
basic10100ms00%
realistic20250ms100%
mutate20250ms1050%
sample20250ms100%

For these parameters, Spans refers to the maximum number of spans contained within a single trace. Span Length is the maximum duration for a single span within the trace. Extra Attributes refers to the maximum number of extra attributes added to each span in the trace. Finally, Risk describes the percentage with which a “risky attribute” is added to a random span within the trace. For example, in the mutate plan, each trace has a 50% chance of containing the risky attribute and thus being mutated by the collector.

Basic Plan

The workload generated in the basic plan is very synthetic and small. Fig. 2 shows the number of active workers in blue and the traces sent by these workers over time. The red dashed line is the first occurrence of an error during the benchmark. We can see this happening shortly before the 20 minute runtime mark. At this point, about 5800 workers are sending at an accumulative rate of about 2500 traces per second.

We can also see that for the first 4 minutes the sending rate steadily increases. That’s when it reaches a tipping point at just over 3000 workers and a sending rate of 1500 traces per second. After that, the rate becomes increasingly unstable, fluctuating at values between 2000 and 2500 traces per second.

Fig. 3 and Fig. 4 depict the trace rate and send and received latencies, respectively. We can see that while the median send latency is relatively stable at under 20ms, the 90th and 95th percentile values are significantly higher, at about 60ms and 80ms. In comparison, the receive latency does steadily increase, even when the sending rate itself begins to stagnate. This indicates that the collector is now fully saturated, and that new traces are not passed in at the same rate they arrive at.

Realistic Plan

The realistic plan creates a workload that is less synthetic than the basic one with twice as many spans per trace, higher span duration and additional attributes added to the traces. Fig 5 shows the number of active workers and trace rate over time. We can see that the first error again appears shortly before the 20 minute runtime mark. It is also visible, that compared to the basic plan in Fig. 2, the trace rate increases at a slower rate. While the basic plan achieves a throughput of about 2000 traces per second after about 4 minutes, the realistic plan needs about 12 minutes. At the same time, the trace rate in Fig. 5 is stable for a much longer time, too. As the number of workers increases at the same rate as in the basic plan, we can conclude that the instability in trace rate we see later in the benchmarks does not stem from the number of workers sending traces, but is purely dependant on the trace rate itself.

Fig. 6 shows the trace rate and send latency in different percentiles for the realistic plan. We can see that compared to the basic plan, the send latency at the end of the run is about twice as high for the 90th and 95th percentile, while the median latency is about the same at around 20ms to 25ms.

The trace rate compared with the receive latency for the realistic plan is depicted in Fig. 7. Compared to the basic plan, this one is significantly lower with about 1000ms to 1200ms in the 95th percentile at a rate of 2000 traces per second on the realistic plan compared to the 1500ms to 2000ms.

Mutate Plan

As shown in Table 1, the mutate plan is identical to the realistic plan, with only the difference that 50% of traces contain a special attribute that is being filtered out by the collector. Fig. 8, 9, and 10 again show the number of active workers, the trace rate, the send latency, and the receive latency. We can see that the numbers for all of these metrics are nearly identical with those of the realistic plan. This makes sense, because the workload is identical, too. On the other hand, one might expect that mutating the traces before sending them back to the benchmarking client would incur some kind of processing overhead, that should be visible in a higher receive latency. Fig. 10 however shows, that the 95th percentile of the read latency barely exceeds 1000ms, while its counterpart in the realistic plan in Fig. 7 clearly shows peaks at 20% higher than that.

Sustained Load Comparison

As described in Section 3, for each of the plans, we execute a second, 30 minute long run to verify the sustainability of the detected peak performance. In Fig. 2, 5 and 8 we can see that the peak load for all three plans is at about 5000 workers.

Fig. 11 shows the trace rates of each plan when run for 30 minutes using 5000 workers. In order to denoise the output, we apply a 60 second moving average. We can clearly see, that the maximum sustained throughput of the basic plan is significantly higher, at a rate of 2200 to 2600 traces per second. As already seen before, the realistic and mutate plan are nearly identical. However, we can also see, that the mutate plan appears to achieve a slightly higher trace rate on average.

Fig. 12 shows the same underlying data as Fig. 11. This time however, we apply a 30 seconds moving average window. We can now see a periodic pattern with a cycle period of about 1 minute. Interestingly, this pattern is perfectly synchronized across all runs.

Sample Plan

Due to the architecture of our benchmarking client, we consider the sample plan separately from the others. In this configuration, the collector drops 75% of all incoming traces. Only the 25% that were sampled actually return to benchd. Since it has no way to identify which traces get dropped by the collector, the only way to detect this is for it to receive timeout while waiting for the trace to return. Because of this, metrics like trace throughput and receive latency can not be compared to other configurations without caveats.

Instead, we first look at the rate of successfully received traces compared to the number of receive timeouts. Fig. 13 shows these rates in the 30 minute sustained workload with 5000 active workers. We can see, that the number of received timeouts is three times as high as the number of successfully received traces at 1800 traces per second. This matches exactly the expectation, that 75% of traces sent result in a receive timeout by not being sampled. We can also calculate that the resulting overall send throughput is about 2400 traces per second. In this, the sample plan performs better than the realistic and mutate plan, but slightly worse than the basic plan.

Fig. 14 shows the rate of successfully received traces and their latency in the same sustained workload scenario. Comparing the receive latency of this run at a receive rate of 500 to 600 traces per second with all other plans, we can see that the sample plan performs significantly better, with the 99th percentile being at least ten times lower than in all other plans.

Discussion

From the results presented in Section 5 we can learn several things about the OpenTelemetry Collector.

Firstly, comparing the basic and realistic plan shows, that the maximum sustainable throughput of our system under test primarily depends on the size of the payload. Contrary to our assumption, performing additional transformations on the traces does not incur a measurable performance cost. In fact, the opposite appears to be true, as the configuration that mutates the input trace produced a slightly higher throughput. However, the results in this regard are far from exhaustive. In our benchmark, we perform only one mutation on the input data, by removing a single attribute. We deem this a realistic scenario, but performing more extensive processing might still yield a very different result that is in line with the assumption that additional processing comes at a cost.

Secondly, our results show, that under high load the collector can introduce up to 2 seconds of latency when forwarding traces. This should be considered in architectures where traces need to be made available in real time. Especially in architectures where a hierarchy of collectors is used this is important to consider, as latency can quickly accumulate at each step.

On the other hand, our results clearly show, that sampling is an effective way to reduce the latency of the provider to forward any received traces. This should not be surprising, as parts of the data are simply discarded here. In many situations, this can be a viable trade-off though, as often only a small fraction of traces are interesting. As already briefly mentioned in Section 3, tail sampling could be a solution to this problem by making sure to sample based on attributes deemed interesting. However, this decision should be heavily informed by the overall tracing architecture. It usually is not advisable to perform tail sampling as the first step in a distributed tracing pipeline, as traces are still not complete at this stage and the sampling decision can be influenced by related spans from other applications.

A surprising result is the cyclic pattern we identified when measuring sustained throughput. Neither the documentation of the collector nor its codebase indicate any kind of buffering of batching of results taking place [4]. The shape of the measurement supports this finding. Batching or buffering usually works by filling up a fixed sized buffer and clearing it all at once. This would produce a shape similar to a sawtooth with a fixed amplitude. Our measurements on the other hand show a pattern more akin to a sine wave with varying amplitude.

Our results also reveal shortcomings in the study design, specifically when using a sampling configuration. Because the way benchd is designed, it cannot detect that a trace was discarded other than by timing out while waiting for a response. This influences directly each worker’s output speed, as new traces are only sent if previous ones are successfully received. An improved architecture could decouple the load generation and receiving components to work independently. Each trace sent would be logged, along with its traceID. Independently, all incoming traces are also logged with their traceID. This would shift the work of correlating sent and received traces to a later stage in the process, but would presumably yield a more accurate workload generation.

Conclusion and Future Work

As the internet continues to grow in new ways with an ever increasing number of smart devices being used every day, so does the need for tooling that can help developers measure the performance and reliability of these devices. Distributed tracing, with it’s focus on giving a holistic view on the dynamics of distributed systems lends itself especially well for this. By performing a maximum throughput benchmark under multiple configurations that represent real-world use cases, we were able to uncover several performance characteristics of the OpenTelemetry Collector, the reference implementation for the OpenTelemetry standard. First, we show that the amount of input data per trace has the biggest influence on the maximum sustainable throughput. Secondly, we show that mutations, at least on a small scale do not impact the performance of the collector at all. Contrary to our assumptions, the collector performed slightly better when removing a certain attribute from incoming traces. Furthermore, we show that sampling can be a viable trade-off when compute resources for the collector are scarce. Not only is the collector able to accept traces at a higher rate when sampling, we also see a significant speedup in the time it takes for the collector to send sampled traces on. Finally we show that under all configurations the collector can sustain the peak load of 5000 clients sending between 2000 and 3000 traces with up to 20 spans per second for at least 30 minutes while running on commodity hardware with 2 vCPU and 8 GiB of memory.

All these results leave us to conclude that the OpenTelemetry Collector is a good choice for being deployed on the edge of the network, an environment that is often resource constrained and requires high performance to cope with the load created by thousands of clients.

To paint a clearer picture, future research can be conducted by benchmarking a wider range of different configurations. Especially different sampling strategies and data mutations should be further explored. Besides that, entirely new processing methods, such as deriving metrics from incoming data, filtering data based on regular expression matches against attributes, grouping, batching, and routing are worth looking at. All these can have valid use cases in an edge deployment and significantly impact the performance of the collector. Another opportunity for future research is to eliminate shortcomings of the current design of our benchmarking client. Namely, this could include splitting the client into two functional pieces that are soley responsible for sending and receiveing telemetry data, respectively, as described in Section 6. Two other possible improvements are the ability to send “sibling-spans" as described in Section 4 as well as sending incomplete traces or related spans from multiple clients. Lastly, we deem it worthwhile to also benchmark different tools such as Zipkin or Jaeger. OpenTelemetry has emerged as a consolidating standard, however this should not be the only deciding factor when designing a system that enables developers to perform highly distributed tracing on smart devices.

References

[1] W. Yu et al., “A Survey on the Edge Computing for the Internet of Things,” IEEE Access, vol. 6, pp. 6900–6919, 2018, doi: 10.1109/ACCESS.2017.2778504.

[2] B. H. Sigelman et al., “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure,” p. 14.

[3] B. Ruan, H. Huang, S. Wu, and H. Jin, “A Performance Study of Containers in Cloud Environment,” in Advances in Services Computing, vol. 10065, G. Wang, Y. Han, and G. Martínez Pérez, Eds. Cham: Springer International Publishing, 2016, pp. 343–356. doi: 10.1007/978-3-319-49178-3_27.

[4] T. O. Authors, “OpenTelemetry Documentation,” OpenTelemetry Documentation. Feb. 2022. Available: https://opentelemetry.io/docs/


  1. https://opentelemetry.io ↩︎

  2. https://www.jaegertracing.io ↩︎

  3. https://zipkin.io ↩︎

  4. https://opentracing.io ↩︎

  5. https://www.terraform.io/ ↩︎

  6. https://github.com/open-telemetry/opentelemetry-collector-contrib ↩︎