How To Use OpenTelemetry in Laravel 11

with auto(partial) instrumentation on Debian

 

Introduction

Laravel is a popular open source PHP framework for rapid web application development. It features many tools out of the box such as developing simple API, full-stack development, background job processing, authentication and authorisation, and it has a powerful object-relation-mapper (ORM) to deal with database data interaction. It prefers conventions over configuration to let you focus on coding and get things done.

Observability is a technique of how we gain insight on a program’s internal state. We create, or instrument, three signals called metrics and traces and logs that work together to gain an understanding on how applications perform and behave. Amongst information we can obtain are which API endpoint is slow and which ones are most popular. Errors can happen. When it does, we want to know exactly where it is even if we have applications split into microservices. All logs are centralised and easily searchable making debugging a production application a breeze.

One such observability tool we can use is an open source project called OpenTelemetry. It is a set of tools, APIs, and, SDKs that make up this open observability ecosystem. As a bonus, it is vendor-neutral by nature which means you are free to mix and match between tools and companies. Fortunately for us, its SDK includes PHP. The OpenTelemetry PHP section details what you need in order to install OpenTelemetry in your computer environment. It lists three requirements which are PHP 8.0+, pecl, and composer. Under automatic instrumentation, there are a series of commands to add OpenTelemetry for Slim framework and generic PHP application in general. There isn’t any specific instructions for Laravel so we need to rely on bits and pieces of information scattered around the official documentation and other places.

This blog post intends to be a one-page best effort reference to install OpenTelemetry to a Laravel application complete with a dashboard in Grafana. I have been writing this since March this year and I feel vindicated showing the long form (as I often have in this blog) when I saw a recent survey saying that more than 60% users are saying they want detailed tutorials and reference implementation to instrument OpenTelemetry into their application.

We will build a fully instrumented Laravel from scratch step by step. Then we explore how to make sense of all data(signals) using a Grafana dashboard. However, I did not achieve everything I wanted. Running the server with Swoole does not transmit traces reliably. OpenTelemetry metrics library does not have with internal state storage which requires us to use another library to compensate. Nevertheless this is a good starting point to build upon.

By the end of this post, you will have a basic starting point towards observability and I hope you will gain an appreciation of how useful OpenTelemetry can be for your Laravel application.

Table of Contents

Architecture

Before diving into automatic instrumentation, let us view a high-level architecture of what we will be building. Tracing is going to be instrumented using OpenTelemetry automatic instrumentation using official PHP SDK. Tracing signals are sent from Laravel to otel-collector where it is processed, and then emitted to supported backend storage and viewer, in our case Jaeger.

We choose Prometheus, which is a time-series database to store metrics data. Because of the nature of typical PHP program following a share-nothing request lifecycle, metrics data such as counters cannot be remembered(stored) between requests. This causes counters to always reset to zero. The official OpenTelemetry PHP SDK does not come with a storage adapter (nor it intends to). For this reason, we are going to instrument metrics using Prometheus’ library instead that supports storing data such as APCu or Redis. Looking at the architecture diagram below, you also notice that the arrow is from right to left because Prometheus is going to pull metrics data instead of Laravel sending data to Prometheus.

For logs, we use Promtail to tail logs created by Laravel and send them to Loki, which is a log aggregation system. OpenTelemetry SDK already comes with log support and it also can send logs to Loki. It may be the preferable way because the fewer components we need to deal with the better. I am demonstrating one more way just to show you can be flexible in implementing this.

Grafana is our visualisation platform that gives us a dashboard to not only give us a quick glance on how our application is doing, but also perform investigative work. To demonstrate, synthetic load is needed. Finally Postgres is used to demonstrate OpenTelemetry SDK capability of capturing database queries and Redis is used to store metrics data between requests.

Architecture

New Laravel 11 Project

This section offers several approaches to getting or creating this project.

  1. If you already have one running, skip to Prepare Server section.

  2. If you want an already instrumented Laravel api, follow instructions in https://codeberg.org/gmhafiz/observability and skip to Explore Grafana Dashboard

  3. If you prefer to start from scratch, follow the instructions below.

You require composer to be installed to create a new Laravel 11 project. Let us simply name the project as laravel,

sudo apt install composer

Before Laravel can be installed, you may need several PHP packages beforehand.

sudo apt install php-{mbstring,curl,dom,pgsql,xml}

The following command creates a new Laravel project.

composer create-project laravel/laravel:^11.0 laravel
cd laravel

We want to demonstrate that SQL query statements are included in tracing. Thus, I highly recommended setting up a database. If you already have it running, edit your environment variables to reflect correct values accordingly.

vim .env
DB_CONNECTION=pgsql
DB_HOST=127.0.0.1
DB_PORT=5432
DB_DATABASE=laravel
DB_USERNAME=postgres
DB_PASSWORD=password

If you prefer the container way, Postgres database can be created in one line. Then edit your .env accordingly.

docker run --name postgres_db --restart=always -e POSTGRES_PASSWORD=password --publish 5432:5432 -d postgres:17

It takes a few seconds for the database to be up. When it does, create a new database.

docker exec -it postgres_db psql -U postgres -c "CREATE DATABASE laravel;"

If you have not got Redis yet, this is a good place to bring that up using a Redis replacement called Valkey.

docker run --name redis --restart=always --publish 6379:6379 -d valkey/valkey:7.2

Laravel requires a few tables to be created so run database migrations with:

php artisan migrate

Setup is now complete so let us run a quick php web server with

php artisan serve

Open http://localhost:8000 and you should see Laravel’s initial page. If you get a ‘No application encryption key has been specified’, that is because APP_KEY environment variable is empty. Generate one with:

php artisan key:generate

Voila! If you see the following screenshot on your browser, it means you have successfully installed an initial Laravel application.

laravel initial page

Small API

Let us create one api that calls the database. If you are working with an existing project, you can skip to the Infrastructure section.

Laravel 11 does not come with traditional API in default installation but enabling it is only one command away.

php artisan install:api

Before we write our API and controller, let us create mock data. To generate them, we use a seeder. Open database/seeders/DatabaseSeeder.php and create a thousand mock users.

<?php

namespace Database\Seeders;

use App\Models\User;
use Illuminate\Database\Seeder;

class DatabaseSeeder extends Seeder
{
    /**
     * Seed the application's database.
     */
    public function run(): void
    {
        User::factory(1000)->create();
    }
}

Once you are happy with the seeder, run it with db:seed artisan command.

php artisan db:seed

To create our controller, we take advantage of artisan command to scaffold a typical User controller

php artisan make:controller UserController -r

In our User controller at app/Http/Controllers/UserController.php, we return a random number of users. The reason is so we see varied metrics data which we will see in Grafana dashboard later. Note that we could have used eloquent but we aren’t doing anything fancy with this query.

use Illuminate\Support\Facades\DB;
use Illuminate\Support\Facades\Log;

class UserController extends Controller
{
    /**
     * Display a listing of the resource.
     */
    public function index()
    {
        Log::info('in user controller');
    
        $amount = mt_rand(1, 1000);
        Log::debug('random amount is: ' . $amount);

        return DB::table('users')
            ->select(['id', 'name', 'email', 'email_verified_at', 'created_at', 'updated_at'])
            ->orderByDesc('id')
            ->limit($amount)
            ->get();
    }

Make sure we register the /api/users route in routes/api.php

<?php

use App\Http\Controllers\UserController;
use Illuminate\Http\Request;
use Illuminate\Support\Facades\Route;

Route::resource('users', UserController::class); // <-- add this line and its imports

We are now ready to test this endpoint. Simply do a curl to confirm that the API is working as intended.

curl http://localhost:8000/api/users

You should be getting a json response like this

[
    {
      "id": 100,
      "name": "Jovan Rosenbaum",
      "email": "manuel.pacocha@example.com",
      "email_verified_at": "2024-03-20T07:54:31.000000Z",
      "created_at": "2024-03-20T07:54:31.000000Z",
      "updated_at": "2024-03-20T07:54:31.000000Z"
    },
    {
    ...etc

~~

As you can see, developing an endpoint is simple which involved a simple query builder and defining an API endpoint for it. Of course, real-world application involves authentication, authorisation, request validation, output JSON formatting, testing, and more but those are not the focus of this post.

Prepare Server

Before installing OpenTelemetry packages for Laravel, your computer needs to be set up with a PHP extension called opentelemetry. Your operating system may come with an easy way to install it. Regardless, I am showing you the pecl way.

Please note that a lot of preparing the server PHP headache can be avoided if you had used containers instead. But it may not be an option so a long way is demonstrated here.

If you have not got pecl installed yet, install using these two packages.

sudo apt install php-pear php-dev

Then simply use pecl to install opentelemetry.

sudo pecl install opentelemetry

Since PHP requests cannot remember values between requests thanks to its share-nothing model, accumulating number of requests cannot be done without storing the value somewhere. OpenTelemetry SDK does not define a way to do this so we have to resort to a different library to capture metrics. It defaults to Redis for storage but you can also use APCu to store metric values between requests.

# optional
sudo pecl install apcu

To enable these extension, we need to add them to our php.ini file. The best way is to add inside conf.d directory.

echo "extension=opentelemetry" | sudo tee -a /etc/php/8.2/cli/conf.d/20-opentelemetry.ini

# optional
echo "extension=apcu" | sudo tee -a /etc/php/8.2/cli/conf.d/20-apcu.ini

Finally, install the Redis PHP extension.

sudo apt install php-redis

Always confirm that the extension is enabled. Check with:

php -m | grep  "opentelemetry\|apcu\|redis"

No matter what linux distribution or operating system you use, the output must display the installed php extension.

opentelemetry
apcu
redis

Installing a PHP extension is probably the hardest part of this whole post and there are no short of tools out there to try simplify things up like Homestead, Valet, Herd, devcontainer, Sail, etc. Worth mentioning that it is also recommended to install some optional extensions especially mbstring, grpc, and protobuf.

Settings

OpenTelemetry settings can be controlled using environment variables. Open .env and add following lines. Note that this is not an exhaustive list. The OpenTelemetry SDK are set with some default settings which you can overwrite.

OTEL_PHP_AUTOLOAD_ENABLED=true
OTEL_SERVICE_NAME=laravel

OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318
OTEL_EXPORTER_OTLP_PROTOCOL=http/json

OTEL_TRACES_EXPORTER=otlp
OTEL_METRIC_EXPORTER=none
OTEL_LOGS_EXPORTER=otlp

OTEL_TRACES_SAMPLER_ARG=1

At a minimum, OTEL_PHP_AUTOLOAD_ENABLED must be set to true. I highly recommend setting the endpoint and protocol to emit OpenTelemetry data. In our case we use port 4318 and it emits the data in a JSON format. If you can, use port 4317 to send data through grpc protocol instead

The rest are optional but let us take a look at what we can customise.

Setting a service name identifies the signals generated from this particular application. This is useful when we need to differentiate between services, say a nodejs app, or SpringBoot against Laravel.

When it comes to exporting the signals, note that OTEL_METRIC_EXPORTER is disabled because we are going to use a Prometheus library for metrics generation.

Traces are typically sampled because a lot of them are identical and we do not need identical data points. For that reason, the value is often set at a low value such as 0.1. In our case, we set it to 1 because we want see all of traces that were generated for demonstration purpose. To learn more about these settings, visit https://opentelemetry.io/docs/languages/sdk-configuration/

Instrument

Instrumenting is enabling our application to be able to produce the three signals—tracing, metrics, and logs—into our application. It involves installing necessary packages through composer and where necessary, add some codes into the codebase so it can generate these signals to be consumed elsewhere.

Tracing

We instrument tracing using OpenTelemetry by installing several packages. First, enable composer autoloading because this is how OpenTelemetry SDK will be able to automatically register itself to Laravel. By default, Laravel already has this setting enabled. But to be sure, run this line:

composer config allow-plugins.php-http/discovery true

Check composer.json and the “php-http/discovery” value should be true.

"config": {
    "optimize-autoloader": true,
    "preferred-install": "dist",
    "sort-packages": true,
    "allow-plugins": {
        "pestphp/pest-plugin": true,
        "php-http/discovery": true <-- auto discovery
    }
},

Now we are ready to install packages for OpenTelemetry auto-instrumentation. open-telemetry/sdk is the main package we need. It implements OpenTelemetry API so telemetry data are generated in the correct format. The open-telemetry/exporter-otlp package exports that data over the wire in OTLP format. Which means guzzle packages is needed to perform any network calls. The opentelemetry-auto-laravel package automatically hooks into your Laravel request-response lifecycle to generate the signals automatically.

If you want to track requests between microservices, we want to handle W3C trace context propagation. This context passes around one trace ID between microservices so we can track request lifecycle across all services. For that we need to install both PSR15 and PSR18 packages to handle incoming and outgoing HTTP requests respectively.

composer require open-telemetry/sdk
composer require open-telemetry/exporter-otlp
composer require guzzlehttp/guzzle
composer require php-http/guzzle7-adapter
composer require open-telemetry/opentelemetry-auto-laravel

# for distributed tracing 
composer require open-telemetry/opentelemetry-auto-psr15
composer require open-telemetry/opentelemetry-auto-psr18
composer require open-telemetry/opentelemetry-auto-guzzle

Optionally, we can attach logs to each trace by instrumenting monolog which is the package Laravel uses for logging. The advantage of using this way is we let OpenTelemetry SDK does the automatic correlation of logs to a trace.

composer require \
  monolog/monolog \
  open-telemetry/opentelemetry-logger-monolog

If you choose gRPC to send the data, install an additional package. The gRPC PHP extension is required. Make sure that the gRPC version installed by composer is the same as the version you install with pecl.

composer require grpc/grpc@1.57.0
sudo pecl install grpc-1.57.0
composer require open-telemetry/transport-grpc

Then switch to gRPC transport.

# .env
#OTEL_EXPORTER_OTLP_ENDPOINT=http://0.0.0.0:4318
#OTEL_EXPORTER_OTLP_PROTOCOL=http/json
OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4317
OTEL_EXPORTER_OTLP_PROTOCOL=grpc

Our chosen architecture on the other hand does a bit more as detailed down below. We choose a log pusher system called Promtail to push the logs to a log aggregator called Loki. This completely bypasses otel-collector. The downside is these logs won’t be correlated with a trace by default so additional work need to be done in order to correlate logs to a trace which will be explained in the Logs section.

Nesting Traces

Up until now, our Laravel application has been instrumented for tracing automatically thanks to open-telemetry/opentelemetry-auto-laravel package. However, this package only creates one trace for each request. That’s cool, but typically we do not code everything in a controller class only. Plus, it is more interesting to see trace and spans grouped by child functions so it becomes logical. For example, we could relegate database calls to a repository or service class on another function. This function can call yet another function and as a result, you get a tree-like structure of your request. This brings us an opportunity to not only scope the timing between your functions, but also the timing of a group of child functions!

trace with nested span

To achieve this, we can either pass around $trace instance around as function parameter, or re-retrieve this $tracer through a provider in each child class. In the user controller—app/Http/Controllers/UserController.php—, obtain the trace that was created by auto-instrumentation and create a new span. A tracer is obtained from a global default. A span is created by giving a name (I prefer using function’s name) from spanBuilder()->startSpan(). Then it needs to be activated to signify this span is scoped for this function. It does look strange that creating a span does not activate it but this is what the API spec is saying in languages with implicit Context propagation like PHP.

use Illuminate\Http\Request;
use OpenTelemetry\API\Globals;

public function index(Request $request)
{
    $tracer = Globals::tracerProvider()->getTracer('io.opentelemetry.contrib.php');
    $root = $tracer->spanBuilder('index')->startSpan();
    $scope = $root->activate();

Then in each child function, you do the same three lines of code—re-retrive $tracer from a global provider.

If you do not prefer that way, the $tracer instance can be passed from parent into child classes as a function parameter.

// pass $tracer to child function
$this->userRepo->all($tracer);

Finally both $root and $scope need to be closed. A detach() resets the Context associated with this scope while end() signals that this span operation has ended.

$scope->detach();
$root->end();

It is a little bit of work because at least four lines of code need to be added to each function in order to obtain nested spans. Unlike auto-instrumentation in Java, nested spans are added automatically into each function without writing a single line of code. On a flip side, you might not want to create spans for every single function because you might not care about instrumenting certain function.

Before closing off this section, another modification I like is to do is to wrap instrumenting code in a try..finally block where both detach() and end() are moved into the finally block to ensure they are always run.

Metrics

OpenTelemetry SDK supports all three signals for PHP but we opt for Prometheus library to instrument metrics instead because of its additional important feature. The Prometheus library has an internal state storage capability to remember metrics data between requests which OpenTelemetry SDK lacks but is under investigation. Having things like Swoole or Roadrunner might help because Laravel application is held running constantly instead of having to reboot for each request but I haven’t been able to make OpenTelemetry work with it reliably yet.

Nevertheless, install Prometheus client with following command. As always, full code is at https://codeberg.org/gmhafiz/observability/src/branch/master/laravel

composer require promphp/prometheus_client_php

To generate metrics data, we intercept each requests using a middleware. We will be writing a custom middleware that will intercept each request latency grouped by HTTP method and URI path.

To create a middleware in Laravel:

php artisan make:middleware PrometheusMiddleware

Open app/Http/Middleware/PrometheusMiddleware.php and in handle() function, we measure the latency from the first time Laravel boots up until the request has been handled. To do this, we sandwich $next($request) with microtime(true). The resulting difference is multiplied by one thousand because we want the unit to be in millisecond unit.

public function handle(Request $request, Closure $next): Response
{
    $startTime = defined('LARAVEL_START') ? LARAVEL_START : $request->server('REQUEST_TIME_FLOAT');
    
    $result = $next($request);
    
    $duration = $startTime ? ((microtime(true) - $startTime) * 1000) : null;

I am simply going to record one kind of metric which is a histogram. The histogram buckets consist of different api routes based on its latency. This means, we will have exponential buckets for /api/users at defined by Histogram::exponentialBuckets(1, 2, 15). These buckets are further differentiated with HTTP method because we are interested to know metrics from a GET to a PUT. All of them are labelled with a string ‘Laravel’ to identify this application(job). Just like the user controller, I like putting this block of code in a try..catch..finally block because in the code block below, there are at least five things that can go wrong.

use App\Http\Controllers\Controller;
use Prometheus\CollectorRegistry;
use Prometheus\Histogram;

public function handle(Request $request, Closure $next): Response
{
    ...
    
    $registry = $this->registry;
    
    $method = $request->method();
    $uri = $request->route()->uri();
    
    try {
        $histogram = $registry->getOrRegisterHistogram(
            'http_server_duration',
            'milliseconds',
            'latency histogram grouped by path in milliseconds',
            ['job', 'method', 'path'],
            Histogram::exponentialBuckets(1, 2, 15),
        );
    finally {
        $histogram->observe($duration, ['Laravel', $method, $uri]);
        
        return $result;
    }
}

To finish off setting up this middleware, let us take a look at its configuration. Prometheus library uses Redis storage as default so its collector registry can be accessed using \Prometheus\CollectorRegistry::getDefault(). However, you may want to configure the address to Redis, or you may be satisfied with only APCu. We also need an instance of CollectorRegistry that we can use in both here in this middleware, and /api/metrics endpoint which we will build in a moment. Laravel 11 added another(new) way to register our class instantiation which can be confusing. I found it hard to figure out the changes that came with Laravel 11 even with the documentation so I hope the following not only helps with configuring Prometheus storage adaptor, but also using Providers in Laravel 11 in general.

Let us start with step 1 which is to create a new Provider. This creates a file in app/Providers/PrometheusProvider.php and a new entry in bootstrap/providers.php is added. You no longer need config/app.php for registering class anymore.

php artisan make:provider PrometheusProvider

In register() function, we register CollectorRegistry class as a singleton and give its Redis storage adaptor as argument. The values for Redis configuration comes from config() global helper which takes values from config/database.php.

use Prometheus\CollectorRegistry;
use Prometheus\Storage\Redis;

public function register(): void
{
    $this->app->singleton(CollectorRegistry::class, static function () {
        $redis = new Redis([
            'host' => config('database.redis.default.host') ?? 'localhost',
            'port' => config('database.redis.default.port') ?? 6379,
            'password' => config('database.redis.default.password') ?? null,
            'timeout' => 0.1, // in seconds
            'read_timeout' => '10', // in seconds
            'persistent_connections' => config('database.redis.default.persistent', false),
        ]);

        return new CollectorRegistry($redis);
    });
}

Once we have this CollectorRegistry class, our middleware can access it by injecting this CollectorRegistry class as a dependency through its constructor. Reopen the middleware at app/Providers/PrometheusProvider.php and add this constructor

use Prometheus\CollectorRegistry;
use Prometheus\Storage\Redis;

class PrometheusMiddleware
{
    private CollectorRegistry $registry;
    
    public function __construct(CollectorRegistry $registry)
    {
        $this->registry = $registry;
    }

Our Prometheus middleware is complete but it won’t be triggered in each request until is it registered with Laravel. To make Laravel intercept all requests with our Prometheus middleware, we add it in Laravel bootstrap handler. Open bootstrap/app.php and add the following line.

use App\Http\Middleware\PrometheusMiddleware;

return Application::configure(basePath: dirname(__DIR__))
    ->withRouting(
        web: __DIR__.'/../routes/web.php',
        api: __DIR__.'/../routes/api.php',
        commands: __DIR__.'/../routes/console.php',
        health: '/up',
    )
    ->withMiddleware(function (Middleware $middleware) {
        $middleware->append(PrometheusMiddleware::class); // <-- our custom handler
    })
    ->withExceptions(function (Exceptions $exceptions) {
        //
    })->create();

The last step we need to do is to allow Prometheus service to scrape metrics data from Laravel. To do this, we expose an API endpoint that returns this metrics data.

Create a new controller and open the file at app/Http/Controllers/Metrics/MetricsController.php.

php artisan make:controller Metrics/MetricsController

The index() function returns metrics data in Prometheus’ plaintext format.

<?php

namespace App\Http\Controllers\Metrics;

use Prometheus\CollectorRegistry;
use Prometheus\RenderTextFormat;

class MetricsController extends Controller
{
    private CollectorRegistry $registry;

    public function __construct(CollectorRegistry $registry)
    {
        $this->registry = $registry;
    }

    public function index()
    {
        $renderer = new RenderTextFormat();
        $result = $renderer->render($this->registry->getMetricFamilySamples(), true);

        header('Content-type: '.RenderTextFormat::MIME_TYPE);
        echo $result;
    }
}

To expose a /metrics endpoint, open routes/api.php file and add the following route.

use App\Http\Controllers\Metrics\MetricsController;

Route::get('/metrics', [MetricsController::class, 'index']);

In a production setting, you would want to protect this route by adding some authentication.

Logs

We could stop right here. Only metrics and tracing are needed to be instrumented because logs were already instrumented alongside trace(in OTLP format) because we have installed open-telemetry/opentelemetry-logger-monolog package.

However, from our chosen architecture, emitting logs are independent of Laravel and OpenTelemetry. My approach is to have some kind of an agent or log pusher like FluentBit or Promtail that pushes the logs from Laravel log files to a centralised datastore in a system like Loki. 12-factor app says logs should be treated as a stream and you should write them to stdout. We certainly can configure Laravel to log to stdout. The advantage of doing this way is we do not have to not have to worry about Laravel using disk I/O for writing logs. But the downside is you have another tool to install, configure, and maintain.

Keep in mind that logs that get instrumented in OTLP format has the advantage of being correlated with traces automatically thanks to unique trace ID present in both trace and log. This unique ID is absent in default Laravel log system. To achieve similar things, we will attempt something close which is by attaching the trace ID to each log. To do this, we need to write a custom Laravel log formatter.

Custom Laravel Log

When any log is called, our custom class intercepts the log handler and format it to our desire. Trace ID can be acquired from the global trace provider. All is left is to place where we want the trace ID in our formatted string.

Create a file in app/Logging/Loki.php and add the following code.


<?php

namespace App\Logging;

use Illuminate\Log\Logger;
use Monolog\Formatter\LineFormatter;
use OpenTelemetry\API\Globals;

class Loki
{
    public function __invoke(Logger $logger): void
    {
        $traceID = $this->getTraceID();

        foreach ($logger->getHandlers() as $handler) {
            $handler->setFormatter(new LineFormatter(
                "[%datetime%] %channel%.%level_name%: ".$traceID." %message% %context% %extra%\n"
            ));
        }
    }

    private function getTraceID(): string
    {
        $tracer = Globals::tracerProvider()->getTracer('');
        $root = $tracer->spanBuilder('')->startSpan();

        return $root->getContext()->getTraceId();
    }
}

What is left is to make Laravel use our newly created custom log formatter. Go to config/logging.php and find a channel for ‘single’. Add the following line with its import.

use App\Logging\Loki;

'single' => [
    'driver' => 'single',
    'tap' => [Loki::class], // <-- add this line
    'path' => storage_path('logs/laravel.log'),
    'level' => env('LOG_LEVEL', 'debug'),
    'replace_placeholders' => true,
],

The result is a log line that looks like this. Now that we have the trace ID, we can filter for this string and retrieve all logs scoped to a request!

[2024-04-07T11:19:40.975030+00:00] local.INFO: d550726cabd8b98ea5137ab6d73f6aa9 random amount is  [66] []
JSON log

Line formatter is great although I prefer structured logs myself. Fortunately customising a log formatted to JSON is easy. To do so, we create a class that extends NormalizeFormatter. We have free reign on what fields we want to insert including a more precise timestamp, application name, and environment. Application name is useful to differentiate between different application logs. Including environment value can be useful when you mix together production and staging logs in your centralised log system. Then, a call to toJson() function formats our log object into a json log line.

Create a file at app/Logging/CustomJson.php with the following content.

<?php

namespace App\Logging;

use Monolog\Formatter\NormalizerFormatter;
use Monolog\LogRecord;

class CustomJson extends NormalizerFormatter
{
    private string $traceID;

    public function __construct($traceID = '')
    {
        parent::__construct('Y-m-d\TH:i:s.uP');

        $this->traceID = $traceID;
    }

    public function format(LogRecord $record): string
    {
        $recordData = parent::format($record);

        $message = [
            'datetime' => $recordData['datetime'],
        ];

        if (isset($this->traceID)) {
            $message['traceID'] = $this->traceID;
        }

        if (isset($recordData['message'])) {
            $message['message'] = $recordData['message'];
        }

        if (isset($recordData['level'])) {
            $message['level'] = $recordData['level'];
        }

        if (\count($recordData['context']) > 0) {
            $message['context'] = $recordData['context'];
        }

        if (\count($recordData['extra']) > 0) {
            $message['extra'] = $recordData['extra'];
        }
        $message['extra']['hostname'] = (string) gethostname();
        $message['extra']['app'] = config('app.name');
        $message['extra']['env'] = config('app.env');

        return $this->toJson($message)."\n";
    }
}

Then a small adjustment need to be made in app/Logging/Loki.php file so it uses our new custom formatter.

foreach ($logger->getHandlers() as $handler) {
    if (config('logging.json_format')) {
        $handler->setFormatter(new CustomJson($traceID));
    } else {
        $handler->setFormatter(new LineFormatter(
            '[%datetime%] %channel%.%level_name%: '.$traceID." %message% %context% %extra%\n"
        ));
    }
}

Since we use the global config() helper function, we need to define this in config/logging.php file.

<?php

use App\Logging\Loki;
use Monolog\Handler\NullHandler;
use Monolog\Handler\StreamHandler;
use Monolog\Handler\SyslogUdpHandler;
use Monolog\Processor\PsrLogMessageProcessor;

return [

    /*
     * Whether to format log output as a normal single line key value or switch to a JSON format
     */
    'json_format' => env('LOG_JSON', false),
    
... cut for brevity    

The result is a log output but formatted as JSON.

{
    "datetime": "2024-10-06T22:42:37.307234+00:00",
    "traceID": "d4a890b60f7f705f5e6c34f4144df925",
    "level": 200, 
    "message": "random amount is ",
    "context": [
        11
    ],
    "extra": {
        "hostname": "debianDesktop",
        "app": "Laravel",
        "env": "local"
    }
}

~~

All in all, instrumenting both metrics and logs are easy and only need to be done once and without touching the rest of your codebase. Thanks to OpenTelemetry automatic instrumentation, all requests are traced automatically.

Infrastructure

This section brings up the infrastructure needed to transport, store and visualise your observability data in your own computer.

Setting up the infrastructure locally manually is tedious so I highly recommend going the docker-compose route at https://codeberg.org/gmhafiz/observability. And if you do, at least read the #instrument section to familiarise yourself on how to instrument OpenTelemetry to a Laravel application.

The five main infrastructure we are going to bring up are:

  1. Loki for logs and Promtail for log pushing
  2. Jaeger for tracing
  3. Prometheus for metrics
  4. otel-collector to receive signals from application to backend
  5. Grafana for visualisation

Brave souls may continue to the next section.

Loki Logs

To push Laravel logs to a log aggregator like Loki, we include promtail in the root directory and have it run in the background. We only require one promtail binary and one configuration file for this.

Install both loki and promtail using your preferred method https://grafana.com/docs/loki/latest/setup/install/.

wget https://github.com/grafana/loki/releases/download/v3.2.1/loki-linux-amd64.zip
unzip loki-linux-amd64.zip

Create a loki config file loki-config.yaml and paste the following content.

# loki-config.yaml

auth_enabled: false

server:
  http_listen_port: 3100
  grpc_listen_port: 9096

limits_config:
    allow_structured_metadata: true

common:
  path_prefix: /tmp/loki
  storage:
    filesystem:
      chunks_directory: /tmp/loki/chunks
      rules_directory: /tmp/loki/rules
  replication_factor: 1
  ring:
    instance_addr: 127.0.0.1
    kvstore:
      store: inmemory

schema_config:
  configs:
    - from: 2024-06-01
      store: tsdb
      object_store: filesystem
      schema: v13
      index:
        prefix: index_
        period: 24h

ruler:
  alertmanager_url: http://localhost:9093

# By default, Loki will send anonymous, but uniquely-identifiable usage and configuration
# analytics to Grafana Labs. These statistics are sent to https://stats.grafana.org/
#
# Statistics help us better understand how Loki is used, and they show us performance
# levels for most users. This helps us prioritize features and documentation.
# For more information on what's sent, look at
# https://github.com/grafana/loki/blob/main/pkg/usagestats/stats.go
# Refer to the buildReport method to see what goes into a report.
#
# If you would like to disable reporting, uncomment the following lines:
#analytics:
#  reporting_enabled: false

Its config file in Loki version 3 requires schema version 13 using tsdb format. For demonstration and development purpose, we store the logs in local file system. But in a production setting, Loki recommends storing the logs in object storage such as AWS S3, Ceph or Minio. We expose two listening ports at 3100 and 9096 for http and grpc respectively. Lastly, if we choose to send logs to Loki through otlp, we need to allow structured metadata to be stored.

Run Loki with

./loki-linux-amd64 -config.file=loki-config.yaml

Check if your installation is successfully by querying its api and confirm if it returns a 200 OK response. It could take some time so let us proceed with promtail.

curl -v http://localhost:3100/ready

For promtail, I opt for single binary for linux:

wget https://github.com/grafana/loki/releases/download/v3.2.1/promtail-linux-amd64.zip
unzip promtail-linux-amd64.zip

In promtail, we need to tell it where your laravel logs are located, as well as where to forward them.

In its configuration file, the important bit is the list of jobs under ‘scrape_configs’. Here we name it as ’laravel-opentelemetry’ and we provide an absolute path to Laravel log, so adjust accordingly. In our case, it recursively reads all files that ends with ‘.log’ under Laravel ‘storage/log’. The place we are sending the logs is under ‘clients’ section which accepts an array. Here, we define Loki’s HTTP api address for log pushing. In our case, we only send it to one place which is our local loki installation but you could push it to more than one destination like Elasticsearch, Splunk, etc.

Open promtail-local.yaml and paste the following and change the path to your laravel’s log location.

# promtail-local.yaml

server:
  http_listen_port: 9080
  grpc_listen_port: 0
positions:
  filename: /tmp/promtail-positions.yaml
clients:
  - url: http://localhost:3100/loki/api/v1/push
scrape_configs:
  - job_name: laravel-opentelemetry
    static_configs:
      - targets:
          - localhost
        labels:
          job: laravel-opentelemetry
          __path__: /var/www/storage/logs/**/*.log # adjust to where you laravel directory is located

Run it with:

./promtail-linux-amd64 -config.file=promtail-local.yaml

Jaeger Tracing

Jaeger can also be run as a single binary. It is able to receive OTLP data through the port 4317 or 4318 but these conflicts with the same ports otel-collector we are going to run soon. So we set them to different ports instead.

wget https://github.com/jaegertracing/jaeger/releases/download/v1.63.0/jaeger-1.63.0-linux-amd64.tar.gz

tar -xzvf jaeger-1.63.0-linux-amd64.tar.gz

jaeger-1.63.0-linux-amd64/jaeger-all-in-one --collector.otlp.grpc.host-port :44317 --collector.otlp.http.host-port :44318

Note that all traces which are created will be lost when Jaeger is restarted because they are stored in memory. To keep the traces, configure a storage backend.

Prometheus Metrics

For simplicity sake, we run Prometheus as a docker container within host network. Before we run it, let us customise its configuration file.

We define only one scrape job. We give it a name, as well as how long we want it to wait between data scrapings.

‘static_configs’ accepts a list of addresses to be scrapped. Since we run laravel using php artisan serve, the address it exposes is http://localhost:8000. It is important to run this container in host network because Prometheus needs to be able to reach this localhost address. If ‘metrics_path’, if not defined, it will default to ‘/metrics’. So, we need to set it to ‘/api/metrics’ instead because that is the metric endpoint we exposed with Laravel.

# prometheus.yml

scrape_configs:
  - job_name: 'laravel'
    scrape_interval: 10s
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: '/api/metrics'

Take this prometheus.yml config file and mount it into the container. Run the following command and Prometheus will run at its default port of 9090.

docker run \
  --name prometheus \
  --rm \
  --network=host \
  -v $HOME/Downloads/prometheus.yml:/etc/prometheus/prometheus.yml \
  -d \
  prom/prometheus:v3.0.0

Check if Prometheus API is running using curl and it should return a 200 OK status.

curl -v http://localhost:9090/-/healthy

We are choosing to store metrics data inside Redis. So if you have not got that yet, install valkey with the following command

docker run --name valkey --restart=always --publish 6379:6379 -d valkey/valkey:7.2

Otel-Collector

Both trace and log signals are funneled into otel-collector, a program that receives, process and export telemetry data. We do not strictly need this since telemetry signals can be exported directly from Laravel to Jaeger. The reason why we have this intermediary program are many. This program can scale independently of Laravel and Jaeger. It also includes features like batching, retries, filtering and more.

Interestingly, this is not the only otlp-compatible collector. For example, there is a contributor distribution based of otel-collector called otel-collector-contrib and another from Grafana called Alloy

Before running this program, let us take a look at its config file. It has four main sections namely ‘receivers’ to define how telemetry signals are received, ‘processors’ to perform batching, filtering, tries etc, ’exporters’ to define where we want to emit processed data to, and ‘service’ where we define data flow through pipelines.

Note that in the ‘service’ section, both ’traces’ and ’logs’ are defined but not ‘metrics’. This is because Prometheus is scraping data straight from Laravel, bypassing otel-collector. This brings us to a processor called ‘filter/ignore’. Since Prometheus scrapes the data every ten seconds, Laravel OpenTelemetry SDK will generate traces for it. You can filter that out based on the path.

# otel-collector-config.yaml

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: localhost:4317
      http:
        endpoint: localhost:4318

exporters:
  debug:
    verbosity: detailed

  otlp:
    endpoint: localhost:44317 # because we mapped Jaeger otlp to this port 
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
    sending_queue:
      enabled: false

  otlphttp:
    endpoint: http://localhost:3100/otlp/v1/logs

processors:
  # https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor
  batch:

  memory_limiter:
    limit_mib: 4000
    spike_limit_mib: 1000
    check_interval: 5s

  # https://github.com/open-telemetry/opentelemetry-collector-contrib/blob/main/processor/filterprocessor/README.md
  filter/ignore:
    traces:
      span:
        - attributes["http.route"] == "api/metrics"

extensions:
  health_check:
  pprof:
    endpoint: localhost:1888
  zpages:
    endpoint: localhost:55679

service:
  extensions: [pprof, zpages, health_check]
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, memory_limiter, filter/ignore]
      exporters: [debug, otlp]
    logs:
      receivers: [otlp]
      processors: [batch, memory_limiter]
      exporters: [debug, otlphttp]

Run otel-collector-contrib with

docker run \
  -v $HOME/Downloads/otel-collector-config.yaml:/etc/otelcol-contrib/config.yaml \
  --name otel-collector \
  --network=host \
  -d \
   otel/opentelemetry-collector-contrib:0.114.0

Grafana

Grafana can be used to visualise all OpenTelemetry signals that we have. To install, I will show the apt way because why not.

sudo apt install -y apt-transport-https software-properties-common
sudo mkdir -p /etc/apt/keyrings/
wget -q -O - https://apt.grafana.com/gpg.key | gpg --dearmor | sudo tee /etc/apt/keyrings/grafana.gpg > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.gpg] https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt update && sudo apt install grafana
sudo systemctl daemon-reload && sudo systemctl start grafana-server && sudo systemctl enable grafana-server

Explore Grafana Dashboard

Now that we have infrastructure up and Laravel instrumented, it is time to explore what insights we can gain from these signals.

Logs

Open Grafana at http://localhost:3000 and login as admin/admin. To add Loki as Grafana data source, under ‘Connections’, click on the ‘Data sources’ link in the side menu.

Add data source to Grafana

Find Loki in the list

Loki

The only setting you need to put is Loki’s endpoint which is http://localhost:3100

Enter Loki endpoint

Scroll down and click the “Save & test button” to verify connectivity.

Test

To view the logs, either click on the ‘Explore view’ link or go to ‘Explore’ menu and select Loki from the dropdown box.

explore loki

To filter only Laravel logs, select from these two dropdown boxes, click ‘Run query’ button and voila, you see a stream of laravel logs complete with its trace ID!

To filter logs

Grafana has a nice query builder form to help search and filter you want. If you prefer typing Loki’s query language (LogQL) instead, simply switch to ‘Code’. The equivalent query to fetch all laravel logs is {job="laravel-opentelemetry"}.

all laravel logs

If we know what trace ID we are interested in, we can simply search for that string. Here I am using LogQL and Loki filters the two logs with that trace ID.

search logs by trace ID

Tracing

Once run, add Jaeger as Grafana data source just like we did with Loki

Add Jaeger as Grafana data source

The address to Jaeger is http://localhost:16686, hit ‘Save & test’ button, and explore view.

Once you are in Jaeger explore view, there is a dropdown box under ‘Search’ query type where you can filter by ’laravel’. Sometimes it does not appear on first time view and I have to refresh the whole page before Grafana can detect a new ’laravel’ trace service.

Select laravel

Click ‘Run query’ button and you will see a list of traces for all laravel api you have done within the time period.

List of laravel trace

Click on any one of them and you will see detailed trace information along with any spans if they exist. The OpenTelemetry PHP SDK instruments database queries so you are able to see its queries and how long it took.

In the detailed view, we can see the whole request-response took about 45 milliseconds. The SQL query was started at around 16 millisecond and took the majority of time at 28 milliseconds. shows sql query made on this API

Remember that we logged the random number inside UserController.php? That log line is also included in the span. If you click on the first span named ’laravel GET /api/users’, the log information is stored under resource events.

Log in span

Metrics

This brings us to the final observability signal which is metrics and we are going to see some cool line charts. I will show how histogram we created gives us useful insights on how /api/users perform and its count. To view this metric, go to Grafana at http://localhost:3000 and select Prometheus as data source. The address is simpley http://localhost:9090

Add Prometheus address in Grafana

Initially you won’t be able to see any metric data for /api/users endpoint until you have run it at least once. Making one request is fine, but it is better to simulate real-world traffic so we can perform basic traffic analysis. To this end, let us generate some synthetic load using k6.

Load testing with k6

Before we run any load testing, ensure that a single request works.

curl -v http://localhost:8000/api/users

If everything looks alright, install k6 and set an instruction file. There are a total of eight virtual users (vus) with three stages. Each stage has a pre-defined number of vus. So this will complete in ten minutes.

import http from 'k6/http';

export const options = {
    vus: 8,

   stages: [
       { duration: '2m', target: 4 }, // traffic ramp-up from 1 to N users over X minutes.
       { duration: '6m', target: 8 }, // stay at N users for X minute
       { duration: '2m', target: 0 }, // ramp-down to 0 users
   ],
};

export default function () {
    http.get('http://localhost:8000/api/users');
}

Save it as load-test.js and run with

k6 run load-test.js

k6 will return a result that looks like this.

          /\      |‾‾| /‾‾/   /‾‾/   
     /\  /  \     |  |/  /   /  /    
    /  \/    \    |     (   /   ‾‾\  
   /          \   |  |\  \ |  (‾)  | 
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: load-test.js
        output: -

     scenarios: (100.00%) 1 scenario, 8 max VUs, 40s max duration (incl. graceful stop):
              * default: 8 looping VUs for 10s (gracefulStop: 30s)


     data_received..................: 780 kB 74 kB/s
     data_sent......................: 14 kB  1.4 kB/s
     http_req_blocked...............: avg=145.12µs min=86.44µs med=116.41µs max=599.83µs p(90)=171.83µs p(95)=509.15µs
     http_req_connecting............: avg=72.76µs  min=47.73µs med=66.18µs  max=145.46µs p(90)=97.45µs  p(95)=111.22µs
     http_req_duration..............: avg=509.18ms min=68.18ms med=515.53ms max=572.06ms p(90)=544.65ms p(95)=547.53ms
       { expected_response:true }...: avg=509.18ms min=68.18ms med=515.53ms max=572.06ms p(90)=544.65ms p(95)=547.53ms
     http_req_failed................: 0.00%  ✓ 0         ✗ 161
     http_req_receiving.............: avg=29.02ms  min=27.19ms med=28.5ms   max=39.1ms   p(90)=30.95ms  p(95)=32.04ms 
     http_req_sending...............: avg=38.61µs  min=19.88µs med=35.32µs  max=421.94µs p(90)=49.54µs  p(95)=57.4µs  
     http_req_tls_handshaking.......: avg=0s       min=0s      med=0s       max=0s       p(90)=0s       p(95)=0s      
     http_req_waiting...............: avg=480.12ms min=39.6ms  med=486.88ms max=543.81ms p(90)=514.34ms p(95)=518.02ms
     http_reqs......................: 161    15.349624/s
     iteration_duration.............: avg=509.42ms min=69.14ms med=515.72ms max=572.31ms p(90)=544.86ms p(95)=547.8ms 
     iterations.....................: 161    15.349624/s
     vus............................: 8      min=8       max=8
     vus_max........................: 8      min=8       max=8

Ensure no request has failed by checking http_req_failed. The two most important metric are iteration_duration and iterations that give latency and throughput respectively.

Let us return to Prometheus.

Latency traffic analysis using metrics

Just like tracing, the metric can be selected using Grafana form builder. Search for the name of our histogram, http_server_duration and you should see three metrics with the unit we defined, in milliseconds which were created automatically.

Starting with ‘count’, it is a counter so it can be used to find how many times this endpoint are hit. ‘Sum’ is the total of time unit, in our case, milliseconds, our endpoint has been spent on. This is used to find endpoint latency. This brings us to ‘bucket’ which gives us a fine-grained latency data of each API endpoint grouped by its latency buckets.

metrics

The first metric we want to craft is latency. The total duration is stored in ‘sum’ metric. To calculate latency, we divide the rate of duration(sum) over the rate of how many times this /api/users endpoint has been visited. Thus our PromQL is as follows.

rate(http_server_duration_milliseconds_sum{exported_job="Laravel"} [$__rate_interval])
/
rate(http_server_duration_milliseconds_count{exported_job="Laravel"} [$__rate_interval])

The above PromQL looks like a normal math formula with a nominator and a denominator. The [$__rate_interval] is a special Grafana variable that guarantees to be at least four times the scrape interval. We also constrained our metric of choice to be ‘Laravel’ for ’exported job’ key.

The resulting line chart is as follows. X-axis is the time while Y-axis gives us the latency in milliseconds. Notice that latency grew towards the end as our load tester increases the number of concurrent users.

/api/users endpoint latency

Service Level Objective of /api/users must be below 40ms 95% of time

Let’s say that we have a requirement saying GET /api/users endpoint needs to be below 40 milliseconds 95% of the time. We don’t have a bucket defined at 40 milliseconds so let us fudge by using the 32 milliseconds bucket instead. That means we need to find the proportion of request latency below that value over all requests. To find the proportion, we simply select buckets which are less than or equal (le) 32 milliseconds and divide over the whole requests i.e http_server_duration_milliseconds_bucket{exported_job="Laravel",le="32"} / http_server_duration_milliseconds_bucket{exported_job="Laravel",le="+Inf"}. The function increase() calculates the rate of increase over one minutes sliding average while sum() sums up all the values. The numerator is multiplied by 100 to give a nice value of 100%. The whole numerator part is divided of all requests by using +Inf bucket. The resulting PromQL is as follows.

sum(
  increase(http_server_duration_milliseconds_bucket{exported_job="Laravel",le="32"}[1m])
) * 100 /
sum(
  increase(http_server_duration_milliseconds_bucket{exported_job="Laravel",le="+Inf"}[1m])
)

Percentage of /api/users endpoint below 40 milliseconds

Finding out popular endpoints can give an indication of frequently visited pages. The histogram we made also counted the frequency so let us see how we can craft a PromQL to obtain this information.

We can reuse the counter created automatically with our histogram. I am not sure if this is the best metric to be used because we could have created a specific counter metric for this. Nevertheless, we apply increase() function to the http_server_duration_milliseconds_count metric to calculate the increase in counter of this time-series on a specific range. This range is a special Grafana variable in which you can adjust using the controller in the top right. Then sum() is applied and you should get about the same value as the number of requests k6 has made.

sum(increase(http_server_duration_milliseconds_count{exported_job="Laravel"}[$__range]))

As you can see, a simple histogram that we have is powerful and PromQL is very flexible.

Epilogue

While this blog post is rather long, auto instrumenting with OpenTelemetry is actually quite easy. Just have the OpenTelemetry PHP extension installed, add some composer packages, and enable it in environment setting and you are all done. Same goes with logs which are picked up automatically although the way I am choosing is different as I have another process that picks them up, and that brought additional modification as I want to have trace ID correlation. Metrics cannot be done through OpenTelemetry because of the storage issue but Prometheus is mature enough to be used on its own.

Initial time investment to observability looked complicated but each of metrics, tracing, and logs only need to be instrumented once. Arguments can be made regarding nesting tracing because each new child function isn’t automatically traced so you have to add the code manually.

There isn’t much changed to Laravel 11 apart from the new way to register service providers and file locations were moved around. I have been playing around with this OpenTelemetry SDK and promphp library since March and compatibility wise, it is fine. It used to be harder with fragmented opentracing libraries so I am glad OpenTelemetry has unified all these signals into stable SDKs.

What’s Next

What is not demonstrated in this post is OpenTelemetry instrumentation is also able to trace request along with its background jobs. This is a powerful tool to have because without it, you can easily lose context when you have thousands of background jobs and not knowing from which request it came from. This is done by adding trace context in the job handler and when the job is picked up, that context will be extracted and thus, the same parent trace is reused.

Talking about observability is not complete if we do not touch upon distributed tracing. In the diagram below, we have a SpringBoot api that call the Laravel api server, and in turn calls a Go api to return some data.

Distributed Tracing

When each of these services are instrumented the same way, the same trace ID is used in all three services and as a result we get a nice graph showing exactly how the call flows. Both tracing background job and distributed tracing code are in the repository.

Another potent advantage to OpenTelemetry is the ability to correlate data between the three signals. In the sample repository, there is an ability to jump from a log straight to a trace using ‘derived field’ with a click of a button. This is possible because both share the same unique trace ID.

There you have it. If you made this this far, thank you for reading. Please check out my other blog posts too!

Observability

A deeper primer and a vendor-neutral approach

During development, we follow a process and have many tools including functional tests (unit, integration, end-to-end), static analyzer, performance regression tests, and others before we ship to production. In spite of developers best effort, bugs do occur in production. Typical tools we have during development may not be applicable to a production setting. What we need is some visibility on our program’s internal state and how it behaves in production. For example, error logs can tell what is going on and include what user request looks like. An endpoint which is slow will need to be looked at, and it will be great if we can find out which one, and pinpoint exactly where the offending part of a codebase is. These data—or signals—are important to give us insight on what is going on with our applications. For the rest of this post, we will look at available tools to give us this visibility and help us solve this problem.

There are many tools, a crowded one, that provides visibility into not only how a program behaves, but can also report errors while a program is running in production. You may have heard of the term APM or application performance monitoring solutions like Splunk, Datadog, Amazon X-Ray, Sentry.io, ELK stack, and many others that provide complete (partial in some) end-to-end solution to peek under the hood and provide tools to developers to understand a program’s behaviour. These solutions work great. But sometimes we want the flexibility of switching to another vendor. It may be because we want features other vendor have, or it may be because of cost-saving. When we try to adopt one of these solutions, you might find it hard to migrate because of extensive changes needed to be done throughout a codebase, thus feeling locked into one ecosystem.

Thus came OpenTelemetry. It merges the efforts of OpenCensus and OpenTracing in the yesteryear into a single standard anyone can adopt so switching between vendors becomes easier. The term OpenTelemetry in this post encompasses the APIs, SDKs and tools to make up this ecosystem. Over the years, to my surprise, OpenTelemetry is being adopted by heavy hitters including the companies I mentioned above, as well as countless startups in the industry. This is great because when it becomes easy to switch, there is a greater incentive for these vendors to provide a better service and this only benefits developers and your company stakeholders. This is the promise of a vendor-neutral approach championed by cncf.io to ensure innovations are accessible for everyone.

Today, OpenTelemetry has advanced enough that I am comfortable at recommending them to any software programmers. OpenTelemetry has come a long way (and evolved over the years) so in this post, we will give a basic explanation of what OpenTelemetry is and an overview of how its observability signals work under the hood. Then we talk about how its collector tool works and how it helps in achieving vendor-neutrality.

Table of Contents

A comprehensive and official demo is located at the https://opentelemetry.io/docs/demo. But a minimal example can be had at https://codeberg.org/gmhafiz/observability

Observability Signals

Before we go deeper into implementation, we need to know what we are measuring or collecting. In OpenTelemetry world, these are called signals which includes metrics, tracing, and logs. These signals are the three pillars of OpenTelemetry.

Three greek columns labelled each with metrics, tracing, and logs. There is a fourth much smaller pillar labelled as Profile. Image adapted from https://www.hiclipart.com/free-transparent-background-png-clipart-bohsm

Figure 1: Greek pillars are what I imagine every time I hear the term three pillars of observability.

Logs are basically the printf or console.log() to display some internal messages or state of a program. This is what developers are most familiar with. It provides invaluable information at debugging.

Metrics represents aggregation of measurements of your system. For example, it can measure how big is the traffic to your site and its latency. These are useful indications to find out how your system performs. The measurement can be drilled down to each endpoint. So your system could have a 100% uptime by looking at your /ping endpoint, but it does not tell how each endpoint performs. Other than performance, you might be interested to know other measurements like the current total active logged-in users, or the number of user registrations in the past 24 hours.

Tracing may be the least understood tool as it is a relatively newer signal tool. Being able to pinpoint where exactly an error is happening is great—something that logs can do—but arriving to that subsystem can originate from multiple places (maybe from a different microservice). Moreover, an error can be unique to a request, so being able to trace the whole lifecycle of the request is an invaluable tool to find how the error came to be.

These are the central signals that we will dive into deeply. There is a fourth signal called profiling. Profiling has been with us for decades, and it is an invaluable tool during development that gives us means to see how much RAM and CPU cycles a particular function uses down to the line number. Profiling in this context refers to having these data live in a production setting (just like the other three signals), and across microservices! For now, profiling is in an infancy state, so we will focus on the other three signals.

Visualisation

Before diving deeper into the signals, it would be great to see if we can visualise what the end product could look like. The screenshot below is an example of an observability dashboard we can have. It gives us a quick glance on important information such as how our endpoints are doing, what logs are being printed, and some hardware statistics.

observability dashboard Figure 2: Single pane of view to observability.

Now we will look into each of the pillars, so let us start with the most understood signal, namely logging.

Logging

When you have a program deployed on a single server, it is easy to view the logs by simply SSH-ing into the server and navigate into the logs directory. You may need some shell script skills in order to find and filter relevant lines pertaining to errors you are interested with. The problem comes when you are also deploying your program to multiple servers. Now you need to SSH to multiple servers in order to find the errors you are looking for. What’s worse is when you have multiple microservices, and you need to navigate to all of them just to find the error lines you want.

Obi-wan meme saying these aren&rsquo;t the servers your are looking for Figure 3: Have you ever missed the error lines you are interested with in spite of your shell script skills?

An easy solution is to simply send all of these logs into a central server. There are two ways of doing this; either the program pushes these logs to a central server, or you have another process that does the collecting and batching to send these logs. The second approach is what is recommended in the 12-factor app. We treat logs as a stream of data and log them to stdout, to be pushed to another place where we can tail and filter as needed. The responsibility of managing logs is handed over to another program. To push the logs, the industry standard is by using fluentd but there are other tools like its rewrite called Fluent Bit, otel-collector, logstash, and Promtail. Since this post is about OpenTelemetry, we will look at its tool called otel-collector.

Diagram showing logging infrastructure. Each application logs to stdout which are then picked up by Promtail. Promtail then sends the logs to Loki Figure 4: No matter where logs are emitted, they can be channelled to otel-collector.

In the diagram above, Java spring boot app does not save the logs into a file. Instead, we have an option of an easy auto-instrumentation called OpenTelemetry javaagent SDK that automatically sends the logs to otel-collector.

A containerised application like the python Django application can have its logs tailed to be sent to otel-collector too. In reality, logs in containerised apps live in a place like /var/lib/docker/containers/<container_id>/*.log anyway. So, PHP logs like laravel which saves logs into files can be sent the same way to otel-collector.

Like many tools, otel-collector is flexible with where it can retrieve logs from. The log collection is not limited to API applications. Syslog and logs from the database can also be emitted to otel-collector.

This single otel-collector accepts a standardised API for logs. This means any vendor that can support OpenTelemetry can read your log output.

Choose any log vendor you like Figure 5: Choose between any logging backend you like as long as they support otlp format.

Now that you have all logs in one place, you can search, filter by application or log level easily. OpenTelemetry’s solution automatically associates log entries with its origin which makes this an easy task.

Below screenshot shows a snippet of what a log aggregator like Loki can do. Here, I picked logs only for Laravel that have the word ‘ErrorException’ in it, formatted as json and only where the log level is more than 200. This query language is not standardised yet, but it already looks more readable than writing shell scripts with tail, awk, and grep.

query logs by searching and filtering Figure 6: Great user interface that allows searching using a readable query language.

Filtering the logs based on origin is not the only task OpenTelemetry supports. You will also be interested in finding out all logs from a particular request. This request can have a unique identifier, and we can associate this unique key to all logs in this particular request. The code below has a unique string (bcdf53g) right after log level being associated to a single request lifecycle for retrieving a list of authors from the database.

...
[2023-11-16 13:15:48] INFO bcdf53g listing authors
[2023-11-16 13:15:48] INFO e6c8af8 listing books
[2023-11-16 13:15:50] INFO bcdf53g retrieiving authors from database
[2023-11-16 13:15:51] ERROR bcdf53g database error ["[object] (Illuminate\\Database\\QueryException(code: 1045):
...

Now you can filter the logs for that particular request to get a better understanding of how the error came about.

{job="laravel"} |= bcdf53g

This returns only relevant log lines to you and eliminates the noise you do not care about.

...
[2023-11-16 13:15:48] INFO bcdf53g listing authors
[2023-11-16 13:15:50] INFO bcdf53g retrieiving authors from database
[2023-11-16 13:15:51] ERROR bcdf53g database error ["[object] (Illuminate\\Database\\QueryException(code: 1045):
...

More details about this unique key is covered in the tracing section. Also note that the logs do not always have to be in JSON format. The normal log line format shown above is still fine and most vendors provide the ability to filter and search this kind of format.

~~

Before moving on to the next signal, there is one thing of note which must be mentioned which is its SDK availability.

The OpenTelemetry SDK is fantastic if its SDK exists for your programming language. Recently, logs SDK has reached its stable 1.0 version and is generally available for most of the major programming languages.

instrument sdk status Figure 7: https://opentelemetry.io/docs/instrumentation/#status-and-releases

But there are some glaring omission in this table. Considering many of observability programs are written in Go, its logs SDK for this language is missing. To make matter worse, there is no easy auto-instrumentation like Java is for Go. There are solutions to overcome this using eBPF like what Odigos and Grafana Beyla are doing which are worth keeping an eye on. Python is another major programming language which is at ‘Experimental’ stage. Nevertheless, there is always an alternative such as using a log forwarder like fluentd or Promtail. You might want to ensure the tools you use is otlp-compatible so that you do not have to re-instrument your code base in the future.

Just like Figure 4, but added an arrow from logs to promtail, then promtail to Loki with a text labelled on promtail saying: for SDK without logging support yet, Figure 8: Alternative pathway for pushing logs into your backend using promtail.

Metrics

According to the 4 Golden Signals for monitoring systems, we measure traffic, latency, errors and saturation. For metrics, we care about RED method which are Rate, Errors, and Duration.

Just like logging where logs are channelled to a central place, metrics data are also aggregated through otel-collector and then exposed as an endpoint. By default, metrics data are accessed from otel-collector through http://localhost:8889/metrics. This is the endpoint that a vendor like Prometheus uses to scrape metrics data at a regular interval. Let us have a look at what these metrics data look like.

Let us make a single request to a freshly instrumented application. For example:

curl -v http://localhost:3080/uuid

Then visit the endpoint exposed by otel-collector at http://localhost:8889/metrics to view metrics data. The data being returned is in plaintext (text/plain) with a certain format.

Figure 9: Raw metrics data collected by otel-collector.

For now, we are interested in finding out the request counter. So do a search with Ctrl+F for the metric name called http_server_duration_milliseconds_count.

http_server_duration_milliseconds_count{http_method="GET",http_route="/uuid",http_scheme="http",http_status_code="200",job="java_api",net_host_name="java-api",net_host_port="8080",net_protocol_name="http",net_protocol_version="1.1"} 1

Inside this metric, it contains several labels in a key="value" format separated by commas. It tells us that this metric records the /uuid endpoint for a GET request. It also tells us this is for a job called java_api. This job label is important because we could have the same metric name in other APIs, so we need a way to differentiate this data. At the end of the line, we have got a value of 1. Try to re-run the api call once more, and watch how this value changes.

curl -v http://localhost:3080/uuid

Refresh and look for the same metric for /uuid endpoint. You will see that the value has changed.

http_server_duration_milliseconds_count{...cut for brevity...} 2

Notice that at the end of the line, the value has changed from 1 to 2.

So how does this measure the rate? A rate is simply a measurement of change that occurs over time. There was a lag that happened between the first time you called /uuid end point and the second. Prometheus uses this duration and collects the value in order to find out the rate. Easy math!

formula for a rate which is after minus before, then divided by time Figure 10: Formula to finding out the rate is simply the delta over time.

~~

What about latency? The amount of time it took to complete a request is stored into buckets. To understand this, let us take it step-by-step. For this api, Prometheus stores the latency in histograms. Let us construct a hypothetical one:

metrics with explicit buckets Figure 11: Hypothetical cumulative histogram that stores request duration into buckets.

If a request took 97 milliseconds, it will be placed inside the fourth bar from the left because it was between 25 and 100. So the count increased for this bar labelled as ‘100’.

Anything lower than five milliseconds will be placed in the ‘5’ bucket. On the other extreme, any requests that take longer than three seconds get placed in the infinity(‘inf’) bucket.

Take a look back at the http://localhost:8889/metrics endpoint. I deleted many labels keeping only the relevant ones, so it is easier to see:

http_server_duration_milliseconds_bucket{http_route="/uuid",le="0"} 0
http_server_duration_milliseconds_bucket{http_route="/uuid",le="5"} 0
http_server_duration_milliseconds_bucket{http_route="/uuid",le="10"} 1
http_server_duration_milliseconds_bucket{http_route="/uuid",le="25"} 1
http_server_duration_milliseconds_bucket{http_route="/uuid",le="50"} 1
http_server_duration_milliseconds_bucket{http_route="/uuid",le="75"} 1
http_server_duration_milliseconds_bucket{http_route="/uuid",le="100"} 1
http_server_duration_milliseconds_bucket{http_route="/uuid",le="250"} 2
http_server_duration_milliseconds_bucket{http_route="/uuid",le="500"} 2

From the raw data above we can see that there is one request that took between 5 and 10 milliseconds. And another between 100ms and 250ms. What is strange is that there are values for bars such as ‘250’ and ‘500’. Value for bar ‘10’ is one because when a request latency is between 5ms and 10ms. The bar for ‘250’ is also one because it is also true that that request is less than or equal (le) to 10(!). The same reasoning applies to bar ‘500’. This type of histogram which is used by Prometheus is called cumulative and the ’le’ labels were predefined or fixed-boundaries.

This is such a simple example but should give you enough intuition on how they store data into histogram buckets.

~~

Now that we understand how Prometheus calculates rates and latency, let us visualise these. But first we need to use some query languages to translate what we want into something Prometheus understands. This query language is called PromQL.

Going into Prometheus’ web UI at http://localhost:9090, type in the following PromQL query. We re-use the same metric which is http_server_duration_milliseconds_count. Then we wrap the whole metric with a special function called rate(), and we want a moving or sliding calculation over one minute.

rate( http_server_duration_milliseconds_count{ job="java_api" } [1m] )

Using PromQL like above, we get fancy stuff like line charts for latency and throughput. You can view the line chart by clicking on the ‘Graph’ tab. But visualising from the provided Grafana dashboard looks nicer.

Two charts. One for latency showing two lines: A green 200 OK and yellow 500 for uuid api. Seconds chart showing the throughput for /uuid api. Figure 12: Line charts showing latency and throughput for /uuid endpoint.

On the top half of the screenshot above, it shows latency for /uuid endpoint. Green line shows a 200 HTTP status response while the yellow line shows a 500 HTTP status response. Success response stays below 1 millisecond while error response is around 4 milliseconds. The lines can be hovered to see more details.

On the bottom half, it shows throughput or the rate/second. The graph shows an increasing throughput over time which reflects our synthetic load generator slowly increasing the number of users.

Prometheus support in your architecture

Web applications are not the only thing it can measure. Your managed database may already come with Prometheus support. Typically, such metrics are exposed by your service with /metrics endpoint so check out the documentation of your software of choice and you might be able to play with Prometheus right now.

Caveat

Metrics are great but the way it collects them produces a downside. If you look closely in both rate and latency calculations, there is always a mismatch between when something has happened and when Prometheus recorded these events. This is just the nature of Prometheus because it aggregates the states at a certain period in time. It does not record every single event like InfluxDB does. On the flip side, storing metrics data becomes cheap.

More

We have only talked about one type of measurement which is histogram. But it can also measure other types.

Tracing

As mentioned at the start of this post, knowing where an error occurred is great but what is better is knowing the path it took to arrive at the particular line of code. If you think this sounds like a stack trace, it is close! A stack trace shows functions that were called to arrive at a specific offending code. What is so special about tracing in observability? You are not limited to a single program, but you have the ability to follow a request flow across all of your microservices. This is called distributed tracing.

But first, let us start with a basic tracing and visualise using a time axis below through Grafana UI.

a time axis tracing from go api which is displayed by jaeger through grafana Figure 13: Name and duration of twelve spans for this single trace of this request.

There are a lot going on but let us focus on what’s important. Every single trace (a unique request), has a random identifier represented by its hex value called trace ID. In the diagram above, it is 385cfe8270777be20b840671bc246e50. This trace ID is randomly generated for each request by the OpenTelemetry library.

Under the ‘Trace’ section, we see a chart with twelve horizontal dual-coloured bar graphs. These bars are called spans, and they belong to the single trace above. A span can mean anything, but typically we create one span for a one unit of work. The span variable you see in the code block below was created for the ListUUID() function. Using a manual instrumentation approach, we have to manually write a code to create a span, let the program do the work, then call span.End() before the function exits to actually emit the span to the otel-collector.

// api.go

func (s *Server) ListUUID(w http.ResponseWriter, r *http.Request) {
	tracer := otel.Tracer("")
	ctx, span := tracer.Start(r.Context(), "ListUUID")
	defer span.End()
	
	// do the work	

A span can belong to another span. In the trace graph above, we created a child span called “call microservice” for “ListUUID”. And because we performed an HTTP call to the java api, a span is automatically created for us.

Once the request went into the java API, all spans were created automatically thanks to auto-instrumentation. We can see the functions that were called, as well as any SQL queries that were made.

Each span not only shows parent-child relationship, but also the duration. This is invaluable to knowing possible bottlenecks across all of your microservices. We can see the majority of time was spent on the span called ‘SELECT db.uuid’ which took 21.98 milliseconds out of 23.44 milliseconds total of this request. That span can be clicked to display more details in the expanded view.

diagram showing more details of a single span Figure 14: Each span can be clicked to reveal more information.

Here we see several attributes including the database query. At the bottom, we see this span’s identifier which is 7df1eaa514d8605d.

Thanks to visualisation of the spans, it is easy to spot which part of the code took the most time. Drilling down which part of the code is slow is great, but 23.4 millisecond response time for a user-facing request is not something to be concerned with. The user interface allows us to filter slow requests by using the search button. For example, we can put a minimum duration of 500 milliseconds into the search filter form.

use grafana UI to filter traces which are 500 milliseconds or more Figure 15: Spans can be filtered to your criteria.

This way, we can catch slow requests, and have the ability to see the breakdown of which part of the code took most of the time.

Automatic Instrumentation

So tracing is great. But to manually insert instrumentation code into each function can be tedious. We did manual instrumentation for the Go api which was fine because we did not have to do many. Fortunately, code can be instrumented automatically without touching your codebase in several languages.

In languages where it depends on an interpreter (like Python) or Java VM (like this java demo API), bytecode can be injected to capture these OpenTelemetry signals. For example in Java, simply supply a path to a Java Agent, and set any config from either the startup command line or environment variable to your existing codebase.

java -javaagent:path/to/opentelemetry-javaagent.jar -Dotel.service.name=your-service-name -jar myapp.jar

Programs that compile into binaries are harder. In this case, eBPF-based solution like Odigos and Beyla can be used.

Sampling Traces

Unlike Prometheus where it aggregates records, we could store every single trace into the storage. Storing all of them will likely blow your storage limits. You will find that many individual traces are nearly identical. For that reason, you might want to sample the traces, say, only store ten percent of the traces. However, be careful of this technique because the trace you are interested in might lose its parent context because it selects traces at random. For that reason, trace SDK provides several sampling rules (https://opentelemetry.io/docs/concepts/sdk-configuration/general-sdk-configuration/#otel_traces_sampler). Sampling this way still means it is possible to miss an error happening in the system.

Distributed Tracing

As demonstrated with the demo repository, I have shown distributed tracing across two microservices. To achieve this, each service must be instrumented with OpenTelemetry SDK. Then when making a call to another microservice, span context is attached to outgoing requests for the receiving end to extract and consume.

~~

I hope this demystifies what tracing is. There are more things to learn so a great place to look is its documentation as https://opentelemetry.io/docs/concepts/signals/traces/.

Single Collector

Implementing OpenTelemetry into a program is termed as instrumenting. It can be done either manually, or automatically. Manually means writing lines of code into the functions to retrieve and emit both traces and spans.

Automatic instrumentation means either not touching the code at all or very minimal depending on the language. Approaches include using eBPF, or through runtime like Java’s approach to using JavaAgent, or an extension for php.

In either case, vendor-neutrality means your application only needs to be instrumented once using an open standard SDK. You may change vendors but your codebase stays untouched. All these signals are funnelled into a single collector before they are dispersed to OpenTelemetry-compatible destinations. Thanks to this SDK, you have fewer things to worry about when moving to another vendor.

As this is an open standard, otel-collector is not the only tool available out there. Alternatives like grafana agent exists too.

Now that we have OpenTelemetry signals in otlp format, any vendor that understands this protocol can be used. You have a growing choice of vendors at your disposal, open source or commercial, including Splunk, Grafana, SigNoz, Elasticsearch, Honeycomb, Lightstep, DataDog and many others.

observability signals are sent to a single place called otel-collector, then forwarded to any vendor you like. Figure 16: A single collector to forward observability signals to any vendor you like.

Having a single component does not look good architecture-wise because it can become a bottleneck. In this context, it only means standardising the format so that anyone can write and read from it.

One deployment strategy is you can have multiple instances of otel-collector to scale it horizontally by putting a load balancer in front of it. Another popular approach is to have one otel-collector sit next to each of your programs like a sidecar. You can even put a queue in between otel-collector and vendors to handle load spikes.

Otel-Collector Components

As everything goes through otel-collector, let us talk one layer deeper about its components namely receivers, processors, exporters, and pipelines.

Receivers are how an otel-collector receives any observability signals. In the config file, we define a receiver called otlp using both gRPC and http as its protocol. By default, otel-collector listens at port 4317 and 4318 for HTTP and gRPC respectively.

# otel-collector-config.yaml

receivers:
  otlp:
    protocols:
      grpc:
      http:

Processors are how data is transformed inside the collector before being sent out. You can optionally batch them and have some memory limit. Full list of processors are in https://github.com/open-telemetry/opentelemetry-collector/tree/main/processor and contrib repo. Rate limiter is not in one of the processors so at a higher scale. A third party queue that sits in between otel-collector and vendors might be needed to alleviate possible bottlenecks.

# otel-collector-config.yaml

processors:
  batch:
  memory_limiter:

Exporters are how we define where to send these signals to. Here we have three. Metrics are being sent to Prometheus which is fast becoming the de-facto industry standard for metrics. Tracing, which is labelled as otlp will be sent to Jaeger while logs and sent to a URL. As you can see, switching your backend from one to another is as easy as swapping a new exporter(vendor) into this file.

# otel-collector-config.yaml

exporters:
  prometheus:
    endpoint: otel-collector:8889

  debug:
    verbosity: detailed
    endpoint: http://loki:3100/loki/api/v1/push

  otlp:
    endpoint: jaeger-all-in-one:4317

If you want to emit to multiple vendors, that can be done too. Simply add a suffix preceded with a slash (like /2) to it. Below, we can choose to send logs to both loki and dataprepper using ‘otlp/logs’.

# otel-collector-config.yaml

exporters:
  prometheus:
    endpoint: otel-collector:8889

  debug:
    verbosity: detailed
    endpoint: http://loki:3100/loki/api/v1/push
       
  otlp/logs:
     endpoint: dataprepper:21892
     tls:
        insecure: true

  otlp:
    endpoint: jaeger-all-in-one:4317

Metrics data are exposed as an endpoint at otel-collector:8889/metrics. For vendors who have OpenTelemetry protocol (otlp) support, these metrics can be pushed straight from otel-collector. For example, metrics can be pushed straight to Prometheus using http://prometheus:9090/otlp/v1/metrics endpoint.

Pipeline is the binding component that defines how data flows from one to another. Each of trace, metrics, and logs describes where to pull data, any processing needs to be done, and where to put the data to.

Lastly, Extensions can be applied to collect otel-collector’s performance. Examples are listed in the repository including profiling, zPages, and others.

# otel-collector-config.yaml

service:

  extensions: [pprof, zpages, health_check]
  
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, memory_limiter]
      exporters: [debug, otlp]
    metrics:
      receivers: [otlp]
      processors: [batch, memory_limiter]
      exporters: [debug, prometheus]
    logs:
      receivers: [otlp]
      exporters: [debug, loki]

Having a single component that funnels data into otel-collector is great because when you want to switch to another log vendor, you can just simply add the vendor name into the logs’ exporters array in this configuration file.

Vendor Neutrality

A great deal of effort has been made to ensure vendor-neutrality in terms of instrumentation and OpenTelemetry protocol support. A standard SDK means you can instrument your code once, be that automatic or manual. Vendors supporting otlp means you can pick another vendor of your choosing easily by adding or swapping in your yaml file. The other two important parts to achieving vendor-neutrality are dashboards and alerting. Note that these components are not part of OpenTelemetry, but it is important to discuss this part of the ecosystem as a whole.

Both Prometheus and Jaeger have their own UI at http://localhost:9090 and http://localhost:16686 respectively. However, it is easier to have all information in one screen rather than shuffling between different tabs. Grafana makes this easy, and it comes with a lot of bells and whistles too. It gives the information I want, and they look great. However, are those visualisations portable if I want to switch to another vendor?

Take this case with Prometheus. Data visualisation is done using its query language called PromQL. While it may be the dominant metrics solution, competing vendor might have a different idea on the DSL to create visualisations. The same goes with querying logs—there isn’t a standard yet. For this, a working group to create a standardised, unified language has been started.

Second concern is alerting. It is crucial because when an issue arises in your application—latency for a specific endpoint passes certain threshold—it needs to be acted upon. You can measure response times using metrics like Mean Time to Acknowledge (MTTA), Mean Time to Resolve(MTTR) and others can be crucial for your service level agreement (SLA). Performing within SLA margins makes happy customers and pockets.

Alerting rules you have made in one vendor might not be portable to another since a standard does not exist.

Conclusion

In this post, we have learned about three important OpenTelemetry signals which are logs, metrics, and tracing. OpenTelemetry SDKs made it easy to instrument your application in your favourite language. Then we talked about otel-collector which receives, transforms and emits these signals through a standard API. Vendors that support OpenTelemetry protocol give us the freedom to pick and choose however we like without concern of re-instrumenting our codebase.

The approach OpenTelemetry is taking achieves vendor-neutrality which benefits everyone. For developers, it takes out the headache of re-coding. For business owners, it can be a cost-saving measure. For vendors, rising popularity means more potential customers come into this observability space.

For many years, OpenTelemetry project has been the second most active CNCF project right behind kubernetes amongst hundreds. It is maturing fast and it is great to see the industry working together for the common good.

hundreds of projects in CNCF.io Figure 17: OpenTelemetry belongs to CNCF.

Further Reads

Spec Overview https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/overview.md

CNCF projects https://landscape.cncf.io/