Distributed Tracing

Distributed applications, especially those involving multiple Python processes, threads, or asynchronous workers, introduce unique challenges for tracing. While the SGP Tracing SDK automatically manages span context within a single Python process (using contextvars), this automatic propagation does not extend across process boundaries or when work is explicitly dispatched to new, independent execution contexts.

The SGP backend expects well-formed trace data with clear parent-child relationships. Without careful management, there is a strong chance of race conditions or orphaned spans if child spans are reported before their parents, leading to incomplete or broken traces in the UI.

This guide outlines key strategies for effective tracing in multi-process and multi-worker environments.

Understanding Context Propagation

When using context managers (with tracing.create_span(...)), the SDK automatically sets the current span and trace in a context-local variable. However, when you:

Spawn a new Python process (e.g., using multiprocessing).
Enqueue a task to a background job queue (e.g., Celery, RQ).
Dispatch work to a separate thread pool where contextvars might not propagate by default (though threading typically handles this better than multiprocessing).

The new execution context will not automatically inherit the trace_id or parent_id from the originating process. You must explicitly pass this context.

Strategies for Distributed Tracing

One Trace Per Worker (Simplest)

For parallel work where strict hierarchical linking of every operation across workers isn’t necessary, the easiest approach is to create an independent trace for each worker or process.

You can then use a group_id to logically link these independent traces together, allowing you to see all related activity in the Traces page, even if they don’t form a single, continuous trace hierarchy. This is ideal for scenarios where workers process independent units of work concurrently.

import uuid
import time
import multiprocessing
import scale_gp_beta.lib.tracing as tracing
from scale_gp_beta import SGPClient

def process_item_in_worker(item_id: int, shared_group_id: str):
    # Each worker creates its own independent trace, ensure there is a working client with init or by setting ENV Vars
    with tracing.create_trace(name=f"worker_process_for_item_{item_id}", group_id=shared_group_id):
        with tracing.create_span("item_processing_task"):
            time.sleep(0.5)
            print(f"Worker {multiprocessing.current_process().name}: Processed item {item_id}")

def main_multi_process_example():
    batch_group_id = str(uuid.uuid4())

    processes = []
    for i in range(3):
        p = multiprocessing.Process(target=process_item_in_worker, args=(i, batch_group_id), name=f"Worker-{i}")
        processes.append(p)
        p.start()

    for p in processes:
        p.join()

if __name__ == "__main__":
    tracing.init(SGPClient(api_key="YOUR_API_KEY", account_id="YOUR_ACCOUNT_ID"))
    main_multi_process_example()

Extending a Trace Across Workers

If you need to maintain a single, continuous trace hierarchy where operations performed in separate workers are direct children of a span in the main process (or another worker), you must manually propagate the tracing context.

This involves:

Retrieving the trace_id and the span_id of the parent span in the originating process.
Passing these IDs to the new worker/process (e.g., as function arguments or message queue payload fields).
Using these explicit IDs when creating new spans in the worker, ensuring they are correctly linked as children.

Example: Passing Context via Function Arguments

import time
import multiprocessing
import scale_gp_beta.lib.tracing as tracing
from scale_gp_beta import SGPClient


def perform_subtask_in_worker(task_name: str, trace_id: str, parent_span_id: str):
    # Initialize tracing in the worker process. Can skip if you have SGP_API_KEY and SGP_ACCOUNT_ID set
    # tracing.init(SGPClient(api_key="...", account_id="..."))

    with tracing.create_span(
        name=f"worker_subtask_{task_name}",
        trace_id=trace_id,
        parent_id=parent_span_id,
        metadata={"worker_id": multiprocessing.current_process().name}
    ) as worker_span:
        print(f"Worker {multiprocessing.current_process().name}: Performing {task_name}")
        time.sleep(0.3)
        worker_span.output = {"status": "completed", "result": f"Task {task_name} finished"}

def main_distributed_trace_example():
    with tracing.create_trace("main_orchestrator_trace") as trace:
        trace.flush(blocking=True) # Ensure root-span has been sent before continuing to derived jobs
        processes = []
        for i in range(2):
            p = multiprocessing.Process(
                target=perform_subtask_in_worker,
                args=(f"subtask_{i+1}", trace.trace_id, trace.span_id),
                name=f"ChildWorker-{i+1}"
            )
            processes.append(p)
            p.start()

        for p in processes:
            p.join()

if __name__ == "__main__":
    tracing.init(SGPClient(api_key="YOUR_API_KEY", account_id="YOUR_ACCOUNT_ID"))
    main_distributed_trace_example()

Important Considerations

tracing.flush_queue(): Always call tracing.flush_queue() or span.flush() in the originating process before enqueuing a job or spawning a process that creates child spans. This helps ensure the parent span (and any preceding spans in that context) is sent to the backend before its children, reducing the chance of broken traces.
Worker Initialization: Each independent Python process (e.g., a new multiprocessing.Process) will have its own tracing queue manager. Ensure tracing.init() is called within each worker’s entry point if you expect it to send tracing data. This typically means calling tracing.init() at the start of the function that the worker executes.
Error Handling: In distributed systems, be diligent with error handling and ensure spans are ended correctly, even if an exception occurs. Context managers (with tracing.create_span(...)) handle this automatically within their scope.

Evaluations

Tracing

​Understanding Context Propagation

​Strategies for Distributed Tracing

​One Trace Per Worker (Simplest)

​Extending a Trace Across Workers

​Example: Passing Context via Function Arguments