Documentation

Everything you need to integrate Watchtower AI into your ML pipeline.

Installation

Install the Watchtower SDK from PyPI using pip:

pip install watchtower
Requirements: Python 3.8+ • The SDK depends on requests, pandas, and numpy.

Configuration

The SDK uses environment variables for zero-config setup in production. Set these before running your application:

# Required: Your project's API key from the Watchtower dashboard
export WATCHTOWER_API_KEY="your_project_api_key"

# Required for cloud: Your deployed backend URL
export WATCHTOWER_API_URL="https://watchtower-ai-production-604f.up.railway.app"

If WATCHTOWER_API_URL is not set, the SDK defaults to http://localhost:8000.

Tip: You can also pass api_key and endpoint directly to any monitor constructor. Environment variables are simply the recommended approach for production.

Quick Start

Here's the fastest way to start logging data:

import pandas as pd
from watchtower.monitor import WatchtowerInputMonitor

# Initialize (reads WATCHTOWER_API_KEY and WATCHTOWER_API_URL from env)
monitor = WatchtowerInputMonitor(project_name="My ML Project")

# Load and send your data
df = pd.read_csv("production_data.csv")
response = monitor.log(df)
print(response)

That's it! Your data is now being monitored for drift and quality issues on the Watchtower dashboard.

SDK 1 Feature Monitoring — WatchtowerInputMonitor

This is the primary SDK for monitoring tabular/structured data. Use it to log feature vectors (model inputs) so Watchtower can detect data drift, validate data quality, and alert you when your production data deviates from training data.

Constructor

Parameter Type Required Description
project_name str Yes The name of your project (must match the project created on the dashboard).
api_key str No API key. Falls back to WATCHTOWER_API_KEY env var.
endpoint str No Backend URL. Falls back to WATCHTOWER_API_URL env var.

Usage & Examples

Logging a Pandas DataFrame

import pandas as pd
from watchtower.monitor import WatchtowerInputMonitor

monitor = WatchtowerInputMonitor(project_name="Credit Scoring v2", api_key="your_project_api_key", endpoint="https://watchtower-ai-production-604f.up.railway.app")

df = pd.DataFrame({
    "age": [25, 34, 45, 52, 61],
    "income": [45000, 78000, 92000, 55000, 110000],
    "credit_score": [680, 720, 750, 630, 800],
    "loan_amount": [15000, 25000, 35000, 10000, 50000]
})

response = monitor.log(df, stage="model_input")
print(response)

Logging with Custom Metadata

from datetime import datetime

response = monitor.log(
    features=df,
    stage="model_input",
    event_time=datetime(2026, 2, 13, 12, 0, 0),
    metadata={"batch_id": "batch_042", "environment": "production"}
)
Supported Data Formats: The log() method accepts Pandas DataFrames, Python dictionaries, lists of dicts, and NumPy arrays. All are automatically serialized.

Drift Detection Tests

Once enough data is ingested, Watchtower automatically runs the following statistical tests to detect drift between your baseline (training) data and current (production) data:

μ

Mean Shift

Measures the relative change in the mean value of each feature. A large shift indicates the central tendency of your data has changed.

Statistical

Median Shift

Measures the relative change in the median. More robust to outliers than the mean, useful for skewed distributions.

Statistical
σ²

Variance Shift

Detects changes in the spread/dispersion of your data. A widening or narrowing variance often signals upstream data pipeline issues.

Statistical
KS

Kolmogorov-Smirnov Test

A non-parametric test that compares the entire cumulative distribution. If p-value < threshold, the distributions are statistically different.

Distribution
Ψ

Population Stability Index (PSI)

Quantifies how much the distribution has shifted. PSI < 0.1 = no drift, 0.1–0.25 = moderate drift, > 0.25 = significant drift.

Distribution
🌲

Model-Based Drift

Trains a RandomForest classifier to distinguish between baseline and current data. If accuracy > 50% threshold, drift is detected.

ML-Based

Threshold Configuration

Watchtower uses sensible defaults for all drift thresholds. You can customize them per-project via the dashboard or the API.

Threshold Default Value Description
mean_threshold 0.10 (10%) Maximum allowed relative change in mean before flagging drift.
median_threshold 0.10 (10%) Maximum allowed relative change in median.
variance_threshold 0.20 (20%) Maximum allowed relative change in variance.
ks_pvalue_threshold 0.05 If p-value is below this, the KS test flags drift.
psi_thresholds [0.1, 0.25] PSI severity bands: < 0.1 = None, 0.1–0.25 = Moderate, > 0.25 = High.
psi_bins 10 Number of histogram bins used for PSI calculation.
min_samples 50 Minimum data points required for valid statistical tests.
alert_threshold 2 Number of individual test failures needed to trigger an overall drift alert.
model_based_drift_threshold 0.50 RandomForest accuracy above this value indicates drift.

Data Quality Checks

Beyond drift, Watchtower automatically performs quality checks on every batch of data you log:

  • Missing Values: Identifies columns with null/NaN values and reports the percentage per column.
  • Duplicate Rows: Detects and counts duplicate records in the batch.
  • Schema Validation: Verifies that the number of columns and their data types match the expected schema from the first batch.
  • LLM Interpretation: An AI-powered summary of drift results, explaining what changed and why it matters.

SDK 2 Prediction Monitoring — WatchtowerModelMonitor

Use this SDK to monitor your model outputs and performance metrics over time. It supports both classification and regression models.

Constructor

Parameter Type Required Description
project_name str Yes The name of your project.
api_key str No API key. Falls back to env var.
endpoint str No Backend URL. Falls back to env var.
model_type str No "classification" or "regression".

Usage & Examples

Logging Predictions with Metrics (Classification)

from watchtower.monitor import WatchtowerModelMonitor

model_monitor = WatchtowerModelMonitor(
    project_name="Fraud Detector",
    model_type="classification",
    api_key="your_project_api_key",
    endpoint="https://watchtower-ai-production-604f.up.railway.app"
)

# Log predictions along with current performance metrics
predictions = [0, 1, 0, 0, 1, 1, 0, 1]

response = model_monitor.log(
    predictions=predictions,
    accuracy=0.92,
    precision=0.89,
    recall=0.95,
    f1_score=0.91,
    roc_auc=0.96,
    metadata={"batch_id": "eval_batch_7"}
)
print(response)

Logging Predictions with Metrics (Regression)

model_monitor = WatchtowerModelMonitor(
    project_name="House Price Predictor",
    model_type="regression"
)

predictions = [250000, 180000, 320000, 410000]

response = model_monitor.log(
    predictions=predictions,
    mae=12500.0,
    mse=225000000.0,
    rmse=15000.0,
    r2_score=0.87
)

Supported Metrics

Classification

  • accuracy — Overall correctness (0–1)
  • precision — True positives / predicted positives
  • recall — True positives / actual positives
  • f1_score — Harmonic mean of precision & recall
  • roc_auc — Area under the ROC curve

Regression

  • mae — Mean Absolute Error
  • mse — Mean Squared Error
  • rmse — Root Mean Squared Error
  • r2_score — R-squared coefficient

SDK 3 LLM Monitoring — WatchtowerLLMMonitor

Designed for Generative AI / LLM applications. Log every prompt-response pair and get automated analysis for toxicity, response quality, token usage, and semantic drift.

Constructor

Parameter Type Required Description
api_key str Yes API key for authentication.
project_name str Yes The name of your LLM project.
endpoint str No Backend URL. Defaults to http://localhost:8000.
timeout int No Request timeout in seconds. Default: 60.

Usage & Examples

Logging an LLM Interaction

from watchtower.llm_monitor import WatchtowerLLMMonitor

llm_monitor = WatchtowerLLMMonitor(
    api_key="your_api_key",
    project_name="Customer Support Bot",
    endpoint="https://watchtower-ai-production-604f.up.railway.app"
)

response = llm_monitor.log_interaction(
    input_text="How do I reset my password?",
    response_text="Navigate to Settings > Security > Reset Password. You will receive a confirmation email.",
    metadata={
        "model": "gpt-4",
        "latency_ms": 320,
        "user_id": "user_abc123",
        "session_id": "sess_789"
    }
)
print(response)

Batch Logging in a Loop

# Log multiple interactions from a conversation
conversations = [
    {"input": "What are your hours?", "output": "We are open 9 AM - 5 PM, Mon-Fri."},
    {"input": "Can I speak to a manager?", "output": "I'll transfer you to our management team."},
]

for conv in conversations:
    llm_monitor.log_interaction(
        input_text=conv["input"],
        response_text=conv["output"]
    )

Evaluation Features

When you log LLM interactions, Watchtower automatically evaluates them on the backend:

🛡️

Toxicity Detection

Each response is scanned using the Detoxify library. Scores above the configurable threshold (default: 0.5) are flagged as toxic.

Safety
📊

Token Length Tracking

Response token lengths are tracked over time. Sudden increases or decreases in verbosity can signal model behavior changes.

Performance
📉

Token Length Drift

Compares average token lengths between baseline and monitoring windows. Drift threshold default: 15% change.

Distribution
🧠

LLM Judge Evaluation

Uses a secondary LLM to evaluate response quality, relevance, and hallucination risk with configurable thresholds.

AI-Powered