Documentation
Everything you need to integrate Watchtower AI into your ML pipeline.
Installation
Install the Watchtower SDK from PyPI using pip:
pip install watchtower
requests,
pandas, and numpy.
Configuration
The SDK uses environment variables for zero-config setup in production. Set these before running your application:
# Required: Your project's API key from the Watchtower dashboard
export WATCHTOWER_API_KEY="your_project_api_key"
# Required for cloud: Your deployed backend URL
export WATCHTOWER_API_URL="https://watchtower-ai-production-604f.up.railway.app"
If WATCHTOWER_API_URL is not set, the SDK defaults to http://localhost:8000.
api_key and endpoint directly to any
monitor constructor. Environment variables are simply the recommended approach for production.
Quick Start
Here's the fastest way to start logging data:
import pandas as pd
from watchtower.monitor import WatchtowerInputMonitor
# Initialize (reads WATCHTOWER_API_KEY and WATCHTOWER_API_URL from env)
monitor = WatchtowerInputMonitor(project_name="My ML Project")
# Load and send your data
df = pd.read_csv("production_data.csv")
response = monitor.log(df)
print(response)
That's it! Your data is now being monitored for drift and quality issues on the Watchtower dashboard.
SDK 1
Feature Monitoring — WatchtowerInputMonitor
This is the primary SDK for monitoring tabular/structured data. Use it to log feature vectors (model inputs) so Watchtower can detect data drift, validate data quality, and alert you when your production data deviates from training data.
Constructor
| Parameter | Type | Required | Description |
|---|---|---|---|
project_name |
str | Yes | The name of your project (must match the project created on the dashboard). |
api_key |
str | No | API key. Falls back to WATCHTOWER_API_KEY env var. |
endpoint |
str | No | Backend URL. Falls back to WATCHTOWER_API_URL env var. |
Usage & Examples
Logging a Pandas DataFrame
import pandas as pd
from watchtower.monitor import WatchtowerInputMonitor
monitor = WatchtowerInputMonitor(project_name="Credit Scoring v2", api_key="your_project_api_key", endpoint="https://watchtower-ai-production-604f.up.railway.app")
df = pd.DataFrame({
"age": [25, 34, 45, 52, 61],
"income": [45000, 78000, 92000, 55000, 110000],
"credit_score": [680, 720, 750, 630, 800],
"loan_amount": [15000, 25000, 35000, 10000, 50000]
})
response = monitor.log(df, stage="model_input")
print(response)
Logging with Custom Metadata
from datetime import datetime
response = monitor.log(
features=df,
stage="model_input",
event_time=datetime(2026, 2, 13, 12, 0, 0),
metadata={"batch_id": "batch_042", "environment": "production"}
)
log() method accepts Pandas DataFrames,
Python dictionaries, lists of dicts, and NumPy arrays. All are automatically serialized.
Drift Detection Tests
Once enough data is ingested, Watchtower automatically runs the following statistical tests to detect drift between your baseline (training) data and current (production) data:
Mean Shift
Measures the relative change in the mean value of each feature. A large shift indicates the central tendency of your data has changed.
StatisticalMedian Shift
Measures the relative change in the median. More robust to outliers than the mean, useful for skewed distributions.
StatisticalVariance Shift
Detects changes in the spread/dispersion of your data. A widening or narrowing variance often signals upstream data pipeline issues.
StatisticalKolmogorov-Smirnov Test
A non-parametric test that compares the entire cumulative distribution. If p-value < threshold, the distributions are statistically different.
DistributionPopulation Stability Index (PSI)
Quantifies how much the distribution has shifted. PSI < 0.1 = no drift, 0.1–0.25 = moderate drift, > 0.25 = significant drift.
DistributionModel-Based Drift
Trains a RandomForest classifier to distinguish between baseline and current data. If accuracy > 50% threshold, drift is detected.
ML-BasedThreshold Configuration
Watchtower uses sensible defaults for all drift thresholds. You can customize them per-project via the dashboard or the API.
| Threshold | Default Value | Description |
|---|---|---|
mean_threshold |
0.10 (10%) | Maximum allowed relative change in mean before flagging drift. |
median_threshold |
0.10 (10%) | Maximum allowed relative change in median. |
variance_threshold |
0.20 (20%) | Maximum allowed relative change in variance. |
ks_pvalue_threshold |
0.05 | If p-value is below this, the KS test flags drift. |
psi_thresholds |
[0.1, 0.25] | PSI severity bands: < 0.1 = None, 0.1–0.25 = Moderate, > 0.25 = High. |
psi_bins |
10 | Number of histogram bins used for PSI calculation. |
min_samples |
50 | Minimum data points required for valid statistical tests. |
alert_threshold |
2 | Number of individual test failures needed to trigger an overall drift alert. |
model_based_drift_threshold |
0.50 | RandomForest accuracy above this value indicates drift. |
Data Quality Checks
Beyond drift, Watchtower automatically performs quality checks on every batch of data you log:
- Missing Values: Identifies columns with null/NaN values and reports the percentage per column.
- Duplicate Rows: Detects and counts duplicate records in the batch.
- Schema Validation: Verifies that the number of columns and their data types match the expected schema from the first batch.
- LLM Interpretation: An AI-powered summary of drift results, explaining what changed and why it matters.
SDK 2
Prediction Monitoring — WatchtowerModelMonitor
Use this SDK to monitor your model outputs and performance metrics over time. It supports both classification and regression models.
Constructor
| Parameter | Type | Required | Description |
|---|---|---|---|
project_name |
str | Yes | The name of your project. |
api_key |
str | No | API key. Falls back to env var. |
endpoint |
str | No | Backend URL. Falls back to env var. |
model_type |
str | No | "classification" or "regression". |
Usage & Examples
Logging Predictions with Metrics (Classification)
from watchtower.monitor import WatchtowerModelMonitor
model_monitor = WatchtowerModelMonitor(
project_name="Fraud Detector",
model_type="classification",
api_key="your_project_api_key",
endpoint="https://watchtower-ai-production-604f.up.railway.app"
)
# Log predictions along with current performance metrics
predictions = [0, 1, 0, 0, 1, 1, 0, 1]
response = model_monitor.log(
predictions=predictions,
accuracy=0.92,
precision=0.89,
recall=0.95,
f1_score=0.91,
roc_auc=0.96,
metadata={"batch_id": "eval_batch_7"}
)
print(response)
Logging Predictions with Metrics (Regression)
model_monitor = WatchtowerModelMonitor(
project_name="House Price Predictor",
model_type="regression"
)
predictions = [250000, 180000, 320000, 410000]
response = model_monitor.log(
predictions=predictions,
mae=12500.0,
mse=225000000.0,
rmse=15000.0,
r2_score=0.87
)
Supported Metrics
Classification
accuracy— Overall correctness (0–1)precision— True positives / predicted positivesrecall— True positives / actual positivesf1_score— Harmonic mean of precision & recallroc_auc— Area under the ROC curve
Regression
mae— Mean Absolute Errormse— Mean Squared Errorrmse— Root Mean Squared Errorr2_score— R-squared coefficient
SDK 3
LLM Monitoring — WatchtowerLLMMonitor
Designed for Generative AI / LLM applications. Log every prompt-response pair and get automated analysis for toxicity, response quality, token usage, and semantic drift.
Constructor
| Parameter | Type | Required | Description |
|---|---|---|---|
api_key |
str | Yes | API key for authentication. |
project_name |
str | Yes | The name of your LLM project. |
endpoint |
str | No | Backend URL. Defaults to http://localhost:8000. |
timeout |
int | No | Request timeout in seconds. Default: 60. |
Usage & Examples
Logging an LLM Interaction
from watchtower.llm_monitor import WatchtowerLLMMonitor
llm_monitor = WatchtowerLLMMonitor(
api_key="your_api_key",
project_name="Customer Support Bot",
endpoint="https://watchtower-ai-production-604f.up.railway.app"
)
response = llm_monitor.log_interaction(
input_text="How do I reset my password?",
response_text="Navigate to Settings > Security > Reset Password. You will receive a confirmation email.",
metadata={
"model": "gpt-4",
"latency_ms": 320,
"user_id": "user_abc123",
"session_id": "sess_789"
}
)
print(response)
Batch Logging in a Loop
# Log multiple interactions from a conversation
conversations = [
{"input": "What are your hours?", "output": "We are open 9 AM - 5 PM, Mon-Fri."},
{"input": "Can I speak to a manager?", "output": "I'll transfer you to our management team."},
]
for conv in conversations:
llm_monitor.log_interaction(
input_text=conv["input"],
response_text=conv["output"]
)
Evaluation Features
When you log LLM interactions, Watchtower automatically evaluates them on the backend:
Toxicity Detection
Each response is scanned using the Detoxify library. Scores above the configurable threshold (default: 0.5) are flagged as toxic.
SafetyToken Length Tracking
Response token lengths are tracked over time. Sudden increases or decreases in verbosity can signal model behavior changes.
PerformanceToken Length Drift
Compares average token lengths between baseline and monitoring windows. Drift threshold default: 15% change.
DistributionLLM Judge Evaluation
Uses a secondary LLM to evaluate response quality, relevance, and hallucination risk with configurable thresholds.
AI-Powered