ml — Machine Learning

Machine learning tools for commodity market analysis. Key applications:

Isolation Forest — outlier detection in thin market trade data (critical for index integrity)
Random Forest — grade surface regression with non-linear interactions
K-means — market segmentation within clusters
Gradient Boosting — price prediction for proxy construction

Key Functions

isolationForest(X, nTrees=100, contamination=0.05, seed=None): Isolation forest for anomaly detection. Returns {scores, isAnomaly, threshold}. Primary use: flag potentially non-arm’s-length trades before index calculation. Low anomaly scores (negative) indicate potential price manipulation.
anomalyScore(forest, X): Score new observations against a trained isolation forest.
regressionTree(X, y, maxDepth=5, minSamplesSplit=2) / decisionTree(...): Decision tree for regression or classification.
predictTree(tree, X): Predict using a trained tree model.
randomForest(X, y, nTrees=100, maxDepth=5, seed=None): Random forest regressor. Returns {predictions, featureImportances}. Use for grade surface regression when moisture, protein, test weight, and dockage interact non-linearly.
gradientBoosting(X, y, nEstimators=100, learningRate=0.1, maxDepth=3, seed=None): Gradient boosting regressor.
kmeans(X, nClusters=3, nInit=10, maxIter=300, seed=None): K-means clustering. Returns {labels, centers, inertia}. Use to segment markets within a cluster by price level and volatility regime.
knn(X, y, nNeighbors=5): K-nearest neighbours regressor/classifier.
naiveBayes(X, y): Gaussian naive Bayes classifier.
logisticRegression(X, y, maxIter=1000, tol=1e-4): Binary logistic regression classifier. Use for regime classification.
pca(X, nComponents=None): PCA (same as dimension.pca — included here for pipeline convenience).
lda(X, y, nComponents=None): LDA classifier.

import sipQuant as sq
import numpy as np

# --- Outlier detection for index integrity ---
# Trade prices: one potentially non-arm's-length outlier at index 7
trade_features = np.array([
    [187.5, 500.0],  # price, volume
    [185.0, 300.0],
    [189.0, 250.0],
    [186.5, 400.0],
    [188.0, 350.0],
    [187.0, 320.0],
    [188.5, 280.0],
    [215.0, 50.0],   # suspicious: high price, low volume
    [186.0, 360.0],
    [187.8, 310.0],
])

forest = sq.ml.isolationForest(
    trade_features,
    nTrees=100,
    contamination=0.10,  # expect ~10% anomalies in this thin market
    seed=42,
)

for i, (score, flag) in enumerate(zip(forest['scores'], forest['isAnomaly'])):
    status = 'ANOMALY - REVIEW' if flag else 'clean'
    print(f"Trade {i}: score={score:.3f}  {status}")

# --- Grade surface regression ---
# Features: moisture %, test_weight, dockage %, protein %
grade_features = np.random.normal([14.0, 42.0, 1.5, 12.0], [1.0, 2.0, 0.5, 1.0], (200, 4))
prices         = 185.0 + 3*(14 - grade_features[:,0]) + 0.5*grade_features[:,1] \
                       - 2*grade_features[:,2] + 0.8*grade_features[:,3]

rf_model = sq.ml.randomForest(grade_features, prices, nTrees=200, maxDepth=6, seed=42)
print(f"Feature importances: {rf_model['featureImportances'].round(3)}")
# Expected: moisture and dockage dominate

# --- Market segmentation ---
# Segment markets by weekly avg price and avg volume
market_stats = np.random.normal([185, 300], [15, 100], (50, 2))
km = sq.ml.kmeans(market_stats, nClusters=3, seed=42)
print(f"Market cluster labels: {km['labels']}")
print(f"Cluster centers:\n{km['centers'].round(1)}")