ml — Machine Learning
Machine learning tools for commodity market analysis. Key applications:
Isolation Forest — outlier detection in thin market trade data (critical for index integrity)
Random Forest — grade surface regression with non-linear interactions
K-means — market segmentation within clusters
Gradient Boosting — price prediction for proxy construction
Key Functions
isolationForest(X, nTrees=100, contamination=0.05, seed=None)Isolation forest for anomaly detection. Returns
{scores, isAnomaly, threshold}. Primary use: flag potentially non-arm’s-length trades before index calculation. Low anomaly scores (negative) indicate potential price manipulation.anomalyScore(forest, X)Score new observations against a trained isolation forest.
regressionTree(X, y, maxDepth=5, minSamplesSplit=2)/decisionTree(...)Decision tree for regression or classification.
predictTree(tree, X)Predict using a trained tree model.
randomForest(X, y, nTrees=100, maxDepth=5, seed=None)Random forest regressor. Returns
{predictions, featureImportances}. Use for grade surface regression when moisture, protein, test weight, and dockage interact non-linearly.gradientBoosting(X, y, nEstimators=100, learningRate=0.1, maxDepth=3, seed=None)Gradient boosting regressor.
kmeans(X, nClusters=3, nInit=10, maxIter=300, seed=None)K-means clustering. Returns
{labels, centers, inertia}. Use to segment markets within a cluster by price level and volatility regime.knn(X, y, nNeighbors=5)K-nearest neighbours regressor/classifier.
naiveBayes(X, y)Gaussian naive Bayes classifier.
logisticRegression(X, y, maxIter=1000, tol=1e-4)Binary logistic regression classifier. Use for regime classification.
pca(X, nComponents=None)PCA (same as
dimension.pca— included here for pipeline convenience).lda(X, y, nComponents=None)LDA classifier.
import sipQuant as sq
import numpy as np
# --- Outlier detection for index integrity ---
# Trade prices: one potentially non-arm's-length outlier at index 7
trade_features = np.array([
[187.5, 500.0], # price, volume
[185.0, 300.0],
[189.0, 250.0],
[186.5, 400.0],
[188.0, 350.0],
[187.0, 320.0],
[188.5, 280.0],
[215.0, 50.0], # suspicious: high price, low volume
[186.0, 360.0],
[187.8, 310.0],
])
forest = sq.ml.isolationForest(
trade_features,
nTrees=100,
contamination=0.10, # expect ~10% anomalies in this thin market
seed=42,
)
for i, (score, flag) in enumerate(zip(forest['scores'], forest['isAnomaly'])):
status = 'ANOMALY - REVIEW' if flag else 'clean'
print(f"Trade {i}: score={score:.3f} {status}")
# --- Grade surface regression ---
# Features: moisture %, test_weight, dockage %, protein %
grade_features = np.random.normal([14.0, 42.0, 1.5, 12.0], [1.0, 2.0, 0.5, 1.0], (200, 4))
prices = 185.0 + 3*(14 - grade_features[:,0]) + 0.5*grade_features[:,1] \
- 2*grade_features[:,2] + 0.8*grade_features[:,3]
rf_model = sq.ml.randomForest(grade_features, prices, nTrees=200, maxDepth=6, seed=42)
print(f"Feature importances: {rf_model['featureImportances'].round(3)}")
# Expected: moisture and dockage dominate
# --- Market segmentation ---
# Segment markets by weekly avg price and avg volume
market_stats = np.random.normal([185, 300], [15, 100], (50, 2))
km = sq.ml.kmeans(market_stats, nClusters=3, seed=42)
print(f"Market cluster labels: {km['labels']}")
print(f"Cluster centers:\n{km['centers'].round(1)}")