ml — Machine Learning
======================

.. module:: sipQuant.ml

Machine learning tools for commodity market analysis. Key applications:

- **Isolation Forest** — outlier detection in thin market trade data (critical for index integrity)
- **Random Forest** — grade surface regression with non-linear interactions
- **K-means** — market segmentation within clusters
- **Gradient Boosting** — price prediction for proxy construction

Key Functions
-------------

``isolationForest(X, nTrees=100, contamination=0.05, seed=None)``
    Isolation forest for anomaly detection. Returns ``{scores, isAnomaly, threshold}``.
    **Primary use**: flag potentially non-arm's-length trades before index calculation.
    Low anomaly scores (negative) indicate potential price manipulation.

``anomalyScore(forest, X)``
    Score new observations against a trained isolation forest.

``regressionTree(X, y, maxDepth=5, minSamplesSplit=2)`` / ``decisionTree(...)``
    Decision tree for regression or classification.

``predictTree(tree, X)``
    Predict using a trained tree model.

``randomForest(X, y, nTrees=100, maxDepth=5, seed=None)``
    Random forest regressor. Returns ``{predictions, featureImportances}``.
    Use for grade surface regression when moisture, protein, test weight, and
    dockage interact non-linearly.

``gradientBoosting(X, y, nEstimators=100, learningRate=0.1, maxDepth=3, seed=None)``
    Gradient boosting regressor.

``kmeans(X, nClusters=3, nInit=10, maxIter=300, seed=None)``
    K-means clustering. Returns ``{labels, centers, inertia}``.
    Use to segment markets within a cluster by price level and volatility regime.

``knn(X, y, nNeighbors=5)``
    K-nearest neighbours regressor/classifier.

``naiveBayes(X, y)``
    Gaussian naive Bayes classifier.

``logisticRegression(X, y, maxIter=1000, tol=1e-4)``
    Binary logistic regression classifier. Use for regime classification.

``pca(X, nComponents=None)``
    PCA (same as ``dimension.pca`` — included here for pipeline convenience).

``lda(X, y, nComponents=None)``
    LDA classifier.

.. code-block:: python

   import sipQuant as sq
   import numpy as np

   # --- Outlier detection for index integrity ---
   # Trade prices: one potentially non-arm's-length outlier at index 7
   trade_features = np.array([
       [187.5, 500.0],  # price, volume
       [185.0, 300.0],
       [189.0, 250.0],
       [186.5, 400.0],
       [188.0, 350.0],
       [187.0, 320.0],
       [188.5, 280.0],
       [215.0, 50.0],   # suspicious: high price, low volume
       [186.0, 360.0],
       [187.8, 310.0],
   ])

   forest = sq.ml.isolationForest(
       trade_features,
       nTrees=100,
       contamination=0.10,  # expect ~10% anomalies in this thin market
       seed=42,
   )

   for i, (score, flag) in enumerate(zip(forest['scores'], forest['isAnomaly'])):
       status = 'ANOMALY - REVIEW' if flag else 'clean'
       print(f"Trade {i}: score={score:.3f}  {status}")

   # --- Grade surface regression ---
   # Features: moisture %, test_weight, dockage %, protein %
   grade_features = np.random.normal([14.0, 42.0, 1.5, 12.0], [1.0, 2.0, 0.5, 1.0], (200, 4))
   prices         = 185.0 + 3*(14 - grade_features[:,0]) + 0.5*grade_features[:,1] \
                          - 2*grade_features[:,2] + 0.8*grade_features[:,3]

   rf_model = sq.ml.randomForest(grade_features, prices, nTrees=200, maxDepth=6, seed=42)
   print(f"Feature importances: {rf_model['featureImportances'].round(3)}")
   # Expected: moisture and dockage dominate

   # --- Market segmentation ---
   # Segment markets by weekly avg price and avg volume
   market_stats = np.random.normal([185, 300], [15, 100], (50, 2))
   km = sq.ml.kmeans(market_stats, nClusters=3, seed=42)
   print(f"Market cluster labels: {km['labels']}")
   print(f"Cluster centers:\n{km['centers'].round(1)}")