Post

EUR/USD Forecasting with Random Forest — A Machine Learning Approach

EUR/USD Forecasting with Random Forest — A Machine Learning Approach

TL;DR — Raw price data alone isn’t useful. We engineered 36 features (returns, volatility, RSI, lag windows), used a proper time-based split to avoid leakage, and trained a Random Forest reaching R² = 0.95 and 74.76% directional accuracy. Built by a team of three, each handling a key part: data, modeling, and evaluation.


Motivation

Forex prediction is one of the classic hard problems in ML — the signal-to-noise ratio is brutal, markets are non-stationary, and naive models overfit almost immediately. I chose EUR/USD as my target for this project because it’s the most liquid currency pair in the world, which means cleaner data and fewer anomalous spikes.

The goal was not to build a trading bot, but to practice the full supervised learning pipeline: data collection → feature engineering → model selection → evaluation → deployment.


Architecture Overview

The project is split into three clean classes, each owned by a different “role” in the team:

1
2
3
EnhancedDataEngineer      →  collect_data()  +  create_enhanced_features()
OptimizedModelDeveloper   →  prepare_data()  +  train_model()  +  evaluate_model()
ModelEvaluator            →  create_plots()

This separation made it easy to swap the model later (e.g. replace Random Forest with Gradient Boosting) without touching the data layer.


Data Collection

Data is pulled live from Yahoo Finance using yfinance:

1
eurusd = yf.download('EURUSD=X', start=start_date, end=end_date, progress=False)

Two additional correlated assets are also fetched as exogenous features:

SymbolAssetRationale
GC=FGoldSafe-haven flows correlate with EUR
CL=FOilPetrodollar dynamics affect USD supply

If any download fails, the pipeline falls back to synthetic data so training is never blocked.


Feature Engineering

This was the most impactful step. Raw close prices are nearly useless for a tree model — what matters is change, momentum, and context.

Features created (30+ total)

CategoryFeatures
Returnsreturn_1d, return_3d, return_7d
Volatilityrolling std over 7 and 14 days
Moving averagesMA(7), MA(21), MA(50) + crossover ratios
Momentum / RSIRSI(14) + overbought/oversold signal
Lag featuresprice and return lags at 1, 2, 3, 5, 7, 14 days
Rolling stats7-day high, low, range
Temporalday of week, month, quarter
ExogenousGold/Oil returns and 7-day MAs

The RSI is computed manually to avoid any library dependency:

\[RSI = 100 - \frac{100}{1 + RS}, \quad RS = \frac{\text{avg gain}}{\text{avg loss}}\]

Model

1
2
3
4
5
6
7
8
9
10
RandomForestRegressor(
    n_estimators    = 200,
    max_depth       = 15,
    min_samples_split = 10,
    min_samples_leaf  = 4,
    max_features    = 'sqrt',
    bootstrap       = True,
    n_jobs          = -1,
    random_state    = 42
)

Why Random Forest over a simple regression?

  • Handles non-linear interactions between features (e.g. RSI × volatility)
  • Robust to outliers and missing values
  • Built-in feature importance for interpretability
  • No need to scale features

The train/test split is chronological (no shuffle) to avoid look-ahead bias — the last 20% of dates form the test set.


Results

Loading chart…

Metrics

Loading metrics…


Evaluation Charts

Evaluation report — 4-panel figure From top-left: actual vs predicted time series · scatter with R² · error distribution · top-12 feature importances

The error distribution is centred very close to zero with no heavy tail on either side — a good sign that the model is not systematically biased.

The most important features are (unsurprisingly) recent lag values and short-term moving average crossovers. The RSI signal and volatility features contribute meaningful lift on top.


Limitations

  • Look-ahead risk — features like MA_50 require 50 days of history, so the model cannot be used in a true real-time setting without a warm-up window.
  • Non-stationarity — the model is retrained periodically; old parameters decay as market regimes shift.
  • No macro features — interest rate differentials (ECB vs Fed), CPI prints, and geopolitical events are not captured.
  • This is not a trading signal — high R² on price levels is expected because prices are autocorrelated. Directional accuracy is the more honest measure of predictive skill.

Source code: github.com/ozyns/EUR-USD-Forecasting-with-Random-Forest

This post is licensed under CC BY 4.0 by the author.