EUR/USD Forecasting with Random Forest — A Machine Learning Approach

Posted May 2, 2025 Updated May 2, 2026

By ozyns

3 min read

TL;DR — Raw price data alone isn’t useful. We engineered 36 features (returns, volatility, RSI, lag windows), used a proper time-based split to avoid leakage, and trained a Random Forest reaching R² = 0.95 and 74.76% directional accuracy. Built by a team of three, each handling a key part: data, modeling, and evaluation.

Motivation

Forex prediction is one of the classic hard problems in ML — the signal-to-noise ratio is brutal, markets are non-stationary, and naive models overfit almost immediately. I chose EUR/USD as my target for this project because it’s the most liquid currency pair in the world, which means cleaner data and fewer anomalous spikes.

The goal was not to build a trading bot, but to practice the full supervised learning pipeline: data collection → feature engineering → model selection → evaluation → deployment.

Architecture Overview

The project is split into three clean classes, each owned by a different “role” in the team:

EnhancedDataEngineer      →  collect_data()  +  create_enhanced_features()
OptimizedModelDeveloper   →  prepare_data()  +  train_model()  +  evaluate_model()
ModelEvaluator            →  create_plots()

This separation made it easy to swap the model later (e.g. replace Random Forest with Gradient Boosting) without touching the data layer.

Data Collection

Data is pulled live from Yahoo Finance using yfinance:

  
eurusd = yf.download('EURUSD=X', start=start_date, end=end_date, progress=False)

Two additional correlated assets are also fetched as exogenous features:

Symbol	Asset	Rationale
GC=F	Gold	Safe-haven flows correlate with EUR
CL=F	Oil	Petrodollar dynamics affect USD supply

If any download fails, the pipeline falls back to synthetic data so training is never blocked.

Feature Engineering

This was the most impactful step. Raw close prices are nearly useless for a tree model — what matters is change, momentum, and context.

Features created (30+ total)

Category	Features
Returns	`return_1d`, `return_3d`, `return_7d`
Volatility	rolling std over 7 and 14 days
Moving averages	MA(7), MA(21), MA(50) + crossover ratios
Momentum / RSI	RSI(14) + overbought/oversold signal
Lag features	price and return lags at 1, 2, 3, 5, 7, 14 days
Rolling stats	7-day high, low, range
Temporal	day of week, month, quarter
Exogenous	Gold/Oil returns and 7-day MAs

The RSI is computed manually to avoid any library dependency:

\[RSI = 100 - \frac{100}{1 + RS}, \quad RS = \frac{\text{avg gain}}{\text{avg loss}}\]

Model

  
RandomForestRegressor(
    n_estimators    = 200,
    max_depth       = 15,
    min_samples_split = 10,
    min_samples_leaf  = 4,
    max_features    = 'sqrt',
    bootstrap       = True,
    n_jobs          = -1,
    random_state    = 42
)

Why Random Forest over a simple regression?

Handles non-linear interactions between features (e.g. RSI × volatility)
Robust to outliers and missing values
Built-in feature importance for interpretability
No need to scale features

The train/test split is chronological (no shuffle) to avoid look-ahead bias — the last 20% of dates form the test set.

Results

Loading chart…

Metrics

Loading metrics…

Evaluation Charts

From top-left: actual vs predicted time series · scatter with R² · error distribution · top-12 feature importances

The error distribution is centred very close to zero with no heavy tail on either side — a good sign that the model is not systematically biased.

The most important features are (unsurprisingly) recent lag values and short-term moving average crossovers. The RSI signal and volatility features contribute meaningful lift on top.

Limitations

Look-ahead risk — features like MA_50 require 50 days of history, so the model cannot be used in a true real-time setting without a warm-up window.
Non-stationarity — the model is retrained periodically; old parameters decay as market regimes shift.
No macro features — interest rate differentials (ECB vs Fed), CPI prints, and geopolitical events are not captured.
This is not a trading signal — high R² on price levels is expected because prices are autocorrelated. Directional accuracy is the more honest measure of predictive skill.

Source code: github.com/ozyns/EUR-USD-Forecasting-with-Random-Forest

Machine Learning, Finance

This post is licensed under CC BY 4.0 by the author.