From ARIMA to Transformers: A Practical Guide to Statistical, Machine Learning, and Deep Learning Forecasting

ParsaNikpour
 
Nikpour  
University of Tehran  

This article serves as a complete guide to choosing the right model for time series forecasting. It progresses from simple linear baselines to complex deep learning architectures, providing a practical framework for selecting the best tool for the task.

Model Selection Depends on the Forecasting Task

Even a small forecasting error can lead to millions of dollars in excess inventory and wasted capital. A retail company that underestimates demand faces empty shelves and lost customers, while an energy provider that misjudges consumption may cause blackouts. In some situations, such errors can even cost human lives or harm the environment. In today’s data-driven world, every business relies heavily on predictions. From managing supply chains to staffing call centers and balancing a nation’s energy grid, the ability to accurately forecast the future has become a necessity.

The common thread in all these examples is that they rely on  time series data, which are data points collected sequentially over time. While the broader field of Time Series Analysis involves understanding the underlying structure of this data (such as detecting anomalies or explaining causes), our focus here is specifically on Time Series Forecasting (TSF), which holds critical importance. But why is this task so difficult? Unlike static spreadsheets, time series data has memory. Yesterday’s stock price directly affects today’s. These data often hide underlying patterns such as seasonality (like the summer peak in ice cream sales) and long-term trends. A simple moving average, for instance, might be good at identifying long-term growth trends but is completely blind to sudden, nonlinear movements.

This brings us to a key insight: there is no single best model for all time series forecasting problems. Even the most advanced and powerful models aren’t always the right choice. The secret to accurate time series forecasting lies in choosing the right tool for each specific task.

This article provides a practical framework to understand these methods and helps the reader select the model that best fits their data and business goals.

The Surprising Power of Simple Models: Linear and State-Space

Given that the secret to time series forecasting lies in matching the model to the task, the first step is always to establish a baseline. This is the interpretable and cost-effective benchmark that any complex, black box model must beat. Before an analyst reaches for a neural network, they must start with these foundational statistical models.

  • Linear Models: The Surprising Contenders 
    At their core, linear models approach time series analysis with a simple, powerful assumption: that the future is a straightforward, weighted combination of past trends and seasonal patterns. While it’s tempting to dismiss these models as too simple for today’s complex data, recent research has challenged that assumption. A recent paper (Zeng et al. (2022)) sent waves through the community by demonstrating that linear-based models (like Dlinear) could outperform massive, complex Transformer architectures on many standard benchmarks.
    This back-to-basics movement highlights that complexity is not a virtue in itself. For a vast number of business applications, like forecasting sales for a single product or a stable inventory, a simple linear model is not just a suitable starting point, but also it’s often the best-performing, most robust, and most cost-effective solution.
  • State-Space Models: Peeking Under the Hood
    Another foundational family, called state-space models, addresses a different kind of problem. These models are designed for systems that change over time and are driven by hidden (latent) states that cannot be directly observed.
    Consider a GPS system in a car. In addition to predicting the next position based on the previous one, it maintains an internal state that includes the current location, speed, and direction, and it continuously updates this internal belief as new satellite data arrives. The Kalman Filter is the most famous example of this type of model.
    In the business world, these models are highly useful for understanding systems that are constantly changing and evolving. For example, they can model the health of a machine based on sensor readings or track the value of a customer (as a latent state) as it changes over time with each purchase or interaction. These models don’t just forecast a number, but also they provide a way to model the entire living system that generates that number.
ORTMR_SP26 _4
Figure 1: The Components of a Time Series

Classic Time Series Models: ARIMA and Extensions

For decades, one model has stood as the undisputed champion of classical forecasting: ARIMA, which stands for AutoRegressive Integrated Moving Average. This model combines three key ideas to handle data with a clear internal structure, or what statisticians call auto-correlation (how a value relates to its own past values over time).

The AutoRegressive part captures the past effect, assuming that the next value in the series has a linear relationship with its previous values. The Integrated component deals with stationarity. This is a statistical tweak to stabilize the data; in simple terms, it removes an underlying trend to make the data easier to model. For example, instead of trying to predict the ever-rising price of a stock, this component would focus on predicting the change in price from one day to the next. Finally, the Moving Average part captures short-term shocks or random disturbances. Figure 1 shows the common patterns found in time series data.

ARIMA performs particularly well with small and relatively stable datasets, such as predicting a single restaurant’s monthly revenue or estimating passenger numbers on a specific bus route. A popular extension, SARIMA (Seasonal ARIMA), adds a seasonal component to account for recurring patterns. This enhancement makes the model even more robust and dependable for a wide range of common business forecasting challenges.

Machine Learning Approaches: Regression and Beyonds

Instead of using models like ARIMA, which are aligned with the concept of time, the ML approach treats forecasting as an advanced regression problem. The core mechanism of these models lies in feature engineering.

A machine learning model such as a gradient boosting tree (e.g., XGBoost or LightGBM) has no understanding of concepts like “yesterday” or “last month.” We have to teach these temporal notions to it. To do so, the time series data is transformed into a static table, and we construct features such as lag features, window features, and calendar features.

Besides having data and building the model, developing a robust AI infrastructure is essential. What systems will you use to evaluate performance during offline development and online in production? How will you monitor for model drift or downtime? Can your infrastructure ensure low latency

This approach allows ML models to process hundreds of features simultaneously and capture complex, nonlinear relationships. They also perform well when external data is incorporated, such as competitor pricing, marketing campaigns, or even weather conditions, whereas classical models are often limited in this regard.

However, this very strength is also their main weakness. Since these models are not sequence-aware, they tend to struggle with long-term forecasting. Their perspective is confined to the features we define for them. Lacking an understanding of temporal dependencies and long-term trends, which models like ARIMA or state-space models naturally capture, this approach is excellent for short-term predictions (e.g., forecasting the next few days or weeks), where recent, lagged data is the strongest signal, but it is less reliable for longer horizons (e.g., forecasting several months or a full year out).

Deep Learning: Adding Memory to Understand Sequences

The previous machine learning approach struggles with long-term forecasts precisely because it is not sequence-aware. This is the exact problem deep learning models were designed to solve. Instead of treating data as a static table, this class of models is built to understand the flow of time. The fundamental innovation in this field is the Recurrent Neural Network (RNN). Unlike standard machine learning models that process all inputs at once, an RNN has a type of memory. It reads data step by step, and at each step, it combines new information with a summary of all the data it has seen so far. This allows the model to be truly sequence-aware, making it a natural fit for time series data.

However,  simple RNNs suffer from short-term memory limitations and often forget important information from the distant past. This led to the development of more advanced and popular architectures. One such model is the LSTM (Long Short-Term Memory), which is an enhanced and smarter version of the RNN. Another model, the GRU (Gated Recurrent Unit), is a simpler and more modern variant of the LSTM, which trains faster while delivering similar performance.

These models are capable of learning from vast amounts of data and excel at modeling complex, nonlinear dependencies in chaotic systems, such as stock markets, real-time sensor data, or natural language. When your dataset is large and the underlying patterns are deep and subtle, using LSTM or GRU can be a powerful and efficient tool.

Transformers and Modern Architectures

The latest and most-hyped models to enter the forecasting arena are Transformers. Originally designed for Natural Language Processing (LLMs), their power comes from a mechanism called self-attention. Unlike LSTM which processes data sequentially (one step at a time), an attention mechanism can look at the entire data sequence at once. It learns to assign "attention scores" to different time steps, which allows it to spot relationships between a sales spike last year and a holiday promotion last week. This makes them exceptionally powerful for massive datasets and long-range forecasting, where the dependencies are spread across a vast horizon.

However, this power comes at a steep computational cost. More importantly, their real-world value in TSF is a topic of intense debate. The same paper that praised linear models (Zeng et al. (2022)) suggested that the original attention mechanism, built for language, might just be capturing noise in time series data.

This has not stopped progress. It has inspired a new wave of research to create Transformer-based architectures specifically for time series. Models like PatchTST (which breaks the time series into patches) (Nie et al., 2022) are now being designed to overcome these challenges, proving that while the original Transformer may have been a flawed tool for this job, the core idea of attention is driving the field forward.

In this article, we briefly introduced time series forecasting models, from the simplest to the most complex. However, the fundamental principle still holds: there is no single best model. Choosing the right model should depend on performance, cost, and your data. Table 1 serves as a guide for selecting the most suitable model.

ORTMR_SP26 _5
Table 1: A Practical Comparison of Modern Forecasting Models

References: 

Nie, Y., Nguyen, N.H., Sinthong, P., Kalagnanam, J., 2022. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. arXiv e-prints , arXiv:2211.14730doi:10.48550/arXiv.2211.14730, arXiv:2211.14730.

Zeng, A., Chen, M., Zhang, L., Xu, Q., 2022.     Are Transformers Effective for Time Series Forecasting?     arXiv e-prints , arXiv:2205.13504doi:10.48550/arXiv.2205.13504, arXiv:2205.13504.

Acknowledgements: We would like to thank Ronak Tiwari for taking time to review this article. Photo credit goes to Markus Spiske for the header photo, and Jakub Żerdzicki for the footer photo.