The most common data type encountered in our daily life is the time series data. The regular intervals of data including Financial Markets, Weather, Satellite data, house electricity consumptions can be assessed with the help of time series data. In the field of data science, time series analysis has various models that can be plug-in for accurate forecasted results. For instance, with the help of forecasting, one can regulate and keep a check on the consumption of electricity. In the same manner, a person could readily sell or purchase stocks by the analysis of stock prices with the right profits. Time series analysis can be visualized through interactive plots. Graph plots like line graphs, scatter plots, bar graphs and many other provides lucid patterns of the trend of data. Nowadays, multinational companies create interactive dashboards for the comparisons of projected and the present scenario results. Time series analysis plays a prominent role in the understanding of records/data and processes those data for projecting or forecasting with better accuracy.
For the time series analysis, python Pandas and NumPy caters to a wide range of predefined functions. Pandas and Numpy are the two libraries that support mathematical operations over data frames (set of features and attributes in a layout similar to a spreadsheet). Other than pandas there are other libraries like DateTime but it is not as efficient as Pandas and NumPy. In DateTime, you need to define the pattern of <month, day, year> or <day, month, year> for date/ time specific column/ feature whereas NumPy detects some of the date /time specific features or columns. You can set the time/date column even as an index and can perform numerous operations. NumPy comprises datetime64.
These predictions are highly based on the parameters on which the model is been processed. Consideration of adequate parameters is the backbone of all the time series analysis models. Various types of hyperparameter tuning take place before the model creation and by testing those parameters for better accuracy. Later on, a loop is defined for optimization of the model.
In this write-up, we will go through the data processing segment of time series analysis. The data processing is the first step or says the first phase in the time series analysis. In the data processing part, the data is processed and clean before any analysis. The data is checked for the null values. We don’t generally drop the missing data as it will affect the results. The missing values are treated with some Central Tendency Methods and Interpolations. We try to generate unbiased values. Using Central Tendency Methods and basic interpolation might generate some biased values. There are more unbiased interpolating methods such as Newton`s interpolation, Hot Deck Imputation, Cold Deck Imputation, and stochastic regression imputation.
Pandas provide a predefined function for the interpolation. The data frame.interpolate() function is used to fill the null values in the data frame. It uses various interpolating techniques rather than simply dropping rows or columns. But these functions generate some bias in the data. Few more techniques are referred to as advanced interpolating techniques:
· Newton Interpolation: The Newton polynomial can be written as:
Newton’s interpolation accepts two array values say x and y and the value xi you want to interpolate. Now you need to create a function for the same. The function will return interpolated values along with their relative error in a multidimensional array format. Now for a better understanding of errors visualize the error data. The visualization can be done with the help of boxplots to identify the mild and extreme outliers. You can eliminate the extreme outliers.
Importing libraries Pandas for creating a data frame. Importing NumPy for Newton`s polynomial calculations and importing Matplotlib.pyplot for visualization. We can visualize these polynomial functions of a different order and compare them to the actual function of f(x)=ln(x).
NumPy handles date/time more efficiently than Python’s DateTime format. The NumPy data type is called datetime64 to distinguish it from Python`s DateTime.
We see the dtype listed as ‘datetime64[D]’. This tells us that NumPy applied a day-level date precision. If we want, we can pass in different measurements, such as [h] for hour or [Y] for year.
Just as np.arange(start, stop, step) in the python`s DateTime, can be used to produce an array of evenly-spaced integers, we can pass a dtype argument to obtain an array of dates. Remember that the stop date is exclusive. By omitting the step value we can obtain every value based on precision.
Pandas Datetime Index
We’ll usually deal with time series as a DateTime index when working with pandas data frames. Pandas have a lot of functions and methods to work with time series. The simplest way to build a DatetimeIndex is with the pd.date_range() method:
DatetimeIndex Frequencies: When we used pd.date_range() above, we had to pass in a frequency parameter ‘D’. This created a series of 7 dates spaced one day apart. Another way is to convert incoming text with the pd.to_datetime() method:
A third way is to pass a list or an array of DateTime objects into the pd.DatetimeIndex() method:
Notice that even though the dates came into pandas with a day-level precision, pandas assigns a nanosecond-level precision with the expectation that we might want this later on. To set an existing column as the index, use .set_index().df.set_index(‘Date’,inplace=True)
Normally we would find index locations by running .idxmin() or .idxmax() on df[‘column’] since .argmin() and .argmax() have been deprecated. However, we still use .argmin() and .argmax() on the index itself.
A common operation with time-series data is resampling based on the time series index.
Syntax : DataFrame.resample(rule, how=None, axis=0, fill_method=None, closed=None, label=None, convention=’start’, kind=None, loffset=None, limit=None, base=0, on=None, level=None)
rule: offset string represent target conversion
axis: int, optional, default 0
closed : (‘right’,’ left’)
label: (‘right’,’ left’)
convention: For PeriodIndex only, controls whether to use the start or end of rule.
loffset: Adjust the resampled time labels
base: For frequencies that evenly subdivide 1 day, the “origin” of the aggregated intervals. For example, for the ‘5min’ frequency, the base could range from 0 through 4. Defaults to 0.
on: For a DataFrame, column to use instead of index for resampling. Column must be DateTime-like.
level: For a MultiIndex, level (name or number) to use for resampling. The level must be DateTime-like.
When calling .resample() you first need to pass in a rule parameter, then you need to call some sort of aggregation function.The rule parameter describes the frequency with which to apply the aggregation function (daily, monthly, yearly, etc.). The aggregation function is needed because, due to resampling, we need some sort of mathematical rule to join the rows (mean, sum, count, etc.). It is passed in using an “offset alias” — refer to the table below:
The above data is the stock price of Starbucks. The ‘Close’ represents the closing amount of the share. The ‘Volume’ represents the volume of trade on that particular date. Resampling rule ‘A’ takes all the data points in a given year, applies the aggregation function (in this case we calculate the mean), and reports the result as the last day of that year.
In the data cleaning process, df.shift() is used to shift down or up the frequencies in a column whether it may be a time series data with frequency, values with periods, to calculate the difference in consecutive rows, or to calculate the difference in days for time series data.
In time-shifting series data with the help of freq, make sure the index of Dataframe is date or DateTime else it will show a NotImplementedError. There are primarily two methods in the Time -series shifting for shifting the series in backward and forward cells in the data frame. If we insert a positive value the time-series frequencies will shift downwards/ forward. Similarly, inserting a negative value will move the time series backward/ upwards in the data frame.
Rolling and Expanding
There are two types of window functions in Pandas: rolling and expanding. In calculating the rolling and expanding mean make sure there are no null values.
Expanding: The expanding window function will consider an account of everything from the start of the time series up to each point of time.
The above graph depicts the expanding mean of the ‘Close’ of the Starbucks stocks. The min_period is set to 30. Setting it to 30 will calculate the monthly (or after 30days) expanding window and visualize the trend as upward for the stocks.
It doesn’t help much to visualize an expanding operation against the daily data, since all it gives us is a picture of the “stability” or “volatility” of a stock.
Rolling: A common process with time series is to create data based on a rolling mean. The idea is to divide the data into “windows” of time, and then calculate an aggregate function for each window. In this way, we obtain a simple moving average.
A time-series is considered as the summation of the Additive Model and Multiplicative Model. The time series comprises some levels and some unwanted error or say noise. Components like trends and seasonality are optional in it. In the additive model the components are added together as follows:
y(t) = level + Trend + Seasonality + Noise
A linear model posses a straight line as changes over time are consistently made by the same amount. It has the same frequency and amplitudes of the cycles.
The product of the components is referred to as the Multiplicative Model. A multiplicative model comprises of following components:
y(t) = Level * Trend * Seasonality * Noise
Unlike Additive Model, the multiplicative models are nonlinear such as quadratic or exponential. In such type of models change increase or decrease over time. This type of model visualizes in a curved manner. A non-linear seasonality has an increasing or decreasing frequency and/or amplitude over time.
ETF models stand for Error Trend Seasonality, quite a wide variety of different models including exponential smoothing trend methods models and ESD composition. As we begin working with endogenous data (“endog” for short) and start to develop forecasting models, it helps to identify and isolate factors working within the system that influence behavior. Here the name “endogenous” considers internal factors, while “exogenous” would relate to external forces. These fall under the category of state-space models and include decomposition (described below), and exponential smoothing (described in an upcoming section). The decomposition of a time series attempts to isolate individual components such as error, trend, and seasonality (ETS).
Statsmodels, a library in python for statistical models for time series analysis, provides a seasonal decomposition tool we can use to separate the different components. We apply an additive model when it seems that the trend is more linear and the seasonality and trend components seem to be constant over time. A multiplicative model is more appropriate when we are increasing (or decreasing) at a non-linear rate.
The above graph is for the airline passengers from 1949 to 1961. Based on this chart, it looks like the trend in the earlier days is increasing at a higher rate than just linear.
Exponentially Weighted Moving Averages (EWMA)
The basic Simple Moving Average (SMA) has some weaknesses:
· Smaller windows will lead to more noise, rather than signal
· It will always lag by the size of the window
· It will never reach to full peak or valley of the data due to the averaging.
· Does not inform you about possible future behavior, all it does is describe trends in your data.
· Extreme historical values can skew your SMA significantly
To help fix some of these issues, we can use an Exponentially weighted moving average (EWMA). EWMA will allow us to reduce the lag effect from SMA and it will put more weight on values that occurred more recently (by applying more weight to the more recent values, thus the name). The amount of weight applied to the most recent values will depend on the actual parameters used in the EWMA and the number of periods given window size.
Where 𝑥𝑡 is the input value, 𝑤𝑖 is the applied weight, and 𝑦𝑡 is the output. The weight term 𝑤𝑖 is defined as this depends on the adjust parameter you provide to the .ewm() method. When adjust=True (default) is used, weighted averages are calculated using weights equal to 𝑤𝑖=(1−𝛼)𝑖
𝑦𝑡 =(1−𝛼)𝑦𝑡−1+𝛼𝑥𝑡, which is equivalent to using weights:
We have to pass precisely one of the above into the .ewm() function.
For the above graph, we have considered the span value = 12 as per the passenger data.
Comparing SMA to EWMA
Simple Exponential Smoothing
The above example employed Simple Exponential Smoothing with one smoothing factor α. Unfortunately, this technique does a poor job of forecasting when there is a trend in the data as seen above. In the next section, we’ll look at Double and Triple Exponential Smoothing with the Holt-Winters Methods.
In Exponentially Weighted Moving Averages (EWMA) we applied Simple Exponential Smoothing using just one smoothing factor 𝛼 (alpha). This failed to account for other contributing factors like trend and seasonality. Now, we’ll look at Double and Triple Exponential Smoothing with the Holt-Winters Methods.
In Double Exponential Smoothing (aka Holt’s Method) we introduce a new smoothing factor 𝛽 (beta) that addresses trend:
Because we haven’t yet considered seasonal fluctuations, the forecasting model is simply a straight sloped line extending from the most recent data point. With Triple Exponential Smoothing (aka the Holt-Winters Method) we introduce a smoothing factor 𝛾 (gamma) that addresses seasonality:
Here 𝐿 represents the number of divisions per cycle. In our case looking at monthly data that displays a repeating pattern each year, we would use 𝐿=12.In general, higher values for 𝛼, 𝛽 and 𝛾 (values closer to 1), place more emphasis on recent data.
Simple Exponential Smoothing
A variation of the stat models Holt-Winters function provides Simple Exponential Smoothing. We’ll show that it performs the same calculation of the weighted moving average as the pandas .ewm() method:
𝑦0 = 𝑥0
yt = (1−𝛼)𝑦𝑡−1+𝛼𝑥𝑡,
Here we can see that Double Exponential Smoothing is a much better representation of the time-series data. Let’s see if using a multiplicative seasonal adjustment helps.
Although minor, it does appear that a multiplicative adjustment gives better results. Note that the green line almost completely overlaps the original data.
Triple Exponential Smoothing
Triple Exponential Smoothing, the method most closely associated with Holt-Winters, adds support for both trends and seasonality in the data.
“The more you learn and write the more you grow”- this is the key to hold your place in the data science world. For understanding the subject more professionally and systematically a strong bond or connection is essential. I would like to thank Mr. Venkatesh Gauri Shankar, Assistant Professor, Department of Information Technology, Manipal University Jaipur, Rajasthan, India for guiding me on the path of Data Science. For more insights on the time series analysis, you can contact him on his mail-Id firstname.lastname@example.org. His contribution in the field of data science is highly commendable.
There is always a lot more to learn for you, sir. ThankYou.