Monday, September 17, 2012

Introductory Time Series with R

---------------Notes and shortcuts------------
> data(AirPassengers)
> AP = AirPassengers
> AP

> www ="http://www.massey.ac.nz/~pscowper/data/pounds_nz.dat"
> z = read.table(www, header=T)
> z.ts = ts(z, st=1991, fr=4)

Reading Columns of a series
plot(YTL[,3]) //3rd column of the series

 
gld = read.csv("http://ichart.finance.yahoo.com/table.csv?s=6012.KL&ignore=.csv", stringsAsFactors=F) 
 


colnames(gld)
[1] "Date"      "Open"      "High"     
[4] "Low"       "Close"     "Volume"   
[7] "Adj.Close"
 
gld$Open          #Access the Open column 
gld$Adj.Close     #Access last column from CSV file
 
 

ts.plot(z.ts, predict( HoltWinters(z.ts),n.ahead=4*12), lty=1:2)

---------CHAPTER 1----------


When a variable is measured sequentially in time over a fixed interval (or sampling interval ) the data form a time series.

Observations that have been collected in the past over fixed sampling intervals
form an historical time series.

A sequence of random variables taken over fixed sampling intervals is
sometimes referred to as a discrete-time stochastic process

> data(AirPassengers)
> AP = AirPassengers


Seasonal data is removed via Aggregate
Cycle function extracts the seasons for each item of data

> plot(aggregate(AP), ylab="Annual passengers/1000’s")
> boxplot(AP ~ cycle(AP), names=month.abb)
> layout(1:2)
> plot(aggregate(AP)); boxplot(AP ~ cycle(AP))

Read external data
> www = "http://www.massey.ac.nz/~pscowper/data/cbe.dat"
> cbe = read.table(www, header=T)
> cbe[1:4,]

to create  a time series data
ts(1:120, start=c(1990, 1), end=c(1993, 8), freq=12)

For a data file like above
> elec.ts = ts(cbe[,3], start=1958, freq=12)
> beer.ts = ts(cbe[,2], start=1958, freq=12)
> choc.ts = ts(cbe[,1], start=1958, freq=12)

plot to see each data frame converted to a time series
plot(cbind(elec.ts, beer.ts, choc.ts),
main="Chocolate, Beer, and Electricity Production: 1958-1990")

The three series constitute a multiple time series. There are many functions
in R for handling more than one series. to obtain the intersection of two series which overlap in time use ts.intersect

ap.elec = ts.intersect(AP, elec.ts)


The term non-stationary is used to describe a time series that has underlying
trends or seasonal effects

covariance
a measure of linear association between two variables
sum((x - mean(x))*(y - mean(y)))/(n - 1)
cov(x,y)
if y tends to decrease as x increases the covariance will tend to be negative
if x and y tend to increase together, resulting in a positive covariance

Correlation measures the linear association between a pair of variables (x, y),
cor(x, y)
A value of +1 or -1 indicates an exact linear association
a value of 0 indicating no linear association.

In R the lag k autocorrelation can be obtained from the autocorrelation
function acf. By default it is called the correlogram.
acf(x)

For example, the lag 1 autocorrelation for x is:
> acf(x)$acf[2]

pairs of values taken at a time difference of lag 1 will tend to have a high correlation when there are trends in the data.

Correlation is dimensionless, so there are no units for the y-axis
The dotted lines represent the 5% significance level for the statistical test:Pk = 0.
Any correlations that fall outside these lines are ‘significantly’ different from zero.

**Usually a trend in the data will show in the correlogram as a slow decay in the autocorrelations,

mean and variance -> ‘centre’ and the ‘spread’

stationary time series {xt}, If the mean and autocovariance do not change with time, then {xt} is a called a second-order stationary process. this means variance and autocorrelation do not change with time

A simple additive decomposition model also called a classical decomposition model.
xt = mt + st + et

xt is the observed series, mt is the trend, st is the seasonal effect, and et is the remainder or residual series

In R the functions decompose and stl estimate trends and seasonal effects using methods based on moving averages and polynomials respectively.

plot(decompose(elec.ts))
Nesting the functions within plot, using plot(decompose()) or plot(stl()), produces a single figure showing the original series xt and the decomposed series: mt, st and et

If the seasonal effect tends to increase as the trend increases a multiplicative model may be more appropriate
If the residual series is also multiplicative the data can be transformed using the natural logarithm to produce an additive model


In forcasting One approach to this is to use the past values as dependent variables in a predictive model that gives more weight to the more recent observations. This approach is used in a model based on exponential smoothing

Holt-Winters procedure provides a prediction for xn+1 by updating the trend and seasonal effect
> z.hw = HoltWinters(z.ts, beta=0, gamma=0)
> z.hw$alpha
> plot(z.hw)

In general, exponential smoothing provides good predictions when the underlying process is a random walk or a random walk with moving average error terms.

The predict function in R can be used with the fitted model to make forecasts into the future
> AP.hw = HoltWinters(AP, seasonal="mult")
> plot(AP.hw)
> AP.predict = predict(AP.hw, n.ahead=4*12)
> ts.plot(AP, AP.predict, lty=1:2)


SUMMARY
ts produces a time series object
ts.plot produces a time plot for one or more series
ts.intersect creates the intersection of one or more time series
cycle returns the season for each value in a series
time returns the times for each value in a series
acf returns the correlogram
decompose decomposes a series into the components:trend, seasonal effect and residual
HoltWinters estimates the parameters of the Holt-Winters or exponential smoothing model
predict forecasts future values



---------CHAPTER 2---------
A residual error et is the difference between an observed value and a model predicted value at time t. also called white noise.

Time series simulated using a model are sometimes called a synthetic series to distinguish them from an observed historical serie

rnorm(100) is used to simulate 100 independent standard Normal variables, which is equivalent to simulating a Gaussian white noise series of length 100
> set.seed(1)
> e = rnorm(100)
> plot(e, type="l")

set.seed is used to provide a starting point (or seed) in the simulations, thus ensuring that the simulations can be reproduced

Histogram
> x = seq(-3,3, len=1000)
> hist(rnorm(100), prob=T); points(x, dnorm(x), type="l")

Correlogram of a simulated white noise series. The underlying autocorrelations are all zero (except at lag 0); the statistically significant value at lag 7 is due to sampling variation
> acf(rnorm(100))


Differencing adjacent terms of a series can transform a non-stationary series to a stationary series

The following commands can be used to simulate random walk data for x.
> x = e = rnorm(1000)
> for (t in 2:1000) x[t] = x[t-1] + e[t]
> plot(x, type="l")

The ‘for’ loop then generates the random walk using Equation (2.3)

**The first order differences of a random walk are a white noise series,
so the correlogram of the series of differences can be used to assess whether a given series is a random walk.
> acf(diff(x))

> z.hw = HoltWinters(z.ts, gamma=0)
> acf(resid(z.hw))

The partial autocorrelation at lag k is the correlation that results after removing the effect of any correlations due to previous terms.

Hence, a plot of the partial autocorrelations can be useful when determining the order of an underlying AR process. In R, the function pacf can be used to calculate the partial autocorrelations of a time series and produce a plot of the partial
autocorrelations against lag (the ‘partial-correlogram’).
> set.seed(1)
> x = e = rnorm(100)
> for (t in 2:100) x[t] = 0.7*x[t-1] + e[t]
> plot(x); acf(x); pacf(x)

An AR(p) model can be fitted to data in R using the ar function
> x.ar = ar(x, method="mle")

The method “mle” used in the fitting procedure above is based on maximising the likelihood function

Akaike Information Criteria (AIC; Akaike (1974)), which penalises models with too many parameters
AIC = -2 × log-likelihood + 2 × number of parameters

acf(z.ar$res[-1])
"-1" is used in the vector of residuals to remove the first item from the residual series. (For a fitted AR(1) model the first item has no predicted value, because there is no observation at t = 0;

> global.ar = ar(aggregate(global.ts, FUN=mean), method="mle")
> mean(aggregate(global.ts, FUN=mean))
> global.ar$order; global.ar$ar
> acf(global.ar$res[-(1:global.ar$order)], lag=50)


set.seed sets a seed for the random number generator enabling a simulation to be reproduced
rnorm simulates a white noise series
diff creates a series of first-order differences
ar gets the best fitting AR(p) model
pacf extracts partial autocorrelations and partial-correlogram


-----------CHAPTER 3 --------------
A trend is stochastic(random) when the underlying cause is not understood, and can only be attributed to high serial correlation with random error. Trends of this type, which are common in *financial series, can be simulated in R using models such as the *random walk or *autoregressive process (Chapter 2)

Deterministic trends and seasonal variation can be modelled using regression(so-called because there is a greater understanding of the underlying cause of the trend.)

linear model will often provide a good approximation even when the underlying process is non-linear

differencing turns out to be a very useful general procedure as it can ‘remove’ both stochastic and deterministic trends

Linear models are usually fitted by minimising the sum of squared errors, which can be achieved in R using the function lm.
> set.seed(1)
> z = e = rnorm(100, sd=20)
> for (t in 2:100) z[t] = 0.8*z[t-1] + e[t]
> t = 1:100
> x = 50 + 3*t + z
> x.lm = lm(x ~ t)
> coef(x.lm)
> confint(x.lm)

an approximate 95% confidence interval for the estimated parameters extracted
The function summary can be used to obtain a standard regression output to observe this, type summary(x.lm) after entering the commands above.

> summary(x.lm)$coef
> summary(x.lm)$sigma
> summary(x.lm)$r.sq

the coefficients and their standard errors with respective t-tests, the residual standard deviation, and R2 can all be extracted

> acf(resid(lm(temp ~ time(temp))))
If the autocorrelation in the residual series is not accounted for, then the standard errors used to create confidence intervals, such as those above, or any t-ratios, will be under-estimated.

> library(nlme)
> x.gls = gls(x ~ t, cor = corAR1(0.8))
The primary use of GLS is therefore in the estimation of the standard errors and adjusting the confidence intervals for the parameters.


TO BE CONTINUED PAGE 64


SUMMARY
lm fits a linear (regression) model
coef extracts the parameter estimates from a fitted model
confint returns a (95%) confidence interval for the parameters of a fitted model
summary gets basic summary information
gls fits a linear model using generalised least squares (allowing for autocorrelated residuals)located in the nlme library
factor returns variables in the form of ‘factors’ or indicator variables
predict returns predictions (or forecasts) from a fitted model
nls fits a non-linear model by least squares

No comments:

Post a Comment