Description

5 attachmentsSlide 1 of 5attachment_1attachment_1attachment_2attachment_2attachment_3attachment_3attachment_4attachment_4attachment_5attachment_5.slider-slide > img { width: 100%; display: block; }

.slider-slide > img:focus { margin: auto; }

Unformatted Attachment Preview

CSCI351 Assignment 4

Due by: 11:59PM on April 22, 2022 (Friday)

Instruction:

Show your work (at least 50% penalty otherwise)

Submit a single PDF or word document containing all your answers to the

corresponding assignment folder under D2L (at least 10% penalty otherwise)

Make sure if you submitted the intended one. It is recommended that you

download what has been uploaded and double-check if the correct document has

been submitted.

You can submit as many times as you want, but the last submission will only be

graded. If the last submission is made after the deadline, there will be a late

submission penalty.

No plagiarism: Do not copy and paste any from textbooks and other resources to

answer questions (Zero points will be given otherwise).

No resubmission/extension request will be accepted.

Problem 1. Firewall Configuration (30 pt.)

Suppose a home network with a network address of 129.100.5.* (i.e.,

129.100.5.0/24).The IP addresses of the Web server and email server are 129.100.5.1

and 129.100.5.2, respectively. Assume that the Web servers port number is TCP/80 and

the email servers port number is TCP/25. Configure the firewall table to implement the

following ruleset.

Ruleset:

1. Allow external users to access the internal web server.

2. Internal users can access external web servers (i.e., TCP/80).

3. Allow external users to access the internal email server (to send email in).

4. Internal users can access a specific external service with TCP port number 8080

(i.e., TCP/8080).

5. Allow external UDP traffic with destination port number 6700 to the internal

network (i.e., UDP/6700).

6. Everything not previously allowed is explicitly denied.

Rule

Type

Source

Address

Dest.

Address

Dest.

Port

Action

1

2

3

4

5

6

1

Problem 2. Intrusion Detection (20 pt.)

Suppose 9,900 benign (negative) and 100 malicious (positive) events in a certain time

interval (i.e., 10,000 events in total). From the evaluation of a newly developed IDS

system, the developer observed that 10 (out of 9,900) benign events were misclassified

into intrusions, while 3 (out of 100) malicious events were wrongly classified into benign.

a. (10 pt.) Construct the confusion matrix (as shown in the slide of Possible Alarm

Outcomes).

Intrusion Attack

No Intrusion Attack

Alarm

Sounded

No Alarm

Sounded

b. (5 pt.) Calculate accuracy using the formula given.

Accuracy = (TP+TN) / (TP+TN+FP+FN)

c. (5 pt.) Calculate false positive rate using the formula given.

False positive rate = FP / (TN+FP)

2

Eco 309-Assignment 4

Total points 125. Submit through D2L in Word/Excel format, single or multiple files. Must show

all Excel work, answer all parts of the question, and write necessary explanations to earn full

points. You have to use Excel to solve this numerical problem (project).

The following table gives the data for per capita income in thousands of US dollars with the percentage

of the labor force in Agriculture, the average years of schooling of the population over 25 years of age,

and overall health index for 15 developed countries in 2015 (data modified for educational purpose).

Develop a multiple regression model for per capita income (dependent variable) using Excel and answer

the questions below the table. You can use symbols Y, X1, X2 and X3 for the variables in your calculation.

Show your computer output.

Country number

per capita

1

2

3

4

5

5

7

8

9

10

11

12

13

14

15

32

42

41

38

40

52

44

41

46

50

47

53

48

47

50

% of labor in

Agriculture

12

10

8

9

10

5

7

7

6

8

7

4

6

7

6

Average yrs of

schooling

7

12

11

9

10

16

10

11

13

15

11

16

14

12

15

Overall Health

Index

70

80

77

75

76

95

82

80

85

92

90

98

90

85

90

Find the Y-intercept and slopes for the three independent variables and interpret them. Predict the per

capita income when percentage of labor force in Agriculture is only 4, average years of schooling is 15.

And Health Index is 100. Find the overall explanatory power (Coefficient of Determination) of the model

and interpret it. Also find the adjusted coefficient of Determination and interpret it. Find the standard

error of estimate. From the ANOVA table find SSR, SSE and SST and the F-value. Perform the F-test and

comment on the overall usefulness of the model. What are the degrees of freedom for Regression, Error

and Total? Perform t-test for the statistical significance of individual coefficients. Plot the errors or

residuals by countries and comment on the visible pattern. Plot the errors separately by each

explanatory variable and comment on the visible patterns with respect to heteroscedasticity.

Bivariate Linear Regression Analysis

Introduction

Most often, we are interested in the relationships among several variables. For example, we are

interested in the relationships which quantity demanded has with self-price, price of substitutes,

price of complements, and income. Similarly, we may be interested in the impact on imports of

National Income, Exchange rate, Inflation, etc. In real life, any socio-economic variable has some

relationships with a host of other variables. At the outset, we generally start with the study of

bivariate relationships (two variables at a time) and extend to multivariate relationships. Hence in

the present chapter, our focus is studying relationships between two variables at a time. The

scatter plot gives a visual impression about the nature of the relationship between two variables:

whether it is positive (upward moving as the variable on the X-axis increases) or negative

(downward), and whether it is linear or non-linear.

Numerical Example of a Demand Curve Estimation

Suppose we have ten paired observations on the quantity sold and price for some product as

follows:

Table 12.1: Quantity Sold at Various Prices

P

2

2

4

4

5

7

7

9

10

10

Q

63

61

50

49

44

35

34

26

20

18

Scatter Plot of Price and Quantity

The above plot exhibits a negative (or downward) slope implying that the Y-axis variable

(quantity demanded) falls as the variable on the X-axis (price-axis) increases. In Microeconomics,

1

we see this as a general rule to show a downward sloping demand curve. Moreover, we see that

the relationship is nearly (not perfectly) linear. We see that smaller quantity demanded are

associated with higher prices. But no quantitative information about the strength of the

relationship is provided by the scatter plot. Besides, the scatter plot cannot help make predictions

on Y based on knowledge about X (or vice versa). These issues with the scatter plot create the

need for some quantitative measures, as discussed below.

12.4 Covariation and Covariance

Let us introduce three abbreviations:

? )2 = ?X2 (?X)2/n = ?X2 n(??

? )2 is the symbol for Variation in X. If we

SSxx = ?(X -??

divide Variation by its degree of freedom (here n-1), then we get Variance.

? )2 = ?Y2 (?Y)2/n = ?Y2 n(??

? )2 is the symbol for Variation in Y.

SSyy = ?(Y -??

? )(?? ? ??

? )= ?XY (?X?Y)/n = ?XY n??

???

? is the symbol for Covariation of

SSxy = ?(X -??

X and Y.

A numerical estimate of the strength of the linear relationship between two variables X and Y is

the Covariation (SSxy), which is simply the sum of cross products of deviations of X and Y from

their respective means. If we divide SSxy by its degree of freedom (n-1), we get the Covariance

(sxy). It is like the average covariation. In symbols, Covariance (X,Y) = sxy = SSxy /(n-1)

To have an unbiased estimator of the true but unknown population covariance, we divide by n-1.

Let us denote Price as X and Quantity as Y. For the above example we have:

Calculation Table for Covariance

P = Xi

2

2

4

4

5

7

7

9

10

10

? =6

Total 60 ??

Q = Yi

63

61

50

49

44

35

34

26

20

18

? = 40

400 ??

?

Xi – ??

-4

-4

-2

-2

-1

1

1

3

4

4

0

?

Yi – ??

23

21

10

9

4

-5

-6

-14

-20

-22

0

? )(Yi-??

?)

(Xi-??

-92

-84

-20

-18

-4

-5

-6

-42

-80

-88

SSxy = -439

Thus, the covariation is -439 and the covariance is sxy =-439/9 = -48.7778. These negative values

imply a negative relation between the two variables under study. The larger the numerical values,

the stronger is the positive or negative relationship. But there are problems in determining

whether a value is large or small, as discussed below.

The first issue is that these two measures are unit dependent. For example, a change of the unit

from kg to mg for one of the variables leads to a thousand times change in the numerical value of

Covariation and Covariance. It would be complicated to compare the estimated relations between

two sets of similar variables expressed in different units (say across countries).

2

The Correlation Coefficient

Statisticians have defined a measure called the Correlation Coefficient denoted by the Greek letter

? (rho) for the population parameter and r for the corresponding sample statistic. The sample

correlation coefficient formula is r = sxy/ (sx* sy) = SSxy/?(SSxx*SSyy). Thus, the correlation

coefficient is simply the ratio of Covariance divided by the product of sample standard deviations

of the two variables. Since both numerator and denominator involve X and Y units, the resultant

value is a unit free number. Moreover, this number can range between two fixed boundaries: -1

and +1.

If r is precisely -1, then we have a case of perfect negative correlation. The scatter plot will be a

downward sloping “perfect” straight line. If r happens to be precisely zero, we have zero

correlation (that is, no linear relation). However, the zero correlation could be a result of perfect

non-linear relation like the circular relation. The only definitive conclusion we can draw from

zero correlation is that there is no linear relation between the two variables. We can supplement

our understanding here by looking at the scatter plot. If it looks like a rectangle or scatters without

any apparent relation (positive or negative), we can conclude that the two variables are not

related. If there is a non-linear relation like circular relation, that would be visible in the scatter

plot. If r is precisely +1, we have a case of perfect positive correlation. The scatter plot will be an

upward sloping “perfect” straight line.

Generally, the correlation coefficient is other than these three extreme values. If it is

between -1 and 0, it is a case of a negative correlation. The closer is r to -1, the stronger is

the negative correlation, and the closer it is to 0, the weaker is the correlation. Similarly, if r

is between 0 and 1, then we have a positive correlation. The closer it is to 1, the stronger is

the positive correlation, and the closer it is to 0, we have a weaker relation. Let us calculate

the correlation coefficient and other sample statistics for the data given above.

Calculations for correlation Coefficient

P = Xi

Q = Yi

?

Xi – ??

?

Yi – ??

? )*(Yi-??

?)

(Xi-??

? )2

(Xi-??

? )2

(Yi-??

2

63

-4

23

-92

16

529

2

61

-4

21

-84

16

441

4

50

-2

10

-20

4

100

4

49

-2

9

-18

4

81

5

44

-1

4

-4

1

16

7

35

1

-5

-5

1

25

7

34

1

-6

-6

1

36

9

26

3

-14

-42

9

196

10

20

4

-20

-80

16

400

10

18

4

-22

-88

16

484

? =6

Total 60 ??

? = 40

400 ??

0

0

SSxy = -439

SSxx = 84

SSyy = 2308

The sample standard deviations are sx= ?84/9 = ?9.333 = 3.055 and sy = ?2308/9 = ?256.44 =

16.01. The coefficient of correlation is rxy = SSxy/?(SSxx*SSyy) = -439/?(84*2308) = 0.99703 which is close to -1 showing a strong negative correlation. This was also the

3

impression we had looking at the scatter plot above. But now we have a concrete numerical

measure. There are tests for testing whether the sample correlation coefficient is significantly

different from zero or some other specified value. The descriptive Statistics of the two variables

using Data analysis are shown below.

Descriptive Statistics

The correlation coefficient gives a good idea about the (linear) relation between variables much

better than a simple scatter plot or Covariance. But it has limitations too. The first issue is that a

correlation does not by itself imply any cause-and-effect type relation. Two variables may be

highly correlated but have no causal relation at all. For example, the sales of sunglasses and icecream sales are highly correlated because they are both driven by a common variable: the weather

condition. But you cannot “cause” an increase in the sales of ice-cream by increasing the sales of

sunglasses or vice versa. Hence, such relations are called spurious correlations. Another issue

with the coefficient of correlation is that it cannot help us predict one variable from the

knowledge of other variables. Prediction and forecasting are important statistical exercises in

business. The Regression Analysis enables us to do that.

The Bivariate Simple Linear Regression Model

In this basic model, we assume that changes in a variable Y (called the dependent variable) can

be forecasted based on the known values of another variable X, called the explanatory (or

4

independent) variable. We are not claiming the existence of a cause-and-effect relationship. We

are only asserting that we can predict the values of Y if we know the X values. It may be a case

like the sunglasses and ice-cream example cited above, but for Regression, it is sufficient that we

can estimate the sales of ice-cream if we know sales of sunglasses.

The theoretical bivariate linear regression model is Yi = ?0+ ?1Xi + ?i for i = 1, 2
n. The

subscript i corresponds to the observation number. This equation is called the Regression

Equation. Its geometrical counterpart (plot) is called the Regression Line.

Greek letters denote the parameters of this equation. Each observation pair Yi and Xiis a pair of

random variables with some joint probability distribution. We make the following basic

assumptions underlying the model.

1. The variable on the L.H.S. is called the Dependent variable with the generic notation Y. It is

supposed to be the variable about which we make a prediction (forecast) based on the knowledge

about the explanatory variable X on the R.H.S. of the equation.

2. We are assuming a linear relationship. The variables are in linear form (power one), and there

are no squares, or square roots, log, etc. If there is a non-linear relation, we transform the variables

so that we end up with a linear relationship between the transformed variables.

3. The first parameter ?0, is called the constant term or Intercept and is the expected Y value

when X is zero. In many situations, zero may not be a possible or valid value for X. In such cases,

?0is not very meaningful but may still indicate Y’s value when X is near zero.

4. The second parameter, ?1, is called the slope coefficient and is generally the focus of

Regression Analysis. If it is positive, the Regression line is upward sloping. If it is negative, the

Regression line is downward sloping. If it is zero, then the Regression line is horizontal: a rare

case when the expected value of Y is constant, that is, X has no role to play. The value of

?1shows the rate of change in Y for a small increment in X. In calculus, it is the derivative of Y

with respect to X.

5. The first two terms of the bivariate regression ?0 + ?1Xiconstitute the deterministic or

systematic part in contrast to the unsystematic last term.

6. The last term (epsilon) has many names: residual or error or random disturbance or stochastic

term. We will generally call it the error term. It is an acknowledgment that the model is not

perfect. Every economic variable has hundreds of other variables influencing its value. We cannot

include all possible variables in any Regression (in bivariate, we select only one such explanatory

variable) either because we dont have sufficient information about them or want to keep the

model manageable. Therefore, this error term includes the influences of all such knowingly or

unknowingly omitted variables. This term also includes the influence of measurement errors we

almost always commit even in the included variables. We make several assumptions about its

behavior so that we can make some meaningful predictions about the variations in Y. Some of the

important assumptions about the error term are listed below (also called the Classical Linear

Model Assumptions):

i. The explanatory variable Xis are assumed to be uncorrelated with the error terms ?is. That is,

the deterministic (or systematic) and stochastic parts of the regression equation are uncorrelated.

ii. The mean of the error term is zero. E(?) = overline{mathbfvarepsilon}? = 0. The underlying

belief is that some of the omitted variables will have a positive influence and some negative, but

5

they will cancel out on balance. The average error is expected to be zero. This assumption assures

unbiasedness in our estimates of the regression coefficients. Consequently, E(Yi) = ?0 + ?1E(Xi)

because the third term drops out. Geometrically speaking, the regression line passes through the

average values of Y and X.

iii. It is assumed that the error term variance remains fixed from observation to observation:

Homoscedasticity property. Symbolically, E(?2) = ?2, a fixed value for all “i.” A violation of this

property (or assumption) is called “Heteroscedasticity,” which creates doubt in the statistical

tests of the coefficients’ significance. We need to make corrections or adjustments for this issue to

have confidence in our tests.

iv. There is no linear relation among error terms in different observations. The error terms are

uncorrelated. symbolically, E(?i?j) = 0 for all i ? j. This assumption is called “No

Autocorrelation or No Serial Correlation.” If this assumption is violated (we have

Autocorrelation or Serial correlation), we again need to make corrections and adjustments to have

confidence in our testing.

v. For traditional hypothesis testing based on t-test and F-test, we assume normality in the

distribution of error terms.

These assumptions can be compactly expressed as ?i ~ iidN(0, ?2) for all “i”; In other words, the

error terms are Identically, Independently, and Normally distributed with mean 0 and (fixed)

variance ?2.

The Ordinary Least Squares (O.L.S.) Estimation

Under the above assumption we derive a set of formulas called the Ordinary Least Squares

formulas which minimize the Sum of Squared Errors (SSE). Let us first introduce the following

symbols (popular in the literature).

i. English letters indicate the estimators of the unknown parameters in Greek letters. Thus, the

estimator of ?i will be denoted by ei. The estimators of ?0 and ?1 will be denoted by b0 and b1,

respectively. Since the variables are themselves expressed in English letters, their estimators will

require some other style. The sample estimate (or estimator) of Y will be denoted by ??? called Yhat. Remember that X is treated as given or predetermined explanatory variable and, therefore,

does not have an estimator.

ii. The SSE = ?(Yi – ???)2 = ? ei 2 is the sum of squared deviations of actual values of the

Dependent variable from the Estimated values. Here Yi – ??? = ei is the estimate of the error in

using the regression model for ith observation.

The O.L.S. method (also called the Least Squares Method) estimates the parameters in such a way

that S.S.E. is minimized (that is, the total error is minimized). It can be easily shown, using

Calculus rules for minimization, that the estimators that minimize the S.S.E. are:

? )(?? ? ??

? )/[?(X -??

? )2]: The OLS estimator of slope

b1 = SSxy/ SSxx = ?(X -??

? – b1??

?: the OLS estimator of Intercept

b0 = ??

?= 40 SSxy = -439 SSxx = 84 and SSyy =

Using the results in Table 12.3 above, we have ?

X= 6 Y

? – b1??

?= 40 (-5.2262) *6 = 40 +

2308. So, b1 = SSxy/ SSxx = -439/84 = -5.2262 and b0 = ??

6

31.3572 = 71.3572. The estimated Regression Equation is: ??? = 71.3572 5.2262?? or ??? =

71.3572 5.2262P

To predict the quantity demanded when the price is 12, we simply plug in the explanatory

variable’s given value in the estimated equation to make the prediction. ??? ?p = 12 = 71.3572

5.2262* 12 = 8.6428. At price equal to $12 we expect that the quantity demanded will be only

8.6428 units of the product.

Interpretations of the Estimated Regression Equation

Intercept: The intercept term shows that the quantity demanded would be 71.3572 units if the

price dropped to zero (or the good were made free). Since price is rarely zero, this has little

practical importance. But it can still give an idea of how large the demand could be if the price

were quite low.

Slope: The slope coefficient’s negative sign shows a negative relationship between the quantity

demanded and the price. Its value shows that the quantity demanded is expected to drop by about

5.2262 units if the price increased by one unit (or one dollar per unit of the good). The individual

observation variations are supposed to occur because of the noise created by the error term.

Overall Explanatory Power of the Regression Model

One popular measure of the Explanatory power or the “Goodness of Fit” of the Regression Model

is the Coefficient of Determination denoted by R2. Some authors use the symbol r2 but I

recommend the use of R2. The square of correlation coefficient and the Coefficient of

Determination are equal in Bivariate Regression but not in multiple regression. I recommend that

you adopt the symbol as capital R-squared for the Coefficient of Determination.

R2 = (Variation or Change in Y explained by the Regression)/(Total Variation in Y)

The total variation or change in Y is measured by SSyy and is also denoted as SST (total Sum of

Squared deviations). Its formula is already given above. Thus, Total Variation in Y = SST =

SSyy = ? (Yi -Yi? overline YY)2 . The explained variation in Y is denoted by SSR (Sum of

Squares due to Regression) and is equal to b1 SSxy.

We can break the total variation in Y as: SST = SSR + SSE.. So, SSE can easily be

calculated by subtracting SSR from SST. In the above example, SST = SSyy = 2308 SSR = (5.2262)(-439) = 2294.302 Therefore SSE = 2308 2294.302 = 13.698. Thus, out of a total

variation of 2308 in Y, 2294.302 is explained by the Regression model and 13.698 is unexplained

variation or total error. Hence the coefficient of determination is: R2 = SSR/SST = 2294.302/2308

= 0.9941.This value of R2 indicates that the Regression model can explain or capture 99.41% of

the variation in We will learn about testing the statistical significance the Dependent variable

(quantity demanded). of R2 later. One problem with the simple Coefficient of Determination is

that it always increases when more explanatory variables are added, whether the added variables

are relevant or not. Therefore, Statisticians have devised an Adjusted Coefficient of

Determination, which penalizes for adding irrelevant variables by considering the loss in degree

of freedom when more parameters have to be estimated using the same data set. We will talk

about it in Multiple Regression in the next chapter.

7

Standard Errors and Tests of Significance

The coefficient of Determination tells us about the overall explanatory power of the whole model.

But we also want to know the statistical significance of the individual parameters. For example,

we would like to know whether the slope coefficient is significantly different from zero (or some

other specified value) or whether the intercept is zero (that is, the regression line passes through

the origin). Since the population variances are unknown, we know from the previous chapter that

we will have to perform a t-test. To perform a t-test, we need to know the standard errors (or the

estimates of standard deviations) and the degree of freedom.

The estimated constant variance of the error term ?2 is called the Mean Square Error

(MSE) and is given by SSE/(n-2) for bivariate Regression. The denominator is the degree of

freedom. We subtract two from the number of observations because two parameters (the

intercept and the slope) have to be estimated using the data to find this estimate. The square

root of MSE is called the standard error of estimate or the standard error of the

Regression or simply the standard error denoted by se.

For the above example, we already found SSE = 13.698. The MSE = 13.698/(10-2) = 1.7123 and

the standard error of estimate se = ?1.7123 = 1.3085. The standard errors of the regression

parameters are based on se. The standard error for b1 denoted as sb1 = se/? SSxx = 1.3085/?84 =

0.1428. In the t-test, we generally perform a two-tailed test with the following hypotheses: Ho: b1=

0 and H1: b1? 0. The calculated t-value is simply b1/ sb1, and the degree of freedom for

bivariate Regression is n-2. In the above example tb1 = -5.2262/.1428 = -36.60 with absolute

value 36.60. The two-tailed critical value of t for eight df, even for a 1% significance level, is

3.355. Thus, Null Hypothesis is strongly rejected. We can conclude that the slope coefficient is

(highly) significantly different from zero.

Sometimes we perform a one-tailed test when we have a guide from the underlying theory about

the relationship’s direction. In this example, the microeconomic theory tells us that a demand

curve is negatively sloped. So, we may want to test: Ho: b1 ? 0 and H1: b1 < 0. But we know that
the two-tailed critical value is larger than the on-tailed. Since the numerical value of our
calculated t far exceeds even the two-tailed test value, we do not need to test for on-tailed. The
formula for the standard error of b0 is sb0 = sb1*?(?X2/n). Generally, our focus is on the slope
parameters. The Excel Regression results using Data analysis are given below. The results show
that both coefficients have pretty low p-values indicating very high statistical significance. The Ftest in the ANOVA table (discussed in detail in the next chapter) also indicates the high statistical
significance of the Coefficient of Determination (R2). The residual output table shows the
estimated errors in each of the ten observations (The Observed minus the predicted values of the
dependent variable). The regression model minimizes the sum of squares of these errors, which is
the SSE. mentioned above. Note that the SSE. divided by the degree of freedom (MSE) is also the
error term's estimated variance, while the square root of M.S.E. is the standard error.
The Excel Results are given below. (Go to Data, Data Analysis, Regression, Check labels, and
click from the variable names and drag until end of data. Use column of Q for Y input and column
of P for X input).
Excel Output of Bivariate Linear Regression
8
Estimated Residuals using Excel
9
Multiple Regression Model- Lecture note by Dr. Guru-Gharana
Introduction
The bivariate Regression has one dependent and one independent (or explanatory) variable. Clearly,
bivariate Regression involves omission of many independent variables which could be useful in explaining
the changes in the dependent variable because every variable is related to and affected by a host of other
variables. The impacts of these omissions are accumulated in the error or residual term. We would like to
have this error term to be as small as possible. One way to reduce this error term is to include some
additional explanatory variables or Predictors. If we carefully choose additional explanatory variables our
model gets closer to the real-world relationship between the dependent variable Y and its predictors. We
will still have some error left because we cannot practically include all relevant predictors in the Business
and Economics field. We just want to minimize its size as much as we can. Note that including more
explanatory variables does not always improve the model. If the included variables happen to be
irrelevant variables the model may deteriorate in its performance as additional errors (noises) and burden
of estimation may worsen the model. So, being parsimonious is prudent.
The Multiple Regression Equation and Classical (Least Squares) Linear Regression
Assumptions
The general form of the Multiple Regression Equation is:
Yi = ?0 + ?1X1i + ?2 X2i +
+ ?k Xki + ?i for I = 1, 2
n with the following classical Assumptions: (note
that the Assumptions can be correctly and precisely expressed using Matrix Notations. But this course does
not involve Matrix Algebra. Therefore, I will translate the assumptions in simple language without
involving Matrix Notations. They may not be rigorous, but they will be practically correct. The following
are called Classical Linear Regression Model Assumptions in the context of Multiple Regression:
1. Linearity: The dependent variable Y is related to a set of k Explanatory variables X1,
,Xk which
together help explain changes in Y. The relation is assumed linear in the variables. The first k+1 terms in the
above equation constitute the explained part in contrast to the unexplained part or the error term.
2. Direction of causation or Predictability: we assume the direction from Xs towards Y not eh other way
round.
3. Predicted Y and error: The predicted value of the dependent variable is denoted by ??? and is expressed
as:
? i = b0 + b1X1i + b2X2i +
+ bkXki, also called the fitted Regression equation. Here bj is the sample
??
estimator of the corresponding population parameter ?j. The plot of this explained or predicted or fitted
Regression Equation will be a multidimensional surface (hyperplane) instead of a Regression Line as in the
bivariate regression. It cannot be demonstrated graphically for more than three dimensions. Even for three
dimensions we need solid graphical structures.
1
The estimated error is the difference between the actual (observed) value of Y and its corresponding
predicted value: ei = Yi - ???i. Note that it is not the difference between actual values and the mean of Y (the
deviation value of Y from its mean).
4. Intercept: The first parameter ?0 is called the constant term or the Intercept and is the value of the
dependent variable when all explanatory variables are zero. This may or may not make sense, since some of
the explanatory variables may not have zero as a meaningful value.
5. Slopes: There are k slope parameters. The parameter ?j is the slope with respect to the explanatory
variable j. If the jth explanatory variable increases by one unit, the dependent variable is expected to
increase (if the slope is positive) or decrease (if the slope is negative) by the amount ?j units, keeping
all other explanatory variables fixed. This is like the partial equilibrium analysis of Economics wherein
the phrase Ceteris Paribus is used. Mathematically, it is the partial derivative of Y with respect to the jth X
variable. Some or all the slope parameters could be negative, positive or zero, depending on the underlying
relationship. In other words, Y may have negative relation with some, positive relation with some and no
relation with some of the variables.
6. No Perfect Multicollinearity: The explanatory variables are assumed to have no perfect linear
relation among themselves. If there is perfect Multicollinearity, then the Regression Estimation breaks down
(because the relevant matrix cannot be inverted). This is a new assumption in Multiple Regression not
needed in bivariate regression wherein we had only one explanatory variable. If the X variables have
strong linear relation (but not perfect), then we have high Multicollinearity. In that case we can estimate the
Regression equation, but the error in estimation of individual parameters will be very large (inflated) and ttests of significance will be invalid. Therefore, we hope that our model is not infected with this problem,
and that the X variables do not have strong linear relation among themselves. There are many tests for the
severity of this problem, but beyond the scope of this course. A simple test is to examine the correlation
matrix. If two or more explanatory variables seem to have high negative or positive correlation, we can
suspect high Multicollinearity. But lack of bilaterally high correlation is not a guarantee of lack of high
Multicollinearity when all X variables are considered together. Another simple test is to examine the
Variance Inflation Factor (VIF) provided with some computer results (MegaStat too has this option). If any
VIF is 10 or more, we can suspect high Multicollinearity
7. The assumptions about the error term are very similar to the bivariate case:
i. Xs and errors uncorrelated: The explanatory variable Xjs are assumed to be uncorrelated with the
error terms ?is. That is the deterministic (or systematic) and stochastic parts of the regression equation are
uncorrelated.
ii. Expected value of error is Zero: T