ECO 309 Texas A & M University Economy Questions

Description

5 attachmentsSlide 1 of 5attachment_1attachment_1attachment_2attachment_2attachment_3attachment_3attachment_4attachment_4attachment_5attachment_5.slider-slide > img { width: 100%; display: block; }
.slider-slide > img:focus { margin: auto; }

Unformatted Attachment Preview

CSCI351 Assignment 4
Due by: 11:59PM on April 22, 2022 (Friday)
Instruction:
• Show your work (at least 50% penalty otherwise)
• Submit a single PDF or word document containing all your answers to the
corresponding assignment folder under D2L (at least 10% penalty otherwise)
• Make sure if you submitted the intended one. It is recommended that you
download what has been uploaded and double-check if the correct document has
been submitted.
• You can submit as many times as you want, but the last submission will only be
graded. If the last submission is made after the deadline, there will be a late
submission penalty.
• No plagiarism: Do not copy and paste any from textbooks and other resources to
answer questions (Zero points will be given otherwise).
• No resubmission/extension request will be accepted.
Problem 1. Firewall Configuration (30 pt.)
Suppose a home network with a network address of 129.100.5.* (i.e.,
129.100.5.0/24).The IP addresses of the Web server and email server are 129.100.5.1
and 129.100.5.2, respectively. Assume that the Web server’s port number is TCP/80 and
the email server’s port number is TCP/25. Configure the firewall table to implement the
following ruleset.
Ruleset:
1. Allow external users to access the internal web server.
2. Internal users can access external web servers (i.e., TCP/80).
3. Allow external users to access the internal email server (to send email in).
4. Internal users can access a specific external service with TCP port number 8080
(i.e., TCP/8080).
5. Allow external UDP traffic with destination port number 6700 to the internal
network (i.e., UDP/6700).
6. Everything not previously allowed is explicitly denied.
Rule
Type
Source
Address
Dest.
Address
Dest.
Port
Action
1
2
3
4
5
6
1
Problem 2. Intrusion Detection (20 pt.)
Suppose 9,900 benign (negative) and 100 malicious (positive) events in a certain time
interval (i.e., 10,000 events in total). From the evaluation of a newly developed IDS
system, the developer observed that 10 (out of 9,900) benign events were misclassified
into intrusions, while 3 (out of 100) malicious events were wrongly classified into benign.
a. (10 pt.) Construct the confusion matrix (as shown in the slide of “Possible Alarm
Outcomes”).
Intrusion Attack
No Intrusion Attack
Alarm
Sounded
No Alarm
Sounded
b. (5 pt.) Calculate accuracy using the formula given.
Accuracy = (TP+TN) / (TP+TN+FP+FN)
c. (5 pt.) Calculate false positive rate using the formula given.
False positive rate = FP / (TN+FP)
2
Eco 309-Assignment 4
Total points 125. Submit through D2L in Word/Excel format, single or multiple files. Must show
all Excel work, answer all parts of the question, and write necessary explanations to earn full
points. You have to use Excel to solve this numerical problem (project).
The following table gives the data for per capita income in thousands of US dollars with the percentage
of the labor force in Agriculture, the average years of schooling of the population over 25 years of age,
and overall health index for 15 developed countries in 2015 (data modified for educational purpose).
Develop a multiple regression model for per capita income (dependent variable) using Excel and answer
the questions below the table. You can use symbols Y, X1, X2 and X3 for the variables in your calculation.
Show your computer output.
Country number
per capita
1
2
3
4
5
5
7
8
9
10
11
12
13
14
15
32
42
41
38
40
52
44
41
46
50
47
53
48
47
50
% of labor in
Agriculture
12
10
8
9
10
5
7
7
6
8
7
4
6
7
6
Average yrs of
schooling
7
12
11
9
10
16
10
11
13
15
11
16
14
12
15
Overall Health
Index
70
80
77
75
76
95
82
80
85
92
90
98
90
85
90
Find the Y-intercept and slopes for the three independent variables and interpret them. Predict the per
capita income when percentage of labor force in Agriculture is only 4, average years of schooling is 15.
And Health Index is 100. Find the overall explanatory power (Coefficient of Determination) of the model
and interpret it. Also find the adjusted coefficient of Determination and interpret it. Find the standard
error of estimate. From the ANOVA table find SSR, SSE and SST and the F-value. Perform the F-test and
comment on the overall usefulness of the model. What are the degrees of freedom for Regression, Error
and Total? Perform t-test for the statistical significance of individual coefficients. Plot the errors or
residuals by countries and comment on the visible pattern. Plot the errors separately by each
explanatory variable and comment on the visible patterns with respect to heteroscedasticity.
Bivariate Linear Regression Analysis
Introduction
Most often, we are interested in the relationships among several variables. For example, we are
interested in the relationships which quantity demanded has with self-price, price of substitutes,
price of complements, and income. Similarly, we may be interested in the impact on imports of
National Income, Exchange rate, Inflation, etc. In real life, any socio-economic variable has some
relationships with a host of other variables. At the outset, we generally start with the study of
bivariate relationships (two variables at a time) and extend to multivariate relationships. Hence in
the present chapter, our focus is studying relationships between two variables at a time. The
scatter plot gives a visual impression about the nature of the relationship between two variables:
whether it is positive (upward moving as the variable on the X-axis increases) or negative
(downward), and whether it is linear or non-linear.
Numerical Example of a Demand Curve Estimation
Suppose we have ten paired observations on the quantity sold and price for some product as
follows:
Table 12.1: Quantity Sold at Various Prices
P
2
2
4
4
5
7
7
9
10
10
Q
63
61
50
49
44
35
34
26
20
18
Scatter Plot of Price and Quantity
The above plot exhibits a negative (or downward) slope implying that the Y-axis variable
(quantity demanded) falls as the variable on the X-axis (price-axis) increases. In Microeconomics,
1
we see this as a general rule to show a downward sloping demand curve. Moreover, we see that
the relationship is nearly (not perfectly) linear. We see that smaller quantity demanded are
associated with higher prices. But no quantitative information about the strength of the
relationship is provided by the scatter plot. Besides, the scatter plot cannot help make predictions
on Y based on knowledge about X (or vice versa). These issues with the scatter plot create the
need for some quantitative measures, as discussed below.
12.4 Covariation and Covariance
Let us introduce three abbreviations:
? )2 = ?X2 – (?X)2/n = ?X2 – n(??
? )2 is the symbol for “Variation” in X. If we
SSxx = ?(X -??
divide Variation by its degree of freedom (here n-1), then we get “Variance”.
? )2 = ?Y2 – (?Y)2/n = ?Y2 – n(??
? )2 is the symbol for Variation in Y.
SSyy = ?(Y -??
? )(?? ? ??
? )= ?XY – (?X?Y)/n = ?XY – n??
???
? is the symbol for Covariation of
SSxy = ?(X -??
X and Y.
A numerical estimate of the strength of the linear relationship between two variables X and Y is
the Covariation (SSxy), which is simply the sum of cross products of deviations of X and Y from
their respective means. If we divide SSxy by its degree of freedom (n-1), we get the Covariance
(sxy). It is like the average covariation. In symbols, Covariance (X,Y) = sxy = SSxy /(n-1)
To have an unbiased estimator of the true but unknown population covariance, we divide by n-1.
Let us denote Price as X and Quantity as Y. For the above example we have:
Calculation Table for Covariance
P = Xi
2
2
4
4
5
7
7
9
10
10
? =6
Total 60 ??
Q = Yi
63
61
50
49
44
35
34
26
20
18
? = 40
400 ??
?
Xi – ??
-4
-4
-2
-2
-1
1
1
3
4
4
0
?
Yi – ??
23
21
10
9
4
-5
-6
-14
-20
-22
0
? )(Yi-??
?)
(Xi-??
-92
-84
-20
-18
-4
-5
-6
-42
-80
-88
SSxy = -439
Thus, the covariation is -439 and the covariance is sxy =-439/9 = -48.7778. These negative values
imply a negative relation between the two variables under study. The larger the numerical values,
the stronger is the positive or negative relationship. But there are problems in determining
whether a value is large or small, as discussed below.
The first issue is that these two measures are unit dependent. For example, a change of the unit
from kg to mg for one of the variables leads to a thousand times change in the numerical value of
Covariation and Covariance. It would be complicated to compare the estimated relations between
two sets of similar variables expressed in different units (say across countries).
2
The Correlation Coefficient
Statisticians have defined a measure called the Correlation Coefficient denoted by the Greek letter
? (rho) for the population parameter and r for the corresponding sample statistic. The sample
correlation coefficient formula is r = sxy/ (sx* sy) = SSxy/?(SSxx*SSyy). Thus, the correlation
coefficient is simply the ratio of Covariance divided by the product of sample standard deviations
of the two variables. Since both numerator and denominator involve X and Y units, the resultant
value is a unit free number. Moreover, this number can range between two fixed boundaries: -1
and +1.
If r is precisely -1, then we have a case of perfect negative correlation. The scatter plot will be a
downward sloping “perfect” straight line. If r happens to be precisely zero, we have zero
correlation (that is, no linear relation). However, the zero correlation could be a result of perfect
non-linear relation like the circular relation. The only definitive conclusion we can draw from
zero correlation is that there is no linear relation between the two variables. We can supplement
our understanding here by looking at the scatter plot. If it looks like a rectangle or scatters without
any apparent relation (positive or negative), we can conclude that the two variables are not
related. If there is a non-linear relation like circular relation, that would be visible in the scatter
plot. If r is precisely +1, we have a case of perfect positive correlation. The scatter plot will be an
upward sloping “perfect” straight line.
Generally, the correlation coefficient is other than these three extreme values. If it is
between -1 and 0, it is a case of a negative correlation. The closer is r to -1, the stronger is
the negative correlation, and the closer it is to 0, the weaker is the correlation. Similarly, if r
is between 0 and 1, then we have a positive correlation. The closer it is to 1, the stronger is
the positive correlation, and the closer it is to 0, we have a weaker relation. Let us calculate
the correlation coefficient and other sample statistics for the data given above.
Calculations for correlation Coefficient
P = Xi
Q = Yi
?
Xi – ??
?
Yi – ??
? )*(Yi-??
?)
(Xi-??
? )2
(Xi-??
? )2
(Yi-??
2
63
-4
23
-92
16
529
2
61
-4
21
-84
16
441
4
50
-2
10
-20
4
100
4
49
-2
9
-18
4
81
5
44
-1
4
-4
1
16
7
35
1
-5
-5
1
25
7
34
1
-6
-6
1
36
9
26
3
-14
-42
9
196
10
20
4
-20
-80
16
400
10
18
4
-22
-88
16
484
? =6
Total 60 ??
? = 40
400 ??
0
0
SSxy = -439
SSxx = 84
SSyy = 2308
The sample standard deviations are sx= ?84/9 = ?9.333 = 3.055 and sy = ?2308/9 = ?256.44 =
16.01. The coefficient of correlation is rxy = SSxy/?(SSxx*SSyy) = -439/?(84*2308) = 0.99703 which is close to -1 showing a strong negative correlation. This was also the
3
impression we had looking at the scatter plot above. But now we have a concrete numerical
measure. There are tests for testing whether the sample correlation coefficient is significantly
different from zero or some other specified value. The descriptive Statistics of the two variables
using Data analysis are shown below.
Descriptive Statistics
The correlation coefficient gives a good idea about the (linear) relation between variables much
better than a simple scatter plot or Covariance. But it has limitations too. The first issue is that a
correlation does not by itself imply any cause-and-effect type relation. Two variables may be
highly correlated but have no causal relation at all. For example, the sales of sunglasses and icecream sales are highly correlated because they are both driven by a common variable: the weather
condition. But you cannot “cause” an increase in the sales of ice-cream by increasing the sales of
sunglasses or vice versa. Hence, such relations are called spurious correlations. Another issue
with the coefficient of correlation is that it cannot help us predict one variable from the
knowledge of other variables. Prediction and forecasting are important statistical exercises in
business. The Regression Analysis enables us to do that.
The Bivariate Simple Linear Regression Model
In this basic model, we assume that changes in a variable Y (called the dependent variable) can
be forecasted based on the known values of another variable X, called the explanatory (or
4
independent) variable. We are not claiming the existence of a cause-and-effect relationship. We
are only asserting that we can predict the values of Y if we know the X values. It may be a case
like the sunglasses and ice-cream example cited above, but for Regression, it is sufficient that we
can estimate the sales of ice-cream if we know sales of sunglasses.
The theoretical bivariate linear regression model is Yi = ?0+ ?1Xi + ?i for i = 1, 2… n. The
subscript i corresponds to the observation number. This equation is called the Regression
Equation. Its geometrical counterpart (plot) is called the Regression Line.
Greek letters denote the parameters of this equation. Each observation pair Yi and Xiis a pair of
random variables with some joint probability distribution. We make the following basic
assumptions underlying the model.
1. The variable on the L.H.S. is called the Dependent variable with the generic notation Y. It is
supposed to be the variable about which we make a prediction (forecast) based on the knowledge
about the explanatory variable X on the R.H.S. of the equation.
2. We are assuming a linear relationship. The variables are in linear form (power one), and there
are no squares, or square roots, log, etc. If there is a non-linear relation, we transform the variables
so that we end up with a linear relationship between the transformed variables.
3. The first parameter ?0, is called the constant term or Intercept and is the expected Y value
when X is zero. In many situations, zero may not be a possible or valid value for X. In such cases,
?0is not very meaningful but may still indicate Y’s value when X is near zero.
4. The second parameter, ?1, is called the slope coefficient and is generally the focus of
Regression Analysis. If it is positive, the Regression line is upward sloping. If it is negative, the
Regression line is downward sloping. If it is zero, then the Regression line is horizontal: a rare
case when the expected value of Y is constant, that is, X has no role to play. The value of
?1shows the rate of change in Y for a small increment in X. In calculus, it is the derivative of Y
with respect to X.
5. The first two terms of the bivariate regression ?0 + ?1Xiconstitute the deterministic or
systematic part in contrast to the unsystematic last term.
6. The last term (epsilon) has many names: residual or error or random disturbance or stochastic
term. We will generally call it the error term. It is an acknowledgment that the model is not
perfect. Every economic variable has hundreds of other variables influencing its value. We cannot
include all possible variables in any Regression (in bivariate, we select only one such explanatory
variable) either because we don’t have sufficient information about them or want to keep the
model manageable. Therefore, this error term includes the influences of all such knowingly or
unknowingly omitted variables. This term also includes the influence of measurement errors we
almost always commit even in the included variables. We make several assumptions about its
behavior so that we can make some meaningful predictions about the variations in Y. Some of the
important assumptions about the error term are listed below (also called the Classical Linear
Model Assumptions):
i. The explanatory variable Xi’s are assumed to be uncorrelated with the error terms ?is. That is,
the deterministic (or systematic) and stochastic parts of the regression equation are uncorrelated.
ii. The mean of the error term is zero. E(?) = overline{mathbfvarepsilon}? = 0. The underlying
belief is that some of the omitted variables will have a positive influence and some negative, but
5
they will cancel out on balance. The average error is expected to be zero. This assumption assures
unbiasedness in our estimates of the regression coefficients. Consequently, E(Yi) = ?0 + ?1E(Xi)
because the third term drops out. Geometrically speaking, the regression line passes through the
average values of Y and X.
iii. It is assumed that the error term variance remains fixed from observation to observation:
Homoscedasticity property. Symbolically, E(?2) = ?2, a fixed value for all “i.” A violation of this
property (or assumption) is called “Heteroscedasticity,” which creates doubt in the statistical
tests of the coefficients’ significance. We need to make corrections or adjustments for this issue to
have confidence in our tests.
iv. There is no linear relation among error terms in different observations. The error terms are
uncorrelated. symbolically, E(?i?j) = 0 for all i ? j. This assumption is called “No
Autocorrelation” or “No Serial Correlation.” If this assumption is violated (we have
Autocorrelation or Serial correlation), we again need to make corrections and adjustments to have
confidence in our testing.
v. For traditional hypothesis testing based on t-test and F-test, we assume normality in the
distribution of error terms.
These assumptions can be compactly expressed as ?i ~ iidN(0, ?2) for all “i”; In other words, the
error terms are Identically, Independently, and Normally distributed with mean 0 and (fixed)
variance ?2.
The Ordinary Least Squares (O.L.S.) Estimation
Under the above assumption we derive a set of formulas called the Ordinary Least Squares
formulas which minimize the Sum of Squared Errors (SSE). Let us first introduce the following
symbols (popular in the literature).
i. English letters indicate the estimators of the unknown parameters in Greek letters. Thus, the
estimator of ?i will be denoted by ei. The estimators of ?0 and ?1 will be denoted by b0 and b1,
respectively. Since the variables are themselves expressed in English letters, their estimators will
require some other style. The sample estimate (or estimator) of Y will be denoted by ??? called Yhat. Remember that X is treated as given or predetermined explanatory variable and, therefore,
does not have an estimator.
ii. The SSE = ?(Yi – ???)2 = ? ei 2 is the sum of squared deviations of actual values of the
Dependent variable from the Estimated values. Here Yi – ??? = ei is the estimate of the error in
using the regression model for ith observation.
The O.L.S. method (also called the Least Squares Method) estimates the parameters in such a way
that S.S.E. is minimized (that is, the total error is minimized). It can be easily shown, using
Calculus rules for minimization, that the estimators that minimize the S.S.E. are:
? )(?? ? ??
? )/[?(X -??
? )2]: The OLS estimator of slope
b1 = SSxy/ SSxx = ?(X -??
? – b1??
?: the OLS estimator of Intercept
b0 = ??
?= 40 SSxy = -439 SSxx = 84 and SSyy =
Using the results in Table 12.3 above, we have ?
X= 6 Y
? – b1??
?= 40 – (-5.2262) *6 = 40 +
2308. So, b1 = SSxy/ SSxx = -439/84 = -5.2262 and b0 = ??
6
31.3572 = 71.3572. The estimated Regression Equation is: ??? = 71.3572 – 5.2262?? or ??? =
71.3572 –5.2262P
To predict the quantity demanded when the price is 12, we simply plug in the explanatory
variable’s given value in the estimated equation to make the prediction. ??? ?p = 12 = 71.3572 –
5.2262* 12 = 8.6428. At price equal to $12 we expect that the quantity demanded will be only
8.6428 units of the product.
Interpretations of the Estimated Regression Equation
Intercept: The intercept term shows that the quantity demanded would be 71.3572 units if the
price dropped to zero (or the good were made free). Since price is rarely zero, this has little
practical importance. But it can still give an idea of how large the demand could be if the price
were quite low.
Slope: The slope coefficient’s negative sign shows a negative relationship between the quantity
demanded and the price. Its value shows that the quantity demanded is expected to drop by about
5.2262 units if the price increased by one unit (or one dollar per unit of the good). The individual
observation variations are supposed to occur because of the noise created by the error term.
Overall Explanatory Power of the Regression Model
One popular measure of the Explanatory power or the “Goodness of Fit” of the Regression Model
is the Coefficient of Determination denoted by R2. Some authors use the symbol r2 but I
recommend the use of R2. The square of correlation coefficient and the Coefficient of
Determination are equal in Bivariate Regression but not in multiple regression. I recommend that
you adopt the symbol as capital R-squared for the Coefficient of Determination.
R2 = (Variation or Change in Y explained by the Regression)/(Total Variation in Y)
The total variation or change in Y is measured by SSyy and is also denoted as SST (total Sum of
Squared deviations). Its formula is already given above. Thus, Total Variation in Y = SST =
SSyy = ? (Yi -Yi? overline YY)2 . The explained variation in Y is denoted by SSR (Sum of
Squares due to Regression) and is equal to b1 SSxy.
We can break the total variation in Y as: SST = SSR + SSE.. So, SSE can easily be
calculated by subtracting SSR from SST. In the above example, SST = SSyy = 2308 SSR = (5.2262)(-439) = 2294.302 Therefore SSE = 2308 – 2294.302 = 13.698. Thus, out of a total
variation of 2308 in Y, 2294.302 is explained by the Regression model and 13.698 is unexplained
variation or total error. Hence the coefficient of determination is: R2 = SSR/SST = 2294.302/2308
= 0.9941.This value of R2 indicates that the Regression model can explain or capture 99.41% of
the variation in We will learn about testing the statistical significance the Dependent variable
(quantity demanded). of R2 later. One problem with the simple Coefficient of Determination is
that it always increases when more explanatory variables are added, whether the added variables
are relevant or not. Therefore, Statisticians have devised an Adjusted Coefficient of
Determination, which penalizes for adding irrelevant variables by considering the loss in degree
of freedom when more parameters have to be estimated using the same data set. We will talk
about it in Multiple Regression in the next chapter.
7
Standard Errors and Tests of Significance
The coefficient of Determination tells us about the overall explanatory power of the whole model.
But we also want to know the statistical significance of the individual parameters. For example,
we would like to know whether the slope coefficient is significantly different from zero (or some
other specified value) or whether the intercept is zero (that is, the regression line passes through
the origin). Since the population variances are unknown, we know from the previous chapter that
we will have to perform a t-test. To perform a t-test, we need to know the standard errors (or the
estimates of standard deviations) and the degree of freedom.
The estimated constant variance of the error term ?2 is called the Mean Square Error
(MSE) and is given by SSE/(n-2) for bivariate Regression. The denominator is the degree of
freedom. We subtract two from the number of observations because two parameters (the
intercept and the slope) have to be estimated using the data to find this estimate. The square
root of MSE is called the “standard error of estimate” or the standard error of the
Regression or simply the standard error denoted by se.
For the above example, we already found SSE = 13.698. The MSE = 13.698/(10-2) = 1.7123 and
the standard error of estimate se = ?1.7123 = 1.3085. The standard errors of the regression
parameters are based on se. The standard error for b1 denoted as sb1 = se/? SSxx = 1.3085/?84 =
0.1428. In the t-test, we generally perform a two-tailed test with the following hypotheses: Ho: b1=
0 and H1: b1? 0. The calculated t-value is simply b1/ sb1, and the degree of freedom for
bivariate Regression is n-2. In the above example tb1 = -5.2262/.1428 = -36.60 with absolute
value 36.60. The two-tailed critical value of t for eight df, even for a 1% significance level, is
3.355. Thus, Null Hypothesis is strongly rejected. We can conclude that the slope coefficient is
(highly) significantly different from zero.
Sometimes we perform a one-tailed test when we have a guide from the underlying theory about
the relationship’s direction. In this example, the microeconomic theory tells us that a demand
curve is negatively sloped. So, we may want to test: Ho: b1 ? 0 and H1: b1 < 0. But we know that the two-tailed critical value is larger than the on-tailed. Since the numerical value of our calculated t far exceeds even the two-tailed test value, we do not need to test for on-tailed. The formula for the standard error of b0 is sb0 = sb1*?(?X2/n). Generally, our focus is on the slope parameters. The Excel Regression results using Data analysis are given below. The results show that both coefficients have pretty low p-values indicating very high statistical significance. The Ftest in the ANOVA table (discussed in detail in the next chapter) also indicates the high statistical significance of the Coefficient of Determination (R2). The residual output table shows the estimated errors in each of the ten observations (The Observed minus the predicted values of the dependent variable). The regression model minimizes the sum of squares of these errors, which is the SSE. mentioned above. Note that the SSE. divided by the degree of freedom (MSE) is also the error term's estimated variance, while the square root of M.S.E. is the standard error. The Excel Results are given below. (Go to Data, Data Analysis, Regression, Check labels, and click from the variable names and drag until end of data. Use column of Q for Y input and column of P for X input). Excel Output of Bivariate Linear Regression 8 Estimated Residuals using Excel 9 Multiple Regression Model- Lecture note by Dr. Guru-Gharana Introduction The bivariate Regression has one dependent and one independent (or explanatory) variable. Clearly, bivariate Regression involves omission of many independent variables which could be useful in explaining the changes in the dependent variable because every variable is related to and affected by a host of other variables. The impacts of these omissions are accumulated in the error or residual term. We would like to have this error term to be as small as possible. One way to reduce this error term is to include some additional explanatory variables or Predictors. If we carefully choose additional explanatory variables our model gets closer to the real-world relationship between the dependent variable Y and its predictors. We will still have some error left because we cannot practically include all relevant predictors in the Business and Economics field. We just want to minimize its size as much as we can. Note that including more explanatory variables does not always improve the model. If the included variables happen to be irrelevant variables the model may deteriorate in its performance as additional errors (noises) and burden of estimation may worsen the model. So, being parsimonious is prudent. The Multiple Regression Equation and Classical (Least Squares) Linear Regression Assumptions The general form of the Multiple Regression Equation is: Yi = ?0 + ?1X1i + ?2 X2i + … + ?k Xki + ?i for I = 1, 2… n with the following classical Assumptions: (note that the Assumptions can be correctly and precisely expressed using Matrix Notations. But this course does not involve Matrix Algebra. Therefore, I will translate the assumptions in simple language without involving Matrix Notations. They may not be rigorous, but they will be practically correct. The following are called Classical Linear Regression Model Assumptions in the context of Multiple Regression: 1. Linearity: The dependent variable Y is related to a set of k Explanatory variables X1,…,Xk which together help explain changes in Y. The relation is assumed linear in the variables. The first k+1 terms in the above equation constitute the explained part in contrast to the unexplained part or the error term. 2. Direction of causation or Predictability: we assume the direction from X’s towards Y not eh other way round. 3. Predicted Y and error: The predicted value of the dependent variable is denoted by ??? and is expressed as: ? i = b0 + b1X1i + b2X2i + … + bkXki, also called the fitted Regression equation. Here bj is the sample ?? estimator of the corresponding population parameter ?j. The plot of this explained or predicted or fitted Regression Equation will be a multidimensional surface (hyperplane) instead of a Regression Line as in the bivariate regression. It cannot be demonstrated graphically for more than three dimensions. Even for three dimensions we need solid graphical structures. 1 The estimated error is the difference between the actual (observed) value of Y and its corresponding predicted value: ei = Yi - ???i. Note that it is not the difference between actual values and the mean of Y (the deviation value of Y from its mean). 4. Intercept: The first parameter ?0 is called the “constant term” or the “Intercept” and is the value of the dependent variable when all explanatory variables are zero. This may or may not make sense, since some of the explanatory variables may not have zero as a meaningful value. 5. Slopes: There are k slope parameters. The parameter ?j is the slope with respect to the explanatory variable j. If the jth explanatory variable increases by one unit, the dependent variable is expected to increase (if the slope is positive) or decrease (if the slope is negative) by the amount ?j units, keeping all other explanatory variables fixed. This is like the partial equilibrium analysis of Economics wherein the phrase Ceteris Paribus is used. Mathematically, it is the partial derivative of Y with respect to the jth X variable. Some or all the slope parameters could be negative, positive or zero, depending on the underlying relationship. In other words, Y may have negative relation with some, positive relation with some and no relation with some of the variables. 6. No Perfect Multicollinearity: The explanatory variables are assumed to have no perfect “linear” relation among themselves. If there is perfect Multicollinearity, then the Regression Estimation breaks down (because the relevant matrix cannot be inverted). This is a new assumption in Multiple Regression not needed in bivariate regression wherein we had only one explanatory variable. If the X variables have strong linear relation (but not perfect), then we have high Multicollinearity. In that case we can estimate the Regression equation, but the error in estimation of individual parameters will be very large (inflated) and ttests of significance will be invalid. Therefore, we hope that our model is not infected with this problem, and that the X variables do not have strong linear relation among themselves. There are many tests for the severity of this problem, but beyond the scope of this course. A simple test is to examine the correlation matrix. If two or more explanatory variables seem to have high negative or positive correlation, we can suspect high Multicollinearity. But lack of bilaterally high correlation is not a guarantee of lack of high Multicollinearity when all X variables are considered together. Another simple test is to examine the Variance Inflation Factor (VIF) provided with some computer results (MegaStat too has this option). If any VIF is 10 or more, we can suspect high Multicollinearity 7. The assumptions about the error term are very similar to the bivariate case: i. X’s and errors uncorrelated: The explanatory variable Xj’s are assumed to be uncorrelated with the error terms ?i’s. That is the deterministic (or systematic) and stochastic parts of the regression equation are uncorrelated. ii. Expected value of error is Zero: T