How to set up smartphones and PCs. Informational portal

Multiple regression. This option is set when constructing multiple regression

REPORT

Assignment: Consider a regression analysis procedure based on data (sales price and living space) for 23 real estate objects.

The "Regression" operating mode is used to calculate the parameters of the linear regression equation and check its adequacy to the process under study.

To solve the problem of regression analysis in MS Excel, select from the menu Service command Data analysis and analysis tool " Regression".

In the dialog box that appears, set the following parameters:

1. Input interval Y is the range of performance data. It must be one column.

2. Input interval X is a range of cells containing the values ​​of factors (independent variables). The number of input ranges (columns) must be no more than 16.

3. Checkbox Tags, is set if the first line of the range contains a title.

4. Checkbox Reliability level activated if in the field next to it you need to enter a level of reliability other than the default. Used to test the significance of the determination coefficient R 2 and the regression coefficients.

5. Constant zero. This checkbox must be set if the regression line should pass through the origin (a 0 = 0).

6. Output Interval / New Worksheet / New Workbook - specify the address of the upper left cell of the output range.

7. Flags in Group Leftovers are set if it is necessary to include the corresponding columns or graphs in the output range.

8. The Normal probability graph checkbox must be made active if you want to display a dot graph of the dependence of the observed Y values ​​on the automatically generated percentile intervals on the sheet.

After pressing the OK button in the output range, we get a report.

Using a set of data analysis tools, we will perform a regression analysis of the original data.

The Regression analysis tool is used to fit the parameters of a regression equation using the least squares method. Regression is used to analyze the effect on an individual dependent variable of the values ​​of one or more explanatory variables.

TABLE REGRESSION STATISTICS

The magnitude plural R is the root of the coefficient of determination (R-square). It is also called the correlation index or multiple correlation coefficient. Expresses the degree of dependence of independent variables (X1, X2) and dependent variable (Y) and is equal to the square root of the coefficient of determination, this value takes values ​​in the range from zero to one. In our case, it is 0.7, which indicates a significant relationship between the variables.

The magnitude R-squared (coefficient of determination), also called a measure of certainty, characterizes the quality of the obtained regression line. This quality is expressed by the degree of fit between the original data and the regression model (calculated data). The measure of certainty is always within the interval.

In our case, the R-squared value is 0.48, i.e. almost 50%, which indicates a weak fit of the regression line to the original data. found R-squared = 48%<75%, то, следовательно, также можно сделать вывод о невозможности прогнозирования с помощью найденной регрессионной зависимости. Таким образом, модель объясняет всего 48% вариации цены, что говорит о недостаточности выбранных факторов, либо о недостаточном объеме выборки.

Normalized R-square is the same coefficient of determination, but adjusted for the sample size.

Normal R-square = 1- (1-R-square) * ((n-1) / (n-k)),

regression analysis linear equation

where n is the number of observations; k is the number of parameters. The normalized R-square is preferable to use when adding new regressors (factors), because increasing them will also increase the R-squared value, but this will not indicate an improvement in the model. Since in our case the obtained value is equal to 0.43 (which differs from the R-square by only 0.05), then we can talk about high confidence in the R-square coefficient.

Standard error shows the quality of the approximation (approximation) of the observation results. In our case, the error is 5.1. Let's calculate as a percentage: 5.1 / (57.4-40.1) = 0.294? 29% (The model is considered better when the standard error is<30%)

Observations- indicates the number of observed values ​​(23).

TABLE DISPERSION ANALYSIS

To obtain the regression equation, -statistics is determined - a characteristic of the accuracy of the regression equation, which is the ratio of that part of the variance of the dependent variable that is explained by the regression equation to the unexplained (residual) part of the variance.

In column df- the number of degrees of freedom k is given.

For regression, this is the number of regressors (factors) - X1 (area) and X2 (estimate), i.e. k = 2.

For the remainder, this is a value equal to n- (m + 1), i.e. the number of origin points (23) minus the number of coefficients (2) and minus the intercept (1).

Column SS- the sum of squares of deviations from the mean of the resulting feature. It presents:

The regression sum of squares of deviations from the mean of the resulting feature of the theoretical values ​​calculated by the regression equation.

The residual sum of deviations of the original values ​​from the theoretical values.

The total sum of the squares of the deviations of the initial values ​​from the resulting feature.

The larger the regression sum of squared deviations (or the smaller the residual sum), the better the regression equation approximates the original point cloud. In our case, the residual amount is about 50%. Therefore, the regression equation is a very weak approximation to the cloud of original points.

In column MS- unbiased sample variances, regression and residual.

Column F the value of criterion statistics is calculated to test the significance of the regression equation.

To carry out a statistical test of the significance of the regression equation, a null hypothesis is formulated about the absence of a relationship between the variables (all coefficients for the variables are equal to zero) and the level of significance is selected.

The significance level is the acceptable probability of making a type I error - rejecting the correct null hypothesis as a result of testing. In this case, making a mistake of the first kind means recognizing, from the sample, the presence of a relationship between variables in the general population, when in fact it is not there. Typically, the significance level is assumed to be 5%. Comparing the obtained value = 9.4 with the table value = 3.5 (the number of degrees of freedom is 2 and 20, respectively), we can say that the regression equation is significant (F> Fcr).

In the column, the significance of F the probability of the obtained value of the criterion statistics is calculated. Since in our case this value = 0.00123, which is less than 0.05, then we can say that the regression equation (dependence) is significant with a probability of 95%.

The two pillars described above show the reliability of the model as a whole.

The following table contains the coefficients for the regressors and their estimates.

The Y-intersection string is not associated with any regressor, it is a free coefficient.

In the column odds the values ​​of the coefficients of the regression equation are recorded. Thus, we got the equation:

Y = 25.6 + 0.009X1 + 0.346X2

The regression equation must pass through the center of the original point cloud: 13.02 × M (b) × 38.26

Next, we compare the values ​​of the columns in pairs Coefficients and Standard Error. It can be seen that in our case, all the absolute values ​​of the coefficients exceed the values ​​of standard errors. This may indicate the importance of regressors, however, this is a rough analysis. The column t-statistics contains a more accurate estimate of the significance of the coefficients.

Column t-statistic contains the t-test values ​​calculated by the formula:

t = (Coefficient) / (Standard error)

This criterion has a Student distribution with the number of degrees of freedom

n- (k + 1) = 23- (2 + 1) = 20

According to the Student's table, we find the value of ttabl = 2.086. By comparing

t with ttabl we obtain that the coefficient of the regressor X2 is insignificant.

Column p-value represents the probability that the critical value of the statistic of the used criterion (Student's statistic) will exceed the value calculated from the sample. In this case, compare p-values with the selected significance level (0.05). It can be seen that only the regressor coefficient X2 = 0.08> 0.05 can be considered insignificant.

The lower 95% and upper 95% columns show the confidence limits with 95% confidence. Each coefficient has its own boundaries: Coefficient ttable * Standard error

Confidence intervals are plotted only for statistically significant values.

y=f(x), when each value of the independent variable x corresponds to one definite value of the quantity y, with a regression relationship to the same value x may correspond depending on the case, different values ​​of the quantity y... If for each value x=x i observed n i values y i 1 …y in 1 quantities y, then the dependence of the arithmetic means = ( y i 1 +…+y in 1)/n i from x=x i and is a regression in the statistical sense of the term.

This term in statistics was first used by Francis Galton (1886) in connection with the study of the inheritance of human physical characteristics. Human height was taken as one of the characteristics; it was found that, on the whole, the sons of tall fathers, not surprisingly, turned out to be taller than the sons of short fathers. More interesting was that the variation in the height of the sons was less than the variation in the height of the fathers. This is how the tendency for the growth of sons to return to average ( regression to mediocrity), that is, "regression". This fact was demonstrated by calculating the average height of the sons of fathers who are 56 inches tall, calculating the average height of the sons of fathers who are 58 inches tall, etc. After that, the results were plotted on a plane, along the ordinate of which the average height of sons was plotted , and on the abscissa - the values ​​of the average height of fathers. The points (approximately) lie on a straight line with a positive inclination angle less than 45 °; it is important that the regression was linear.

So, let's say there is a sample from the two-dimensional distribution of a pair of random variables ( X, Y). Straight line in plane ( x, y) was a selective analogue of the function

In this example, the regression Y on the X is a linear function. If the regression Y on the X differs from linear, then the given equations are a linear approximation of the true regression equation.

In general, regression from one random variable to another does not have to be linear. It is also not necessary to be limited to a couple of random variables. Statistical problems of regression are associated with determining the general form of the regression equation, constructing estimates of unknown parameters included in the regression equation, and testing statistical hypotheses about regression. These problems are considered in the framework of regression analysis.

A simple example of regression Y on X is the relationship between Y and X, which is expressed by the ratio: Y=u(X) + ε, where u(x)=E(Y | X=x), and the random variables X and ε are independent. This view is useful when planning a functional relationship experiment. y=u(x) between nonrandom values y and x... In practice, usually the regression coefficients in the equation y=u(x) are unknown and are estimated from experimental data.

Linear regression (propedeutics)

Imagine a dependency y from x in the form of a first-order linear model:

We will assume that the values x are determined without error, β 0 and β 1 are model parameters, and ε is an error, the distribution of which obeys the normal law with zero mean and constant deviation σ 2. The values ​​of the parameters β are not known in advance and must be determined from a set of experimental values ​​( x i, y i), i=1, …, n... Thus, we can write:

where means the value predicted by the model y given x, b 0 and b 1 - sample estimates of the model parameters, and - values ​​of the approximation errors.

The least squares method gives the following formulas for calculating the parameters of a given model and their deviations:

here the average values ​​are determined as usual:, and s e 2 denotes the residual regression deviation, which is an estimate of the variance σ 2 if the model is correct.

The standard errors of the regression coefficients are used in the same way as the standard error of the mean - to find confidence intervals and test hypotheses. We use, for example, the Student's test to test the hypothesis about the equality of the regression coefficient to zero, that is, about its insignificance for the model. Student's statistics: t=b/s b... If the probability for the obtained value and n−2 degrees of freedom is small enough, for example,<0,05 - гипотеза отвергается. Напротив, если нет оснований отвергнуть гипотезу о равенстве нулю, скажем b 1 - there is reason to think about the existence of the desired regression, at least in this form, or about collecting additional observations. If the free term is equal to zero b 0, then the straight line passes through the origin and the estimate of the slope is

,

and her standard error

Usually, the true values ​​of the regression coefficients β 0 and β 1 are not known. Only their estimates are known b 0 and b one . In other words, the true regression line can run differently than the one built on the basis of sample data. You can calculate the confidence region for the regression line. For any value x corresponding values y distributed normally. The mean is the value of the regression equation. The uncertainty of his estimate is characterized by the standard regression error:

You can now calculate the 100 (1 − α / 2) -percent confidence interval for the value of the regression equation at the point x:

,

where t(1 − α / 2, n−2) - t-value of the Student's distribution. The figure shows a 10-point regression line (solid points), as well as a 95% confidence region of the regression line, which is bounded by dashed lines. With a 95% probability, it can be argued that the true line is somewhere within this area. Or otherwise, if we collect similar data sets (indicated by circles) and plot regression lines (indicated in blue) from them, then in 95 cases out of 100 these straight lines will not leave the limits of the confidence region. (To visualize, click on the picture) Note that some points were outside the confidence region. This is quite natural, since we are talking about the confidence region of the regression line, and not the values ​​themselves. The scatter of values ​​is the sum of the scatter of values ​​around the regression line and the uncertainty of the position of this line itself, namely:

Here m- frequency of measurement y given x... And 100 (1 − α / 2) -percentage confidence interval (forecast interval) for the mean of m values y will:

.

In the figure, this 95% confidence region at m= 1 is bounded by solid lines. This area contains 95% of all possible values ​​of the quantity y in the studied range of values x.

Literature

Links

  • (English)

Wikimedia Foundation. 2010.

See what "Regression (mathematics)" is in other dictionaries:

    There is an article in Wiktionary "regression"

    On the function, see: Interpolyant. Interpolation, interpolation in computational mathematics is a method of finding intermediate values ​​of a quantity from an available discrete set of known values. Many of those who come across scientific and ... ... Wikipedia

    This term has other meanings, see mean. In mathematics and statistics, the arithmetic mean is one of the most common measures of the central tendency, which is the sum of all observed values ​​divided by their ... ... Wikipedia

    Not to be confused with Japanese candlesticks. Graph 1. Results of the Michelson Morley experiment ... Wikipedia

    Beginners · Community · Portals · Awards · Projects · Requests · Assessment Geography · History · Society · Personalities · Religion · Sports · Technology · Science · Art · Philosophy ... Wikipedia

    REGRESSION AND CORRELATION ANALYSIS- REGRESSION AND CORRELATION ANALYSIS P. a. is a calculation based on statistical information for the purpose of mathematically evaluating the average relationship between a dependent variable and some independent variable or variables. Simple ... ... Encyclopedia of Banking and Finance

    Logo Type Mathematical modeling programs Developer… Wikipedia

The following example uses the data file Poverty. sta. You can open it using the File menu by choosing the Open command; most likely this data file is located in the / Examples / Datasets directory. Data are based on a comparison of 1960 and 1970 census results for a random sample of 30 counties. County names are entered as case identifiers.

The following information for each variable is provided in the Variable Specification Editor spreadsheet (available when you select All Variable Specification ... from the Data menu).

Purpose of the study. We will analyze the correlates of poverty (ie predictors that are "strongly" correlated with the percentage of families living below the poverty line). Thus, we will consider variable 3 (Pt_Poor) as a dependent or criterion variable, and all other variables as independent variables or predictors.

Initial analysis. When you choose the Multiple Regression command from the Analyze menu, the start panel of the Multiple Regression module opens. You can define a regression equation by clicking the Variables button on the Quick tab of the launch pad of the Multiple Regression module. In the Variable Selection window that appears, select Pt_Poor as the dependent variable, and all other variables in the dataset as independent variables. In the Additional tab, also check the Show descriptive statistics, corr. matrices.



Now click OK on this dialog box and the View Descriptive Statistics dialog box will open. Here you can view the means, standard deviations, correlations, and covariances between variables. Note that this dialog is accessible from almost all subsequent windows in the Multiple Regression module, so you can always go back to look at the descriptive statistics for specific variables.

Distribution of variables. First, let's examine the distribution of the dependent variable Pt_Poor by county. Click Average & Std Deviations to display the table of results.


Select Histograms from the Graphics menu to build a histogram for the Pt_Poor variable (in the Advanced tab of the 2M Histograms dialog box, set the Number of categories in the Category row option to 16). As you can see below, the distribution of this variable is somewhat different from the normal distribution. Correlation coefficients can be significantly overestimated or underestimated if there are significant outliers in the sample. However, although the two counties (the two rightmost columns) have a higher percentage of households living below the poverty line than would be expected from the normal distribution, they still seem to be “within the margin” to us.



This decision is somewhat subjective; The rule of thumb is that concern is only required when the observation (or observations) are outside the range given by the mean ± 3 standard deviations. In this case, it is prudent to repeat the critical (in terms of the effect of outliers) part of the analysis with and without outliers in order to ensure that they do not affect the nature of cross-correlations. You can also view the distribution of this variable by clicking the Span Plot button on the Advanced tab of the View Descriptive Statistics dialog box by selecting the Pt_Poor variable. Next, select the Median / Quartile / Range option in the Range Plots dialog box and click the OK button.


(Note that a specific method for calculating the median and quartiles can be selected for the entire "system" in the Options dialog box on the Tools menu.)

Scatter plots. If there are a priori hypotheses about the relationship between certain variables, it may be helpful at this stage to derive the corresponding scatterplot. For example, consider the relationship between population change and the percentage of households below the poverty line. It would be natural to expect that poverty leads to population migration; thus, there should be a negative correlation between the percentage of families living below the poverty line and population change.

Return to the View Descriptive Statistics dialog box and click the Correlations button on the Quick tab to display the table of results with the correlation matrix.



Correlations between variables can also be displayed in a matrix scatterplot. The scatter matrix for selected variables can be obtained by clicking the Correlation Matrix Plot button on the Advanced tab of the Descriptive Statistics View dialog box and then selecting the variables of interest.

Sets multiple regression. To perform regression analysis, all you need to do is click OK on the View Descriptive Statistics dialog box and go to the Multiple Regression Results window. A standard regression analysis (with intercept) will be performed automatically.

View results. Shown below is the Multiple Regression Results dialog box. The general multiple regression equation is highly significant (see the chapter Basic Concepts of Statistics for a discussion of statistical significance testing). Thus, knowing the values ​​of the explanatory variables, one can “predict” the predictor associated with poverty better than guessing it purely by chance.



Regression coefficients. To find out which explanatory variables contribute more to predicting the predictor of poverty, examine the regression (or B) coefficients. Click the Summary Regression Table button on the Quick tab of the Multiple Regression Results dialog box to display a table of results with these coefficients.



This table shows the standardized regression coefficients (Beta) and the usual regression coefficients (B). Beta coefficients are the coefficients that are obtained if all variables are previously standardized to mean 0 and standard deviation 1. Thus, the magnitude of these Beta coefficients allows comparison of the relative contribution of each independent variable to the prediction of the dependent variable. As seen in the results table above, Pop_Chng, Pt_Rural, and N_Empld are the most important predictors of poverty; of these, only the first two are statistically significant. The regression coefficient for Pop_Chng is negative; those. the smaller the population growth, the more families live below the poverty line in the respective county. The regression contribution for Pt_Rural is positive; those. the larger the percentage of the rural population, the higher the poverty rate.

Partial correlations. Another way to examine the contributions of each independent variable to predicting the dependent variable is to calculate partial and semi-partial correlations (click the Partial Correlation button on the Advanced tab of the Multiple Regression Results dialog box). Partial correlations are correlations between the corresponding independent variable and the dependent variable, adjusted for other variables. Thus, it is the correlation between the residuals after adjusting for the explanatory variables. Partial correlation represents the independent contribution of the corresponding independent variable to the prediction of the dependent variable.



Semi-partial correlations are correlations between the corresponding independent variable, adjusted for other variables, and the original (unadjusted) dependent variable. Thus, the semi-partial correlation is the correlation of the corresponding independent variable after adjustment for other variables, and the unadjusted baseline values ​​of the dependent variable. In other words, the square of the semi-partial correlation is a measure of the percentage of the total variance self-explained by the corresponding independent variable, while the square of the partial correlation is the measure of the percentage of residual variance that is accounted for after adjusting the dependent variable for the explanatory variables.

In this example, partial and semi-private correlations have close values. However, sometimes their values ​​can differ significantly (the semi-partial correlation is always less). If the semi-partial correlation is very small, while the partial correlation is relatively large, then the corresponding variable may have its own "part" in explaining the variability of the dependent variable (ie, a "part" that is not explained by other variables). However, in practical terms, this fraction may be small, and represent only a small fraction of the total variability (see, for example, Lindeman, Merenda, and Gold, 1980; Morrison, 1967; Neter, Wasserman, and Kutner, 1985 ; Pedhazur, 1973; or Stevens, 1986).

Residual analysis. After fitting the regression equation, it is always useful to examine the resulting predicted values ​​and residuals. For example, extreme outliers can significantly bias results and lead to erroneous conclusions. In the Residuals / Offers / Observed tab, click the Residuals Analysis button to go to the corresponding dialog box.

Line-by-line plot of residuals. This option of the dialog box gives you the opportunity to select one of the possible types of residuals for plotting a line-by-line graph. Typically, the nature of the original (non-standardized) or standardized residuals should be examined to identify extreme observations. In our example, select the Residuals tab and click the Residual Row Plotting button; by default, a graph of the initial residuals will be built; however, you can change the type of residues in the corresponding field.



The scale used in the line-by-line plot in the left-most column is in sigma terms, i.e. standard deviation of residuals. If one or more observations fall outside the ± 3 * sigma range, then it is likely that the relevant observations should be excluded (easily achieved by selection criteria) and the analysis performed again to ensure that there is no bias in the key results caused by these outliers in the data.

Line graph of emissions. A quick way to identify emissions is to use the Emissions Plot option in the Emissions tab. You can choose to view all standard residuals outside the ± 2-5 sigma range, or view the 100 most prominent cases selected in the Outlier Type field on the Outliers tab. When using the Standard Residual (> 2 * sigma) option, no outliers are noticeable in our example.

Mahalanobis distances. Most textbooks on statistics make room for a discussion of the topic of outliers and residuals for the dependent variable. However, the role of outliers in the set of explanatory variables is often overlooked. On the independent variable side, there is a list of variables involved with different weights (regression coefficients) in predicting the dependent variable. Independent variables can be thought of as points of some multidimensional space in which each observation can be located. For example, if you have two explanatory variables with equal regression coefficients, you can plot the scatterplot of the two variables and plot each observation on that plot. You can then draw a point for the means of both variables and calculate the distances from each observation to that mean (now called the centroid) in that two-dimensional space; this is the conceptual idea behind calculating the Mahalanobis distances. Now let's look at these distances, sorted by magnitude, in order to identify extreme observations from the independent variables. In the Emission type field, check the Mahalanobis distances option and click the Emission line plot button. The resulting graph shows Mahalanobis distances sorted in descending order.



Note that Shelby County appears to stand out in some way compared to other counties on the graph. Looking at the raw data, you find that Shelby County is actually a much larger county, with more people involved in farming (N_Empld) and a much larger African American population. It would probably make sense to express these numbers as percentages rather than absolute values, in which case Shelby's Mahalanobis distance from other counties would not be that great in this example. However, we found Shelby County to be a clear outlier.

Removed leftovers. Another very important statistic for assessing the scale of the outlier problem is the removed residues. They are defined as the standardized residuals for the corresponding observations that would result if the corresponding observations were excluded from the analysis. Recall that the multiple regression procedure fits a straight line to express the relationship between the dependent and independent variables. If one of the observations is an obvious outlier (like the Shelby county in this data), then the regression line will tend to "get closer" to that outlier in order to account for it as much as possible. The result is a completely different regression line (and B-coefficients) when the corresponding observation is excluded. Therefore, if the removed residual is very different from the standardized residual, you have reason to believe that the results of the regression analysis are significantly biased by the corresponding observation. In this example, the Shelby County removed residual is an outlier that significantly affects the analysis. You can plot the scatterplot of residuals versus removed residuals using the Remains and Removed option. residuals in the Scatterplots tab. An outlier is clearly visible in the scatterplot below.


STATISTICA provides an interactive outlier removal tool (Brushon the graphics toolbar;). Allows you to experiment with removing outliers and allows you to immediately see their effect on the regression line. When this tool is activated, the cursor changes to a cross and the Paint dialog box is highlighted next to the graph. You can (temporarily) interactively exclude individual data points from the graph by checking (1) the Auto update option and (2) the Disable field from the Operation block; and then clicking with the mouse on the point that you want to delete, aligning it with the cursor cross.


Note that deleted points can be "reverted" by clicking the Undo All button in the Shading dialog box.

Normal probabilistic plots. The user receives a large number of additional charts from the Residual Analysis window. Most of these graphs are more or less easy to interpret. However, here we will give an interpretation of the normal probability graph, as it is most often used in analyzing the validity of regression assumptions.

As noted earlier, multiple linear regression assumes a linear relationship between the variables in the equation and a normal distribution of residuals. If these assumptions are violated, final conclusions may not be accurate. The normal probability plot of residuals clearly shows the presence or absence of large deviations from the stated assumptions. Click the Normal button on the Probability plots tab to draw this plot.


This graph is constructed as follows. First, the regression residuals are ranked. For these ordered residuals, z-scores (i.e., standard normal distribution values) are computed, assuming the data is normal distribution. These z-values ​​are plotted along the y-axis on the graph.

If the observed residuals (plotted along the X-axis) are normally distributed, then all values ​​will be located on the graph near a straight line; on this graph, all points lie very close to a straight line. If the residuals are not normally distributed, then they will deviate from the line. Outliers can also appear on this graph.

If the available model does not fit well with the data, and the plotted data seems to form some structure (for example, the observation cloud takes an S-shaped shape) around the regression line, then it may be useful to apply some transformation of the dependent variable (for example, taking logarithm to tail of the distribution, etc .; see also the short discussion of Box-Cox and Box-Tidwell transforms in the Notes and Technical Information section). A discussion of such techniques is outside the scope of this manual (in Neter, Wasserman and Kutner 1985, p. 134, the authors offer an excellent discussion of transformations as a means of dealing with abnormality and nonlinearity). All too often, however, researchers simply accept their data without trying to look closely at its structure or check it against their assumptions, leading to erroneous conclusions. For this reason, one of the main challenges facing the developers of the user interface of the Multiple Regression module was to simplify the (graphical) analysis of the residuals as much as possible.

The main purpose of regression analysis consists in determining the analytical form of communication, in which the change in the effective feature is due to the influence of one or more factorial features, and many of all other factors that also affect the effective feature are taken as constant and average values.
Regression Analysis Tasks:
a) Establishing the form of dependence. Regarding the nature and form of the relationship between the phenomena, distinguish between positive linear and nonlinear and negative linear and nonlinear regression.
b) Determining the regression function in the form of a mathematical equation of one type or another and establishing the influence of explanatory variables on the dependent variable.
c) Estimation of unknown values ​​of the dependent variable. Using the regression function, you can reproduce the values ​​of the dependent variable within the interval of specified values ​​of the explanatory variables (i.e., solve the interpolation problem) or estimate the process flow outside the specified interval (i.e., solve the extrapolation problem). The result is an estimate of the value of the dependent variable.

Paired regression is an equation of the relationship between two variables y and x:, where y is the dependent variable (effective indicator); x is an independent explanatory variable (sign factor).

Distinguish between linear and non-linear regressions.
Linear regression: y = a + bx + ε
Nonlinear regressions are divided into two classes: regressions that are nonlinear with respect to the explanatory variables included in the analysis, but linear in the estimated parameters, and regressions that are nonlinear in the estimated parameters.
Regressions non-linear in explanatory variables:

Regression, nonlinear in the estimated parameters: The construction of the regression equation is reduced to the estimation of its parameters. To estimate the parameters of regressions that are linear in parameters, use the method of least squares (OLS). OLS allows one to obtain such parameter estimates for which the sum of the squares of the deviations of the actual values ​​of the effective attribute y from the theoretical ones is minimal, i.e.
.
For linear and nonlinear equations that can be reduced to linear ones, the following system is solved with respect to a and b:

You can use ready-made formulas that follow from this system:

The closeness of connection of the studied phenomena is estimated by the linear coefficient of pair correlation for linear regression:

and the correlation index - for nonlinear regression:

An assessment of the quality of the constructed model will be given by the coefficient (index) of determination, as well as the average approximation error.
Average approximation error is the average deviation of the calculated values ​​from the actual ones:
.
The admissible limit of values ​​is no more than 8-10%.
The average coefficient of elasticity shows how many percent on average across the population the result y will change from its average value when the factor x changes by 1% from its average value:
.

Analysis of variance is aimed at analyzing the variance of the dependent variable:
,
where is the total sum of the squares of the deviations;
- the sum of squares of deviations due to regression ("explained" or "factorial");
- residual sum of squares of deviations.
The proportion of variance explained by regression in the total variance of the effective attribute y is characterized by the coefficient (index) of determination R 2:

Coefficient of determination - the square of the coefficient or correlation index.

F-test - evaluating the quality of the regression equation - consists in testing the hypothesis But about the statistical insignificance of the regression equation and the indicator of the tightness of the connection. For this, a comparison is made between the actual F fact and the critical (tabular) F table of the F-Fisher's test values. F fact is determined from the ratio of the values ​​of the factorial and residual variances, calculated for one degree of freedom:
,
where n is the number of units in the population; m is the number of parameters for variables x.
F table is the maximum possible value of the criterion under the influence of random factors for given degrees of freedom and significance level a. Significance level a is the probability of rejecting a correct hypothesis, provided that it is correct. Usually a is taken equal to 0.05 or 0.01.
If F tab< F факт, то Н о - гипотеза о случайной природе оцениваемых характеристик отклоняется и признается их статистическая значимость и надежность. Если F табл >F fact, then the hypothesis H about is not rejected and the statistical insignificance, unreliability of the regression equation is recognized.
To assess the statistical significance of the regression and correlation coefficients, the Student's t-test and the confidence intervals for each of the indicators are calculated. The hypothesis H about the random nature of the indicators is put forward, i.e. about their insignificant difference from zero. The assessment of the significance of the regression and correlation coefficients using the Student's t-test is carried out by comparing their values ​​with the magnitude of the random error:
; ; .
The random errors of the linear regression parameters and the correlation coefficient are determined by the formulas:



Comparing the actual and critical (tabular) values ​​of t-statistics - t table and t fact - we accept or reject the hypothesis H o.
The relationship between Fisher's F-test and Student's t-statistic is expressed by the equality

If t tab< t факт то H o отклоняется, т.е. a, b и не случайно отличаются от нуля и сформировались под влиянием систематически действующего фактора х. Если t табл >t fact that hypothesis H o is not rejected and the random nature of the formation of a, b or is recognized.
To calculate the confidence interval, we determine the marginal error D for each indicator:
, .
The formulas for calculating the confidence intervals are as follows:
; ;
; ;
If zero falls within the confidence interval, i.e. the lower limit is negative, and the upper one is positive, then the estimated parameter is taken to be zero, since it cannot simultaneously take on both positive and negative values.
The predicted value is determined by substituting the corresponding (predicted) value into the regression equation. The mean standard error of the forecast is calculated:
,
where
and the confidence interval of the forecast is constructed:
; ;
where .

Solution example

Problem number 1. For seven territories of the Ural region In 199X, the values ​​of two signs are known.
Table 1.
Required: 1. To characterize the dependence of y on x, calculate the parameters of the following functions:
a) linear;
b) power-law (you first need to perform the procedure for linearizing the variables by taking the logarithm of both parts);
c) indicative;
d) equilateral hyperbola (you also need to figure out how to pre-linearize this model).
2. Evaluate each model in terms of the mean approximation error and Fisher's F-test.

Solution (Option # 1)

To calculate the parameters a and b of linear regression (calculation can be done using a calculator).
we solve the system of normal equations for a and b:
Based on the initial data, we calculate :
y x yx x 2 y 2 A i
l 68,8 45,1 3102,88 2034,01 4733,44 61,3 7,5 10,9
2 61,2 59,0 3610,80 3481,00 3745,44 56,5 4,7 7,7
3 59,9 57,2 3426,28 3271,84 3588,01 57,1 2,8 4,7
4 56,7 61,8 3504,06 3819,24 3214,89 55,5 1,2 2,1
5 55,0 58,8 3234,00 3457,44 3025,00 56,5 -1,5 2,7
6 54,3 47,2 2562,96 2227,84 2948,49 60,5 -6,2 11,4
7 49,3 55,2 2721,36 3047,04 2430,49 57,8 -8,5 17,2
Total 405,2 384,3 22162,34 21338,41 23685,76 405,2 0,0 56,7
Wed meaning (Total / n) 57,89 54,90 3166,05 3048,34 3383,68 X X 8,1
s 5,74 5,86 X X X X X X
s 2 32,92 34,34 X X X X X X


Regression equation: y = 76,88 - 0,35X. With an increase in the average daily wage by 1 rub. the share of expenses for the purchase of food products decreases on average by 0.35% points.
Let's calculate the linear pair correlation coefficient:

Communication is moderate, reverse.
Let's define the coefficient of determination:

The 12.7% variation in the result is explained by the variation in factor x. Substituting the actual values ​​into the regression equation X, determine the theoretical (calculated) values . Let's find the value of the average approximation error:

On average, the calculated values ​​deviate from the actual ones by 8.1%.
Let's calculate the F-criterion:

since 1< F < ¥ should consider F -1 .
The resulting value indicates the need to accept the hypothesis. But oh the random nature of the revealed dependence and the statistical insignificance of the parameters of the equation and the indicator of the tightness of the connection.
1b. The construction of a power-law model is preceded by the procedure of linearization of variables. In the example, linearization is done by taking the logarithm of both sides of the equation:


whereY = log (y), X = log (x), C = log (a).

For calculations, we use the data in Table. 1.3.

Table 1.3

Y X YX Y 2 X 2 A i
1 1,8376 1,6542 3,0398 3,3768 2,7364 61,0 7,8 60,8 11,3
2 1,7868 1,7709 3,1642 3,1927 3,1361 56,3 4,9 24,0 8,0
3 1,7774 1,7574 3,1236 3,1592 3,0885 56,8 3,1 9,6 5,2
4 1,7536 1,7910 3,1407 3,0751 3,2077 55,5 1,2 1,4 2,1
5 1,7404 1,7694 3,0795 3,0290 3,1308 56,3 -1,3 1,7 2,4
6 1,7348 1,6739 2,9039 3,0095 2,8019 60,2 -5,9 34,8 10,9
7 1,6928 1,7419 2,9487 2,8656 3,0342 57,4 -8,1 65,6 16,4
Total 12,3234 12,1587 21,4003 21,7078 21,1355 403,5 1,7 197,9 56,3
Mean 1,7605 1,7370 3,0572 3,1011 3,0194 X X 28,27 8,0
σ 0,0425 0,0484 X X X X X X X
σ 2 0,0018 0,0023 X X X X X X X

Let's calculate C and b:


We get a linear equation: .
Having performed its potentiation, we get:

Substituting the actual values ​​into this equation X, we get the theoretical values ​​of the result. Based on them, we will calculate the indicators: the tightness of the connection - the correlation index and the average approximation error

The characteristics of the power-law model indicate that it describes the relationship somewhat better than a linear function.

1c... Construction of the exponential curve equation

preceded by the procedure for linearizing the variables by taking the logarithm of both sides of the equation:

For calculations, we use the data in the table.

Y x Yx Y 2 x 2 A i
1 1,8376 45,1 82,8758 3,3768 2034,01 60,7 8,1 65,61 11,8
2 1,7868 59,0 105,4212 3,1927 3481,00 56,4 4,8 23,04 7,8
3 1,7774 57,2 101,6673 3,1592 3271,84 56,9 3,0 9,00 5,0
4 1,7536 61,8 108,3725 3,0751 3819,24 55,5 1,2 1,44 2,1
5 1,7404 58,8 102,3355 3,0290 3457,44 56,4 -1,4 1,96 2,5
6 1,7348 47,2 81,8826 3,0095 2227,84 60,0 -5,7 32,49 10,5
7 1,6928 55,2 93,4426 2,8656 3047,04 57,5 -8,2 67,24 16,6
Total 12,3234 384,3 675,9974 21,7078 21338,41 403,4 -1,8 200,78 56,3
Wed zn. 1,7605 54,9 96,5711 3,1011 3048,34 X X 28,68 8,0
σ 0,0425 5,86 X X X X X X X
σ 2 0,0018 34,339 X X X X X X X

The values ​​of the regression parameters A and V made up:


A linear equation is obtained: . Let's potentiate the resulting equation and write it in the usual form:

We estimate the tightness of the connection through the correlation index:

  • Tutorial

Statistics have recently received strong PR support from newer and noisy disciplines - Machine Learning and Big Data... Those who seek to ride this wave need to make friends with regression equations... At the same time, it is advisable not only to learn 2-3 tricks and pass the exam, but to be able to solve problems from everyday life: to find the relationship between variables, and ideally, to be able to distinguish a signal from noise.



For this purpose, we will use a programming language and development environment R, which is perfectly adapted to such tasks. At the same time, let's check what the rating of Habrapost depends on on the statistics of our own articles.

Introduction to Regression Analysis

If there is a correlation between the variables y and x, it becomes necessary to determine the functional relationship between the two quantities. The dependence of the mean value is called by regression y in x.


Regression analysis is based on least squares method (OLS), according to which a function is taken as the regression equation such that the sum of the squares of the differences is minimal.



Karl Gauss discovered, or rather recreated, OLS at the age of 18, but the results were first published by Legendre in 1805. According to unverified data, the method was known even in ancient China, from where it migrated to Japan and only then came to Europe. The Europeans did not make a secret out of this and successfully launched production, discovering with its help the trajectory of the dwarf planet Ceres in 1801.


The form of the function, as a rule, is determined in advance, and the optimal values ​​of the unknown parameters are selected using the LSM. The metric for scattering values ​​around a regression is variance.


  • k is the number of coefficients in the system of regression equations.

Most often, a linear regression model is used, and all nonlinear dependencies are brought to a linear form using algebraic tricks, various transformations of the variables y and x.

Linear regression

Linear regression equations can be written as



In matrix form, it looks like


  • y - dependent variable;
  • x is an independent variable;
  • β - coefficients to be found using the least squares method;
  • ε - error, unexplained error and deviation from the linear relationship;


A random variable can be interpreted as the sum of two terms:



Another key concept is the correlation coefficient R 2.


Linear Regression Constraints

In order to use a linear regression model, some assumptions are needed about the distribution and properties of the variables.



How do you find out that the above conditions are not met? Well, first of all, it is often seen with the naked eye on the graph.


Dispersion heterogeneity


With an increase in variance with an increase in the independent variable, we have a graph in the shape of a funnel.



In some cases, it is also fashionable to see nonlinear regression on the graph quite clearly.


Nevertheless, there are also quite strict formal ways to determine whether the conditions of linear regression are met, or violated.




In this formula - the coefficient of mutual determination between and other factors. If at least one of the VIFs is> 10, it is quite reasonable to assume the presence of multicollinearity.


Why is it so important for us to comply with all of the above conditions? It's all about Gauss-Markov theorem, according to which the OLS estimate is accurate and effective only if these constraints are met.

How to overcome these limitations

Violations of one or more restrictions are not yet a sentence.

  1. The non-linearity of the regression can be overcome by transforming the variables, for example, through the natural logarithm function ln.
  2. In the same way, it is possible to solve the problem of inhomogeneous variance, using ln, or sqrt transformations of the dependent variable, or using a weighted OLS.
  3. To eliminate the problem of multicollinearity, the variable elimination method is used. Its essence is that highly correlated explanatory variables are removed from the regression and it is reevaluated. The selection criterion for the variables to be excluded is the correlation coefficient. There is another way to solve this problem, which consists in change of variables, which are inherent in multicollinearity, by their linear combination... This does not end the whole list, there is still stepwise regression and other methods.

Unfortunately, not all conditional violations and linear regression defects can be eliminated using the natural logarithm. If there is autocorrelation of disturbances for example, it’s better to take a step back and build a new and better model.

Linear regression of pluses on Habré

So, enough theoretical baggage and you can build the model itself.
For a long time I was curious about what the very green figure depends on, which indicates the rating of the post on Habré. Having collected all the available statistics of my own posts, I decided to run it through a linear regression model.


Loads data from a tsv file.


> hist<- read.table("~/habr_hist.txt", header=TRUE) >hist
points reads comm faves fb bytes 31 11937 29 19 13 10265 93 34 122 71 98 74 14995 32 12153 12 147 17 22476 30 16867 35 30 22 9571 27 13851 21 52 46 18824 12 16571 44 149 35 9972 18 9651 16 86 49 11370 59 29610 82 29 333 10131 26 8605 25 65 11 13050 20 11266 14 48 8 9884 ...
  • points- Article rating
  • reads- Number of views.
  • comm- Number of comments.
  • faves- Added to bookmarks.
  • fb- Shared on social networks (fb + vk).
  • bytes- Length in bytes.

Checking multicollinearity.


> Cor (hist) points reads comm faves fb bytes points 1.0000000 0.5641858 0.61489369 0.24104452 0.61696653 0.19502379 reads 0.5641858 1.0000000 0.54785197 0.57451189 0.57092464 0.24359202 comm 0.6148937 0.5478520 1.00000000 -0.01511207 0.51551030 0.08829029 faves 0.2410445 0.5745119 -0.01511207 1.00000000 0.23659894 0.14583018 fb 0.6169665 0.5709246 0.51551030 0.23659894 1.00000000 0.06782256 bytes 0.1950238 0.2435920 0.08829029 0.14583018 0.06782256 1.00000000

Contrary to my expectations greatest return not on the number of views of the article, but from comments and publications on social networks... I also assumed that the number of views and comments would have a stronger correlation, but the dependence is quite moderate - there is no need to exclude any of the explanatory variables.


Now the actual model itself, we use the lm function.


regmodel<- lm(points ~., data = hist) summary(regmodel) Call: lm(formula = points ~ ., data = hist) Residuals: Min 1Q Median 3Q Max -26.920 -9.517 -0.559 7.276 52.851 Coefficients: Estimate Std. Error t value Pr(>| t |) (Intercept) 1.029e + 01 7.198e + 00 1.430 0.1608 reads 8.832e-05 3.158e-04 0.280 0.7812 comm 1.356e-01 5.218e-02 2.598 0.0131 * faves 2.740e-02 3.492e-02 0.785 0.4374 fb 1.162e-01 4.691e-02 2.476 0.0177 * bytes 3.960e-04 4.219e-04 0.939 0.3537 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 Residual standard error: 16.65 on 39 degrees of freedom Multiple R-squared: 0.5384, Adjusted R-squared: 0.4792 F- statistic: 9.099 on 5 and 39 DF, p-value: 8.476e-06

In the first line, we set the parameters for linear regression. The points string ~. defines the dependent variable points and all other variables as regressors. You can define one single independent variable via points ~ reads, a set of variables - points ~ reads + comm.


Let us now proceed to deciphering the results obtained.




You can try to improve the model somewhat by smoothing out non-linear factors: comments and posts on social networks. Let's replace the values ​​of the variables fb and comm with their powers.


> hist $ fb = hist $ fb ^ (4/7)> hist $ comm = hist $ comm ^ (2/3)

Let's check the values ​​of the linear regression parameters.


> regmodel<- lm(points ~., data = hist) >summary (regmodel) Call: lm (formula = points ~., data = hist) Residuals: Min 1Q Median 3Q Max -22.972 -11.362 -0.603 7.977 49.549 Coefficients: Estimate Std. Error t value Pr (> | t |) (Intercept) 2.823e + 00 7.305e + 00 0.387 0.70123 reads -6.278e-05 3.227e-04 -0.195 0.84674 comm 1.010e + 00 3.436e-01 2.938 0.00552 ** faves 2.753e-02 3.421e-02 0.805 0.42585 fb 1.601e + 00 5.575e-01 2.872 0.00657 ** bytes 2.688e-04 4.108e-04 0.654 0.51677 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 Residual standard error: 16.21 on 39 degrees of freedom Multiple R-squared: 0.5624, Adjusted R-squared: 0.5062 F- statistic: 10.02 on 5 and 39 DF, p-value: 3.186e-06

As we can see, in general, the model's responsiveness has increased, the parameters tightened and became more silky, the F-statistic increased, as did the corrected determination coefficient.


Let's check if the conditions of applicability of the linear regression model are met? The Darbin-Watson test checks for the autocorrelation of disturbances.


> dwtest (hist $ points ~., data = hist) Durbin-Watson test data: hist $ points ~. DW = 1.585, p-value = 0.07078 alternative hypothesis: true autocorrelation is greater than 0

And finally, checking the variance inhomogeneity using the Brousch-Pagan test.


> bptest (hist $ points ~., data = hist) studentized Breusch-Pagan test data: hist $ points ~. BP = 6.5315, df = 5, p-value = 0.2579

Finally

Of course, our linear regression model of the Habra topics rating turned out to be not the most successful. We were able to explain no more than half of the variability in the data. Factors need to be corrected to get rid of inhomogeneous dispersion, with autocorrelation it is also not clear. In general, the data are not enough for any serious assessment.


But on the other hand, this is good. Otherwise, any hastily written troll post on Habré would automatically gain a high rating, but fortunately this is not the case.

Used materials

  1. A. I. Kobzar Applied Mathematical Statistics. - M .: Fizmatlit, 2006.
  2. William H. Green Econometric Analysis

Tags: Add Tags

Top related articles