AP Statistics Ch 3
Things to know before start this guide...
-response variables: the outcome of the study (y)
-explanatory variables: the variable that explain or influences changes in a response variable (x)
IMPORTANT!!! Just because a variable is an explanatory variable, that DOES NOT mean that the explanatory variable causes the change.
As for solving the problems, the procedure is more or less the same: plot data, look for patterns and deviations from those patterns. This time, compose an equation to describe the pattern.
Part 1: Scatterplots and Correlation
- most effective way to display relationship between two variables is to create a scatterplot
- this shows relationship of variables that are about the same individuals
- to interpret scatterplots...
- look at overall pattern and deviations (outliers), then describe direction (positive or negative association), form, and strength
- if slope is positive, then it has a positive association; if slope is negative, then it has a negative association
- the form can be linear (quite common), parabolic, etc...
- the strength can be strong or weak or relatively strong...
- to add another categorical variable into the graph, do so by using something different, like a different shape or color
- strength is described by correlation r
- can be anywhere from -1 to 1; the “1’s”, both negative and positive, are the strongest; the closer the correlation is to 0, the weaker it is; also, the only way r can be one of the “1’s” is if the relationship is perfectly linear
- negative association have negative correlation, positive has positive correlation
- can be given by the calculator
- can only be trusted and used when the relationship is linear (if it isn’t, we will have to convert the relationship to linear, which will be explained in the ch 4 notes)
- it is the same no matter which variable you call explanatory and which you call responsive
- does not change when we change the units of measurement, and r itself has no unit of measurement; it is just a number
- when can we use correlation
- as stated above, only trust, and therefore use, when relationship is linear, no matter how strong the relationship is
- requires the variables to be quantitative
- it is not resistant, and is not a complete summary of the two-variable data
- need to give means and standard deviations for summary of data to be complete
Part 2: Least Squares Regression
- when a relationship is linear, summarize using a regression line (a model for the data)
- can only be used when we have an explanatory and response variable
- describes how response variable changes due to explanatory variable
- can compose with the equation y = a + bx
- basically the same as y = mx + b, but statisticians use the y = a +bx version
- how to interpret
- b is the slope, which means for every time x increases by one unit, then y will increase by b
- can’t determine how big of an effect explanatory variable has on response variable by just looking at the slope
- a is the y-int, which means y = a when x = 0
- it is possible for the y-int to not make sense, such as equation y = 23 +4x for apples harvested, where y = apples harvested from the apple trees, and x = apple trees harvested from. y-int does not make sense because it says that if we harvested 0 apple trees, we will have harvested 23 apples from the apple trees
- predicting based on the line
- we can predict based on the line, but can be inaccurate, and therefore untrustworthy, if x is extrapolated, or when x exceeds the range of values of x
- least-squares regression line
- normal regression lines vary due to our choices, so our predictions of y-values will have errors >>> need least-squares regression line (LSRL)
- LSRL makes the sum of the vertical distances of the points (and therefore y-value prediction errors) as small as possible
- how to find equation
- y-hat = a + bx
- b = (correlation) x (standard deviations of y)/(standard deviations of x)
- line has to pass through point (mean of x, mean of y)
- IMPORTANT! Be sure to include y-hat sign for any FRQ question that asks for predictions! The y-hat shows that the calculated value is only a prediction and is not the actual observed response of y
- this equation can also be found by your calculator
- what if on an FRQ they give you a chart full of numbers like
- then, the value of the slope will be the coefficient of the predictor that isn’t the constant, and the y-int will be the coefficient of the constant
- the amount of the data that can be predicted by the equation is given by the R-sq, NOT the R-sq (adj)
- residuals
- we know if something is a deviation by looking at the placement of the points of data among the LSRL; deviations of LSRL is supposed to be minor
- these deviations are called residuals, the difference between observed value of response variable and the value predicted by LSRL
- residual = y - “y-hat”
- sum of residuals of a set of data is always zero
- residual plot
- makes it easier to study residuals
- can be made by calculator
- if residual plot has clear pattern or is fan-shaped, then a linear model is not a good model
- can plot residual against either explanatory value or predicted y-values; shape will be the same either way because y-hat is linearly related to x
- R2 (can be described as R-sq in FRQs): how well line fits data
- numerical way to see if LSRL is good model
- called coefficient of determination
- shows how much percent of the data can be predicted by the LSRL
- in other words, R2 is actually a tool to see how trustworthy the LSRL is, so the closer R2 is to 1 (if calculator gives the 0.xxxx version) or 100%, the better the LSRL is
- so basically, R2 is actually a measure of strength of the LSRL
- Using LSRL
- which variable we assign as explanatory or response DOES matter! If we switch the categories of the variables, LSRL will be different!
- remember that the slope of LSRL and the correlation is closely related
- slope = (correlation) x (standard deviation of y)/(standard deviation of x)
- remember that LSRL ALWAYS passes through point (mean of x, mean of y)
- correlation r describes only straight line relationships
- if describing correlation of LSRL, use r2!
Part 3: Review about correlation and regression
- correlation and LSRL only describe linear relationships; while they can be made for other relationships, they are only useful, and therefore reliable, for linear relationships
- extrapolation produces unreliable results
- r and r2 are not resistant
- LSRL is not resistant
- can be significantly changed by influential points, which is a point that is extreme in the x direction with no other points near it; these points pulls the LSRL towards themselves
- outliers and influential points: what are they in LSRL
- outliers: any point that lies outside the overall pattern of the relationship; some can have large residuals, and some can have small residuals
- influential points: if we remove these points, it can significantly change the LSRL model; these points are actually usually outliers in the x-direction (too far to the right, or too far to the left)
- the only way to make sure a point is influential is to make an LSRL with the suspected point, then an LSRL without the point; if the change is significant, then the point is influential
- lurking variables
- relationships between two variables are often understood only when considering other variables that are not initially present/stated in the data; these “other variables” are called lurking variables
- they are variables that are neither the response nor explanatory variables in the LSRL; they should be considered before making conclusions based on correlations or regressions
- when writing an answer, for either a test or the AP Exam, NEVER say this variable “causes” that variable to change
- say that the variables are associated; remember, association DOES NOT automatically mean causation
- correlations based on averaged data
- not reliable; these kinds of correlations (the kinds that are based on average, meaning that one of the variables states the average ___ ) are usually very high compared to individuals and are much higher than the correlations of data without averages
More vocab and context for ch 3
No comments:
Post a Comment