AP Statistics

AP Statistics Ch 3

Things to know before start this guide...

-response variables: the outcome of the study (y)

-explanatory variables: the variable that explain or influences changes in a response variable (x)

IMPORTANT!!! Just because a variable is an explanatory variable, that DOES NOT mean that the explanatory variable causes the change.

As for solving the problems, the procedure is more or less the same: plot data, look for patterns and deviations from those patterns. This time, compose an equation to describe the pattern.

Part 1: Scatterplots and Correlation

most effective way to display relationship between two variables is to create a scatterplot

this shows relationship of variables that are about the same individuals

to interpret scatterplots...

look at overall pattern and deviations (outliers), then describe direction (positive or negative association), form, and strength

if slope is positive, then it has a positive association; if slope is negative, then it has a negative association
the form can be linear (quite common), parabolic, etc...
the strength can be strong or weak or relatively strong...

to add another categorical variable into the graph, do so by using something different, like a different shape or color
strength is described by correlation r

can be anywhere from -1 to 1; the “1’s”, both negative and positive, are the strongest; the closer the correlation is to 0, the weaker it is; also, the only way r can be one of the “1’s” is if the relationship is perfectly linear
negative association have negative correlation, positive has positive correlation
can be given by the calculator
can only be trusted and used when the relationship is linear (if it isn’t, we will have to convert the relationship to linear, which will be explained in the ch 4 notes)
it is the same no matter which variable you call explanatory and which you call responsive
does not change when we change the units of measurement, and r itself has no unit of measurement; it is just a number

when can we use correlation

as stated above, only trust, and therefore use, when relationship is linear, no matter how strong the relationship is
requires the variables to be quantitative
it is not resistant, and is not a complete summary of the two-variable data

need to give means and standard deviations for summary of data to be complete

Part 2: Least Squares Regression

when a relationship is linear, summarize using a regression line (a model for the data)

can only be used when we have an explanatory and response variable
describes how response variable changes due to explanatory variable
can compose with the equation y = a + bx

basically the same as y = mx + b, but statisticians use the y = a +bx version

how to interpret

b is the slope, which means for every time x increases by one unit, then y will increase by b

can’t determine how big of an effect explanatory variable has on response variable by just looking at the slope

a is the y-int, which means y = a when x = 0

it is possible for the y-int to not make sense, such as equation y = 23 +4x for apples harvested, where y = apples harvested from the apple trees, and x = apple trees harvested from. y-int does not make sense because it says that if we harvested 0 apple trees, we will have harvested 23 apples from the apple trees

predicting based on the line

we can predict based on the line, but can be inaccurate, and therefore untrustworthy, if x is extrapolated, or when x exceeds the range of values of x

least-squares regression line

normal regression lines vary due to our choices, so our predictions of y-values will have errors >>> need least-squares regression line (LSRL)

LSRL makes the sum of the vertical distances of the points (and therefore y-value prediction errors) as small as possible

how to find equation

y-hat = a + bx

b = (correlation) x (standard deviations of y)/(standard deviations of x)
line has to pass through point (mean of x, mean of y)
IMPORTANT! Be sure to include y-hat sign for any FRQ question that asks for predictions! The y-hat shows that the calculated value is only a prediction and is not the actual observed response of y

this equation can also be found by your calculator

what if on an FRQ they give you a chart full of numbers like

then, the value of the slope will be the coefficient of the predictor that isn’t the constant, and the y-int will be the coefficient of the constant
the amount of the data that can be predicted by the equation is given by the R-sq, NOT the R-sq (adj)

residuals

we know if something is a deviation by looking at the placement of the points of data among the LSRL; deviations of LSRL is supposed to be minor

these deviations are called residuals, the difference between observed value of response variable and the value predicted by LSRL

residual = y - “y-hat”

sum of residuals of a set of data is always zero

residual plot

makes it easier to study residuals
can be made by calculator

http://answers.yahoo.com/question/index?qid=20101018124256AAtMaLC

if residual plot has clear pattern or is fan-shaped, then a linear model is not a good model
can plot residual against either explanatory value or predicted y-values; shape will be the same either way because y-hat is linearly related to x

R2 (can be described as R-sq in FRQs): how well line fits data

numerical way to see if LSRL is good model
called coefficient of determination
shows how much percent of the data can be predicted by the LSRL

in other words, R2 is actually a tool to see how trustworthy the LSRL is, so the closer R2 is to 1 (if calculator gives the 0.xxxx version) or 100%, the better the LSRL is

so basically, R2 is actually a measure of strength of the LSRL

Using LSRL

which variable we assign as explanatory or response DOES matter! If we switch the categories of the variables, LSRL will be different!
remember that the slope of LSRL and the correlation is closely related

slope = (correlation) x (standard deviation of y)/(standard deviation of x)

remember that LSRL ALWAYS passes through point (mean of x, mean of y)
correlation r describes only straight line relationships

if describing correlation of LSRL, use r2!

Part 3: Review about correlation and regression

correlation and LSRL only describe linear relationships; while they can be made for other relationships, they are only useful, and therefore reliable, for linear relationships
extrapolation produces unreliable results
r and r2 are not resistant
LSRL is not resistant

can be significantly changed by influential points, which is a point that is extreme in the x direction with no other points near it; these points pulls the LSRL towards themselves

outliers and influential points: what are they in LSRL

outliers: any point that lies outside the overall pattern of the relationship; some can have large residuals, and some can have small residuals
influential points: if we remove these points, it can significantly change the LSRL model; these points are actually usually outliers in the x-direction (too far to the right, or too far to the left)

the only way to make sure a point is influential is to make an LSRL with the suspected point, then an LSRL without the point; if the change is significant, then the point is influential

lurking variables

relationships between two variables are often understood only when considering other variables that are not initially present/stated in the data; these “other variables” are called lurking variables

they are variables that are neither the response nor explanatory variables in the LSRL; they should be considered before making conclusions based on correlations or regressions

when writing an answer, for either a test or the AP Exam, NEVER say this variable “causes” that variable to change

say that the variables are associated; remember, association DOES NOT automatically mean causation

correlations based on averaged data

not reliable; these kinds of correlations (the kinds that are based on average, meaning that one of the variables states the average ___ ) are usually very high compared to individuals and are much higher than the correlations of data without averages

More vocab and context for ch 3

http://quizlet.com/24821872/cloud-ap-statistics-chapter-3-flash-cards/

AP Statistics

Sunday, January 5, 2014

No comments:

Post a Comment