Sunday, January 5, 2014

AP Statistics Ch 3

Things to know before start this guide...

-response variables: the outcome of the study (y)
-explanatory variables: the variable that explain or influences changes in a response variable (x)
IMPORTANT!!! Just because a variable is an explanatory variable, that DOES NOT mean that the explanatory variable causes the change.
As for solving the problems, the procedure is more or less the same: plot data, look for patterns and deviations from those patterns. This time, compose an equation to describe the pattern.

Part 1: Scatterplots and Correlation

  • most effective way to display relationship between two variables is to create a scatterplot
    • this shows relationship of variables that are about the same individuals
  • to interpret scatterplots...
    • look at overall pattern and deviations (outliers), then describe direction (positive or negative association), form, and strength
      • if slope is positive, then it has a positive association; if slope is negative, then it has a negative association
      • the form can be linear (quite common), parabolic, etc...
      • the strength can be strong or weak or relatively strong...
  • to add another categorical variable into the graph, do so by using something different, like a different shape or color
  • strength is described by correlation r
    • can be anywhere from -1 to 1; the “1’s”, both negative and positive, are the strongest; the closer the correlation is to 0, the weaker it is; also, the only way r can be one of the “1’s” is if the relationship is perfectly linear
    • negative association have negative correlation, positive has positive correlation
    • can be given by the calculator
    • can only be trusted and used when the relationship is linear (if it isn’t, we will have to convert the relationship to linear, which will be explained in the ch 4 notes)
    • it is the same no matter which variable you call explanatory and which you call responsive
    • does not change when we change the units of measurement, and r itself has no unit of measurement; it is just a number
  • when can we use correlation
    • as stated above, only trust, and therefore use, when relationship is linear, no matter how strong the relationship is
    • requires the variables to be quantitative
    • it is not resistant, and is not a complete summary of the two-variable data
      • need to give means and standard deviations for summary of data to be complete

Part 2: Least Squares Regression

  • when a relationship is linear, summarize using a regression line (a model for the data)
    • can only be used when we have an explanatory and response variable
    • describes how response variable changes due to explanatory variable
    • can compose with the equation y = a + bx
      • basically the same as y = mx + b, but statisticians use the y = a +bx version
  • how to interpret
    • b is the slope, which means for every time x increases by one unit, then y will increase by b
      • can’t determine how big of an effect explanatory variable has on response variable by just looking at the slope
    • a is the y-int, which means y = a when x = 0
      • it is possible for the y-int to not make sense, such as equation y = 23 +4x for apples harvested, where y = apples harvested from the apple trees, and x = apple trees harvested from. y-int does not make sense because it says that if we harvested 0 apple trees, we will have harvested 23 apples from the apple trees
  • predicting based on the line
    • we can predict based on the line, but can be inaccurate, and therefore untrustworthy, if x is extrapolated, or when x exceeds the range of values of x
  • least-squares regression line
    • normal regression lines vary due to our choices, so our predictions of y-values will have errors >>> need least-squares regression line (LSRL)
      • LSRL makes the sum of the vertical distances of the points (and therefore y-value prediction errors) as small as possible
    • how to find equation
      • y-hat = a + bx
        • b = (correlation) x (standard deviations of y)/(standard deviations of x)
        • line has to pass through point (mean of x, mean of y)
        • IMPORTANT! Be sure to include y-hat sign for any FRQ question that asks for predictions! The y-hat shows that the calculated value is only a prediction and is not the actual observed response of y
      • this equation can also be found by your calculator
    • what if on an FRQ they give you a chart full of numbers like
      • then, the value of the slope will be the coefficient of the predictor that isn’t the constant, and the y-int will be the coefficient of the constant
      • the amount of the data that can be predicted by the equation is given by the R-sq, NOT the R-sq (adj)
  • residuals
    • we know if something is a deviation by looking at the placement of the points of data among the LSRL; deviations of LSRL is supposed to be minor
      • these deviations are called residuals, the difference between observed value of response variable and the value predicted by LSRL
        • residual = y - “y-hat”
    • sum of residuals of a set of data is always zero
  • residual plot
    • makes it easier to study residuals
    • can be made by calculator
    • if residual plot has clear pattern or is fan-shaped, then a linear model is not a good model
    • can plot residual against either explanatory value or predicted y-values; shape will be the same either way because y-hat is linearly related to x
  • R2 (can be described as R-sq in FRQs): how well line fits data
    • numerical way to see if LSRL is good model
    • called coefficient of determination
    • shows how much percent of the data can be predicted by the LSRL
      • in other words, R2 is actually a tool to see how trustworthy the LSRL is, so the closer R2 is to 1 (if calculator gives the 0.xxxx version) or 100%, the better the LSRL is
    • so basically, R2 is actually a measure of strength of the LSRL
  • Using LSRL
    • which variable we assign as explanatory or response DOES matter! If we switch the categories of the variables, LSRL will be different!
    • remember that the slope of LSRL and the correlation is closely related
      • slope = (correlation) x (standard deviation of y)/(standard deviation of x)
    • remember that LSRL ALWAYS passes through point (mean of x, mean of y)
    • correlation r describes only straight line relationships
      • if describing correlation of LSRL, use r2!

Part 3: Review about correlation and regression
  • correlation and LSRL only describe linear relationships; while they can be made for other relationships, they are only useful, and therefore reliable, for linear relationships
  • extrapolation produces unreliable results
  • r and r2 are not resistant
  • LSRL is not resistant
    • can be significantly changed by influential points, which is a point that is extreme in the x direction with no other points near it; these points pulls the LSRL towards themselves
  • outliers and influential points: what are they in LSRL
    • outliers: any point that lies outside the overall pattern of the relationship; some can have large residuals, and some can have small residuals
    • influential points: if we remove these points, it can significantly change the LSRL model; these points are actually usually outliers in the x-direction (too far to the right, or too far to the left)
      • the only way to make sure a point is influential is to make an LSRL with the suspected point, then an LSRL without the point; if the change is significant, then the point is influential
  • lurking variables
    • relationships between two variables are often understood only when considering other variables that are not initially present/stated in the data; these “other variables” are called lurking variables
      • they are variables that are neither the response nor explanatory variables in the LSRL; they should be considered before making conclusions based on correlations or regressions
  • when writing an answer, for either a test or the AP Exam, NEVER say this variable “causes” that variable to change
    • say that the variables are associated; remember, association DOES NOT automatically mean causation
  • correlations based on averaged data
    • not reliable; these kinds of correlations (the kinds that are based on average, meaning that one of the variables states the average ___ ) are usually very high compared to individuals and are much higher than the correlations of data without averages

More vocab and context for ch 3


No comments:

Post a Comment