Wednesday, February 5, 2014

AP Statistics Ch 10: Estimating with Confidence


Before starting, you should know...
  • statistical inference: method for drawing conclusions of population based on conclusions on sample
  • since diff. samples can lead to diff. conclusions, we can’t be sure our conclusion is correct; can only use probability to prove conclusion is strong or weak
  • two types of statistical inference
    • confidence intervals (ch 10)
      • estimate value of population parameter
    • significance tests (ch 11)
      • assess evidence for claim about population
    • shows probabilities about what happens if we use inference method many times
    • require predictable behavior that resulted from many trials
    • most reliable if sample or experiment is randomized


Part 1: Confidence intervals: the basics


  • sample mean will vary slightly if we take different samples of the same size and population
  • what mean of sample tells us about mean of population; confidence interval
    • rare for mean of sample (x-bar) to be exactly equal to mean of population (mu), so there will be some error in the estimation; x-bar will help us find how big that error is
      • x-bar gets close to Normal distribution in large samples, so use the 68-95-99.7 rule; find the standard deviation, x-bar to 2 standard deviations from mu (95% of samples in distribution); this will mean mu is at most 2 standard deviations from x-bar too
        • 95% samples have mu that is between x-bar + 2o-- and x-bar - 2o--
          • interval called 95% confidence interval; a confidence interval: range of numbers so that a certain percentage of samples have parameter in that range of numbers
          • very possible that a sample has mu between x-bar + 2o-- and x-bar - 2o--
      • narrows down the data to an interval where mu is in 95% of all samples; use to estimate the error we get when estimating mu
    • 5% of sample will not have mu in the confidence interval, so we are 95% sure that mu is between confidence interval, aka we used a method that made sure 95% of samples have mu between confidence interval
      • don’t know whether sample we get is one of the 5% or the 95%
  • Level C confidence interval
    • two parts: confidence interval and confidence level C
      • confidence interval: has in the form of estimate + / - margin of error
        • + / - = plus or minus
      • confidence level C: usually use over 90%
        • expressed in decimal form when finding z* (shown below) or t* (shown in part 2)
  • calculating confidence interval when know standard deviation but not mean
    • only construct confidence intervals when data is from SRS, sampling distribution is Normal, and individual observations are independent
      • SRS: data comes from SRS of size n
      • Normal: sample distribution of sample mean approximately Normal; population at least 30 times as large as sample
      • Independence: is required to use standard deviation equation o-- /sqrt(n); observations independent if sampling w/ replacement (rare), or sampling w/o replacement when population ten times as large as sample
    • first, use Normal curve and z-scores, which are called z* in this situation
      • determine the confidence level C, then take away (1-C)/2 from each end of the tail
      • calculate z* from table A; mark it on the Normal curve; since Normal curve is symmetric, the side second “z*” will be the negative of the first one
        • second z* = -z*, which is generally drawn to the left of mean
        • the area contained within z* and -z* will be the confidence level C
      • you now have the critical value: the the two z* s
  • now Level C confidence interval for mu can be
    • x-bar + z* x o-- / sqrt(n) and x-bar - z* x o-- / sqrt (n)
      • z* determines area between -z* and z*
    • accurate for exactly normal; only approximate if not (sample still needs to be big and curve still needs to be Normal though)
  • how to calculate Level C confidence model based on data
    • know what is the population and what do you want to know about the population
    • based on the conditions, decide what method you will use to determine confidence level, then carry out the calculations
    • interpret the results
      • sentence structure: We are ___ % confident that the true mean of the ______ is between ___ and ___
  • behavior of confidence intervals
    • margin of error (MOE) changes as choice of confidence level changes
    • MOE gets smaller when...
      • z* is smaller, which means less confidence
      • o-- gets smaller, but very difficult to do
      • n gets larger, must take many more samples just to cut moe in half, however (because of the square root)
  • determining sample size
    • w/ enough observations, can get both large confidence and small MOE
    • to find the sample size necessary, use formula: (desired z*) x (o--)/(sqrt(n)) is equal or less than (specified margin of error)
      • simplifies to z* x o-- / sqrt (n) is less than or equal to m
    • always round up when dealing with sample size
    • sample size determines margin of error while population size doesn’t
  • remember
    • data must be from SRS of population formed by random selection (applies to random observations too); the method discussed here can’t be used on samples formed by more complicated methods than SRS
    • formulas cannot fix badly sampled data and can’t produce trustworthy conclusions from that data; outliers can also change the results, so they should be removed or corrected
    • shape of population distribution can affect results; skewed and other non-Normal shapes will have a different confidence level than the one you calculate; level c confidence interval only depends on distribution of x-bar; however, when sample size is equal or greater than 15, then confidence level not really affected by non-normal shapes except when there are very strong outliers and skewness
    • standard deviation must be known
  • MOE (margin of error) tells amount error to expect b/c of chance variation; it only covers sampling errors and will not fix bias, non-response, and other practical errors
  • every method of inference have some kind of warning and condition
    • conditions rarely fully met when dealing with these problems in real life; data should be judged and analyzed first
  • just because we are x % confident, we cannot say that there is an x% chance that mu or x-bar is in the interval; either it is or is not in the interval, so the probability that mu or x-bar is in the interval is 0 or 1
    • after we got an interval, no randomness will remain


Part 2: Estimating population mean


  • do not know o-- and will still calculate confidence interval
    • will need to estimate o-- first; use sample’s standard deviation s to estimate
      • o-- is around s
      • change (o--) / (sqrt(n)) to s / sqrt(n)
      • called standard error of sample mean; is when standard deviation of statistics is estimated by data
  • still need to follow the same three settings to use this method: data from SRS, have Normality, and outcomes are independent
    • for Normality: if sample size is larger than 30, then count it as Normal; if not, then it can either be stated in problem or determined by graphing data then looking for the shape
  • t distribution
    • when we substitute s / sqrt (n), the graph of the means will not be normal anymore if the data size is really small
      • will continue to get more and more normal as sample size (and therefore df) grows, as usual
      • becomes t-distribution, and t will be our critical value
        • similar in to Normal shape, but spread and area of tails of t distribution greater than that of Normal distribution
      • extra info: a z-distribution is the distribution when we use z* as our critical value
    • different with different sizes of sample
    • identified with degree of freedom, or df
    • use the table t-distribution (Table B critical value t) to calculate the t-value
      • df on the side and confidence level C on the bottom; probability above critical number is on top
  • one-sample t confidence interval
    • like the level C confidence interval
    • equation: x-bar + / - t* (s)/sqrt(n)
      • + / - means plus or minus
    • interval will be approximate if large samples and exact if population distribution is exactly Normal
    • still use the four steps to calculate interval for data:
      • know what is the population and what do you want to know about the population
      • based on the conditions, decide what method you will use to determine confidence level, then carry out the calculations
      • interpret the results
        • sentence structure: We are ___ % confident that the true mean of the ______ is between ___ and ___
    • form: estimate + / - t* SEestimate
      • SE = standard error
    • only need to know level C and df
      • if df does not show on table C, choose the greatest df on the table that is less than the actual df >>> gives us wider confidence interval than we need
    • these problems not as common as paired t procedures b/c not as convincing
  • paired t procedures
    • compare observations of two treatments in matched pairs design or of before-and-after measurements on same subjects
    • uses one-sample t procedure
    • population mean equal mean diff. in observations between
      • responses to 2 treatments in matched-paired subjects in population
      • responses to 2 treatments in single individuals in a population
      • before-after measurements of all individuals in population (one set of measurements carried out on same individual)
      • so calculate this mean and use that for x-bar
    • sentence structure: I am ___% confident that the actual mean difference in ______ for the population is between ___ and ___
      • positive and different numbers do make a difference!
    • since many paired t problems do not have samples from SRS, can only say shows evidence, but can say the conclusion about population
    • diff between random selection and random assignment
      • random selection >>> conclusion about population
      • random assignment >>> shows whether there is evidence treatment caused effect
    • DO NOT calculate as if there are two samples, b/c the pairing means that samples might not be chosen independently, and treating as if they are two separate samples means you are treating as if samples are chosen independently
  • robust t procedures
    • since no sample in real life is exactly Normal, t confidence interval not exact
    • procedures’ usefulness depends on how resistant they are by lack of Normality
      • robust: when an inference procedure’s calculations remains accurate and not very affected when one condition for use is violated >>> confidence interval still accurate
    • not robust against outliers if small sample >>> can’t declare demanded confidence
    • if no outliers, can be pretty robust against non-Normality
      • if skewed, large sample size can fix (b/c Central Limit Theorem)
    • small samples: always check shape and outliers
  • rules for t procedures
    • unless there is a small sample size, SRS is more important than Normality
    • sample size less than 15 >>> only use t procedures if data close to Normal or if no outliers
    • samples size at least 15 >>> can use t procedures except if have strong outliers or skewness
    • sample size at least 30 >>> can use t procedures anytime
    • if sample gives bias (not randomly selected) or is actually a population, don’t use t procedures


Part 3: Estimating population proportion

  • interested in proportion of population that fits some requirement, which we will call success
  • conditions for inference
    • based on sampling distribution of statistic
    • in large statistic, population proportion p is close to sample proportion p-hat
    • standard error of p-hat will be sqrt((p-hat x (1 - p-hat))/n)
    • confidence interval becomes
      • estimate + / - z* SE
    • to use z procedures, data obtained by SRS, is Normal (when n(p-hat) and n(1 - p-hat) is at least ten, data also counted as normal), and outcomes are independent
      • stated in terms of p-hat
  • z procedures for proportions
    • level C confidence interval = p-hat + / - z* sqrt((p-hat x (1 - p-hat))/n)
    • z* still equal (1-C)/2
    • interpretation sentence structure: I am ___% confident that the proportion of _____ lies between ___ and ___.
    • margin of error only describes random sampling error
      • there are other sources of error that can’t be accounted for by margin of error
  • inference toolbox summary
    • step 1: determine population and interest of measure
    • step 2: do they pass the conditions? (Normality, SRS, Independence?)
    • step 3: determine on method and then calculate
    • step 4: interpret what you found out
  • sample size
    • estimating parameter to certain confidence and margin of error
    • set z* sqrt((p* x (1 - p*))/n) to be less than or equal to the margin of error you desire, then calculate for n
      • to get p*, you can base estimate on experience... do several values of p* to cover range of p-hat if you are doing this
      • can also set p* to 0.5 b/c margin of error largest at p* = 0.5 >>> real margin of error for any p-hat value will be smaller than anticipated
      • if p* is from 0.3 to 0.7, use p* = 0.5; if not, then using p* = 0.5 will give larger sample than needed, so estimate based on experience

Websites I found helpful: