































Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A comprehensive guide to statistical analysis, focusing on categorical and quantitative data, data visualization, and hypothesis testing. Topics covered include histograms, skewed data, crf plots, statistical analysis, standard deviation, bivariate data, data transformation, simpson's paradox, experimental and observational studies, sampling methods, treatments, confounding variables, lurking variables, placebo effect, relative frequencies, random variables, discrete random variables, geometric distribution, probabilities, expected value, variance, standard deviation, chi-square independence or homogeneity test, one-sample mean z test, two-sample mean t test, proportion z interval, t-interval for slope, confidence intervals, and properties of the normal curve.
Typology: Exams
1 / 39
This page cannot be seen from the preview
Don't miss anything!
What is a dotplot? A graphical display which shows "dots" for each point. It's good for categorical data- ie data classified into categories. What's the difference between categorical and quantitative data? Categorical data fits into various categories; whereas, quantitative data has numerical values associated with it. What is a bar chart? A display for categorical data which indicates frequencies or percents for each category. What are histograms? Histograms are good for large quantitative data sets- either having numbers at the left/right of a bar to show the amount of data in-between each value or in the center of a bar to show the amount of data at a certain value. Sometimes, the axis will just be the frequency, but often, it can be the relative frequency (ie. amount/total). What do relative areas in histograms mean? Relative areas correspond to relative frequencies (ie. if 10% of the area for a histogram is between 25-26, that means that 10% of the data falls between 25 and 26.
What's a stemplot/stem and leaf plot? It has stems which are some digit and leaves which are the other part of the number (for example depending on context 5|7 could be 57, 5.7, or some other variant- that's why a key must always be included). It's good for looking at individual data in small data sets. What is important in analyzing visual data displays? SOCS (Shape, Outlier, Center, Spread): Shape-How is the data shaped (skewed left/right, symmetric, bimodal, etc.)? Are there any clusters (subgroups which the data falls into)? Are there any gaps in the data set? Outliers: Are there any outliers within the data set? Center: Give the mean/median- the value which is the approximate midpoint of the data Spread- What is the range OR IQR (if it's easy to find) of the data set? What is a mode? How do modes relate to unimodal/bimodal data sets? A mode is a major peak in the data (most repeated value). A unimodal data set has just one mode; whereas, a bimodal data set has two modes. What are some possible descriptions of shapes within distributions?
Descriptive statistics means summarizing averages, shape of a distribution, etc. while statistical analysis means drawing inferences from limited data. What are the two main ways of measuring center? The median (the middle number of a set when arranged in order). The mean (summing the values in a set and dividing by the number of quantities in that set) When does it make more sense to use the median over the mean? When there are outliers which we want to minimize. We say the median is RESISTANT to outliers (which means it's not affected). What are the notations for mean of a population and mean of a sample? The sample mean usually assumes a simple random sample. The mean is computed by ∑x/n. What are the ways of describing variability/dispersion of the measurements?
There are two ways of computing this : way 1) simply take out upper and lower quarters of the data and subtract.
The deciles have ranks of 10% and 90%. What is the formula for a z score? This shows the number of standard deviations away from the mean. Also, if you're given a z score, the mean, and the standard deviation, you can solve for an x value. What is the empirical rule? The empirical rule says that for symmetric, bell- shaped data, 68% of the data lies within one standard deviation of the mean, 95% lies within 2 standard deviations of the mean, and 99.7% of the data lies within 3 standard deviations of the mean. How is the empirical rule related to range? The empirical rule can indicate arithmetic errors as the range should be somewhere between 4 times the standard deviation and 6 times the standard deviation. How does skewed data affect how the mean compares to the median? If data is skewed to the left, the mean is usually lower than the median. If data is skewed to the right, the mean is usually higher than the median. What is a boxplot?
It gives a 5 number summary with a whisker out to the highest value, a line at Q3, a line for the median, a line at Q1, and a line out to the lowest value. Alternatively, outliers can be depicted as dots on the boxplot, and the lines just go to the highest/lowest values not considered to be outliers. What is the effect on mean, median, range, and standard deviation of adding a certain amount or multiplying by a certain amount to every value in the data set? Adding: Changes the mean & median by that amount but doesn't change the range or standard deviation. Multiplying: Changes mean, median, range, and standard deviation all by that same factor. What are some graphical methods of comparing distributions?
What is r²? r² is called the coefficient of determination and gives the percentage of variation in y explained by x. One must be careful when finding r from r² in terms of assigning positive/negative values. What is the least squares regression line? It's the line that is the best fitting as it minimizes the squares of the residuals. It's equation can be determined as it goes through the mean of x (x bar) and the mean of y (y bar). The slope is determined by b1=r *(sy/sx) where sy is the standard deviation of y, and sx is the standard deviation of x. What is the equation for the line comparing z scores of y to z scores of x? zy=rzx What's the difference between interpolation and extrapolation? Interpolation is inside the scope of your data range which is good. Extrapolation is outside your data set and is risky as you don't know whether the linear trend will continue. What does y hat really indicate?
The mean prediction for each x value (there could be a variety of y values, so it simply gives the mean) What is a residual plot? Observed-expected value gives the residuals. A residual plot gives the residuals on the y axis and the x values on the x. What is the mean and standard deviations of residuals? The mean of the residuals is always 0. The standard deviation of residuals is given by the following formula: The standard deviation of residuals indicates a typical residual value. In computer output, it's given by S. What are you looking for in a residual plot? Small, balanced residuals which don't show any kind of curve/pattern. What are outliers and influential points in regression? Outliers deviate from the overall pattern. Influential points sharply change the slope of the regression line. How do you transform data to make it linear?
attack severity). This information can be displayed in a bar chart. What are conditional relative frequencies? Dividing each value by the marginal frequency of that row or column. So you could divide the number of non fatal heart attacks with low cholesterol by the total number of non fatal heart attacks. This information can be displayed in side by side bars in bar charts or alternatively by segmented bar charts in order to gauge association. What is perfect independence in two way contingency tables? Perfect independence is when the conditional relative frequencies all match up. However, even if two variables are completely independent, they may not necessarily show perfect indepndence. What is Simpson's paradox? Simpson's paradox is when the results from a combined grouping contradict the results for an individual group (due to lurking variables). Ie. if there are two doctors and you're comparing survival rates, you may initially conclude that one doctor is better than the other (based on combined survival rate). However, if you split these groups
into good & bad condition of the patients that they're treating, you may come to the opposite conclusion. What is a census? What are the advantages/disadvantages of a census? A census is a complete enumeration of the population. It's ideal because you manage to capture everybody. However, it can be very time consuming/costly. Also, it would be far better to take a sample and do it well then to conduct a poorly run census. What is a sample survey? A sample survey just takes a part of the whole population to survey. What's necessary for a good sample survey? Avoiding bias which is frequently achieved by randomization. Also, a large sample size gives more validity to the results (NOTE: It's the actual size not percentage- a group of 500 in a population of 100,000 is just as good as a group of 500 in a population of 1,000,000). What is an experiment The researchers divide subjects into appropriate groups. Most often there is a treatment group which receives the treatment and a control group which does not (often receiving a placebo). What are the facets of a well designed experiment?
What is a simple random sample? What are some ways to get a simple random sample? In a simple random sample, every participant has an equal chance of being selected. The best ways to generate a simple random sample are via random digit tables or having a computer generate random samples. One thing you have to be careful of is that you might not have a complete listing of the population in which case randomness is not ensured. Are other sampling techniques (stratified, cluster, etc.) just subsets of simple random sampling? NO!!! In these techniques, every participant does not have equal chance of being selected. What is sampling error? No matter how well designed a survey is, it still gives a sample statistic for a population parameter, so we're always bound to have some error. Generally, the chance of an error occurring is less when the sample size is larger unless the survey was badly conducted. What are some common types of biases? Bias is defined as a tendency to favor certain members of a population. The following are the main types of bias: Household bias- only one member of a households responds, so large households are
underrepresented. Nonresponse bias- people don't respond to surveys or are too difficult to contact, thus creating a source of bias. Quota sampling bias- interviewers are at liberty to pick people (ie. a specific percentage Catholic, a specific percentage African-American, etc.). Response bias- People may lie/be untruthful when responding, especially when they're not anonymous if their views are unsavory. Selection bias- for example a newspaper interviewed just people with cars and telephones in a presidential election and predicted a landslide victory for the wrong person due to the fact that the people owning cars and telephones were wealthy and tended to vote Republican. Size bias- For instance if you have a student pick a coin out of a bag to estimate the monetary value, throw a dart at a map, etc. This benefits large states, large coins, etc. Undercoverage bias- Inadequate representation- for instance there were phone surveys to landlines which left out people who only had cell phones. Another instance of this is convenience samples,
school classes to survey. Multistage sampling- there are two or more steps, each of which involves any of the other sampling techniques. For instance, some organizations randomly select nationwide locations, then randomly pick neighborhoods in each of these locations, then randomly pick households in each of these neighborhoods. What is an experiment vs. an observational study vs. a survey? An experiment is when a treatment or change is assigned. An observational study is when we observe or measure something which is occurring. A sample survey is a particular type of observational study when we look at a sample. What are explanatory and response variables? What are treatments? Explanatory variables (called factors) are what is being changed/tested and is believed to have an effect on the response variable (which is being measured). Treatments consist of factor-level combinations (for instance, you could have two factors and 3 levels of each factor for a total of 6 treatments).
What is confounding? What are lurking variables? How can both of these effects be overcome? Confounding is when there's uncertainty with regard to which variable is causing a given set of results (for instance if two or more variables are being altered). A lurking variable is a variable driving two other variables (for instance, those with higher shoe sizes have higher reading levels not because of their shoe size but because of the lurking variable of age). This can also be described as a common response in that the lurking variable and the measured variable seem to be producing the same response. What is a control group? What is the placebo effect? How can the placebo effect be minimized? A control group is one which doesn't receive the treatment, and the treatment group receives the treatment. People can randomly be assigned to control & treatment groups in order to minimize confounding/lurking variables. The placebo effect is when people respond to any treatment (for instance, they might report that a sugar pill makes them feel much better). This can be overcome by either single-blinding in which the subjects don't know what they're receiving or double-blinding in which neither subjects nor