Statistical Thinking

“To compete and win, we must redouble our efforts – not only in the quality of our goods and services, but in the quality of our thinking, in the quality of our response to customers, in the quality of our decision making, in the quality of everything we do.” – E.S. Woolard, E.I. du Pont de Nemours & Co.

 

Statistics can be defined as the science of collection, presentation, analysis, and reasonable interpretation of data.

Statistics presents a rigorous scientific method for gaining insight into data. With so many measurements, simply looking at the data fails to provide an informative account. However statistics can give an instant overall picture of data based on graphical presentation or numerical summarisation irrespective to the number of data points. Besides data summarisation, another important task of statistics is to make inference and predict relations of variables.

Statistical Description of Data

Statistics describes a numeric set of data by its:

  • Center
  • Variability
  • Shape

Statistics describes a categorical set of data by:

  •  Frequency
  • Percentage
  • Proportion of each category

Statistical Terms and Definition

Variable – any characteristic of an individual or entity. A variable can take different values for different individuals. Variables can be categorical or quantitative.

Nominal – Categorical variables with no inherent order or ranking   sequence such as names or classes (e.g., gender). Value may be a numerical, but without numerical value (e.g., I, II, III). The only operation that can be applied to Nominal variables is enumeration.

  • Ordinal – Variables with an inherent rank or order, e.g. mild, moderate, severe. Can be compared for equality, or greater or less, but not how much greater or less.
  • Interval – Values of the variable are ordered as in Ordinal, and additionally, differences between values are meaningful, however, the scale is not absolutely anchored. Calendar dates and temperatures on the Fahrenheit scale are examples. Addition and subtraction, but not multiplication and division are meaningful operations.
  • Ratio – Variables with all properties of Interval plus an absolute, non-arbitrary zero point, e.g. age, weight, temperature (Kelvin). Addition, subtraction, multiplication, and division are all meaningful operations.
  • Distribution – (of a variable) tells us what values the variable takes and how often it takes these values.◦Unimodal – having a single peak
    • Bimodal – having two distinct peaks
    • Symmetric – left and right half are mirror images.

Data Presentation

Two types of statistical presentation of data:

  • Graphical Presentation: We look for the overall pattern and for striking deviations from that pattern. Over all pattern usually described by shape, center, and spread of the data. An individual value that falls outside the overall pattern is called an outlier. Bar diagram and Pie charts are used for categorical variables. Histogram, stem and leaf and Box-plot are used for numerical variable.
  • Numerical Presentation: A fundamental concept in summary statistics is that of a central value for a set of observations and the extent to which the central value characterizes the whole set of data. Measures of central value such as the mean or median must be coupled with measures of data dispersion (e.g., average distance from the mean) to indicate how well the central value characterizes the data as a whole.

Methods of Center Measurement

Center measurement is a summary measure of the overall level of a dataset. Commonly used methods are mean, median, mode, geometric mean etc.

Methods of Variability Measurement

Mean: Summing up all the observation and dividing by number of observations.

Median: The middle value in an ordered sequence of observations. That is, to find the median we need to order the data set and then find the middle value. In case of an even number of observations the average of the two middle most values is the median.

Mode: The value that is observed most frequently. The mode is undefined for sequences in which no observation is repeated.

Variability (or dispersion) measures the amount of scatter in a dataset.

The median is less sensitive to outliers (extreme scores) than the mean and thus a better measure than the mean for highly skewed distributions

Commonly used methods: range, variance, standard deviation, interquartile range, coefficient of variation etc.:

Range: The difference between the largest and the smallest observations. The range of 10, 5, 2, 100 is (100-2)=98. It’s a crude measure of variability.

  • Variance: The variance of a set of observations is the average of the squares of the deviations of the observations from their mean.
  • Standard Deviation: Square root of the variance
  • Quartiles: Data can be divided into four regions that cover the total range of observed values. Cut points for these regions are known as quartiles.
  • Coefficient of Variation: The standard deviation of data divided by it’s mean.

Shape of Data

Shape of data is measured by:

  • Skewness : Measures asymmetry of data
    • Positive or right skewed: Longer right tail
    • Negative or left skewed: Longer left tail
  • Kurtosis: Measures peakedness of the distribution of data. The kurtosis of normal distribution is 0.

“Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write.”
– H G Wells