Code options ( repos = c ( CRAN = "https://cran.rstudio.com/" ) )
knitr :: opts_chunk $ set ( message = FALSE )
What the Syllabus Covers
The most basic distinction in data analysis is between quantitative (numerical) and qualitative (categorical) data. The distinction determines which descriptive statistics, graphs, and inferential tests are appropriate.
PYQ patterns: (a) classify a given variable as quantitative or qualitative, (b) sub-classify into discrete/continuous (quant) or nominal/ordinal (qual), (c) match each data type to its measure of central tendency (mode/median/mean), and (d) identify the Stevens scale (NOIR).
Quantitative Data
Quantitative data are numerical measurements — values that can be added, averaged, and ordered.
Two Types
Discrete
Countable integers only
Number of students, number of cars, defective items
Continuous
Any value on a scale, including fractions
Height (cm), weight (kg), time (s), temperature
Stevens’ Scales — Recap
Nominal
Gender, religion
=, ≠
Mode, χ²
Ordinal
Rank, Likert
<, >
Median, Spearman’s ρ
Interval
Temperature °C, IQ
+, −
Mean, SD, Pearson’s r, t-test
Ratio
Height, weight, income
+, −, ×, ÷
All — geometric mean, CV
Nominal and Ordinal are qualitative ; Interval and Ratio are quantitative .
Central Tendency, Dispersion, Shape
Central tendency — Mean (arithmetic, geometric, harmonic) · Median · Mode.
Dispersion — Range · Quartile deviation · Mean deviation · Variance · Standard deviation · Coefficient of variation.
Shape — Skewness (asymmetry) · Kurtosis (peakedness).
Three Means
Arithmetic Mean (AM) = Σx / n. Most-used. Sensitive to outliers.
Geometric Mean (GM) = ⁿ√(x₁ × x₂ × … × xₙ). Used for growth rates and ratios.
Harmonic Mean (HM) = n / Σ(1/x). Used for rates (e.g., average speed of equal distances).
Order: AM ≥ GM ≥ HM (for positive numbers).
Standard Deviation and Variance
Variance σ² = Σ(xᵢ − x̄)² / n (population) or /(n−1) (sample).
Standard Deviation σ = √Variance.
Coefficient of Variation CV = (σ / x̄) × 100 %. Useful for comparing dispersion across datasets with different units.
Empirical / Normal Distribution Rule
In a normal distribution: - ~68 % of values within μ ± 1σ. - ~95 % within μ ± 2σ. - ~99.7 % within μ ± 3σ.
Skewness and Kurtosis
Skewness — asymmetry. Positive (long right tail; Mean > Median > Mode). Negative (long left tail; Mean < Median < Mode). Symmetric (Mean = Median = Mode).
Kurtosis — peakedness. Mesokurtic (normal), Leptokurtic (sharp peak), Platykurtic (flat).
Qualitative Data
Qualitative data are categorical — values that label categories, not amounts.
Two Types
Nominal
Categories with no inherent order
Religion, blood group, gender, state
Ordinal
Categories with a meaningful order
Likert (strongly agree → strongly disagree), education level, severity
Statistics for Qualitative Data
Central tendency: Mode (nominal); Median (ordinal — though strictly mode is safer).
Dispersion: Frequency distribution, percentages, mode-based diversity.
Association: Cramér’s V, phi (φ) coefficient, Goodman-Kruskal lambda, Kendall’s tau (ordinal).
Tests: Chi-square (χ²), Fisher’s exact test (small samples), Mann-Whitney U (ordinal), Wilcoxon, Kruskal-Wallis.
Qualitative Research vs Qualitative Data
Qualitative DATA (this sub-unit) — categorical numbers (e.g., “55 men, 45 women”).
Qualitative RESEARCH — depth-oriented method with words, narratives, observations as data (Topic 8).
Both are “qualitative” but in different senses: the first is about measurement scale ; the second is about research approach .
Mixed Data and Coding
Dummy coding — categorical → 0/1 indicator variables.
Effect coding — −1/0/1.
One-hot encoding — common in ML.
Likert scales are ordinal; commonly treated as interval for parametric tests.
Comparing the Two — Side by Side
Data
Numbers
Categories
Scales
Interval, Ratio
Nominal, Ordinal
Operations
Arithmetic
Equality / order only
Central tendency
Mean, Median, Mode
Mode, Median (ordinal)
Dispersion
SD, Variance, CV
Distribution, %
Charts
Histogram, line, scatter
Bar, pie
Test for association
Pearson’s r, regression
χ², Cramér’s V
Tests for difference
t-test, ANOVA
χ², Fisher’s exact
Software examples
SPSS, R, Stata
NVivo, ATLAS.ti (qual research)
flowchart TB
D{Data} --> QN[Quantitative<br/>Numerical]
D --> QL[Qualitative<br/>Categorical]
QN --> DC[Discrete<br/>Count]
QN --> CO[Continuous<br/>Measurement]
QL --> NM[Nominal<br/>No order]
QL --> OR[Ordinal<br/>Order, no equal intervals]
DC --> S1[Interval/Ratio]
CO --> S1
NM --> S2[Nominal]
OR --> S3[Ordinal]
classDef default fill:#003366,color:#ffffff,stroke:#ffcc00,stroke-width:3px,rx:10px,ry:10px;
Worked Examples — Classify the Variable
Blood group (A, B, AB, O): Qualitative · Nominal.
Education level (primary, secondary, tertiary): Qualitative · Ordinal.
Age in years: Quantitative · Ratio · Continuous (often reported as discrete).
Number of children: Quantitative · Ratio · Discrete.
Temperature in °C: Quantitative · Interval · Continuous.
Temperature in Kelvin: Quantitative · Ratio · Continuous.
Likert satisfaction (1–5): Qualitative · Ordinal (often treated as interval).
Marks out of 100: Quantitative · Interval (some argue ratio) · Continuous-discrete hybrid.
Choosing the Right Statistic
Nominal
Mode
Diversity index
Cramér’s V, φ
χ², Fisher’s exact
Ordinal
Median (Mode)
IQR
Spearman’s ρ, Kendall’s τ
Mann-Whitney, Wilcoxon, Kruskal-Wallis
Interval
Mean
SD
Pearson’s r
t-test, ANOVA
Ratio
Mean (GM, HM where appropriate)
SD, CV
Pearson’s r
t-test, ANOVA, regression
Common Mistakes
Using the mean on ordinal data without justification (e.g., averaging satisfaction codes).
Using a t-test on nominal data (use χ² instead).
Confusing discrete quantitative (count) with ordinal (rank).
Treating a 0 in Celsius as “no temperature” — Celsius has no true zero; Kelvin does.
Ignoring outliers when reporting the mean.
Confusing qualitative data with qualitative research .
Using bar charts for continuous data (use histograms; bars don’t touch in bar chart but do in histogram).
Theory Anchors
S.S. Stevens
1946
NOIR scales
Karl Pearson
early 20th c.
Correlation; chi-square
R.A. Fisher
1925, 1935
ANOVA, F-test, design of experiments
Charles Spearman
1904
Rank correlation (ordinal)
G. Udny Yule
early 20th c.
Yule’s Q (qualitative association)
Maurice Kendall
1938
Kendall’s tau (ordinal)
Harald Cramér
1946
Cramér’s V (nominal association)
John W. Tukey
1977
Exploratory Data Analysis (EDA); boxplot
C.R. Rao
20th c.
Cramér-Rao bound; Indian statistician
Florence Nightingale
1858
Polar (coxcomb) charts; pioneer of statistical visualisation
Practice Questions
Which of the following is QUALITATIVE data?
A Height in cm
B Number of cars
C Religion
D Temperature
View solution
Correct Option: C
Religion — category, no inherent order = nominal qualitative.
"Number of defective items in a batch" is:
A Discrete quantitative
B Continuous quantitative
C Nominal qualitative
D Ordinal qualitative
View solution
Correct Option: A
Countable integers = discrete .
Likert-scale satisfaction (1 = strongly disagree to 5 = strongly agree) is:
A Nominal
B Ordinal
C Interval
D Ratio
View solution
Correct Option: B
Ordinal — order, but unequal "psychological" intervals. (Often treated as interval for parametric tests.)
Temperature measured in Celsius is on which Stevens scale?
A Nominal
B Ordinal
C Interval
D Ratio
View solution
Correct Option: C
Equal intervals, but 0 °C ≠ absence of temperature → interval . Kelvin would be ratio.
The MOST appropriate measure of central tendency for nominal data is:
A Mean
B Median
C Mode
D Variance
View solution
Correct Option: C
Mode — the only measure that makes sense for unordered categories.
For a set of positive numbers, the relationship among the three means is:
A AM > GM > HM
B HM > GM > AM
C GM > AM > HM
D AM = GM = HM
View solution
Correct Option: A
For positive numbers: AM ≥ GM ≥ HM , with equality only when all values are equal.
In a normal distribution, approximately what % of values fall within μ ± 2σ?
A 50 %
B 68 %
C 95 %
D 99.7 %
View solution
Correct Option: C
68-95-99.7 rule. 1σ → 68 %, 2σ → 95 % , 3σ → 99.7 %.
In a positively-skewed distribution:
A Mean > Median > Mode
B Mean < Median < Mode
C Mean = Median = Mode
D Mean = Mode > Median
View solution
Correct Option: A
Long right tail pulls the mean to the right of the median, which is to the right of the mode: Mean > Median > Mode .
The empirical relationship for a moderately-skewed distribution is:
A Mode = 3 Median − 2 Mean
B Mean = 3 Mode − 2 Median
C Median = Mean × Mode
D Mode = Mean × Median
View solution
Correct Option: A
Mode ≈ 3 × Median − 2 × Mean for moderately skewed distributions.
A distribution with a sharper peak than the normal curve is called:
A Mesokurtic
B Leptokurtic
C Platykurtic
D Skewed
View solution
Correct Option: B
Leptokurtic = sharper peak. Mesokurtic = normal; Platykurtic = flatter.
To test the association between two NOMINAL variables (e.g., gender and voting choice), the appropriate test is:
A t-test
B Pearson's r
C Chi-square
D ANOVA
View solution
Correct Option: C
Chi-square (χ²) for categorical-categorical association.
Spearman's rank correlation (ρ) is most appropriate for:
A Two nominal variables
B Two ordinal variables
C Two interval / ratio variables
D One nominal and one ratio variable
View solution
Correct Option: B
Spearman's ρ uses rank order — perfect for ordinal data or non-linear monotonic relationships.
The Coefficient of Variation (CV) is defined as:
A σ / x̄
B (σ / x̄) × 100 %
C x̄ / σ
D σ²
View solution
Correct Option: B
CV = (σ / x̄) × 100 % . Allows comparison of relative dispersion across datasets with different units.
The arithmetic mean of 5, 8, 12, 15, 20 is:
View solution
Correct Option: B
(5 + 8 + 12 + 15 + 20) / 5 = 60 / 5 = 12 .
The median of 12, 5, 18, 7, 22 is:
View solution
Correct Option: B
Sorted: 5, 7, 12 , 18, 22 → middle = 12 .
A distribution with two distinct modes is called:
A Unimodal
B Bimodal
C Multimodal
D Amodal
View solution
Correct Option: B
Bimodal = two modes; multimodal = more than two.
Converting a qualitative variable like "city = Mumbai/Delhi/Chennai" into three 0/1 indicator variables is called:
A Likert scaling
B One-hot / dummy encoding
C Normalisation
D Standardisation
View solution
Correct Option: B
One-hot / dummy encoding — categorical → indicator variables for use in quantitative models.
To visualise the distribution of a CONTINUOUS quantitative variable, the BEST chart is:
A Bar chart
B Pie chart
C Histogram
D Word cloud
View solution
Correct Option: C
Histogram — bars touch, representing continuous intervals. Bar charts (gaps between bars) are for categorical data.
For ordinal data, the appropriate test for two independent groups is:
A Independent t-test
B Mann-Whitney U
C Chi-square
D Pearson's r
View solution
Correct Option: B
Mann-Whitney U — non-parametric, for ordinal or non-normal interval data.
Match each scale with its appropriate central-tendency measure:
(i)
Nominal
(a)
Mean
(ii)
Ordinal
(b)
Mean (incl. GM, HM)
(iii)
Interval
(c)
Mode
(iv)
Ratio
(d)
Median
A (i)-c, (ii)-d, (iii)-a, (iv)-b
B (i)-a, (ii)-b, (iii)-c, (iv)-d
C (i)-b, (ii)-c, (iii)-d, (iv)-a
D (i)-d, (ii)-a, (iii)-b, (iv)-c
View solution
Correct Option: A
Nominal → Mode; Ordinal → Median; Interval → Mean; Ratio → Mean (including GM, HM).
Quick Recall
Quantitative: numerical (Interval, Ratio); Qualitative: categorical (Nominal, Ordinal).
Quantitative sub-types: Discrete (count) · Continuous (measurement).
Qualitative sub-types: Nominal (no order) · Ordinal (order).
Stevens NOIR: Nominal · Ordinal · Interval · Ratio. NOIR mnemonic.
3 properties of quantitative data: Central tendency · Dispersion · Shape.
3 Means: AM (= Σx/n) · GM (=ⁿ√Πx) · HM (= n/Σ(1/x)); AM ≥ GM ≥ HM.
Median, Mode, Quartiles. Empirical: Mode ≈ 3 Median − 2 Mean (moderate skew).
Variance σ² · SD σ · CV = σ/x̄ × 100 %.
68-95-99.7 rule for normal distribution.
Skewness: Positive (Mean > Median > Mode) · Negative (Mean < Median < Mode) · Symmetric (Mean = Median = Mode).
Kurtosis: Mesokurtic (normal) · Leptokurtic (sharp) · Platykurtic (flat).
Central tendency by scale: Nominal → Mode · Ordinal → Median · Interval/Ratio → Mean.
Correlation by scale: Nominal → Cramér’s V/φ · Ordinal → Spearman’s ρ, Kendall’s τ · Interval/Ratio → Pearson’s r.
Tests by scale: Nominal → χ², Fisher’s exact · Ordinal → Mann-Whitney, Wilcoxon, Kruskal-Wallis · Interval/Ratio → t-test, ANOVA, regression.
Coding: Dummy/one-hot encoding for categorical → quantitative analysis.
Bar chart = categorical (gaps between bars). Histogram = continuous (bars touch).
Qualitative DATA ≠ Qualitative RESEARCH — different senses of the word.
Indian statistician: P.C. Mahalanobis (ISI 1931); C.R. Rao (Cramér-Rao bound).