29  Sources, acquisition and classification of Data

29.1 What the Syllabus Covers

This sub-unit has three examined heads:

  1. Sources of data — primary vs secondary; internal vs external.
  2. Acquisition — methods of collecting / observing data.
  3. Classification — organising raw data into structured form.

The most-repeated PYQ patterns are: (a) distinguishing primary vs secondary, (b) matching the right acquisition method to a research question, (c) recognising the four bases of classification (geographic, chronological, qualitative, quantitative), and (d) naming Indian statistical agencies (NSO, NSSO, CSO, MoSPI, RBI, RGI, NITI Aayog).

29.2 What “Data” Is

Data are facts, observations or measurements recorded for analysis. They become information only after processing and interpretation. Statistical work moves through: Source → Acquisition → Classification → Analysis → Interpretation → Presentation.

TipData, Information, Knowledge — The DIKW Pyramid

DataInformation (processed data) → Knowledge (information + context + meaning) → Wisdom (applied knowledge). Russell Ackoff (1989).

29.3 Sources of Data

29.3.1 Primary vs Secondary

TipPrimary vs Secondary Sources
Source Definition Examples Strength Limitation
Primary Collected first-hand by the researcher Survey, interview, observation, experiment Tailored, recent, controlled Expensive, slow
Secondary Collected by someone else for another purpose Census, government reports, journal articles, databases Cheap, fast, large scale May not fit research need; outdated

29.3.2 Internal vs External

TipInternal vs External Sources
  • Internal — sourced from within the organisation (sales records, HR data, accounts).
  • External — sourced from outside (government, industry bodies, market research firms).

29.3.3 Major Indian Sources of Secondary Data

TipIndian Statistical & Data Agencies
Agency Full form Output
MoSPI Ministry of Statistics and Programme Implementation Apex statistical ministry
NSO (2019 merger) National Statistical Office Merged CSO + NSSO under MoSPI
CSO Central Statistical Office National accounts, GDP, IIP, CPI
NSSO National Sample Survey Office Household consumption, employment, health surveys
RGI / ORG Registrar General & Census Commissioner Census (every 10 years); Vital Statistics; SRS
RBI Reserve Bank of India Banking, monetary, balance-of-payments data
NITI Aayog National Institution for Transforming India SDG India Index, policy data
NCRB National Crime Records Bureau Crime statistics
CGHS / MoHFW Ministry of Health and Family Welfare Health & demographic data; NFHS
DGCIS Directorate General of Commercial Intelligence and Statistics Foreign trade data
IIP Index of Industrial Production Industrial output
AISHE All India Survey on Higher Education Higher-ed data (MoE)
NIRF National Institutional Ranking Framework HEI rankings
IMD India Meteorological Department Weather, climate data
ISRO Bhuvan Geospatial data
EAC-PM Economic Advisory Council to PM Economic analysis
Open Government Data (OGD) Platform data.gov.in Open datasets

29.3.4 International Sources

TipInternational Statistical Sources
  • UN Statistical Division — World Statistics Pocketbook.
  • World Bank — World Development Indicators.
  • IMF — World Economic Outlook, Government Finance Statistics.
  • WHO — Global Health Observatory.
  • OECD — Education at a Glance; PISA.
  • UNESCO Institute for Statistics — Education and culture.
  • ILO — Labour statistics.
  • FAO — Agriculture and food.

29.4 Acquisition (Methods of Data Collection)

29.4.1 Primary Data — Six Standard Methods

TipSix Standard Primary Methods
Method What it does Best for
Direct personal investigation Researcher meets each respondent Small samples; sensitive topics
Indirect oral investigation Witnesses or third parties questioned When respondent unavailable
Schedules through enumerators Trained field-workers carry the form Census, large rural surveys
Mailed / online questionnaire Respondent fills the form Large, literate samples
Local correspondents Reporters in different localities Regular feed (e.g., agricultural prices)
Observation Watching behaviour or events Ethnography, classroom interaction

29.4.2 Survey vs Experiment vs Observation

TipThree Major Acquisition Approaches
  • Survey — descriptive; respondents asked about themselves.
  • Experiment — manipulative; IV manipulated under control.
  • Observation — non-intrusive; researcher watches.

(Detailed coverage in Topic 8.)

29.4.3 Modes of Data Collection (CAPI/CATI/CAWI)

TipComputer-Assisted Modes
  • CAPI — Computer-Assisted Personal Interviewing (tablet in the field).
  • CATI — Computer-Assisted Telephone Interviewing.
  • CAWI — Computer-Assisted Web Interviewing (Google Forms, SurveyMonkey).
  • CASI — Computer-Assisted Self-Interviewing (sensitive topics).
  • PAPI — Paper-and-Pencil Interviewing.

29.4.4 Sampling — Quick Recap

(Detailed in Topic 9.) Probability sampling (simple random, stratified, systematic, cluster, multi-stage, PPS) allows statistical generalisation. Non-probability (convenience, purposive, quota, snowball) does not.

29.5 Classification of Data

Classification is the systematic arrangement of raw data into classes or categories with common characteristics.

29.5.1 Four Bases of Classification

TipFour Bases of Classification (Croxton & Cowden)
Basis Categories formed by Example
Geographical / Spatial Place State-wise literacy
Chronological / Temporal Time Population census, year by year
Qualitative Attribute (non-numeric) Gender, religion, marital status
Quantitative / Numerical Numeric value Income brackets, marks

29.5.2 Qualitative Classification — Simple vs Manifold

TipSub-types of Qualitative Classification
  • Simple (dichotomous) — single attribute, two categories (e.g., male / female).
  • Manifold (multi-attribute) — multiple attributes combined (e.g., gender × literacy × urban-rural).

29.5.3 Quantitative Classification — Discrete vs Continuous

TipSub-types of Quantitative Classification
  • Discrete — values are countable integers (number of children).
  • Continuous — values fall on a continuum (height, weight, temperature).

29.5.4 Frequency Distribution

A frequency distribution organises quantitative data into classes (intervals) and shows the count in each.

TipFrequency Distribution Terms
  • Class interval — e.g., 10–20, 20–30.
  • Class limits — upper and lower bounds.
  • Class boundaries — true limits (avoiding overlap; e.g., 9.5–19.5, 19.5–29.5).
  • Class width — upper − lower.
  • Class mark / midpoint — (upper + lower) / 2.
  • Inclusive (10–19, 20–29) vs exclusive (10–20, 20–30) class methods.
  • Open-ended class — e.g., “above 90”.

29.5.5 Number of Classes — Sturges’ Rule

Herbert Sturges (1926): k ≈ 1 + 3.322 log₁₀(N), where k = number of classes and N = number of observations. Used as a rough rule.

29.5.6 Frequencies

TipTypes of Frequency
  • Absolute frequency — raw count.
  • Relative frequency — count / total.
  • Cumulative frequency — running total; less-than or greater-than ogive.
  • Percentage frequency — relative × 100.

29.6 Tabulation

Tabulation is the orderly arrangement of classified data into rows and columns.

TipParts of a Statistical Table
  1. Table number.
  2. Title — concise, clear.
  3. Head-note — additional explanatory note.
  4. Stub — row labels.
  5. Caption — column labels.
  6. Body — actual data.
  7. Source note — credit data source.
  8. Footnote — clarifications.

29.6.1 Types of Table

TipSix Types of Statistical Table
  • Simple / one-way — one characteristic.
  • Two-way — two characteristics cross-classified.
  • Manifold — three or more characteristics.
  • Reference table — general purpose, large.
  • Summary table — derived measures (means, totals).
  • Frequency table — frequency distribution.

29.7 Stevens’ Scales — Recap

(Detailed in Topic 8.)

TipNOIR Scales of Measurement

Nominal · Ordinal · Interval · Ratio — increasing in informational richness.

29.8 Data Quality and Errors

TipThree Quality Dimensions
  • Validity — measures what it claims to.
  • Reliability — produces consistent results.
  • Completeness, accuracy, timeliness.
TipTwo Error Categories
  • Sampling error — random difference between sample and population.
  • Non-sampling error — instrument bias, coverage failure, non-response, data-entry mistakes. Cannot be cured by larger samples.

29.9 Big Data Vocabulary (Brief)

TipFive V’s of Big Data

Volume · Velocity · Variety · Veracity · Value.

  • Volume — size of data set.
  • Velocity — speed of generation.
  • Variety — structured, semi-structured, unstructured.
  • Veracity — trustworthiness.
  • Value — usefulness.

Some lists add Variability and Visualisation.

29.10 Data Visualisation Preview

(Detailed in Topic 30.) Quick names: bar chart, pie chart, line graph, histogram, ogive, scatter plot, box plot, heatmap, GIS map, dashboard. Tools: Excel, Tableau, Power BI, R/ggplot2, Python/matplotlib, D3.js, QGIS, Bhuvan (ISRO).

29.11 Theory Anchors

TipConcepts and Persons
Concept / Person Year / Context Contribution
C.R. Kothari 1985 Standard Indian textbook on data collection
Croxton & Cowden mid-20th c. Four bases of classification
Herbert Sturges 1926 Sturges’ formula for class number
Stanley Smith Stevens 1946 NOIR scales
Russell Ackoff 1989 DIKW pyramid
R.A. Fisher 1925, 1935 Statistical methodology
Indian Statistical System 1949 onward Mahalanobis, ISI; NSS, CSO
P.C. Mahalanobis 1950 Father of Indian statistics; founded ISI 1931
MoSPI 1999 (re-org) Indian statistical ministry
NSO 2019 CSO + NSSO merger
OGD Platform 2012 data.gov.in

29.12 Practice Questions

Q 01 Primary Easy

Data collected first-hand by the researcher for a specific purpose are called:

  • APrimary data
  • BSecondary data
  • CTertiary data
  • DReference data
View solution
Correct Option: A
Primary data — collected by the researcher for the research at hand.
Q 02 Secondary Easy

The Census of India is an example of:

  • APrimary data
  • BSecondary data
  • CTertiary data
  • DExperimental data
View solution
Correct Option: B
For researchers reusing it, the Census is secondary data. (For RGI conducting it, it is primary.)
Q 03 Indian Agency Medium

The Census of India is conducted by:

  • ANSSO
  • BCSO
  • CRegistrar General of India
  • DRBI
View solution
Correct Option: C
RGI (Registrar General & Census Commissioner of India). Census conducted every 10 years.
Q 04 NSO Medium

The NSO (National Statistical Office) was formed in 2019 by merging:

  • ARBI and CSO
  • BCSO and NSSO
  • CNSSO and RGI
  • DNITI Aayog and MoSPI
View solution
Correct Option: B
CSO + NSSO → NSO under MoSPI, 2019.
Q 05 MoSPI Easy

MoSPI stands for:

  • AMinistry of Science and Public Implementation
  • BMinistry of Statistics and Programme Implementation
  • CMinistry of Statistics and Planning Information
  • DMinistry of Sample Survey and Population Index
View solution
Correct Option: B
Ministry of Statistics and Programme Implementation — apex statistical ministry, Government of India.
Q 06 Classification Medium

Classifying data state-wise (Maharashtra, Karnataka, Tamil Nadu …) is a:

  • AGeographical classification
  • BChronological classification
  • CQualitative classification
  • DQuantitative classification
View solution
Correct Option: A
By place / region = Geographical / spatial.
Q 07 Classification Medium

Data classified by religion or marital status is a:

  • AGeographical classification
  • BChronological classification
  • CQualitative classification
  • DQuantitative classification
View solution
Correct Option: C
By attribute = Qualitative. By value = quantitative; by time = chronological.
Q 08 Variable Medium

"Number of children in a family" is a:

  • AQualitative variable
  • BDiscrete quantitative variable
  • CContinuous quantitative variable
  • DNominal variable
View solution
Correct Option: B
Counted in integers; no fractional values → discrete quantitative.
Q 09 Frequency Hard

In an exclusive classification, the class "20–30" includes a value of:

  • A10
  • B19
  • C29
  • D30
View solution
Correct Option: C
In the exclusive method, the upper limit is EXCLUDED. The class 20–30 includes 20 to 29.99…; 30 goes to the next class.
Q 10 Sturges Hard

Sturges' rule for the number of classes in a frequency distribution is:

  • Ak = √N
  • Bk = 1 + 3.322 log₁₀ N
  • Ck = N/10
  • Dk = 2 log₂ N
View solution
Correct Option: B
k = 1 + 3.322 log₁₀(N) (Sturges 1926).
Q 11 Bases Medium

Croxton and Cowden recognise how many bases of classification?

  • ATwo
  • BThree
  • CFour
  • DSix
View solution
Correct Option: C
Four: Geographical · Chronological · Qualitative · Quantitative.
Q 12 Acquisition Medium

A door-to-door enumerator with a tablet asking questions and recording answers in real time is using:

  • ACATI
  • BCAPI
  • CCAWI
  • DCASI
View solution
Correct Option: B
CAPI = Computer-Assisted Personal Interviewing.
Q 13 DIKW Hard

In the DIKW pyramid, the layer immediately above "Data" is:

  • AInformation
  • BWisdom
  • CKnowledge
  • DAnalytics
View solution
Correct Option: A
DIKW: Data → Information → Knowledge → Wisdom (Russell Ackoff, 1989).
Q 14 Big Data Medium

The "5 Vs" of big data are:

  • AVolume, Velocity, Variety, Veracity, Value
  • BValidity, Variety, Volume, Vector, Velocity
  • CVision, Velocity, Value, Verification, Volume
  • DValidation, Visualisation, Velocity, Volume, Variety
View solution
Correct Option: A
Volume · Velocity · Variety · Veracity · Value.
Q 15 Indian Agency Hard

India's foreign trade statistics are published by:

  • ADGCIS
  • BNSSO
  • CCSO
  • DRBI
View solution
Correct Option: A
DGCIS = Directorate General of Commercial Intelligence and Statistics, Kolkata.
Q 16 OGD Medium

India's Open Government Data platform is hosted at:

  • Adata.gov.in
  • Bindia.gov.in
  • Cmca.gov.in
  • Ddigitalindia.gov.in
View solution
Correct Option: A
data.gov.in — OGD platform, launched 2012 by MeitY + NIC.
Q 17 Sampling Medium

Which is NOT probability sampling?

  • ASimple random
  • BStratified random
  • CCluster sampling
  • DQuota sampling
View solution
Correct Option: D
Quota is non-probability. Random/stratified/cluster are probability.
Q 18 Mahalanobis Hard

"The Father of Indian Statistics", founder of the Indian Statistical Institute (1931), is:

  • AC.R. Rao
  • BP.C. Mahalanobis
  • CR.A. Fisher
  • DP.V. Sukhatme
View solution
Correct Option: B
Prasanta Chandra Mahalanobis — founder of ISI (Kolkata, 1931); architect of the Mahalanobis Plan (Second Five-Year Plan).
Q 19 Table Medium

In a statistical table, the labels of ROWS are called:

  • AStub
  • BCaption
  • CTitle
  • DBody
View solution
Correct Option: A
Stub = row labels. Caption = column labels.
Q 20 Match Hard

Match each Indian agency with its primary output:

(i) RGI (a) Household surveys
(ii) RBI (b) Census
(iii) NSSO (c) Crime statistics
(iv) NCRB (d) Banking and BoP
  • A(i)-b, (ii)-d, (iii)-a, (iv)-c
  • B(i)-a, (ii)-b, (iii)-c, (iv)-d
  • C(i)-c, (ii)-d, (iii)-a, (iv)-b
  • D(i)-d, (ii)-c, (iii)-b, (iv)-a
View solution
Correct Option: A
RGI → Census; RBI → Banking/BoP; NSSO → Household surveys; NCRB → Crime statistics.

29.13 Quick Recall

ImportantQuick recall
  • Data = raw facts; processed → InformationKnowledgeWisdom (DIKW, Ackoff 1989).
  • Sources: Primary (first-hand) vs Secondary (already collected). Internal vs External.
  • Indian sources: MoSPI (apex); NSO = CSO + NSSO (2019 merger). RGI = Census; RBI = monetary/BoP; NCRB = crime; DGCIS = foreign trade; NFHS (MoHFW); AISHE (MoE); NIRF; NITI Aayog; IMD weather; Bhuvan (ISRO); OGD data.gov.in.
  • International: UN Stat Division · World Bank · IMF · WHO · OECD · UNESCO · ILO · FAO.
  • 6 primary methods: Direct personal · Indirect oral · Schedules through enumerators · Mailed/online questionnaire · Local correspondents · Observation.
  • Modes: PAPI · CAPI · CATI · CAWI · CASI.
  • 3 approaches: Survey · Experiment · Observation.
  • 4 bases of classification (Croxton & Cowden): Geographical · Chronological · Qualitative · Quantitative.
  • Qualitative: simple (dichotomous) vs manifold. Quantitative: discrete vs continuous.
  • Frequency distribution: class interval, limits, boundaries, width, mark; inclusive vs exclusive; open-ended.
  • Sturges’ rule (1926): k = 1 + 3.322 log₁₀ N.
  • Frequencies: absolute · relative · cumulative · percentage.
  • Table parts (8): number · title · head-note · stub · caption · body · source note · footnote.
  • Stevens’ scales: N · O · I · R.
  • Data quality: Validity · Reliability · Completeness · Accuracy · Timeliness.
  • Errors: Sampling (random, fixed by N) vs Non-sampling (bias, not fixed by N).
  • Big data 5 Vs: Volume · Velocity · Variety · Veracity · Value.
  • Mahalanobis (P.C.) — Father of Indian Statistics; founded ISI Kolkata 1931.