In this statistical vacuum, all eyes are on the Centre for Monitoring Indian Economy (CMIE) and its Consumer Pyramids Household Survey (CPHS). CMIE is a private agency engaged in the collection and compilation of Indian statistics, for sale. CPHS is a periodic survey, conducted three times a year in successive four-month ‘waves’ since January 2014. It is based on ‘an all-India representative sample of over 170,000 households’ according to CMIE’s website. Further, this is meant to be a panel dataset, largely tracking the same households over time, though response rates vary, and new households are frequently added to compensate for attrition.
Except for its hefty price, the CPHS dataset is – or, at least, sounds — like a researcher’s dream. It has become a veritable barometer of the Indian economy, closely watched for data on income, expenditure and employment in particular. Research papers based on CPHS data are also mushrooming. CPHS even stayed the course, we are told, during the Covid-19 crisis. The country owes a handsome debt to CMIE for coming to the rescue of its battered statistical system.
Is it really true, however, that CPHS is a ‘robust, nationally representative and panel survey of households,’ as a June 2021
World Bank discussion paper puts it, echoing many similar descriptions of this survey in influential articles?
Consider this: according to CPHS, adult literacy (15-49 years) was 100% in urban areas and 99% in rural areas in late 2019. That is too good to be true. It suggests that CPHS is biased towards better-off households.
The plot thickens when we compare literacy rates at different points of time. Four years earlier, in late 2015, the literacy rate in the same age group was only 83% according to CPHS data. Could it really be that adult illiteracy was wiped out within four years? Seems unlikely.
We can pursue this matter by looking at literacy for the same cohorts over this period. For instance, we can compare the 15-49 age group in late 2019 with the 11-45 age group in late 2015. These two groups correspond to the same cohort. If CPHS is mostly a panel dataset, the literacy rate of this cohort should be much the same in 2015 and 2019. But, in fact, it rises in successive waves, from 84% in 2015 to 99% in 2019. This suggests that the CPHS sample became
The bias already applied in late 2015, judging from a comparison with the earlier NFHS-4. The CPHS estimate of adult literacy (15-49 years) at that time is 6 percentage points higher than the NFHS-4 estimate for 2015-16, for both men and women. The bias is also evident from data on household assets. According to CPHS, for instance, 98% of households had electricity in late 2015, 93% had water within the house, 89% had a television, and 42% had a fridge. The corresponding figures from NFHS-4 are much lower: 88%, 67%, 67% and 30% respectively.
There is no guarantee that NFHS-4 is more reliable than CPHS. But at least we know that it is a nationally representative survey, and the NFHS-4 figures also look more plausible than their CPHS counterparts. Further, the NFHS-4 literacy figures are consistent with 2011 census data for the same cohorts, but CPHS literacy figures are not — they are too high.
As stated earlier, it seems the CPHS bias towards better-off households increased over time. By 2019, the bias was truly embarrassing, judging from similar comparisons with NFHS-5 data for the 11 major states where that survey is on track. Consider Bihar. According to CPHS, 100% of households in Bihar had electricity in late 2019, 100% had water within the house, 98% had a toilet, and 95% had a TV. Paradise! The corresponding NFHS-5 figures are much lower, and far more plausible (96%, 89%, 62% and 35% respectively). Bihar is just one state. But a similar contrast emerges for these 11 states together (see table).
Another clue emerges from comparisons with the Periodic Labour Force Survey (PLFS) 2018-19 presented in the Azim Premji University State of Working India 2021 report (bit.ly/2SQvcqO). These suggest that CPHS overestimates average labour earnings by a long margin — perhaps 50% or so in rural areas.
In short, far from being nationally representative, the CPHS sample is heavily biased towards better-off households, and quite likely, the bias is growing over time. The bias is, perhaps, not surprising, since the sampling method apparently consists of surveying the ‘main street’ first in each sample village or enumeration block, and proceeding to inner streets only if the sample size requires it. If only for this reason, poor households are bound to be under-represented.
Incidentally, we noticed this
bias in CMIE data in a recent review of evidence on the economic impact of Covid-19. A series of household surveys focusing on informal sector workers and their families strongly suggest that employment, income, expenditure and food intake remained well below pre-lockdown levels throughout 2020. CPHS, by contrast, suggests fairly rapid recovery soon after the national lockdown. This apparent contradiction is readily resolved if we bear in mind that poor households are grossly under-represented in CPHS data.
All this is just a sample of statistical issues that call for urgent scrutiny, given the prominent role of CMIE data in economic discussions today. The first step is for CMIE to reassert or retract its claim that CPHS is a nationally representative survey (a fair expectation, surely, from an agency that charges $180,000 per wave for adding a one-minute question in the survey). After that, let a hundred voices boom.