The most important distinction that must be made about samples is whether they are based on:

Note: Much of the content in the first half of this module is presented in a 38 minute lecture by Professor Lisa Sullivan. The lecture is available below, and a transcript of the lecture is also available. Link to transcript of lecture on basics probability

The most important distinction that must be made about samples is whether they are based on:

Sampling individuals from a population into a sample is a critically important step in any biostatistical analysis, because we are making generalizations about the population based on that sample. When selecting a sample from a population, it is important that the sample is representative of the population, i.e., the sample should be similar to the population with respect to key characteristics. For example, studies have shown that the prevalence of obesity is inversely related to educational attainment (i.e., persons with higher levels of education are less likely to be obese). Consequently, if we were to select a sample from a population in order to estimate the overall prevalence of obesity, we would want the educational level of the sample to be similar to that of the overall population in order to avoid an over- or underestimate of the prevalence of obesity.  

There are two types of sampling: probability sampling and non-probability sampling. In probability sampling, each member of the population has a known probability of being selected. In non-probability sampling, each member of the population is selected without the use of probability.

Probability Sampling

Simple Random Sampling

In simple random sampling, one starts by identifying the sampling frame

The most important distinction that must be made about samples is whether they are based on:
, i.e., a complete list or enumeration of all of the population elements (e.g., people, houses, phone numbers, etc.). Each of these is assigned a unique identification number, and elements are selected at random to determine the individuals to be included in the sample. As a result, each element has an equal chance of being selected, and the probability of being selected can be easily computed. This sampling strategy is most useful for small populations, because it requires a complete enumeration of the population as a first step.

Many introductory statistical textbooks contain tables of random numbers that can be used to ensure random selection, and statistical computing packages can be used to determine random numbers. Excel, for example, has a built-in function that can be used to generate random numbers.

Systematic Sampling

Systematic sampling also begins with the complete sampling frame and assignment of unique identification numbers. However, in systematic sampling, subjects are selected at fixed intervals, e.g., every third or every fifth person is selected. The spacing or interval between selections is determined by the ratio of the population size to the sample size (N/n). For example, if the population size is N=1,000 and a sample size of n=100 is desired, then the sampling interval is 1,000/100 = 10, so every tenth person is selected into the sample. The selection process begins by selecting the first person at random from the first ten subjects in the sampling frame using a random number table; then 10th subject is selected.

If the desired sample size is n=175, then the sampling fraction is 1,000/175 = 5.7, so we round this down to five and take every fifth person. Once the first person is selected at random, every fifth person is selected from that point on through the end of the list.

With systematic sampling like this, it is possible to obtain non-representative samples if there is a systematic arrangement of individuals in the population. For example, suppose that the population of interest consisted of married couples and that the sampling frame was set up to list each husband and then his wife. Selecting every tenth person (or any even-numbered multiple) would result in selecting all males or females depending on the starting point. This is an extreme example, but one should consider all potential sources of systematic bias in the sampling process.

Stratified Sampling

In stratified sampling, we split the population into non-overlapping groups or strata (e.g., men and women, people under 30 years of age and people 30 years of age and older), and then sample within each strata. The purpose is to ensure adequate representation of subjects in each stratum.

Sampling within each stratum can be by simple random sampling or systematic sampling. For example, if a population contains 70% men and 30% women, and we want to ensure the same representation in the sample, we can stratify and sample the numbers of men and women to ensure the same representation. For example, if the desired sample size is n=200, then n=140 men and n=60 women could be sampled either by simple random sampling or by systematic sampling.

Non-Probability Sampling

There are many situations in which it is not possible to generate a sampling frame, and the probability that any individual is selected into the sample is unknown. What is most important, however, is selecting a sample that is representative of the population. In these situations non-probability samples can be used. Some examples of non-probability samples are described below.

Convenience Sampling

In convenience sampling, we select individuals into our sample based on their availability to the investigators rather than selecting subjects at random from the entire population. As a result, the extent to which the sample is representative of the target population is not known. For example, we might approach patients seeking medical care at a particular hospital in a waiting or reception area. Convenience samples are useful for collecting preliminary or pilot data, but they should be used with caution for statistical inference, since they may not be representative of the target population.

Quota Sampling

In quota sampling, we determine a specific number of individuals to select into our sample in each of several specific groups. This is similar to stratified sampling in that we develop non-overlapping groups and sample a predetermined number of individuals within each. For example, suppose our desired sample size is n=300, and we wish to ensure that the distribution of subjects' ages in the sample is similar to that in the population. We know from census data that approximately 30% of the population are under age 20; 40% are between 20 and 49; and 30% are 50 years of age and older. We would then sample n=90 persons under age 20, n=120 between the ages of 20 and 49 and n=90 who are 50 years of age and older.

Age Group

Distribution in Population

Quota to Achieve n=300

<20

20-49

50+

30%

40%

30%

n=90

n=120

n=90

Sampling proceeds until these totals, or quotas, are reached. Quota sampling is different from stratified sampling, because in a stratified sample individuals within each stratum are selected at random. Quota sampling achieves a representative age distribution, but it isn't a random sample, because the sampling frame is unknown. Therefore, the sample may not be representative of the population.


Non-probability sampling is a method of selecting units from a population using a subjective (i.e. non-random) method. Since non-probability sampling does not require a complete survey frame, it is a fast, easy and inexpensive way of obtaining data. However, in order to draw conclusions about the population from the sample, it must assume that the sample is representative of the population. This is often a risky assumption to make in the case of non-probability sampling due to the difficulty of assessing whether the assumption holds. In addition, since elements are chosen arbitrarily, there is no way to estimate the probability of any one element being included in the sample. Also, no assurance is given that each item has a chance of being included, making it impossible either to estimate sampling variability or to identify possible bias.

In general, official statistical agencies around the world have been using probability sampling as their preferred tool to meet information needs about a population of interest. In the last few years, however, there have been some research and studies about how to apply non-probability sampling into the official statistics. Using other data sources has been increasingly explored. There are five key reasons behind this trend:

  • the decline in response rates in probability surveys;
  • the high cost of data collection;
  • the increased burden on respondents;
  • the desire for access to real-time statistics, and
  • the surge of non-probability data sources such as web surveys and social media.

Some have suggested the possibility of a shift in the paradigm and traditional approach to statistics. However, data from non-probability sources have a few challenges with respect to data quality, including the potential presence of participation and selection bias. Therefore, data collected using non-probability sampling should be used with extra caution.

The commonly used non-probability sampling methods include the following.

Convenience or haphazard sampling

Units are selected in an arbitrary manner with little or no planning involved. Haphazard sampling assumes that the population units are all alike, then any unit may be chosen for the sample. An example of haphazard sampling is the vox pop survey where the interviewer selects any person who happens to walk by. Unfortunately, unless the population units are truly similar, selection is subject to the biases of the interviewer and whoever happened to walk by at the time of sampling.

Volunteer sampling

The respondents are only volunteers in this method. Generally, volunteers must be screened so as to get a set of characteristics suitable for the purposes of the survey (e.g. individuals with a particular disease). This method can be subject to large selection biases, but is sometimes necessary. For example, for ethical reasons, volunteers with particular medical conditions may have to be solicited for some medical experiments.

Another example of volunteer sampling is callers to a radio or television show, when an issue is discussed and listeners are invited to call in to express their opinions. Only the people who care strongly enough about the subject one way or another tend to respond. The silent majority does not typically respond, resulting in a large selection bias. Volunteer sampling is often used to select individuals for focus groups or in-depth interviews (i.e. for qualitative testing, where no attempt is made to generalize to the whole population).

Judgement sampling

With this method, sampling is done based on previous ideas of population composition and behaviour. An expert with knowledge of the population decides which units in the population should be sampled. In other words, the expert purposely selects what is considered to be a representative sample. Judgment sampling is subject to the researcher’s biases and is perhaps even more biased than haphazard sampling.

Since any preconceptions the researcher has are reflected in the sample, large biases can be introduced if these preconceptions are inaccurate. However, it can be useful in exploratory studies, for example in selecting members for focus groups or in-depth interviews to test specific aspects of a questionnaire.

Quota sampling

This is one of the most common forms of non-probability sampling. Sampling is done until a specific number of units (quotas) for various subpopulations have been selected. Quota sampling is a means for satisfying sample size objectives for the subpopulations.

The quotas may be based on population proportions. For example, if there are 100 men and 100 women in the population and a sample of 20 are to be drawn, 10 men and 10 women may be interviewed. Quota sampling can be considered preferable to other forms of non-probability sampling (e.g. judgment sampling) because it forces the inclusion of members of different subpopulations.

Quota sampling is somewhat similar to stratified sampling, which is probability sampling, in that similar units are grouped together. However, it differs in how the units are selected. In probability sampling, the units are selected randomly while in quota sampling a non-random method is used—it is usually left up to the interviewer to decide who is sampled. Contacted units that are unwilling to participate are simply replaced by units that are, in effect ignoring nonresponse bias. Market researchers often use quota sampling (particularly for telephone surveys) instead of stratified sampling to survey individuals with particular socio-economic profiles. This is because compared with stratified sampling, quota sampling is relatively inexpensive and easy to administer and has the desirable property of satisfying population proportions. However, it disguises potentially significant selection bias.

As with all other non-probability sample designs, in order to make inferences about the population, it is necessary to assume that persons selected are similar to those not selected. Such strong assumptions are rarely valid.

Snowball or network sampling

Suppose a researcher wishes to find rare individuals in the population, and already knows of the existence of some of these individuals and how to contact them. One approach is to contact those individuals and simply ask them if they know anyone like themselves, then contact those people, etc. The sample grows like a snowball rolling down a hill to hopefully include virtually everybody with that characteristic. Snowball sampling is useful for rare or hard to reach populations such as people with disabilities, homeless people, drug users, or other persons who may not belong to an organised group or such as musicians, painters, or poets, not readily identified on a survey list frame. However, some individuals or subgroups may have no chance of being sampled. In order to be able to generalize the conclusion to the whole population, some assumptions, which are usually not met, are required.

Crowdsourcing

Crowdsourcing has been defined slightly differently by researchers from various areas. Despite the multiplicity of definitions for crowdsourcing, one constant has been the broadcasting of a problem to the public, and an open call for contributions to help solve the problem. Members of the public submit solutions that are then owned by the entity (e.g. individuals, companies, or organizations), which originally broadcast the problem. Crowdsourcing is channelling the experts’ desire to solve a problem and then freely sharing the answer with everyone.

As part of Statistics Canada’s modernization, crowdsourcing has become an innovative way to collect valuable information for statistical purposes. By using crowdsourcing as the only collection method, surveys can be executed quickly with reduced cost and response burden. To better understand the challenges associated with crowdsourcing and to ensure that the results are good quality, methods are being developed to compare and validate the data with other sources of complementary data. A couple of examples are outlined below.

  • As part of the OpenStreetMap (OSM) pilot project, which was completed in March 2018, crowdsourced geographic information was collected by mapping the building footprints in the Ottawa, Ontario and Gatineau, Quebec areas. The network and experience of this pilot project helped to launch the Building Canada 2020 initiative (BC2020), aimed at mapping all building footprints of Canada on OSM by the year 2020.
  • During the pandemic of COVID-19, Statistics Canada developed a series of initiatives to generate data and analysis quickly and effectively via crowdsourcing to help fill the data gaps on the economic and social impact of COVID-19 on Canadians. For example, the survey, Impacts of COVID-19 on Canadians, collected data from April 3 to 9, 2020. Close to 200,000 people living in Canada voluntarily answered the survey, which focused on behaviour and attitudes related to COVID-19. And then, a series of results were released over the following weeks.

Web panels

A web panel (or online or internet panel) could be defined as an access panel of people willing to respond to web questionnaires. It contains a sample of potential respondents who declare that they will cooperate for future data collection if selected. A web panel survey is a survey utilizing samples from web panels.

Web panels can be seen as sampling frames for web panel surveys. All persons in the panels must have up-to-date e-mail addresses. Recruitment for web panels can be made in different ways. Respondents can be sourced from offline channels: telephone, TV ads, radio ads, ads in newspapers and magazines, addressed letters, outdoor posters, customer registers, etc. Respondents can also be sourced from online channels: e-mails, websites, banners, community sites, member programs, etc. Often, many channels are used in order to achieve the necessary diversity. After the recruitment, a profile survey is conducted in order to collect information on the new participants to the panel. The recruitment can be done using either probability-based or self-recruited panels. In practice, the distinction between these two may not be very important if the nonresponse rate is very high for the probability-based panels. Sometimes incentives, such as gift cards or souvenirs, are used to attract people and boost response rates. Web panels are often used for marketing research or pilot studies.

During the pandemic of COVID-19, Statistics Canada developed a new web panel survey, Canadian Perspectives Survey Series (CPSS), to get timely information about how Canadians are coping with COVID-19. More than 4,600 people in the 10 provinces responded to this survey between March 29 and April 3. Unlike the most web panels, CPSS is a probabilistic panel based on the Labour Force Survey (LFS), as some respondents agreed to complete short online questionnaires following their participation to the LFS. CPSS enables Statistics Canada to collect important information from Canadians more efficiently, more rapidly and at a lower cost, compared with traditional survey methods. 

Advantages and disadvantages of non-probability sampling

Advantages
  • Quick and convenient
    As a general rule, non-probability samples can be constituted quickly, which allows the survey to be launched, executed and finished in shorter times.
  • Inexpensive
    It usually only takes a few hours to an interviewer to conduct such a survey. As well, non-probability samples are generally not spread out geographically, therefore travelling expenses for interviewers are low. In web panels or crowdsourcing, no interviewers are necessary. Tracing and persuasion of non-respondents are not required or less demanding.
  • Reduce respondent burden
    In the case of volunteer sampling or crowdsourcing, respondents volunteer to participate in the survey without being solicited personally.
Disadvantages
  • Selection bias
    In order to make inferences about the population, it requires strong assumptions about the similarity between the sample and the population even though the respondents are self-selected. Due to the selection bias presented in all non-probability samples, these are often dangerous assumptions to make. When generalization to the whole population is to be made, probability sampling should be performed instead.
  • Noncoverage (undercoverage) bias
    Since some units in the population can have no chance of being included in the sample, it results noncoverage bias. For example, people without the internet at home might never be selected for a web panel and may differ from those with the internet.
  • Difficulty of assessing the quality
    It is impossible to determine the probability that a unit in the population is selected for the sample, so reliable estimates and estimates of sampling error cannot be computed.