*953*

**50+ Statistics Questions for Data Science Interview & Machine Learning**

Statistics is fundamental to data science .If we want to understand the data entirely, we have to make use of various descriptive statistics concepts starting from the type of the data, scales of measurement ,central tendency ,dispersion of the data etc. These concepts will help us understand the central value of the data and how dispersed the individual observations are from that central value. In any data science project the first step after the data has been cleaned should be to perform the EDA (exploratory data analysis).

In most practical settings it is next to impossible to collect the data of the population as it is time consuming and a costly affair hence what we do is we take samples and on the basis of these samples we draw conclusions about the population from which the sample has been drawn. Here we make use of various inferential statistics concepts such as hypothesis testing etc.

In this post we have tried to include the frequently asked questions from the statistics module in the data science interviews. We will keep extending the list so that it become exhaustive in time to come. Your feedback would be of immense help. We love to hear from you.

- What is a dataset and what are the different types of variables which you get to see in a dataset?

- Briefly explain the difference between Primary and secondary data

- What are the different types of sampling techniques? How is snowballing technique different from Judgmental sampling?

- Can you explain a scenario in which we should use a cross sectional data and quota sampling?

- Can you compare the different scales of measurement?

- Which scale of measurement will be used to record the following?

- Name of a city
- Rating given by people to 5 flavors of Ice cream on a scale of 1 to 10 , 10 being highest
- Income of an individual
- Pin code of a district.
- Roll number of a person

- What is the difference between a population and sample?

- What is the difference between a Parameter and Statistic? What kind of symbols are used to denote parameters and statistics for Mean, Standard deviation and Variance?

- You are given a project to study the prevalence of malnourishment among children studying in class 1 in India, Define the population and how would you take sample from that population?

- What is random sampling?

- Do all observation have equal opportunity of getting selected in case of random sampling?

- How is descriptive statistics useful in understanding a given dataset?

- What are the important measures of central tendency? Which measure of central tendency would you like to use in case of a skewed data?

- Which measure of central tendency is same as the 2
^{nd}quartile?

- Which measure of central tendency is also referred to as a locational average?

- What are the different types of dispersion and what advantage standard deviation has over variance?
- What is IQR and what advantage it has over other dispersion measures?

- What do you understand by the term skewness and Kurtosis?

- If a distribution is positively skewed then on which side of the distribution you will see the tail?

- If the relative Kurtosis of a distribution is positive, will it have a fatter tail or thinner tail?
- What will be the value of the skewness of a data whose mean, median and mode coincide?

- What is the shape of a normal distribution and what is the empirical rule associated with the data which is normally distributed?
- If you want to visualize the age variable would you be using a histogram or a bar plot?

- What is the difference between a histogram and a bar plot?

- What is a frequency polygon and what does it tell us?

- What is a boxplot and at what distance do we draw the whiskers?

- Can a boxplot also tell us about the skewness of a distribution?

- How can you identify the outlier in a data?

- Suppose you want to show the relationship between the height and weight of a person, which plot you would use?
- What is the use of a contingency table?

- If you have the data about the students enrolled in a college in science, commerce and humanities and you want to visualize the total enrollment and also the proportion of students enrolled in different stream, which plot would you use?

- What is the difference between descriptive statistics and inferential statistics?

- Is there any difference between sample distribution and sampling distribution?

- What are the parameters of a normal distribution?

- How is a standard normal distribution different from any other normal distribution?

- What is the total area under a standard normal distribution curve?

- Can you explain a discrete probability distribution and how is it different from a continuous probability distribution?

- Will the spread of a sample distribution be more or less equal to the spread of the sampling distribution?

- What is Standard error?

- What is Central Limit Theorem?

- What are the assumptions under the Central limit theorem?

- What are some of the examples of normally distributed data?

- Have you observed the covid graph over time? The Covid graph follows which probability distribution?

- What is a confidence interval around mean and why is it important in hypothesis testing?

- How do you define the Null and Alternate Hypothesis? Explain with an example

- Why do we do a Z-Test? What is the shape of a Z distribution?

- If the Z score of any observation is 8.25, what can you say about the outlier?

- When do we use a T Test over a Z test and what is shape of a T distribution?

- What is the relationship between the sample size and the shape of the T distribution?

- There is a new medication developed by scientist for weight loss, the medicine comes in 3 dosages 50mg,100mg and 200mg.Now you want to find out if there is any difference between the impact that these 3 dosages have on weight loss.Which test would you be performing?

- When do we use a Z test of proportion?

- What do you understand by p-value in a Z test?

- When do we use a chi square goodness of fit test?

- What is the relationship between the P value and the alpha value?

- What is the difference between a single tailed test and a double tailed test?

- What is Type 1 error and type 2 error in statistics? What is referred to as the power of test?

- Have you heard of Pareto principle? Can you explain with an example?

**Written by:**

Kunal

*Data science consultant and trainer (IndiQa Analytics LLP)*

*PCP in Business Analytics, IIM (K)*

*Certificate Program in Data Science, IIT (M)*

*MSc Nottingham Trent University*

*MBA Fore School of Management, New Delhi*