Testing statistical hypotheses

In this article, we will touch upon Data Science. And we can’t do today’s things without Python. We’ll look at hypothesis testing. Hypotheses are a more data-driven approach to business decision making. Hypothesis testing is a decision-making mechanism based on inferences. Statistical hypothesis testing also provides a framework for solving a specific problem.

Imagine there is a farm fish habitat, and we want to know if the fish population has changed in that area. In order not to carry out the calculation, we can make a sample from the areas of the range and compare with the previous period. For the null hypothesis, we take the number for the past period. In turn, the alternative hypothesis will say that the value, on the contrary, is not equal to this indicator.

Next comes the process of choosing a suitable statistical test (criteria), but you need to understand what statistical sensitivity is. This is the probability that a particular statistical criteria will correctly reject the null hypothesis (the ability of the criteria to detect differences where they really are).

Selection of the relative error of the interval. Deviation of the measured value of a quantity from its true value. Let’s say we are 95% sure that the true number of fish will be in the range of plus or minus 5% of the calculated value.

So let’s look at the steps of testing hypotheses:

  1. Creation of hypotheses (zero and alternative)
  2. Select the appropriate statistical criterion (t-test, chi2-test)
  3. Determine the test error
  4. Get data
  5. Analyze
  6. Make a decision

Example

Let’s determine if Tesla’s share price distributions for 2018 and 2019 (NASDAQ) are the same.

First, let’s plot the distribution graphs of the array for 2018 and 2019 separately. Then let’s compare them with each other.

Distribution of 2018, where the average is 63.19:

Distribution of 2019, where the average is 55.05. Here we can see a small spike to the right of 80 – 90 per share. Before calculating, let’s make a sample to avoid this outlier:

Pay attention to the combined graph, where you can see that at first glance the distributions are different:

Let’s make a comparison by the Student’s criterion, a function was created in Python:

Where the first argument is the first distribution, the second is the second distribution, and the third is the allowable error. Then we use the ttest_ind method from the Scipy package.

Next, we write a condition, if the p-value of the t-test is greater than Alpha then H0 is not rejected, another outcome is rejected.

Test result:

Statistics=-2.469, p-value=0.018

Different distribution (reject H0)

Accordingly, we reject H0 and accept the alternative hypothesis that these two distributions are different, and we see that the data for 2019 is more stable, that is, it has less volatility than 2018.

CaseWare IDEA will help you carry out this analysis with one click on the program tab, as our specialists took care of this and developed many other “push-button” solutions for you