How to check for data normality using kstest?

124 views (last 30 days)
Suppose I have a data set with about 100 numbers as listed below, how do I properly determine whether or not this data set is a normally distributed using the kstest()? The description mentioned to minus it by the mean and then divide it by standard deviation before putting in the kstest(), but do I need to do that for this case?
Dataset = [64 66 80 66 76 55 57 72 76 68 81 70 82 80 71 74 83 80 76 78 72 74 76 65 61 75 68 80 88 73 76 71 70 74 70 76 66 72 80 75 81 82 84 86 71 82 77 78 80 78 88 77 73 72 74 68 75 62 65 71 72 75 72 75 76 73 81 71 61 61 71 81 73 67 77 77 80 57 70 73 80 75 70 75 74 70 68 80 85 81 71 80 80 78 75 75 80 76 82 75 57];
PS: I'm testing on whether the data is normal only. I must use kstest to find it.

Accepted Answer

Rik
Rik on 16 Sep 2021
If you want to test if your data is from a standard normal distribution you should not change it before calling kstest.
If you want to test if your data is normally distributed (but not necessarily from the standard normal distribution), you will first have to normalize it by subtracting the mean and dividing by the standard deviation.
Which of the two is relevant for your case depends on your context. I'm guessing you want the second one, otherwise you don't need the test.
  2 Comments
DANIEL KONG LEN HAO
DANIEL KONG LEN HAO on 18 Sep 2021
Alright thank you! I was looking for normal distribution alone. Another thing I want to ask, does a smaller p-value (Probability) in the ks-test means it's more likely or less likely a normal distributed curve?
Rik
Rik on 18 Sep 2021
That is easy to determine: since your data is absolutely not from a standard normal distribution, you can feed it your unaltered data and see the result. You can also read the documentation:
help kstest
KSTEST Single sample Kolmogorov-Smirnov goodness-of-fit hypothesis test. H = KSTEST(X) performs a Kolmogorov-Smirnov (K-S) test to determine if a random sample X could have come from a standard normal distribution, N(0,1). H indicates the result of the hypothesis test: H = 0 => Do not reject the null hypothesis at the 5% significance level. H = 1 => Reject the null hypothesis at the 5% significance level. X is a vector representing a random sample from some underlying distribution, with cumulative distribution function F. Missing observations in X, indicated by NaNs (Not-a-Number), are ignored. [H,P] = KSTEST(...) also returns the asymptotic P-value P. [H,P,KSSTAT] = KSTEST(...) also returns the K-S test statistic KSSTAT defined above for the test type indicated by TAIL. [H,P,KSSTAT,CV] = KSTEST(...) returns the critical value of the test CV. [...] = KSTEST(X,'PARAM1',val1,'PARAM2',val2,...) specifies one or more of the following name/value pairs: Parameter Value 'alpha' A value ALPHA between 0 and 1 specifying the significance level. Default is 0.05 for 5% significance. 'CDF' CDF is the c.d.f. under the null hypothesis. It can be specified either as a ProbabilityDistribution object or as a two-column matrix. Default is the standard normal, N(0,1). 'Tail' A string indicating the type of test. The one-sample K-S test tests the null hypothesis that F = CDF (that is, F(x) = CDF(x) for all x) against the alternative specified by TAIL: 'unequal' -- "F not equal to CDF" (two-sided test) (Default) 'larger' -- "F > CDF" (one-sided test) 'smaller' -- "F < CDF" (one-sided test) Let S(X) be the empirical c.d.f. estimated from the sample vector X, F(X) be the corresponding true (but unknown) population c.d.f., and CDF be the known input c.d.f. specified under the null hypothesis. For TAIL = 'unequal', 'larger', and 'smaller', the test statistics are max|S(X) - CDF(X)|, max[S(X) - CDF(X)], and max[CDF(X) - S(X)], respectively. In the matrix version of CDF, column 1 contains the x-axis data and column 2 the corresponding y-axis c.d.f data. Since the K-S test statistic will occur at one of the observations in X, the calculation is most efficient when CDF is only specified at the observations in X. When column 1 of CDF represents x-axis points independent of X, CDF is 're-sampled' at the observations found in the vector X via interpolation. In this case, the interval along the x-axis (the column 1 spread of CDF) must span the observations in X for successful interpolation. The decision to reject the null hypothesis is based on comparing the p-value P with ALPHA, not by comparing the statistic KSSTAT with the critical value CV. CV is computed separately using an approximate formula or by interpolation in a table. The formula and table cover the range 0.01<=ALPHA<=0.2 for two-sided tests and 0.005<=ALPHA<=0.1 for one-sided tests. CV is returned as NaN if ALPHA is outside this range. Since CV is approximate, a comparison of KSSTAT with CV may occasionally lead to a different conclusion than a comparison of P with ALPHA. See also KSTEST2, LILLIETEST, CDFPLOT. Documentation for kstest doc kstest
[h,p]=kstest([64 66 80 66 76 55 57 72 76 68 81 70 82 80 71 74 83 80 76 78 72 74 76 65 61 75 68 80 88 73 76 71 70 74 70 76 66 72 80 75 81 82 84 86 71 82 77 78 80 78 88 77 73 72 74 68 75 62 65 71 72 75 72 75 76 73 81 71 61 61 71 81 73 67 77 77 80 57 70 73 80 75 70 75 74 70 68 80 85 81 71 80 80 78 75 75 80 76 82 75 57])
h = logical
1
p = 3.2646e-90
So you can see your answer here: a small p value means it is less likely to be from a normal distribution.

Sign in to comment.

More Answers (0)

Tags

Products


Release

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!