Mathematics desk
< April 14	<< Mar \| April \| May >>	Current desk >

Welcome to the Wikipedia Mathematics Reference Desk Archives
The page you are currently viewing is a transcluded archive page. While you can leave answers for any questions shown below, please ask new questions on one of the current reference desk pages.

April 15[edit]

What does it mean when we say that data is normally distributed?[edit]

Suppose we have a set $S$ of data which is possibly infinite. When we say that the data is normally distributed, mathematically what do we mean? My understanding is that we consider the probability space $(\mathbb {R} ,{\mathcal {B}},P)$ where ${\mathcal {B}}$ is the sigma algebra of all real Borel sets, $P(B)=\int _{B}f(x)dx$ where $f(x)$ is the normal PDF with the mean and variance given by the descriptive data. On this probability space we further have the identity function as a random variable $X$ . Now the statement that the data is normally distributed means (according to my understanding), that the frequency polygon of the data coincides (approximately) with the graph of the PDF. I have not been able to find this written anywhere however, and hence I am asking this question to clarify whether my understanding is correct or not? Is it correct, and if so can it be made more rigorous. If not, what is the correct meaning of saying that the data is normally distributed.-- Abdul Muhsy talk 07:26, 15 April 2022 (UTC)[reply]

The term is often used somewhat loosely, not to say sloppily, when the authors actually should have said that the sample distribution is not significantly different from a (best-fit) normal distribution, as might be revealed by a normality test. Probably, the test they applied was the squinting test: squinting their eyes and noticing some similarity. In such cases one should not be surprised they are not more precise. Are you aware of cases where the claim is applied to an infinite set? --Lambiam 09:37, 15 April 2022 (UTC)[reply]

Thanks. The first option given by you, (given a set of data apply a normality test and if there isn't sufficient evidence to reject normality, then it is normal) is rigorous enough but a person not well versed in statistical theory will probably not understand the justification easily. Your second criteria of squinting is somewhat similar to my undersanding and seems more satisfying visually. To clarify, are we taking the frequency polygon (after standardizing the data), squinting and then 'seeing' whether it is resembling the pdf of N(0,1)? If so, is standardizing necessary? If we are just looking at the data (and not at any visual representation of it) what exactly are we looking for? Thirdly, to answer your question I think it is possible to construct a number whose digits follow any given distribution but I am not competent enough to understand fully the proof [1]-- Abdul Muhsy talk 11:43, 15 April 2022 (UTC)[reply]

To start with the last point, if we have an infinite sequence

(s_{1},s_{2},...),

we can take the initial segments

I_{1}=(s_{1}),

I_{2}=(s_{1},s_{2}),...,I_{n}=(s_{1},s_{2},...,s_{n}),...,

and hope that the distributions of

I_{1},I_{2},...

converge to a limit distribution. This may or may not be the case, depending on the sequence. If we are only given the set

S=\textstyle {\bigcup _{i}}\{s_{i}\},

there is not enough information to assign a limit distribution; a new sequence

(s'_{1},s'_{2},...)

obtained by reshuffling the elements of the old sequence can have a definite but different limit distribution, while containing the same set of elements

S=\textstyle {\bigcup _{i}}\{s'_{i}\}.

I don't know how squinters reach their conclusions; you'd have to ask them personally. I expect that if you ask one hundred statisticians to graph a normal distribution by hand, at least 99 will be seriously off, mostly because the tails will be far too fat. If true, this does not bode well for outlier tests based on sample estimates of the dispersion applied to a population distribution assumed normal. --Lambiam 17:42, 15 April 2022 (UTC)[reply]

It doesn't have any mathematically precise meaning to say that observed data are normally distributed. It's actually the process that generates the data that gives rise to a probability distribution. It's still useful in practice to talk about observed data being normally distributed, and I trust without having read it that your discussion with Lambiam has covered the appropriate points.

But if you're bringing up Borel sets, you're talking about a different level of the discussion. This gets into philosophy and interpretation of probability pretty fast. --Trovatore (talk) 18:21, 15 April 2022 (UTC)[reply]