Chance: Is It Real?

Jeremiah

I did not intend you to feel intimidated or patronised, so please try to be less aggressive. — fdrake

I treat everyone the same, I equally dislike everyone, you are not special to me, therefore I see no reason to treat you any differently. Also, I am not interested in your excuses for such sloppy citation, if you hold a college degree you should know how to do proper citation regardless of area of study. You should know that referencing a whole course is just dumb, and it is what people typically do when they are evading. If you don't like my attitude, then don't talk with me, it is that simple and my feelings will not be hurt.

I am glad you are statistician, and to be honest I don't think we are that far apart, I just find some of your word choices confusing. However, it should be pointed out that as a base principle I reject notions of yielding to what may be seen as a greater authority. Which I admit you would hold that edge, but I can't surrender my own process of reasoning to someone I know nothing about.

Please clarify what you mean by "random variable." Or don't; if you don't like me enough to response I will understand.

fdrake

↪Jeremiah

Ok. This is a random variable:

Let (O,S,P) be a probability space where O is a set of outcomes and S a sigma algebra on the set of outcomes, then a random variable X is defined as a measureable function on (O,S,P) to some set of values R. A measurable function is a function X such that the pre-image of every measureable set is measureable. Typically, we have:

O be (some subset of) the real numbers, S be the Borel sigma algebra on O and R being the real numbers. Measureability here is defined with respect to the Lebesgue measure on measure space (O,S) and (R,B(R)) where B(R) is the set of Borel sets on R. In this case we have a real probability space and a real valued random variable X.

A probability measure is a measure P such that P(O)=1. It inherits the familiar properties of probability distributions from introductory courses by saying that it is a measure. The missing properties are entailed by something called sigma-additivity, which is that if you take a countable sequence of sets in S C_n, then P(Union C_n over n) = Sum(P(C_n) over n).

If you would like a quick reference to these ideas: see this. Chapter 3 contains the definition of measures and measureable spaces (also including discrete random variables which I haven't here). Chapter 8 itself studies probability measures.

Jeremiah

↪fdrake

Thanks for the reply, I'll have to examine it later when I have time for a response.

fdrake

↪Jeremiah

The reason I used intuitive rather than mathematical description was because nobody knows what sigma algebras and measureable functions are.

Jeremiah

↪fdrake

Personally, I just think writing is not your strong suit.

Tell me, how is this:

A random variable is a mapping from a collection of possible events and rules for combining them (called a sigma algebra) to a set of values it may take. More formally, a random variable is a measureable mapping from a probability space to a set of values it can take — fdrake

Not the same thing I said? When I said:

the frequency of possible outcomes from a repeated random event. — Jeremiah

fdrake

The notion that it is 'repeated' is absent from the measure theoretic definition. The 'long-run frequency' interpretation of probability is actually contested.

Jeremiah

↪fdrake

The repeated is how we approximate the true value.

fdrake

I'll respond in another thread.

Jeremiah

↪fdrake

I probably won't respond. This thread is about chance, and our discussion is relative to that. I see no reason to start another thread.

Jeremiah

↪fdrake

Btw, I would very much like to know how you plan to do statistics without repeated random events.

Jeremiah

↪fdrake

Do you plan to just guess your numbers?

I think the mean is 12, yep 12 just feels right.

Jeremiah

Wait I am getting a new reading from my crystal ball, the mean is definitely 13.

fdrake

I replied to you in a new thread.

fdrake

Honestly, direct mockery and non-collaboration from the frequentist paradigm of statistics was something that troubled the field up until (and after) the production of Markov Chain Monte Carlo - though Bayesians definitely shot back. After the discovery of Markov Chain Monte Carlo, Bayesian methods had to be taught as a respectable interpretation of probability because of the sheer pragmatic power of Bayesian methods. If you sum up the founding papers of MCMC's citations, you get somewhere in the region of 40k citations. Nowadays you can find a whole host of post Bayesian/frequentist papers that, for example, translate frequentist risk results in terms of implicit prior distributions and look at frequentist/large sample properties of Bayesian estimators.

Jeremiah

↪fdrake

Even Monte Carlo depends on repeated random events. All of statistics does.

fdrake

↪Jeremiah

Markov Chain Monte Carlo uses a Bayesian interpretation of probability. The methods vary, but they all compute Bayesian quantities (such as 'full conditionals' for a Gibbs Sampler). If you've not read the thread I made in response to you, please do so. You're operating from a position of non-familiarity with a big sub-discipline of statistics.

Jeremiah

Btw, for those looking in repeated random events are the actual data.

fdrake

Is it really that difficult to disentangle these two ideas:

1) Statistics deals with random events.
2) Probability's interpretation.

?

Jeremiah

↪fdrake

Tell me, how do you plan on making your probability distribution without data?

fdrake

If copy-pasting the response in the other thread to you is what it takes to get you to stop being petulant, here it is:

This is a response to @Jeremiah from the 'Chance, Is It Real?' thread.

Pre-amble: I'm going to assume that someone reading it knows roughly what an 'asymptotic argument' is in statistics. I will also gloss over the technical specifics of estimating things in Bayesian statistics, instead trying to suggest their general properties in an intuitive manner. However, it is impossible to discuss the distinction between Bayesian and frequentist inference, so it is unlikely that someone without a basic knowledge of statistics will understand this post fully.

Reveal rough definition of asymptotic argument

In contemporary statistics, there are two dominant interpretations of probability.

1) That probability is always proportional to the long-term frequency of a specified event.
2) That probability is the quantification of uncertainty about the value of a parameter in a statistical model.

(1) is usually called the 'frequentist interpretation of probability', (2) is usually called the 'Bayesian interpretation of probability', though there are others. Each of this philosophical positions has numerous consequences for how data is analysed. I will begin with a brief history of the two viewpoints.

The frequentist idea of probability can trace its origin to Ronald Fisher, who gained his reputation in part through analysis of genetics in terms of probability - being a founding father of modern population genetics, and in part through the design and analysis of comparative experiments - developing the analysis of variance (ANOVA) method for their analysis. I will focus on the developments resulting from the latter, eliding technical detail. Bayesian statistics is named after Thomas Bayes, the discoverer of Bayes Theorem', which arose in analysing games of chance. More technical details are provided later in the post. Suffice to say now that Bayes Theorem is the driving force behind Bayesian statistics, and this has a quantity in it called the prior distribution - whose interpretation is incompatible with frequentist statistics.

The ANOVA is an incredibly commonplace method of analysis in applications today. It allows experimenters to ask questions relating to the variation of a quantitive observations over a set of categorical experimental conditions.

For example, in agricultural field experiments 'Which of these fertilisers is the best?'

The application of fertilisers is termed a 'treatment factor', say there are 2 fertilisers called 'Melba' and 'Croppa', then the 'treatment factor' has two levels (values it can take), 'Melba' and 'Croppa'. Assume we have one field treated with Melba, and one with Croppa. Each field is divided into (say) 10 units, and after the crops are fully grown, the total mass of vegetation in each unit will be recorded. An ANOVA allows us to (try to) answer the question 'Is Croppa better than Melba?. This is done by assessing the mean of the vegetation mass for each field and comparing these with the observed variation in the masses. Roughly: if the difference in masses for Croppa and Melba (Croppa-Melba) is large compared to how variable the masses are, we can say there is evidence that Croppa is better than Melba.* How?

This is done by means of a hypothesis test. At this point we depart from Fisher's original formulation and move to the more modern developments by Neyman and Pearson (which is now the industry standard). A hypothesis test is a procedure to take a statistic like 'the difference between Croppa and Melba' and assign a probability to it. This probability is obtained by assuming a base experimental condition, called 'the null hypothesis', several 'modelling assumptions' and an asymptotic argument .

In the case of this ANOVA, these are roughly:

A) Modelling assumptions: variation between treatments only manifests as variations in means, any measurement imprecision is distributed Normally (a bell curve).
B) Null hypothesis: There is no difference in mean yields between Croppa and Melba
C) Asymptotic argument: assume that B is true, then what is the probability of observing the difference in yields in the experiment assuming we have an infinitely large sample or infinitely many repeated samples? We can find this through the use of the Normal distribution (or more specifically for ANOVAS, a derived F distribution, but this specificity doesn't matter).

The combination of B and C is called a hypothesis test.

The frequentist interpretation of probability is used in C. This is because a probability is assigned to the observed difference by calculating on the basis of 'what if we had an infinite sample size or infinitely many repeated experiments of the same sort?' and the derived distribution for the problem (what defines the randomness in the model).

An alternative method of analysis, in a Bayesian analysis would allow the same modelling assumptions (A), but would base its conclusions on the following method:

A) the same as before
B) Define what is called a prior distribution on the error variance.
C) Fit the model using Bayes Theorem.
D) Calculate the odds ratio of the statement 'Croppa is better than Melba' to 'Croppa is worse than or equal to Melba' using the derived model.

I will elide the specifics of fitting a model using Bayes Theorem. Instead I will provide a rough sketch of a general procedure for doing so below. It is more technical, but still only a sketch to provide an approximate idea.

Bayes theorem says that for two events A and B and a probability evaluation P:
P(A|B) = P(B|A)P(A) / P(B)
where P(A|B) is the probability that A happens given that B has already happened, the conditional probability of A given B. If we also allow P(B|A) to depend on the data X, we can obtain P(A|B,X), which is called the posterior distribution of A.

For our model, we would have P(B|A) be the likelihood as obtained in frequentist statistics (modelling assumptions), in this case a normal likelihood given the parameter A = the noise variance of the difference between the two quantities. And P(A) is a distribution the analyst specifies without reference to the specific values obtained in the data, supposed to quantify the a priori uncertainty about the noise variance of the difference between Croppa and Melba. P(B) is simply a normalising constant to ensure that P(A|B) is indeed a probability distribution.

Bayesian inference instead replaces the assumptions B and C with something called the prior distribution and likelihood, Bayes Theorem and a likelihood ratio test. The prior distribution for the ANOVA is a guesstimate of how variable the measurements are without looking at the data (again, approximate idea, there is a huge literature on this). This guess is a probability distribution over all the values that are sensible for the measurement variability. This whole distribution is called the prior distribution for the measurement variability. It is then combined with the modelling assumptions to produce a distribution called the 'posterior distribution', which plays the same role in inference as modelling assumptions and the null hypothesis in the frequentist analysis. This is because posterior distribution then allows you to produce estimates of how likely the hypothesis 'Croppa is better than Melba' is compared to 'Croppa is worse than or equal to Melba', that is called an odds ratio.

The take home message is that in a frequentist hypothesis test - we are trying to infer upon the unknown fixed value of a population parameter (the difference between Croppa and Melba means), in Bayesian inference we are trying to infer on the posterior distribution of the parameters of interest (the difference between Croppa and Melba mean weights and the measurement variability). Furthermore, the assignment of an odds ratio in Bayesian statistics does not have to depend on an asymptotic argument relating the null hypothesis and alternative hypothesis to the modelling assumptions. Also, it is impossible to specify a prior distribution through frequentist means (it does not represent the long run frequency of any event, nor an observation of it).

Without arguing which is better, this should hopefully clear up (to some degree) my disagreement with @Jeremiah and perhaps provide something interesting to think about for the mathematically inclined.

Jeremiah

I

↪fdrake

You are and have been avoiding my question.

fdrake

There are lots of motivations for choosing prior distributions. Broadly speaking they can be chosen in two ways: through expert information and previous studies, or to induce a desirable property in the inference. In the first sense, you can take a posterior distribution or distribution from a frequentist paper through moment matching from previous studies, or alternatively you can elicit quantiles from experts. In the second sense, there are many possible reasons for choosing a prior distribution.

Traditionally, priors were chosen to be 'conjugate' to their likelihoods because that made analytic computation for posterior distributions possible. In the spirit of 'equally possible things are equally probable', there are families of uninformative prior distributions which are alleged to express the lack of knowledge about parameter values in the likelihood. As examples, you can look at entropy maximizing priors, Jeffrey's prior or uniform priors with large support. Or alternatively on asymptotic frequentist principles, and for this you can look at reference priors.

Motivated from the study of random effect models, there is often need to make the inference more conservative than a model component with an uninformative prior typically allows. If for example you have a random effect model for a single factor (with <5 levels), posterior estimates of the variance and precision of the random effect will be unstable [in the sense of huge variance]. This issue has been approached in numerous ways and depends on the problem type at hand. For example, in spatial statistics when estimating the correlation function of the Matern field (a spatial random effect) in addition to other effects, correlation parameter can be shrunk towards 1. This can be achieved through defining a prior on the scale of an information theoretic difference (like the Kullback-Liebler divergence or Hellinger distance). More recently, a family of prior distributions called hypergeometric inverted beta distributions has been proposed for 'top level' variance parameters in random effect models, with the celebrated Half-Cauchy prior on the standard deviation being a popular choice for regularization.

Jeremiah

↪fdrake

Personally I don't think you are making the connection here, I mean at certain points you are in agreement with me and I don't think you realize that. You think this is statistics but it is not, this is, for lack of a better word, philosophy. You make these long winded jargon filled proclamations that are completely off the mark. I mean bring up Monte Carlo was a face palm moment. I think your problem is that you are not thinking about this philosophically, it is the age old debate: Does the string have length because that is an objective property of the string, or does it have length because we created the ruler? The same holds for probability: If it is not derived from the real world for application in the real world is it really a measurement? I am not discarding the conceptual components, but saying that alone they are incomplete.

*** Edit - Auto-spell hijacked one of my words.

fdrake

The long post I made details a philosophical distinction, between frequentist and Bayesian interpretations of probability. I presented the distinction between them. To summarize: the meaning of probability doesn't have to depend on long run frequency. This was our central disagreement. I provided an interpretation of probability which is accepted in the literature showing exactly that.

If you read the other thread, you would also see I made a comment saying that the differences in probability interpretation occur roughly on the level of parameter estimation and the interpretation of parameters - the mathematical definition of random variables and probability measures has absolutely nothing to say about whether probability 'is really frequentist' or 'is really Bayesian'. I gave you an argument and references to show that the definition of random variables and probability doesn't depend on the fundamental notions you said that it did.

Furthermore, the reason I posted the technical things that I did was to give you some idea of the contemporary research on the topics and the independence of fundamental statistical concepts from philosophical interpretations of probability. If you were not a statistics student I would have responded completely differently [in the intuitive manner you called me to task for].

This is relevant because most topics in the philosophy of statistics have been rendered out-dated and out of touch with contemporary methods. Choice of prior distribution for a statistical model doesn't have to be a distribution (look at Jeffrey's prior) IE it doesn't even have to be a probability measure. Statistical models don't have to result in a proper distribution without constraints (look at Besag models and other 'intrinsic' ones), don't have to depend solely on the likelihood [look at penalized regression, like the LASSO] in frequentist inference. What does it even mean to say 'statistics is about probability and sequences of random events' when contemporary topics don't even NEED a specific parametric model (look at splines in generalized additive models) or even necessarily to output a distribution? How can we think a 'book of preferences' for an agent as classical probability/utility arguments go for founding expert choice distributions when in practice statistical analysis allows the choice of non-distributions or distributions without expectation or variance as representative of individuals' preferences about centrality and variability?

You then asked how to choose a probability distribution without using data in some way. I responded by saying various ways people do this in practice in the only context it occurs - choosing a prior. Of course the statistical model depends on the data, that's how you estimate its parameters.

I have absolutely no interest in rehearsing a dead argument about which interpretation of probability is correct when it has little relevance to the contemporary structure of statistics. I evinced this by giving you a few examples of Bayesian methods being used to analyse frequentist problems [implicit priors] and frequentist asymptotics being used to analyse Bayesian methods.

fdrake

With regard to regularization and shrinkage, these are reasons to choose a distribution not because of the way it represents the probability of events, but because the means of their representation induce properties in models. IE, they are probability distributions whose interpretation isn't done in terms of the probability of certain events, their interpretation is 'this way of portraying the probability of certain events induces certain nice properties in my model'.

Jeremiah

Yet my question was not how to choose a distribution.

Jeremiah

As I said, off the mark.

fdrake

↪fdrake Tell me, how do you plan on making your probability distribution without data?

T_T

fdrake

Ok, what actually is your question, what do you think we disagree on?

Jeremiah

↪fdrake

There is a difference between choosing and making.

Chance: Is It Real?

Welcome to The Philosophy Forum!

Categories

More Discussions