New Approach to Approximating the Cumulative Function for the t-Distribution | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IRAQI JOURNAL OF STATISTICAL SCIENCES | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Volume 20, Issue 2, December 2023, Pages 82-89 PDF (588.37 K) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Document Type: Research Paper | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
DOI: 10.33899/iqjoss.2023.0181184 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Author | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Ahmad Najem AL-SHALLAWI* | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Department of Statistics & Informative techniques, Northern Technical University, Mosul, Iraq. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Abstract | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The focus of this paper is to approximate the cumulative distribution function (CDF) of the t distribution, which represents a combined distribution of the normal distribution and gamma distribution. The study utilizes the approximate formula proposed by Polya for the normal distribution, originally introduced in 1945. By applying this final formula to various points and comparing the results with the tabulated values of the t distribution, the researchers found that the absolute error between the two sets of values is negligible. It should be noted that this error slightly increases with higher degrees of freedom. Furthermore, the study observed that the absolute errors remain consistent when multiple points are selected at the same degrees of freedom. These findings have practical implications for statistical analysis, as they offer a time and effort-saving approach for obtaining CDF values associated with the t-distribution. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Highlights | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
In this research, a new approximation was found for the CDF for t distribution, and this value was compared with the original value of the (CDF) by using (Matlab, 22), we concluded that:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Keywords | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Cumulative distribution function; Polya's formula; t-distribution | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Full Text | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
IntroductionT-distribution is one of the important distributions that have a role in statistical analysis. It is very similar to the normal distribution and is used instead if the population standard deviation is unknown or when the sample size is less than 30. [1] One of the characteristics of this distribution is that it has a symmetrical frequency curve around the mean, but it is heavier at the edges than the normal curve. There is one parameter that determines the shape of the distribution, which is called degrees of freedom. [2] The t distribution is a special case of the generalized hyperbolic distribution [3]. It is important to use it in many fields, such as t-test is used to find significant differences between sample means, and to find the confidence intervals to the difference between two population means, also it is used in linear regression and Bayesian analysis. [4]. The main aim of this research is to compute the cumulative function to t-distribution as an approximation in a simple way for obtaining an easy formula more than the old one which depended to a hypergeometric (as will be mentioned) and more difficult in practical application. Material and methodsTheoretical PartMaterials and MethodsThe probability density function for t- distribution is :
Whereas: v: degree of freedom, it is a positive integer. : Complete gamma function Now, we can define a cumulative function as a function that determines what is the probability that the value of any random variable (T) is less than or equal to a given value. It is written as [1] :
Where 2F1 is a special case of the hyper geometric function [5]. Review of approximations to the CDF.Many formula to the approximation for CDF t-distribution proposed, [6] gave a list of various approximation to cumulative function for t-distribution and proposed a simple approximation of F( ;ν ) as :
for n=1 and 2, they have suggested the following exact formulae:
And he gave an absolute error for many values to v and x. [6] proposed a new formula to compute CDF for t-distribution by using neural networks when , depending to the absolute error and compute approximation accuracy for , he compared among them to choose a minimum absolute error . Mixed Model:Mixed model or compound distributions are one of the important distributions in modelling many phenomena because these phenomena are more flexible than standard distributions, and many researchers have been interested in studying this type of distributions, whether they are continuous or discontinuous. It is used to represent some data that cannot be represented by standard statistical distributions as required because the nature of these data or phenomena necessitates the use of mixed distributions that are more flexible than standard distributions [7]. Therefore, the mixed model is a model that consists of two or more probability distributions [8], and it should be noted that it is not necessary that these mixed distributions belong to the same family [9]. If we have Z and Y as two independent variables, Z is normally distributed with mean = 0 and variance σ2 = 1, y has a chi-square distribution with n degrees of freedom, then:
According to references [10], the t-distribution is characterized by its degrees of freedom "n". It can be represented as a mixed distribution, providing more versatility than the standard form. This mixed distribution of the t-distribution combines elements of the normal distribution and the inverse gamma distribution, allowing for a broader range of applications. The random variable Z is normally distributed with certain arithmetic mean and variance, as follows: X ~ t (µ, σ2, v)
X can be expressed as:
Whereas: ~ IG ( Z~ N (0, σ2) Since z is independent of , and the variable has positive values ( > 0) and has a probability density function as follows[11]:
This formula is used in statistical modelling of the t-distribution in conventional statistics and Bayesian statistics [12] Approximation of cumulative function to t-distributionIn this part, we will propose a new approximation for a cumulative function to t-distribution by using a mixed model between normal distribution and gamma distribution as (4a) equation
The cumulative function for t distribution is: F(x) = Above, we say that we can represent t-distribution as a mixed distribution as:
Whereas: : is the conditional normal distribution. : is the gamma function. We will use Polya’s formula to find the solution for the Normal distribution function (Hermuz,1990):
As the approximation formula, which will be updated, relies on the mixed distribution, we will utilize the poly(a) formula in equation (8) as a mixed distribution along with the gamma distribution. In this context, the variable Z will be conditioned on the variable τ, which follows a t-distribution according to equation (4a), as follows:
Where:
Where: By using Tylor 's series as:
We will put We substitute (14), (15) and (16) in (13), and we get:
But:
Then:
and:
We put (18) and (19) in (9) we get:
We substitute (10) and (11) in (20) and we integrate the expression with respect to , we get the marginal cumulative function :
Where:
and:
Then (21) is the final formula to approximation of the cumulative function for t- distribution. Practical SideAs an application of equation (21), we follow the following algorithm: 1- Determine the degree of freedom. 2- We choose two values for a and b, we wanted to choose standard values to apply the equation (21), such as (1, -1). 2- After selecting these two values, we choose a value for z. (negative and positive values) 3- We get the value of the cumulative function between (a,b). 4- We compare the resulting value in (3) with the tabular value which obtained from the statistical tables, and find the difference between them. 5- If the difference is large, we choose another value for z and recalculate the algorithm until we get a value close to the tabular value. Thus, whenever we choose values for a and b, or different values for the degrees of freedom, we recalculate the value of the cumulative function. The researcher concluded that the value of z which is the one that achieved the best result of equation (21) if we compare it with the tabular value,at (-1 , 1 ) , (-1 , 0) , ( 0 , 1) with many degrees of freedom. The algorithm was applied using (Matlab,2020a). The Algorithm: As an application of equation (21), we follow the steps below:
Through this procedure, we found that the value (0.3334) yielded the best results with the least possible difference when compared to the tabulated value. Therefore, the value (0.3334) achieved the best outcome for equation (21) when compared to the tabulated value. The results are as Table (1) : Table (1) The comparative between (CDF original) and (CDF approximation)
The first column in Table (1) represents different values of the degrees of freedom that were chosen, the second column represents the results of the proposed equation, Equation (21), and the third column represents the tabular cumulative value taken from the statistical tables. As for the fourth column, it represents the absolute difference (AE) between the value according to the proposed equation and the tabular value. It is noted from Table (1) that the tabular value is close to the value of t according to equation (21), and it is noted that the error value is equal in the tested truncation points (-1,0) and (0,1) , This means that the value of the cumulative function on the left side is equal to the value of the cumulative function on the right side. IntroductionT-distribution is one of the important distributions that have a role in statistical analysis. It is very similar to the normal distribution and is used instead if the population standard deviation is unknown or when the sample size is less than 30. [1] One of the characteristics of this distribution is that it has a symmetrical frequency curve around the mean, but it is heavier at the edges than the normal curve. There is one parameter that determines the shape of the distribution, which is called degrees of freedom. [2] The t distribution is a special case of the generalized hyperbolic distribution [3]. It is important to use it in many fields, such as t-test is used to find significant differences between sample means, and to find the confidence intervals to the difference between two population means, also it is used in linear regression and Bayesian analysis. [4]. The main aim of this research is to compute the cumulative function to t-distribution as an approximation in a simple way for obtaining an easy formula more than the old one which depended to a hypergeometric (as will be mentioned) and more difficult in practical application. Material and methodsTheoretical PartMaterials and MethodsThe probability density function for t- distribution is :
Whereas: v: degree of freedom, it is a positive integer. : Complete gamma function Now, we can define a cumulative function as a function that determines what is the probability that the value of any random variable (T) is less than or equal to a given value. It is written as [1] :
Where 2F1 is a special case of the hyper geometric function [5]. Review of approximations to the CDF.Many formula to the approximation for CDF t-distribution proposed, [6] gave a list of various approximation to cumulative function for t-distribution and proposed a simple approximation of F( ;ν ) as :
for n=1 and 2, they have suggested the following exact formulae:
And he gave an absolute error for many values to v and x. [6] proposed a new formula to compute CDF for t-distribution by using neural networks when , depending to the absolute error and compute approximation accuracy for , he compared among them to choose a minimum absolute error . Mixed Model:Mixed model or compound distributions are one of the important distributions in modelling many phenomena because these phenomena are more flexible than standard distributions, and many researchers have been interested in studying this type of distributions, whether they are continuous or discontinuous. It is used to represent some data that cannot be represented by standard statistical distributions as required because the nature of these data or phenomena necessitates the use of mixed distributions that are more flexible than standard distributions [7]. Therefore, the mixed model is a model that consists of two or more probability distributions [8], and it should be noted that it is not necessary that these mixed distributions belong to the same family [9]. If we have Z and Y as two independent variables, Z is normally distributed with mean = 0 and variance σ2 = 1, y has a chi-square distribution with n degrees of freedom, then:
According to references [10], the t-distribution is characterized by its degrees of freedom "n". It can be represented as a mixed distribution, providing more versatility than the standard form. This mixed distribution of the t-distribution combines elements of the normal distribution and the inverse gamma distribution, allowing for a broader range of applications. The random variable Z is normally distributed with certain arithmetic mean and variance, as follows: X ~ t (µ, σ2, v)
X can be expressed as:
Whereas: ~ IG ( Z~ N (0, σ2) Since z is independent of , and the variable has positive values ( > 0) and has a probability density function as follows[11]:
This formula is used in statistical modelling of the t-distribution in conventional statistics and Bayesian statistics [12] Approximation of cumulative function to t-distributionIn this part, we will propose a new approximation for a cumulative function to t-distribution by using a mixed model between normal distribution and gamma distribution as (4a) equation
The cumulative function for t distribution is: F(x) = Above, we say that we can represent t-distribution as a mixed distribution as:
Whereas: : is the conditional normal distribution. : is the gamma function. We will use Polya’s formula to find the solution for the Normal distribution function (Hermuz,1990):
As the approximation formula, which will be updated, relies on the mixed distribution, we will utilize the poly(a) formula in equation (8) as a mixed distribution along with the gamma distribution. In this context, the variable Z will be conditioned on the variable τ, which follows a t-distribution according to equation (4a), as follows:
Where:
Where: By using Tylor 's series as:
We will put We substitute (14), (15) and (16) in (13), and we get:
But:
Then:
and:
We put (18) and (19) in (9) we get:
We substitute (10) and (11) in (20) and we integrate the expression with respect to , we get the marginal cumulative function :
Where:
and:
Then (21) is the final formula to approximation of the cumulative function for t- distribution. Practical SideAs an application of equation (21), we follow the following algorithm: 1- Determine the degree of freedom. 2- We choose two values for a and b, we wanted to choose standard values to apply the equation (21), such as (1, -1). 2- After selecting these two values, we choose a value for z. (negative and positive values) 3- We get the value of the cumulative function between (a,b). 4- We compare the resulting value in (3) with the tabular value which obtained from the statistical tables, and find the difference between them. 5- If the difference is large, we choose another value for z and recalculate the algorithm until we get a value close to the tabular value. Thus, whenever we choose values for a and b, or different values for the degrees of freedom, we recalculate the value of the cumulative function. The researcher concluded that the value of z which is the one that achieved the best result of equation (21) if we compare it with the tabular value,at (-1 , 1 ) , (-1 , 0) , ( 0 , 1) with many degrees of freedom. The algorithm was applied using (Matlab,2020a). The Algorithm: As an application of equation (21), we follow the steps below:
Through this procedure, we found that the value (0.3334) yielded the best results with the least possible difference when compared to the tabulated value. Therefore, the value (0.3334) achieved the best outcome for equation (21) when compared to the tabulated value. The results are as Table (1) : Table (1) The comparative between (CDF original) and (CDF approximation)
The first column in Table (1) represents different values of the degrees of freedom that were chosen, the second column represents the results of the proposed equation, Equation (21), and the third column represents the tabular cumulative value taken from the statistical tables. As for the fourth column, it represents the absolute difference (AE) between the value according to the proposed equation and the tabular value. It is noted from Table (1) that the tabular value is close to the value of t according to equation (21), and it is noted that the error value is equal in the tested truncation points (-1,0) and (0,1) , This means that the value of the cumulative function on the left side is equal to the value of the cumulative function on the right side. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
References | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
References[1] Kotz S, Nadarajah S. (2004):" Multivariate t-distributions and their applications", Cambridge University Press; 2004.
[2] Hurst S. (1995):" The characteristic function of the student t distribution. Centre for Mathematics and its Applications", School of Mathematical Sciences; 1995.
[3] Frhr.Ernst August v. Hammerstein (2010): " Generalized hyperbolic distributions: Theory and applications to CDO pricing", Department of Mathematical Stochastics, Faculty of Mathematics and Physics. Albert-Ludwigs-University Freiburg. German.
[4] Nadarajah S, Kotz S. (2008):" Estimation methods for the multivariate t distribution", Acta Applicandae Mathematicae. 2008;102(1):99-118.
[5] Bagdasaryan A. (2009):" A note on the 2F1 hypergeometric function"m arXiv preprint arXiv:09120917. 2009.
[6] Yerukala R, Boiroju NK, Reddy MK. (2013):" Approximations to the t-distribution", International Journal of Statistika and Mathematika. 2013;8(1).
[7] Nascimento A, Rêgo LC, Silva JW. (2022):" Compound truncated Poisson gamma distribution for understanding multimodal SAR intensities", Journal of Applied Statistics. 2022:1-20.
[8] Booth JG, Casella G, Friedl H, Hobert JP.(2003):" Negative binomial loglinear mixed models", Statistical Modelling. 2003;3(3):179-91.
[9] Garcia V, Nielsen F.(2010):" Simplification and hierarchical representations of mixtures of exponential families", Signal Processing. 2010;90(12):3197-212.
[10] Weisstein EW.(2001):" Student’s t-Distribution".
https://mathworld wolfram com/. 2001.
[11] Arellano-Valle RB, Bolfarine H.(1995):" On some characterizations of the t-distribution", Statistics & Probability Letters. 1995;25(1):79-85.
[12] Arellano-Valle RB, Castro LM, González-Farías G, Muñoz-Gajardo KA.(2012):"Student-t censored regression model: properties and inference", Statistical Methods & Applications. 2012;21(4):453-73.
[13] Hermuz, Amir Hanna (1990). "Mathematical Statistics", Directorate of Printing and Publishing House, University of Mosul. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Statistics Article View: 104 PDF Download: 87 |