Thursday, April 7, 2016

Predicting Frequency of Factors

I have mentioned before the Erdős–Kac theorem on my Riemann Hypothesis blog in
"Addendum on Erdős–Kac Theorem" in "More on Erdős–Kac Theorem" and "A Simple Example".

Basically this states if ω(n) is the number of (distinct) prime factors of n, the probability distribution of,

  \frac{\omega(n) - \log\log n}{\sqrt{\log\log n}}  

is the normal distribution.

What is fascinating about this result is that it allows one in principle to calculate for any value of n, the number of composites with 2, 3, 4, 5,...,k factors respectively. Indeed we can also attempt to use it as a means for calculating the number of primes (with 1 factor). However as we shall see, measurements for primes using the normal distribution are not at all reliable.


I will now attempt to illustrate using a simple example.

When n = 1,000,000,000, the average (or mean) number of (distinct) primes is given by log log n = 3.031, with the standard deviation as its square root = 1.741 (correct to 3 decimal places in each case).

We can use the normal distribution (with accompanying tables) to calculate the totals in relation to the number of factors as a percentage of n.

So as 3 is roughly the average number of (distinct) factors (per number) we would expect perhaps that the largest percentage would relate to this category.

However because the normal is a continuous distribution we must adjust the discrete notion of 3 to represent the gap in the normal curve as between 2.5 and 3.5.

So the number of standard deviations that 2.5 lies from the mean is given as (3.031 - 2.5)/1.741 = .30 to 2 decimal places (as my tables are only correct to 2 decimal places).
The area in the normal curve to which this relates = .5 - .3821 = .1179.

Then the corresponding number of standard deviations to which 3.5 corresponds = (3.5 - 3.031)/1.741 = .27.

And the area that this corresponds to is .5 - .3936 = .1064.

Therefore the total area corresponding to numbers with 3 factors = .1179 + .1064 = .2243.

This would suggest therefore that 22% (rounded to the nearest integer) would contain 3 factors. 

The corresponding figures for 4, 5, 6 and 7 factors were then calculated as 19%, 12%, 5% and 2% respectively.

However there is an asymmetry operating, which is not reflected by the use of the normal distribution in this case. Therefore though we can extend indefinitely on the RHS to calculate the probabilities of numbers with an increasing number of factors, this does not equally apply on the LHS where we can now only calculate with respect to two possibilities i.e. for numbers with 2 factors and 1 factor respectively.

If we assume the normal distribution in this case (though not strictly justified) the figures for numbers containing 2 factors and 1 factor (i.e. the primes) works out as 19% and 12% respectively.

I then did a quick empirical study (over a range of 500 from 1,000,000,001 to 1,000,000,500) to calculate the percentage totals that actually occurred in the number system around n (for numbers with these factor totals).






No. of Factors
Normal Dist. Est. (%)
Empirical Estimate (%)
    1
        12     
       4
    2
        20
      18
    3
        22
      29
    4
        19
      27
    5
        12
      14
    6
          5
        6
    7
          2
        1


When the empirical estimates (of actual occurrence) are compared with those based on calculation through the normal distribution, notable discrepancies occur.

In particular, there is considerable over-estimate for the frequency of primes (where actual occurrence is only 1/3 of that predicted using the normal distribution).

Then for the middle scores (3 and 4) actual occurrence is somewhat in excess of that predicted by the normal distribution.

In truth the assumption of a regular normal distribution does not properly apply in this case because the average number of (distinct) prime factors (3) involved per number is still exceedingly low.

However in order to get the appropriate number of factors so that the normal curve does indeed become a suitable approximation, the value of n which in our example is 109, needs to be dramatically increased.


Fro example where the average number of (distinct) prime factors = 10, n = 1010,000.

Therefore, especially in the case of the estimation of primes, the normal distribution could never be used as an accurate measurement. For strictly there is an always unavoidable asymmetry with respect to the LHS of the normal curve. Thus whereas the no. of factors greater than the average can theoretically  increase without limit, the corresponding number of factors less than the average must necessarily end at 1.

Rather, the normal curve can be best used to predict - where the average number of (distinct) factors is relatively high - the % of factors that is likely to occur within specified ranges of that average. However once again it will be less accurate for extreme outliers especially when attempting to calculate the no. of primes (with just one factor).  

Even with the average no. of distinct factors as low as 10, we could estimate pretty accurately for example the % of numbers that would contain exactly 10 distinct factors (12% approx).

We could also say quite accurately that in this case we would expect to find about 68% (approx) within 1 standard deviation of the mean (i.e. between 7 and 13 factors).

However, once again the estimate for primes is widely inaccurate.

Based on the assumption of the normal distribution , we would estimate on average about 1 in 435 numbers as prime, whereas in actual fact it is 1 in 23,000.

No comments:

Post a Comment