Treating Strings as Numbers
primarily for the purpose of finding the average value and the standard deviation of a list of strings

Quick Summary:

Convert each string into a sum of base-256 fractions, then do the math on the numbers, and finally convert back to a string.



It's easy to do math on numbers, so if each string could be converted to a single number, you could use the existing facilities for doing math on numbers. Of course, every letter of a string can be converted to it's numeric ASCII value, but that just leaves you with a series of integers (one for each letter of the original string), not a single number. The problem is how to combine the numbers in this series so that the resulting single number sorts the same way as the original string (when compared to other strings), and averages in a way that makes sense.


Strings and fractional decimals grow to the right, while whole numbers grow to the left.

As a string grows to the right, it's position in a sorted list would tend to change less and less, because the position is primarily determined by the initial characters. But decimal numbers equal to or greater than 1.0 grow to the left, and adding more digits would continually increase the magnitude of the number, thereby changing it's position in a sorted list drastically.

Example:

If you start with the word super and then start growing it by adding characters to the right, until it becomes the word supercali, and then finally supercalifragilisticexpialidoceous, it's sort order might change like this:


rich
super

superant
supercalifornia
supercar


rich
superant
supercali
supercalifornia
supercar


rich
superant
supercalifornia
supercalifragilisticexpialidoceous
supercar


It's position is simply further narrowed down as more characters are added, but it will continue to stay in among the words that start with the letter “s”.



Example:

If you start with the number 6, and you add random digits to it, it's position in a sorted list will change dramatically, each time a new digit is added:

4
5
6
10
100
1000


4
5
10
67
100
1000


4
5
10
100
672
1000


There is no limit to how far it can jump as more digits are added.

But fractional decimal numbers grow to the right, just like strings, and their value does not change greatly as new digits are added to the right, because their value is most affected by their initial digits, the same way that strings are sorted primarily by their initial characters.

Example:

0.4
0.5
0.6
0.7
0.8


0.4
0.5
0.67
0.7
0.8


0.4
0.5
0.672
0.7
0.8

So if each string could be converted to a fraction, those fractions would sort the same way as strings do: both would have positions primarily influenced by their initial members, and the addition of more characters or digits on the right, would not greatly change their positions.


We can't jam 26 a-z values into a column of base-10.

But there is not enough space in the base 10 arithmetic place value system to convert each ASCII symbol into a numeric value in a column to the right of a decimal point:

decimal

s

u

p

e

r

.

115

117

112

101

114

point

tenths

hundredths

thousandths

ten-thousandths

hundred-thousandths

Because the ASCII value of the letter “s” is 115, two of it's digits would spill over into the ones and tens places. And because the ASCII value of the letter “u” is 117, two of it's values would spill over into the tenths and ones places. The range of printable ASCII characters is just too big; even if you ignored all punctuation and converted all text into lower-case, you would still have to jam the values 1 through 26 into a single column in a base 10 place value system.

So instead of trying to convert the first character into tenths, and the second into hundredths, and the third into thousandths, you can convert the first character into a fraction of 256ths, and the second into a fraction of 2562, and the third into a fraction of 2563 and so on.

By starting with the denominator of 256, you are guaranteed that the numerator will never be bigger than the denominator, so the value will never spill over into the next colum. Just as the next column after tenths in the base 10 place value system is 10-2, the next denominator we will use is 2562. This will also ensure that no matter what the next character is, it will never spill over into other columns.

s

u

p

e

r

115/256

+ 117/2562

+ 112/2563

+ 101/2564

+ 114/2565

0.44921875

0.0017852783203125

0.00000667572021484375

0.00000002351589500904083251953125

0.000000000103682396002113819122314453125

0.44921875

0.4510040283203125

0.45101070404052734375

0.45101072755642235279083251953125

0.451010727660104748792946338653564453125

The decimal value of each fraction is shown in the third row. The cumulative sum generated by adding the fractions are shown in the fourth row. So the numerical value for the string “super” is the final sum in the fifth column: 0.451010727660104748792946338653564453125. Because they are displayed in base 10, the values generated by adding characters to the right do actually spill over into the columns on the left, but this is only because these numbers are displayed as base 10 decimal fractions.

Example:

The numerical values for the above strings are:


rich - 0.44692059792578220367431640625
super - 0.451010727660104748792946338653564453125

superant - 0.45101072766045091633202446246286854147911071777344
supercalifornia - 0.45101072766045779971477713843341916799545288085938
supercar - 0.45101072766045779971477713843341916799545288085938


rich - 0.44692059792578220367431640625
superant - 0.45101072766045091633202446246286854147911071777344
supercali - 0.45101072766045779971477713843341916799545288085938
supercalifornia - 0.45101072766045779971477713843341916799545288085938
supercar - 0.45101072766045779971477713843341916799545288085938


rich - 0.44692059792578220367431640625
superant - 0.45101072766045091633202446246286854147911071777344
supercalifornia - 0.45101072766045779971477713843341916799545288085938
supercalifragilisticexpialidoceous - 0.45101072766045779971477713843341916799545288085938
supercar - 0.45101072766045779971477713843341916799545288085938


The digit at which a number starts to differ from it's successor is indicated. For the longer text strings, the difference lies beyond the floating point resolution of my machine.

This scheme apparently does create numbers which sort in the same order as the original strings.


But what about averages?

The letter “a” converts to 0.37890625. The letter “c” converts to 0.38671875. The average of those two numbers is 0.3828125, which is the number that the letter “b” convers into. So the letters “a” and “c” average out to the letter “b”, which is the behaviour you would expect.

Using slightly longer strings: “aaa” = 0.380392134189605712890625, “aba” = 0.380407392978668212890625, and “aca” = 0.380422651767730712890625. The average of these three numbers is 0.380407392979, which is nearly exactly the string aba”, as you would expect. The reason it is different at all is because of floating point round-off errors. The average of "aaa", "aba" and "aca" is "aba", as you would expect.


Converting back to strings.

Because the denominator of each character is 256 times larger than the one before it, multiplying the fraction by 256 should bring just one character up above the decimal point. The fractional part of this new number can be saved off, and the non-fractional part can be converted directly back into an ASCII character. This three step process of a) multiplying by 256, b) saving the fractional part of the product, and c) converting the non-fractional part directly into an ASCII character can be repeatedly applied to re-generate all the characters in the original string.

Example:

Up above, the string "super" was found to be 0.451010727660104748792946338653564453125.

a) 256 * 0.451010727660104748792946338653564453125 = 115.45874628098682
b) Save .45874628098682 for use in step a) of the next iteration
c) 115 corresponds to the ASCII character "s"


a) 256 * .45874628098682 = 117.43904793262593
b) Save .43904793262593 for use in step a) of the next iteration
c) 117 corresponds to the ASCII character "u"


a) 256 * .43904793262593 = 112.39627075223808
b) Save .39627075223808 for use in step a) of the next iteration
c) 112 corresponds to the ASCII character "p"


a) 256 * .39627075223808 = 101.44531257294848
b) Save .44531257294848 for use in step a) of the next iteration
c) 101 corresponds to the ASCII character "e"


a) 256 * .44531257294848 = 114.00001867481087
b) Save .00001867481087 for use in step a) of the next iteration
c) 114 corresponds to the ASCII character "r"


a) 256 * .00001867481087 = 0.0047807515827199996
b) Save .0047807515827199996 for use in step a) of the next iteration
c) 0 corresponds to the NULL character. Make a decision to stop here.




What's the use?

If you have a very long list of strings, you might want to generate a box plot histogram, so you would need to calculate the average and standard deviation of a list of strings.

You might want to know if most of the entries are in the first half of the alphabet or in the last half.

You might want to display the values graphically.


Implementation.

I found that I could use base-128 instead of base-256 because I'm only interested in printable ASCII characters.

Python implementation:

def float_to_text(some_float):
        import math
        some_float *= 128

        ret_val = ""

        the_int = int(some_float)%256

        while the_int != 0 and len(ret_val) < 128:
                ret_val += chr(the_int)

                some_float -= the_int
                some_float *= 128
                the_int = int(some_float)%256

        return ret_val
def text_to_float(some_string):
        import math
        string_len = len(some_string)

        ret_val = 0.0

        for string_index in range(0, string_len):
                the_char = some_string[string_index]
                numerator = ord(the_char)
                denominator = math.pow(128, string_index + 1)

                ret_val += float(numerator)/float(denominator)

        return ret_val




C implementation:

#include <stdio.h>
#include <math.h>

double text_to_float(char * some_string)

{
        double ret_val = 0.0;
        double numerator;

        double denominator;
        int index = 1;
        
        while ( *some_string != '\0')

        {
                numerator = *some_string;
                denominator = pow(128, index);

                
                ret_val += numerator / denominator;
                some_string ++;

                index ++;
        }
        
        return ret_val;
}

int main ( int argc, char ** argv )

{
        double the_double = 0.0;
        the_double = text_to_float ( argv[1] );

        printf("%50.50f", the_double);
        return 0;
}

#include <stdio.h>
#include <math.h>

char * float_to_text(double some_float)

{
        char *ret_val = (char * )malloc(128 * sizeof(char));

        int the_index = 0;
        char the_char;
        
        some_float *= 128;

        the_char = (char)(((int)some_float) % 128);

        while ( (the_char != '\0') && (the_index < 128))

        {
                ret_val[the_index] = the_char;
                some_float -= the_char;

                some_float *= 128;
                the_char = (char)(((int)some_float) % 128);

                the_index ++;
        }
        
        return ret_val;
}

int main ( int argc, char ** argv )

{
        double the_float;
        char * ret_val;
        sscanf ( argv[1], "%lf", &the_float);

        ret_val = float_to_text(the_float);
        
        printf("%s\n", ret_val);

        return 0;
}



The best implementation is to use the Postgres Numeric datatype, which offers unlimited precision.