Treating
Strings as Numbers
primarily for the purpose of finding
the average value and the standard deviation of a list
of strings
|
Quick Summary: Convert each string into a sum of base-256 fractions, then do the math on the numbers, and finally convert back to a string. |
It's easy to do math on numbers, so if each string could be converted to a single number, you could use the existing facilities for doing math on numbers. Of course, every letter of a string can be converted to it's numeric ASCII value, but that just leaves you with a series of integers (one for each letter of the original string), not a single number. The problem is how to combine the numbers in this series so that the resulting single number sorts the same way as the original string (when compared to other strings), and averages in a way that makes sense.
Strings and fractional decimals grow to the right, while whole numbers grow to the left.
As a string grows to the right, it's position in a sorted list would tend to change less and less, because the position is primarily determined by the initial characters. But decimal numbers equal to or greater than 1.0 grow to the left, and adding more digits would continually increase the magnitude of the number, thereby changing it's position in a sorted list drastically.
|
Example: If you start with the word super and then start growing it by adding characters to the right, until it becomes the word supercali, and then finally supercalifragilisticexpialidoceous, it's sort order might change like this: rich rich rich It's position is simply further narrowed down as more characters are added, but it will continue to stay in among the words that start with the letter “s”. |
|
Example: If you start with the number 6, and you add random digits to it, it's position in a sorted list will change dramatically, each time a new digit is added: 4 4 4 There is no limit to how far it can jump as more digits are added. |
But fractional decimal numbers grow to the right, just like strings, and their value does not change greatly as new digits are added to the right, because their value is most affected by their initial digits, the same way that strings are sorted primarily by their initial characters.
|
Example: 0.4 0.4 0.4 |
So if each string could be converted to a fraction, those fractions would sort the same way as strings do: both would have positions primarily influenced by their initial members, and the addition of more characters or digits on the right, would not greatly change their positions.
We can't jam 26 a-z values into a column of base-10.
But there is not enough space in the base 10 arithmetic place value system to convert each ASCII symbol into a numeric value in a column to the right of a decimal point:
|
decimal |
s |
u |
p |
e |
r |
|---|---|---|---|---|---|
|
. |
115 |
117 |
112 |
101 |
114 |
|
point |
tenths |
hundredths |
thousandths |
ten-thousandths |
hundred-thousandths |
Because the ASCII value of the letter “s” is 115, two of it's digits would spill over into the ones and tens places. And because the ASCII value of the letter “u” is 117, two of it's values would spill over into the tenths and ones places. The range of printable ASCII characters is just too big; even if you ignored all punctuation and converted all text into lower-case, you would still have to jam the values 1 through 26 into a single column in a base 10 place value system.
So instead of trying to convert the first character into tenths, and the second into hundredths, and the third into thousandths, you can convert the first character into a fraction of 256ths, and the second into a fraction of 2562, and the third into a fraction of 2563 and so on.
By starting with the denominator of 256, you are guaranteed that the numerator will never be bigger than the denominator, so the value will never spill over into the next colum. Just as the next column after tenths in the base 10 place value system is 10-2, the next denominator we will use is 2562. This will also ensure that no matter what the next character is, it will never spill over into other columns.
|
s |
u |
p |
e |
r |
|---|---|---|---|---|
|
115/256 |
+ 117/2562 |
+ 112/2563 |
+ 101/2564 |
+ 114/2565 |
|
0.44921875 |
0.0017852783203125 |
0.00000667572021484375 |
0.00000002351589500904083251953125 |
0.000000000103682396002113819122314453125 |
|
0.44921875 |
0.4510040283203125 |
0.45101070404052734375 |
0.45101072755642235279083251953125 |
0.451010727660104748792946338653564453125 |
The decimal value of each fraction is shown in the third row. The cumulative sum generated by adding the fractions are shown in the fourth row. So the numerical value for the string “super” is the final sum in the fifth column: 0.451010727660104748792946338653564453125. Because they are displayed in base 10, the values generated by adding characters to the right do actually spill over into the columns on the left, but this is only because these numbers are displayed as base 10 decimal fractions.
|
Example: The numerical values for the above strings are: rich -
0.44692059792578220367431640625 rich -
0.44692059792578220367431640625 rich -
0.44692059792578220367431640625 The digit at which a number starts to differ from it's successor is indicated. For the longer text strings, the difference lies beyond the floating point resolution of my machine. |
This scheme apparently does create numbers which sort in the same order as the original strings.
But what about averages?
The letter “a” converts to 0.37890625. The letter “c” converts to 0.38671875. The average of those two numbers is 0.3828125, which is the number that the letter “b” convers into. So the letters “a” and “c” average out to the letter “b”, which is the behaviour you would expect.
Using slightly longer strings: “aaa” = 0.380392134189605712890625, “aba” = 0.380407392978668212890625, and “aca” = 0.380422651767730712890625. The average of these three numbers is 0.380407392979, which is nearly exactly the string “aba”, as you would expect. The reason it is different at all is because of floating point round-off errors. The average of "aaa", "aba" and "aca" is "aba", as you would expect.
Converting back to strings.
Because the denominator of each character is 256 times larger than the one before it, multiplying the fraction by 256 should bring just one character up above the decimal point. The fractional part of this new number can be saved off, and the non-fractional part can be converted directly back into an ASCII character. This three step process of a) multiplying by 256, b) saving the fractional part of the product, and c) converting the non-fractional part directly into an ASCII character can be repeatedly applied to re-generate all the characters in the original string.
|
Example: Up above, the string "super" was found to be 0.451010727660104748792946338653564453125. a) 256 *
0.451010727660104748792946338653564453125 = 115.45874628098682 a) 256 * .45874628098682
= 117.43904793262593 a) 256 * .43904793262593
= 112.39627075223808 a) 256 * .39627075223808
= 101.44531257294848 a) 256 * .44531257294848
= 114.00001867481087 a) 256 * .00001867481087
= 0.0047807515827199996 |
What's the use?
If you have a very long list of strings, you might want to generate a box plot histogram, so you would need to calculate the average and standard deviation of a list of strings.
You might want to know if most of the entries are in the first half of the alphabet or in the last half.
You might want to display the values graphically.
Implementation.
I found that I could use base-128 instead of base-256 because I'm only interested in printable ASCII characters.
|
Python implementation: def float_to_text(some_float):
import math
some_float *= 128
ret_val = ""
the_int = int(some_float)%256
while the_int != 0 and len(ret_val) < 128:
ret_val += chr(the_int)
some_float -= the_int
some_float *= 128
the_int = int(some_float)%256
return ret_val
def text_to_float(some_string):
import math
string_len = len(some_string)
ret_val = 0.0
for string_index in range(0, string_len):
the_char = some_string[string_index]
numerator = ord(the_char)
denominator = math.pow(128, string_index + 1)
ret_val += float(numerator)/float(denominator)
return ret_val
|
|
C implementation: #include <stdio.h>
#include <math.h>
double text_to_float(char * some_string)
{
double ret_val = 0.0;
double numerator;
double denominator;
int index = 1;
while ( *some_string != '\0')
{
numerator = *some_string;
denominator = pow(128, index);
ret_val += numerator / denominator;
some_string ++;
index ++;
}
return ret_val;
}
int main ( int argc, char ** argv )
{
double the_double = 0.0;
the_double = text_to_float ( argv[1] );
printf("%50.50f", the_double);
return 0;
}
#include <stdio.h>
#include <math.h>
char * float_to_text(double some_float)
{
char *ret_val = (char * )malloc(128 * sizeof(char));
int the_index = 0;
char the_char;
some_float *= 128;
the_char = (char)(((int)some_float) % 128);
while ( (the_char != '\0') && (the_index < 128))
{
ret_val[the_index] = the_char;
some_float -= the_char;
some_float *= 128;
the_char = (char)(((int)some_float) % 128);
the_index ++;
}
return ret_val;
}
int main ( int argc, char ** argv )
{
double the_float;
char * ret_val;
sscanf ( argv[1], "%lf", &the_float);
ret_val = float_to_text(the_float);
printf("%s\n", ret_val);
return 0;
}
|
The best implementation is to use the Postgres Numeric datatype, which offers unlimited precision.