Converting Text to Binary

# Converting Text to Binary

In order to encipher text-based messages using binary numbers, you'll need to use a standard convention to represent letters as numbers. Up until now, the numerical representation of letters has always used decimal numbers. Sticking with the same numerical values, only written as binary numbers would yield the following table.

Character Decimal Binary Character Decimal Binary
A 0 00000 N 13 01101
B 1 00001 O 14 01110
C 2 00010 P 15 01111
D 3 00011 Q 16 10000
E 4 00100 R 17 10001
F 5 00101 S 18 10010
G 6 00110 T 19 10011
H 7 00111 U 20 10100
I 8 01000 V 21 10101
J 9 01001 W 22 10110
K 10 01010 X 23 10111
L 11 01011 Y 24 11000
M 12 01100 Z 25 11001

Each of the 26 upper case letters can be represented by a 5-bit number. There are a few unused 5-bit numbers (26-31) that we can decide to use for other characters if we wish. One example could be to include some punctuation.

Character Decimal Binary
. 26 11010
! 27 11011
? 28 11100
( 29 11101
) 30 11110
- 31 11111

Note that these choices are arbitrary. Someone else may choose to use these remaining 5-bit numbers in a completely different way. There are several standard text to decimal and decimal to text standards that exist. We'll explore a few in the following sections.

## ASCII

The American Standard Code for Information Interchange (ASCII) was one of the first widely used standards for representing text in computers as binary numbers, dating back to the 1960s. The 7-bit binary codes allow for 128 different characters which were originally used for controlling printers via telegraph. As a result, The first 32 characters (0-31 in decimal) are not printable characters, but rather what are known as control characters that determined how printer should operate. For example, character 10 represents the "line feed" function which causes a printer to advance its paper, character 11 represents "horizontal tab", and character 8 represents "backspace".

An old ASCII, or USASCII as it was sometimes called, code chart is found below. The column would determine the left-most 3 digits of the character, while the row would determine the right-most 4 digits of the character. For example, A would be 1000001 and t would be 1110100.

## Unicode

As computers evolved and eventually overtook the telegraph for everyday communications, 8-bit representations became preferred. 8-bit numbers worked well with the newer 8, 16, 32, and now 64 bit processors found in computers. The 1 additional bit of data allowed for 128 additional character choices. As a result, ASCII evolved into many different variations that retained the original 128 characters, with very different options for the new 128 characters. Some variations were regional ( ISCII in India), VISCII in Vietnam ), others were for creating new characters that could be used to draw computer graphics. It wasn't until the early 1990s that an 8-bit standard was widely adopted, Universal Coded Character Set (Unicode) Transformation Format, also known as UTF-8. As of September 2019, solely UTF-8 characters are used on 94.0% of all web pages in the world.

One benefit of UTF-8 is that you can use multiple 8-bit codes together to generate even more characters. In fact, emojis can be represented with Unicode characters. The Smiling Face with Sunglasses Emoji 😎 is represented as 11110000 10011111 10011000 10001110.

While incredibly powerful and customizable, Unicode is more complicated than we need to illustrate how binary operations can encrypt text based messages.

## Base64

While ASCII and Unicode are impressive for the amount of different characters they can represent with 8-bits, in this course we'll focus on using smaller, 6-bit numbers to keep examples easier to understand. Fortunately, there's a standard for what might be considered the essential printable characters. It consists of the 26 uppercase letters, the 26 lowercase letters, the 10 numerals, and the + and / symbols. This set of 64 characters is known as Base64 and is widely used when sending and receiving information over the internet. Base64's primary use is to convert binary information into text so it can be sent through many established text-based communications channels such email and HTML. When received, the text is turned back into binary where it might represent an image file, audio file, or any other file that can be read by a computer.

We'll go against the norm, and use the Base64 table below to convert text to binary for use in our ciphers for the remainder of this chapter.

Index Binary Char Index Binary Char Index Binary Char
0 000000 A 23 010111 X 46 101110 u
1 000001 B 24 011000 Y 47 101111 v
2 000010 C 25 011001 Z 48 110000 w
3 000011 D 26 011010 a 49 110001 x
4 000100 E 27 011011 b 50 110010 y
5 000101 F 28 011100 c 51 110011 z
6 000110 G 29 011101 d 52 110100 0
7 000111 H 30 011110 e 53 110101 1
8 001000 I 31 011111 f 54 110110 2
9 001001 J 32 100000 g 55 110111 3
10 001010 K 33 100001 h 56 111000 4
11 001011 L 34 100010 i 57 111001 5
12 001100 M 35 100011 j 58 111010 6
13 001101 N 36 100100 k 59 111011 7
14 001110 O 37 100101 l 60 111100 8
15 001111 P 38 100110 m 61 111101 9
16 010000 Q 39 100111 n 62 111110 +
17 010001 R 40 101000 o 63 111111 /
18 010010 S 41 101001 p
19 010011 T 42 101010 q
20 010100 U 43 101011 r
21 010101 V 44 101100 s
22 010110 W 45 101101 t

While we'll be working with 6-bit numbers in Base64, the methods described in the remainder of the chapter would still work with numbers represented with more or less than 6-bits.

## Using Python to Convert Between Base64 and Binary

Python has a built-in binary data type that can store binary data. However, it requires a careful understanding of the syntax and operations that pertain to binary, far beyond the scope of this course. As such, instead of using the binary data type in Python, this course will store binary information as strings of 1's and 0's. To facilitate quick conversions between Base64 characters and 6-bit binary, use the following functions.

### charToBinary()

This function takes in a single base64 character and returns the corresponding 6-bit binary representation as a string.

Notes:

• If a string with more than 1 character is input to the function, it will only convert the first character.
• If a non base64 character is input to the function, it will return an empty string.
• The output will always be a 6-bit binary number, even if fewer bits are needed to represent the character.
def charToBinary(char):
if len(char)>1:
char = char
if char in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ':
return '{:06b}'.format( ord(char) - 65 )
elif char in 'abcdefghijklmnopqrstuvwxyz':
return '{:06b}'.format( ord(char) - 71 )
elif char in '0123456789':
return '{:06b}'.format( ord(char) + 4 )
elif char == '+':
return '{:06b}'.format( 62 )
elif char == '/':
return '{:06b}'.format( 63 )
else:
return ''

print( charToBinary('A') )

000000

print( charToBinary('z') )

110011

print( charToBinary('+') )

111110

print( charToBinary('zebra') )

110011

print( charToBinary('?') )


### binaryToChar()

This function takes in a string containing the 6-bit binary number and returns the corresponding base64 character representation as a string.

Notes:

• The function will strip any spaces in the input string.
• If the input string contains less than 6 bits, the function will pad the input out to 6-bits by using 0's.
• If the input string contains more than 6 bits, the function will return an empty string.
• The output will always be a single base64 character.
def binaryToChar(binary):
binary = binary.replace(' ','')
if len(binary) < 6:
binary = binary.zfill(6)
if len(binary) > 6:
return ''
num = int(binary,2)
if (num >= 0) and (num <= 25):
return chr(num + 65)
elif (num >= 26) and (num <= 51):
return chr(num + 71)
elif (num >= 52) and (num <= 61):
return chr(num - 4)
elif num == 62:
return '+'
elif num == 63:
return '/'
else:
return ''

print( binaryToChar('000101') )

F

print( binaryToChar('101') )

F

print( binaryToChar('0100101') )


print( binaryToChar('100111') )

n


### XOR()

This function takes two strings that both contain binary data of arbitrary length and returns a single that represents the XOR of the input strings.

Notes:

• The function will strip any spaces from the input strings.
• The output will be padded to be equal in length to the longer input string.
def XOR( binary1, binary2):
binary1 = binary1.replace(' ','')
binary2 = binary2.replace(' ','')
result = format(int(binary1, 2) ^ int(binary2, 2), 'b')
return result.zfill(max(len(binary1), len(binary2) ))

print( XOR( '1110', '0001' ) )

1111

print( XOR( '1110', '1' ) )

1111

print( XOR( '10110 01000 01100 10010 10100 01001', '11010 11001 00011 11010 11001 00011' ) )

011001000101111010000110101010