

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
A project for creating a program that uses the genetic algorithm to solve substitution ciphers by analyzing trigram frequency in english text. The program assumes a master file of at least 10,000 words and a cipher text of at least one paragraph. The algorithm involves an initial population of 30 sets of alphabets, selection based on fitness value, crossover, mutation, and fitness measurement. The objective is to translate the cipher text to plain text with 85% accuracy.
Typology: Study Guides, Projects, Research
1 / 3
This page cannot be seen from the preview
Don't miss anything!
The basic algorithm of the substitution cipher can be attacked by many ways , however this program focuses on solving the cipher by using Darwin's theory of Natural Selection from which the genetic algorithm is derived. What is the substitution cipher? In this cipher, each letter is replaced by another letter, leaving spaces and punctuation unchanged. For example,
Mow during this time Shahrazad had borne king Shahriyar three sons. Epilogue, Tales from the Thousand and One Nights (plain text) Why is the program using the Genetic Algorithm? The Genetic Algorithm is easy to apply to a wide range of problem. There is a similarity between a set of English alphabet and a set of DNA gene. In English Alphabets there are a discrete number of 26 alphabets. In the real DNA, there are 4 alphabets which are AGTC. It is obvious that the Genetic Algorithm may be the first method that comes in mind when dealing with ciphers. The result can be very good even it is not optimal. The objective is to be able to translate the cipher to plain text, with only 85% of correctness of results. We can guess the rest of the cipher by using our human intelligence to deduct the incorrect replacing of the alphabets. It proves that what is good for nature, it may be good for AI.
The program is intended for solving ciphers in English by using the frequency of trigrams in Standard English. The program assumes there are 26 alphabets and throws away space and punctuations. This program may works in other languages if the same restrictions apply. It needs 2 input files for constructing the trigrams' frequency statistics. One of the input is the master file. It should be at least 10000 words in length. The other is the cipher text with the length of at least one paragraph.
What is trigram frequency? In Standard English, not every 3 letter occurs the same number of time. For example, "the" and "and" are the most frequent 3 letter words in English. By measuring the consecutively occurrences of 3 letters, the program can create statistics of the trigram frequency. THE COMPLETE ALGORITHM OF THE PROGRAM Initial Population The number population in the program is 30. The number of the population does affect the performance of the program to a certain degree, but the bigger doesn't mean it is better. Many books said that the good population size is in the range of 20-50. Population consists of 30 sets of alphabets called " Genome ." Each genome has 26 alphabets that are randomly placed to represent other alphabets. No repeat of the same alphabet allows in Genome. Selection The selection process starts from randomly selecting a number in the range of 0-1 and randomly selecting a pair of Genomes from the current population. Look up the fitness value of each genome. If the value exceed the random number, the genome is selected to be a parent of new offspring, or else it is thrown away and the process will select another genome. Since the higher the fitness value, the better chance of passing the condition above. The genome of the high fitness value will produce more offspring than the genome of the low fitness value. Cross Over After 2 parents are selected, another random number is created. This time it is in the range of 0-25. The number is cross over point of when the two parents are exchanging their alphabets. The offspring will replicate the first parent's set of alphabets up to the crossover point and the rest of the set will have the alphabets from the second parent which haven't been copied from the first parent. Mutation Every 5 generations, a pair of alphabets in each genome of that generation is swapped to create the mutation effect. The swap pair is randomly selected. Fitness The fitness of each genome is measured by the gap of comparing the frequency distribution of each trigram after translating alphabets on the genome to cipher text and standard trigram frequency stats on the master file. The bigger the gap, the lower the fitness value. This process is a little complicated. It starts from open the master file and record the frequency of all trigrams and save the frequency numbers on the multi dimensional array. Repeat the process on the cipher text.