Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Genetic Algorithm for Solving Substitution Ciphers using Trigram Frequency - Prof. David P, Study Guides, Projects, Research of Computer Science

A project for creating a program that uses the genetic algorithm to solve substitution ciphers by analyzing trigram frequency in english text. The program assumes a master file of at least 10,000 words and a cipher text of at least one paragraph. The algorithm involves an initial population of 30 sets of alphabets, selection based on fitness value, crossover, mutation, and fitness measurement. The objective is to translate the cipher text to plain text with 85% accuracy.

Typology: Study Guides, Projects, Research

Pre 2010

Uploaded on 08/09/2009

koofers-user-jlc
koofers-user-jlc 🇺🇸

10 documents

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Artificial Intelligence Project
CS 627 - Fall 2002
Sawipa Sakulchareon
364-15-6686
SURVIVAL OF THE FITTEST AND SIMPLE SUBSTITUTION CIPHERS
INTRODUCTION
The basic algorithm of the substitution cipher can be attacked by many ways , however
this program focuses on solving the cipher by using Darwin's theory of Natural Selection
from which the genetic algorithm is derived.
What is the substitution cipher?
In this cipher, each letter is replaced by another letter, leaving spaces and punctuation
unchanged.
For example,
"PCQ VMJIPD LHIK LISE KHAHJAWAV HAV ZCIPE EIPD KHAHJIUAJ
KHJEE KCPK."
EFIRCDME, LAREK IJCS LHE LHCMKAPV APV CPE PIDHLK
(cipher)
Mow during this time Shahrazad had borne king Shahriyar three sons.
Epilogue, Tales from the Thousand and One Nights
(plain text)
Why is the program using the Genetic Algorithm?
The Genetic Algorithm is easy to apply to a wide range of problem. There is a similarity
between a set of English alphabet and a set of DNA gene. In English Alphabets there
are a discrete number of 26 alphabets. In the real DNA, there are 4 alphabets which are
AGTC. It is obvious that the Genetic Algorithm may be the first method that comes in
mind when dealing with ciphers. The result can be very good even it is not optimal. The
objective is to be able to translate the cipher to plain text, with only 85% of correctness
of results. We can guess the rest of the cipher by using our human intelligence to
deduct the incorrect replacing of the alphabets. It proves that what is good for nature, it
may be good for AI.
SCOPE AND OBJECTIVE
The program is intended for solving ciphers in English by using the frequency of
trigrams in Standard English. The program assumes there are 26 alphabets and throws
away space and punctuations. This program may works in other languages if the same
restrictions apply. It needs 2 input files for constructing the trigrams' frequency statistics.
One of the input is the master file. It should be at least 10000 words in length. The other
is the cipher text with the length of at least one paragraph.
pf3

Partial preview of the text

Download Genetic Algorithm for Solving Substitution Ciphers using Trigram Frequency - Prof. David P and more Study Guides, Projects, Research Computer Science in PDF only on Docsity!

Artificial Intelligence Project

CS 627 - Fall 2002

Sawipa Sakulchareon

SURVIVAL OF THE FITTEST AND SIMPLE SUBSTITUTION CIPHERS

INTRODUCTION

The basic algorithm of the substitution cipher can be attacked by many ways , however this program focuses on solving the cipher by using Darwin's theory of Natural Selection from which the genetic algorithm is derived. What is the substitution cipher? In this cipher, each letter is replaced by another letter, leaving spaces and punctuation unchanged. For example,

"PCQ VMJIPD LHIK LISE KHAHJAWAV HAV ZCIPE EIPD KHAHJIUAJ

KHJEE KCPK."

EFIRCDME, LAREK IJCS LHE LHCMKAPV APV CPE PIDHLK

(cipher)

Mow during this time Shahrazad had borne king Shahriyar three sons. Epilogue, Tales from the Thousand and One Nights (plain text) Why is the program using the Genetic Algorithm? The Genetic Algorithm is easy to apply to a wide range of problem. There is a similarity between a set of English alphabet and a set of DNA gene. In English Alphabets there are a discrete number of 26 alphabets. In the real DNA, there are 4 alphabets which are AGTC. It is obvious that the Genetic Algorithm may be the first method that comes in mind when dealing with ciphers. The result can be very good even it is not optimal. The objective is to be able to translate the cipher to plain text, with only 85% of correctness of results. We can guess the rest of the cipher by using our human intelligence to deduct the incorrect replacing of the alphabets. It proves that what is good for nature, it may be good for AI.

SCOPE AND OBJECTIVE

The program is intended for solving ciphers in English by using the frequency of trigrams in Standard English. The program assumes there are 26 alphabets and throws away space and punctuations. This program may works in other languages if the same restrictions apply. It needs 2 input files for constructing the trigrams' frequency statistics. One of the input is the master file. It should be at least 10000 words in length. The other is the cipher text with the length of at least one paragraph.

What is trigram frequency? In Standard English, not every 3 letter occurs the same number of time. For example, "the" and "and" are the most frequent 3 letter words in English. By measuring the consecutively occurrences of 3 letters, the program can create statistics of the trigram frequency. THE COMPLETE ALGORITHM OF THE PROGRAM Initial Population The number population in the program is 30. The number of the population does affect the performance of the program to a certain degree, but the bigger doesn't mean it is better. Many books said that the good population size is in the range of 20-50. Population consists of 30 sets of alphabets called " Genome ." Each genome has 26 alphabets that are randomly placed to represent other alphabets. No repeat of the same alphabet allows in Genome. Selection The selection process starts from randomly selecting a number in the range of 0-1 and randomly selecting a pair of Genomes from the current population. Look up the fitness value of each genome. If the value exceed the random number, the genome is selected to be a parent of new offspring, or else it is thrown away and the process will select another genome. Since the higher the fitness value, the better chance of passing the condition above. The genome of the high fitness value will produce more offspring than the genome of the low fitness value. Cross Over After 2 parents are selected, another random number is created. This time it is in the range of 0-25. The number is cross over point of when the two parents are exchanging their alphabets. The offspring will replicate the first parent's set of alphabets up to the crossover point and the rest of the set will have the alphabets from the second parent which haven't been copied from the first parent. Mutation Every 5 generations, a pair of alphabets in each genome of that generation is swapped to create the mutation effect. The swap pair is randomly selected. Fitness The fitness of each genome is measured by the gap of comparing the frequency distribution of each trigram after translating alphabets on the genome to cipher text and standard trigram frequency stats on the master file. The bigger the gap, the lower the fitness value. This process is a little complicated. It starts from open the master file and record the frequency of all trigrams and save the frequency numbers on the multi dimensional array. Repeat the process on the cipher text.