Civil War Diary Code
by aweiss814
Basic story:
I’m a PhD student at U.C Berkeley. The history department recently got ahold of a Civil War diary, but a bunch of things are written in code. We can’t decipher it, so we were hoping someone in the public domain could. Anyone want to take a shot at deciphering this–or, also helpful, let me know what code you think it is? Images attached.
Some context: this is a Union soldier who appears to have seen action in a number of battles throughout the war. It is unlikely that this is military code, since we suspect that he is writing about his superior officers and would not want them to be able to decipher it.
***I am currently in the process of reading the diary to determine the whole context. I will update this website with more information as soon as I finish.
Can you say something about the context? And plus it is not quite readable…
It’s quite possible that this is some simple substitution cipher; it seems like there are too many different symbols for it, but I think it’s just a matter of determining which symbols are variations of the same symbol. However, the segment in one image (the one with the diary entry beginning “21 Boys”) appears to be in an entirely different cipher from the rest. that one’s not going to be solvable unless there is more of it.
I wonder if it’s a different cipher or, as Adrienne suggested, just a particular line of code containing the symbols used to contain proper nouns/place etc?
What have you already tried? Given the use case, I’d guess that this is a substitution cipher with a shift, possibly with some pictograms interspersed to represent proper nouns/places. Based on how halting the handwriting looks I’d guess that the translation won’t be error-free.
It does look like a substitution cipher. Some suggestions to start:
1) Attribute a character to each symbol, and transcribe the text to a text file using these characters. In some cases it may be hard to know if a drawing is a single symbol or two symbols, just make a guess for now. Err on the side of assuming they are two symbols.
2) Count the number of occurrences of each symbol, and compare to the frequencies in English or, if you also have a computer transcription of the rest of the text, use those frequencies. The most common symbols likely match the most common English letters.
3) The first letter of each line is likely the first letter of a word, so you can look at the most frequent such letters, and compare to the most frequent letters in English (just measure the size of each section of the dictionary).
4) Look at every sequence of 2 and 3 letters, and count their occurrences. The most common 2-letter pattern is likely “of”. The most common 3-letter pattern is likely “the”.
5) The comma-looking symbol is likely a space, a word separator. If that’s the case, looking at the first and last letters of each word can help you guess the most frequent ones.
Unfortunately these suggestions would work well if you had a lot of text. You may not have enough for them to work, but it’s worth the try.
Everyone’s right about not enough text. I’m going to scan the whole diary next week and post it to this site–hopefully that will help provide context and enough of the symbols to be decipherable
Maybe this results if compare with the letter frequencies of the text
Can it be typed up?
unfortunately, in response to various questions about this, it can’t be typed up because many or even most of the characters don’t exist on a keyboard.
Just use numbers instead of the weird symbols when you type them up.
The actual characters you use don’t matter, as long as they match the ciphertext’s ordering.
Just make a key. For example 1 = “squiggly text with line”, 2 = “squiggly text with tail”, etc. Since its a probably substitution cipher, all that matters is that the order is maintained.
Actually you probably could type all of these characters with the appropriate Unicode enabled software. As a former History major turned geek, I would strongly recommend that you compare these symbols closely to the Unicode standard charts. I took an initial look and many of the symbols used appear to be from the Unicode chart for supplemental mathematical operators (http://www.unicode.org/charts/PDF/U2A00.pdf)
I think that may be a clue to understanding the code. If you can figure out where the symbols came from you might be closer to cracking the code.
It looks a lot like a substitution cipher. I say a first step would be to create a chart of all symbols. Write the symbols so the are clear and readable. As it is now, in some cases I can’t tell if I a looking a the same symbol or a different one. Actually re-writing the entire contents of the letters may be a good idea as that might provide some context.
This looks very reminiscent of one cipher I learned about as a kid, Pigpen, a substitution cipher used since before the 18th Century as well as various so-called “Freemason” or “Rosicrucian” Ciphers.
Without knowing more about the Union soldier’s background, it’s hard to tell more, but it’s possible that he mixed and matched different ciphers either on purpose or as a self-taught pidgin based on limited exposure to multiple systems.
http://en.wikipedia.org/wiki/Pigpen_cipher
http://www.borderschess.org/Ciphers.htm
If it is a substitution cipher like Joseph suggest the way to decipher it is to count the frequency of all the symbols and then try to replace symbols with letters corresponding to how frequently different letters occur in the English language.
Since this is a handwritten (quite messy) text with very specialized symbols, this task has to be done manually. To make the text easier to analyze, I suggest starting by digitizing the cipher. Just use numbers instead of the symbols. The challenge will look a lot more appealing to most computer programmers with just a string of numbers to analyze. 😉
If this is something that is really important I suggest you also transcribe the text written by the same person. This way an analyst will get a good data source for comparison.
I should mention that I am not very familiar with crypto analysis, but I do know some of the basics. And in cases like this, (cryptography schemes invented by non professionals) that will often suffice.
The deletions might be helpful, as presumably the following text will bear some relation to the deleted text.
please go talk to the EECS professors at berkeley who study security. Paxson, Wagner, and Song are good places to start. Even though they probably wont have time to help you personally any of them will know either a grad student or another professor who will have time to help you.
Yup–got a bunch of grad students on it already, actually. Not sure if they’re studying with the aforementioned professors, but it is possible!
Do you have the original paper copy of the diary?
If yes, does it look like the coded markings were applied with the same writing instrument as the rest of the diary?
This would rule out the case of some other person adding noise at some other time in the last 150 years.
Good question, Jon. Yes, a look at the original diary confirms that the writing of the normal letters and the code is the same, with the same instrument, by the same hand, ostensibly at the same time
Took a look at the code. Its definitely not a simple substitution cipher. In fact, I think its gibberish.
I just started getting into crypto a few months ago, so take my writing with a grain of salt.
http://aakil.info/articles/the-union-soldiers-diary-code-is-probably-gibberish.html
I thought that, as a History student, I should respond to a few of your comments. First of all, it was very common for soldiers to write in their diaries in some kind of code (they had a number of reasons for it, the primary being the desire to hide their private thoughts from prying soldiers’ eyes–there wasn’t much privacy among troops and soldiers were known to be quite covetous of any reading material to relieve tedium), so it is very likely that this is more than gibberish. Secondly, it was not possible in the American army to simply put a soldier “up at bayonet point” to demand the meaning of coded words. It is highly unlikely that a superior officer would have cared what was being written about him, as long as the soldier had not taken any action against him. More importantly, though, a soldier had a right to write what he wished in his diary (under the First Amendment), unless there was a suspicion that he was working for the enemy. As long as he did his duty as a soldier and raised none of these suspicions, writing in code was common enough that superior officers would not have felt the need to demand he decode it.
About the various letters that appear three times in a row: when this happens, one or more is capitalized, which could connote a totally different letter. More importantly, however, I think it possible that various symbols that appear only once may connote proper nouns: names of places, battles, officers, etc. If so, this makes it exceptionally hard to crack if there is no key, but it is still possible.
I simply do not think that a soldier would have a reason to write in gibberish, especially as the rest of the diary is totally lucid.
Interesting. I’ll definitely add this to the post and rethink some points.
But one thing for certain, its definitely not a simple substitution cipher as some people have been saying.
I tend to not agree with Aakil’s main points. Just because something is not in the english lexicon doesn’t mean its without meaning. For instance the repeated letters could be compass heading direction or perhaps a proper name or location.
Brute force it! Write a program to try all substitution cyphers until one of them resulted in words that are English.
Are you sure that the encrypted language is English? I am almost on the verge of a key here but the result might suggest that it is not english.
I hope that he is not changing the key.
I have come across more than 26 symbols which could mean that the numbers are encrypted as well. This makes analysis using letter distribution in english language technique difficult.
However, I am fairly sure the the “divide” symbol (the one with a horizontal line with a dot above and below the line) stands for the letter “i” because it seems to have been used a lot as a single word. It could also be an A but a person is more likely to write the letter “i” than “a” in a diary.
I am at work right now so cannot do further detailed analysis. I will post a detailed breakdown of each symbol when I get home. I hope it helps the others.
Do you know the name of the soldier? It will help.
Here’s a text version of the first image:
:::: image 1, left:
s-tac-toe equals minus-dot seven ex comma slash-slash-backslash gamma capital-l lower-j ex slash
:::: image 1, right, downward
capital-t capital-i equals four parallel-lines slash-slash-backslash equals-slash comma s-tac-toe slash-slash-backslash slash-slash-backslash-backslash comma
:::: image 1, left, upper
capital-l divided-by capital-i comma equivalent comma lower-j comma capital-f slash-slash-backslash capital-i squared-capital-n zee slash-slash-backslash-backslash comma divided-by slash-slash comma slash-backslash-backslash slash-slash minus-dot slash-slash-backslash
Here are the first two images:
:::: image 1, left:
s-tac-toe equals minus-dot seven ex comma slash-slash-backslash gamma capital-l lower-j ex slash
:::: image 1, right, downward
capital-t capital-i equals four parallel-lines slash-slash-backslash equals-slash comma s-tac-toe slash-slash-backslash slash-slash-backslash-backslash comma
:::: image 1, left, upper
capital-l divided-by capital-i comma equivalent comma lower-j comma capital-f slash-slash-backslash capital-i squared-capital-n zee slash-slash-backslash-backslash comma divided-by slash-slash comma slash-backslash-backslash slash-slash minus-dot slash-slash-backslash
:::: image 2
plus-dot plus leaning-heart upsidedown-t minus vertical-line ex comma
leaning-heart capital-m capital-i capital-a minus three-peaks comma vertical-line crap capital-b close-bracket plus script-j script-s lower-d comma
u-bar three-peaks capital-i
First three images:
:::: image 1, left:
s-tac-toe equals minus-dot seven ex comma slash-slash-backslash gamma capital-l lower-j ex slash
:::: image 1, right, downward
capital-t capital-i equals four parallel-lines slash-slash-backslash equals-slash comma s-tac-toe slash-slash-backslash slash-slash-backslash-backslash comma
:::: image 1, left, upper
capital-l divided-by capital-i comma equivalent comma lower-j comma capital-f slash-slash-backslash capital-i squared-capital-n zee slash-slash-backslash-backslash comma divided-by slash-slash comma slash-backslash-backslash slash-slash minus-dot slash-slash-backslash
:::: image 2
plus-dot plus leaning-heart upsidedown-t minus vertical-line ex comma
leaning-heart capital-m capital-i capital-a minus three-peaks comma vertical-line crap capital-b close-bracket plus script-j script-s lower-d comma
u-bar three-peaks capital-i
:::: image 3
capital-l backslash divided-by c-slash-slash capital-i comma capital-i equivalent divided-by slash-slash-backslash-backslash comma capital-im minus-dot three-horizontal-two-vertical ex comma equals zee capital-l
l-in-l 11-over-1 comma y-slash-slash slash-slash-backslash-backslash comma capital-i divided-by comma minus-lower-dot c-omega slash-slash slash-i 11-over-1 slash-slash square-c equals capital-l equivalent slash-slash comma
capital-l slash-slash slash comma l-on-l plus slash-backslash-backslash slash-slash-backslash-backslash comma 1-slash-1 11-over-1 capital-z comma capital-i equals ex comma j divided-by c-slash-slash slash-slash capital-l
divided-by slash-slash ex comma
Based on the frequencies and possible use in the cipher it looks like “capital-i” could be a “t”. There are positions in image 3 that it looks like it “capital-i” and “divided-by” could be “to”. Also in image 3 “capital-i” and “equals” and “ex” could be “the”.
First four. I’m going to have to quit soon! ….
:::: image 1, left:
s-tac-toe equals minus-dot seven ex comma slash-slash-backslash gamma capital-l lower-j ex slash
:::: image 1, right, downward
capital-t capital-i equals four parallel-lines slash-slash-backslash equals-slash comma s-tac-toe slash-slash-backslash slash-slash-backslash-backslash comma
:::: image 1, left, upper
capital-l divided-by capital-i comma equivalent comma lower-j comma capital-f slash-slash-backslash capital-i squared-capital-n zee slash-slash-backslash-backslash comma divided-by slash-slash comma slash-backslash-backslash slash-slash minus-dot slash-slash-backslash
:::: image 2
plus-dot plus leaning-heart upsidedown-t minus vertical-line ex comma
leaning-heart capital-m capital-i capital-a minus three-peaks comma vertical-line crap capital-b close-bracket plus script-j script-s lower-d comma
u-bar three-peaks capital-i
:::: image 3
capital-l backslash divided-by c-slash-slash capital-i comma capital-i equivalent divided-by slash-slash-backslash-backslash comma capital-im minus-dot three-horizontal-two-vertical ex comma equals zee capital-l
l-in-l 11-over-1 comma y-slash-slash slash-slash-backslash-backslash comma capital-i divided-by comma minus-lower-dot c-omega slash-slash slash-i 11-over-1 slash-slash square-c equals capital-l equivalent slash-slash comma
capital-l slash-slash slash comma l-on-l plus slash-backslash-backslash slash-slash-backslash-backslash comma 1-slash-1 11-over-1 capital-z comma capital-i equals ex comma j divided-by c-slash-slash slash-slash capital-l
divided-by slash-slash ex comma
:::: image 4 (repeats image 2)
plus-dot plus leaning-heart upsidedown-t minus vertical-line ex comma
leaning-heart capital-m capital-i capital-a minus three-peaks comma vertical-line crap capital-b close-bracket plus script-j script-s lower-d comma
u-bar three-peaks capital-i
Given below are some observations from screenshot 6 and 7. the number denotes the number of times the symbol was used and the letter after that is what i think that symbol denotes.
Divide Sign = 22 = i
// = 12
X = 12 = e
# = 8
capital i = 8 = s
/ = 7
3 horizontal lines = 6
L = 5
Partial # sign = 5
Z = 3 = r
Lambda sign looking like X = 3 = L
Please let me know any developments. Thanks!
Take my words with a grain of salt.
I would take this to get a handwriting analysis done by an expert. You have access to the actual personal handwriting and they can tell you things such as, “Is he drawing the characters or writing them?” and since it is written in pen they can look at it with a magnifying glass to see if there are pauses in the handwriting to indicate this. I don’t see a transcript of the text itself but as said above, perhaps maybe this is just jibberish and he intentionally wanted his thoughts/feelings to be kept? How was this diary discovered? Was it passed down from generation to generation or was it found discarded in the wilderness somewhere? There are characters or symbols that do not repeat themselves and the more complicated ones, if using a key he would most certianly have to ‘draw’ them rather than write them. The more frequent the complicated symbols are used he would undoubtable begin to write from memory rather than draw each one, so I would analyze a symbol used at the beginning of the text to the ones used throughout the text and see if there are any differences. Such as his handwriting geting sloppier from the result of beginning to write the characters from memory rather than drawing the symbols from a key. But like I said, there are people who dedicate their entire career over analyzing handwriting.. and this method of tackling this mystery that will, im sure, at least give you a better understanding if it is just jibberish or a real cipher.
Just some thoughts that ran through my head as I was looking at this.
This is a guess on my side:
Since a lot of the symbols resemble mathematical symbols and unaware of any information concerning the person who wrote this, I think writer might have been a well-educated person. Probably aware of statistical analysis and coding the most common letters with 2 (or even more) different symbols.
I also think it’s possible that the symbols do not stand for the English alphabet, highly depending on the situation of the writer or rather depending on his ancestors. He might have been able to speak (& write) the language of the country they originated.
Other languages have ‘additional letters’, like umlauts (öäü and ß, which is not an umlaut, but a way to denote a double s) in German or letters (mostly vowels) with diacritics in the Roman languages (like French, Spanish and Portuguese) or Turkish. The French language actually knows 4 ways to write the letter ‘e’, three of them with diacritics (é,è,ê).
Lastly, I’m not even sure we can be sure he used a Language based on the Latin alpabet here. Maybe could be the Cyrillic alphabet. However, I think we can rule out most of the Asian writing systems. Simply not enough symbols 😉