## Arithmetic Operations on Random Fields

Recovering steganograms after lossy operations.

The question if a steganogram can be recovered from a modified image depends on the executed operations during the modification process. It is advisable to add a header to simplify the identification.

#### Binary Operations

A bit is encoded in a random field by its position, radius and predefined probability. These parameters can vary during the encoding as long as the mapping is disjunct. In the following image the top left dot is the original bit, followed by AND, OR and XOR operations with fields of changing probabilities:

Combining two random fields with the binary operations AND or OR creates a result where the original bit (0/1) is recoverable. If the information of the encoding is known a complementary random field can be generated and the encoded bit can be erased with a XOR operation (last image). To complicate the erasure, the probability for the 1 bit can vary during the encoding. Any recoverable pseudo random function can be chosen but a noise map, like Perlin maps, create unobtrusive variations:

#### Complex Operations

Complex image operations like scale, rotate or blur are destructive as soon as the encoded bits are removed or coalesced:

By training a Bayesian network on the header data the remaining bits can be recovered.

#### Demo

The following video contains a steganogram with the technique described above and the previous blog post:

## KungleASAP: The News Trend Web Frontend Sources are now available via GitHub

I uploaded the first part of my old News Trend System to GitHub: https://github.com/yousry/KungleAsap. This package contains the web frontend and a trend detection algorithm based on Markov chains. The project was stopped because text mining became increasingly difficult/illegal under local and international laws. The project contains Scala sources and the necessary glue-code for Lift.

The project can be build with the provided Project Object Mode(pom) file for Maven.

## Steganography: Hidden Behind Insignificance

This article is a followup of my post about anomaly detection. It explores a (hopefully) new possibility to hide data in arbitrary digital media. Because this technique can be used in realtime-environments (a GPU version is available), a possible application area could be the creation of a new form of hidden service. This service could be set on top of any available infrastructure, for example social networks or image hosts which allow the public distribution of media files or live media streams.

#### Better than Cryptography: Cryptography & Steganography

One characteristic of modern cryptography is that its output is indistinguishable from random noise, in other words it can be defined as entropic(Defined by Shannons law). If this noise can be mixed up and later separated with the background noise from multimedia data like audio or video, a perfect hideout for sensible data is found.

A well known representative for steganography applications is jsteg. Jsteg encodes a steganogram by replacing the LSB (least significant bit) of the image data with a Bit of a hidden message. It was already shown that LSB encoded data can be unmasked by visual attacks like the chi^2 method. It may be assumed that this detection process can already be automatized by neural networks.

The following algorithm can be classified as Exploiting Modification Direction (EMD) Steganography (with single bits modifications) but owns additional probabilistic properties that allows non-destructive re-encoding with lossy compression algorithms (For example a reduction from 16 to 8 bit color depth).

I will describe this technique in the following paragraphs. The work on this algorithm is not finished because necessary utility functions like DCT filtering for Jpeg compressions are not yet implemented.

They key points are:

• The encrypted steganogram is scattered by a low discrepancy sequence (evenly distributed pseudo random numbers) with indefinite length.
• Message bits are encoded by random fields.
• The re-encoding stability can be defined by two parameters the domain depth and the field size.

• #### Prerequisites

For reasons of simplicity I’m using a basic XOR cypher. The following text:

should be encrypted with the password:

The resulting bit-field is calculated with:

Instead of using a single bit for data encoding, a probabilistic space is used. The dimensionality should be matched to the destination data. In the case of images, two dimensional areas are used and the 0 or 1 Bits from textBits are replaced by probabilities:

These values depend on the the domain depth, the steganogram size and lossy-ness of the expected recompression methods. If the variance is to small the payload could be destroyed during transportation.

#### Encoding Verification

The encoding process can be visualized with a histogram of the color values. Here is the histogram from the original (I’m sorry for the missing scales):

#### Value compression

The general form to create an evenly compressed domain of the initial dataset in RR is:

i = floor(i / steps) * steps

For example with steps = 2 the LSB would be removed.

Here is the compressed image:

This image shows the value changes (red + / blue -):

The histogram of the modified image looks like this:

#### Probabilistic Field distribution with Low Discrepancy Sequences

Instead of a successive list, a low discrepancy sequence (lds) with a stipulated start value is used to evenly distribute the probabilistic areas on the image. The minimal distance of any two given data points defines the size (in this case the radius) of an event. This value decreases linearly with the size of the to encoding data.

#### The Generation of the random field with the payload

The dataset is now updated like this:
If a data-point lies outside of an event, a random bit is generated and with the probability p = 0.5 it is set to 1. If a data-point lies inside of an event (inside a lds defined field) and the event is 0, p = probeZero and finally if the data-point lies inside of a event and the event is 1, p = probeOne.

#### Results

The generated bit-field look like this:

This is the resulting image with its payload..

and the slightly changed histogram:

#### Decoding

The decoding step is the inverse function to the value compression. The remaining (for example modulo) part defines the state of the data point inside the probabilistic field p(i) = sum(v, v in ldsBits(i)) / fieldSize.

Unless the bit-depth of the data is reduced beyond the initial step-size (For Example the step Size = 256 (8 Bit) and not resized below sqrt(radius) > 1 the payload can be retrieved.

#### Attacks and Forensic Results

Besides the statistical methods to identify modifications, that I already described in an earlier article, a visual verification can be performed.

For example Difference, Error and Noise analysis:

Error Analysis

Noise Analysis

The original dataset can be downloaded from here: https://virtualhorde.com/MEDIA/Steganography-Results.tgz and a early draft of the algorithm with grids instead of a lds is available here: https://gist.github.com/yousry/2aa849e26791d99e8641800751840ceb

## What the Average Private Donor should know about the latest DNC Hack

New datasets from the Democratic National Convention were published this week. A hacker called Gucciffer 2.0 released an archive that contained, among other things, 125 Excel sheets and 38 pdf files. Unfortunately the data has not been anonymized and comprehensive informations about private donors are now public.

The earliest entries date back to the year 2008. The good news is that no credit card information could be found. I will give a brief summary about the content that can be found during a short search(in about 10 minutes).

## Primary data

This section contains information that can be immediately obtained form the datasets.

In total 3.219.237 unique email addresses could be identified. Most of them could be associated to:

• A Name
• Gender: About 48.35% of all identities could be identified as female.
• Date and amount of a donation.

For donations above a certain contribution level the residence and address is also included.

## Deducible information

The datasets can be further analyzed by Natural Language Processing (An error rate up to 20% respectively 643847 faulty records can be expected.)

## Nationality

A record is searched for a country name, id or domain name.
For example 177 german(ger, de) residents and 2 residents from Vatican City(vs) could be identified.

## Employer

Email addresses not provided by public email services can be associated to an Employer. For example: 6542 identities could be identified as civil servants.

Thereof:

• NASA: 91
• CIA: 8
• NSA: 8
• FBI: 3

## Anomalies

Are the records genuine? My experimental approach to solve this question is to compare word frequencies between similar sources. In this case, I used the frequency of occurrences of company and CEO names and compared them against Google trend.

The results are in no way significant, but interestingly Apple did an almost perfect match (Most unlikely to be manipulated).

## What is missing

Phone Numbers: Phone numbers could not be found by regular expressions.

## Summery

• Over 3 million unique email addresses can be easily identified from this leak.
• Many addresses can be associated with names, gender, employment and residence.

## Feedback: The alleged NSA malware developers are at risk to be identified

First of all, I would like to thank my webhoster to maintain the availability of my home page over the last few days. The user numbers were considerably increased after the last post. If nevertheless my page becomes unavailable, an archived version is accessible via: https://web.archive.org/web/*/https://www.yousry.de/

I published the last post on different news aggregators and as result could not follow every discussion in realtime. I would like to add a remark at central position.

### My new Project: Anomaly Detection

The trigger for this idea was an article about electoral fraud via voting machines. Unfortunately, it did not contain any further information about the detection process. As result, I started experimenting with the usual methods like hypothesis testing, σ-distances and so on.

I consider my knowledge base in this field as average. I had the “luck” to repeat the statistics course for my intermediate-diploma and acquired some additional knowledge about stochastics for my informatics diploma (Statistics := about probability, Stochastic := about randomness.). I came up with ideas like event frequency- and partitioned outcome distribution tests.

As next step I tried to think of methods to delay the detection of a manipulation or, described as stochastic process, hide information inside entropy:

Finally I tried to figure out, if the detection prevention itself is detectable.

#### From Experiments to Data evaluation

To test my algorithms, I started with simple n-dimensional point clouds and later switched to mass-data collections, primarily from the financial sector (mostly quarterly results from corporations and banks) and Internet traffic (mostly advertisement). Before significant deviations can be identified, it is necessary to create a model.

If significant deviations are detected in different datasets it is necessary to identify the cause and update the existing model if necessary.

#### Adaption for the real world

Two month ago it became evident that my recent projects are not going to be successful and I searched for new products and business models. In retrospective it was obvious that the end customer market for applications and software is supersaturated. Nor could I give my software away for free, neither could I “force” the installation of my apps. Luck, as suggested from some people, as final parameter wasn’t on my side.

At the same time I noticed that the methods for online advertisement had dramatically changed. Ad networks added advanced features for customer tagging and identification. More importantly, public organizations showed a growing interest in monitoring their citizens. That these interests have a close match should be obvious.

The idea of an intelligent proxy emerged, that not only blocks passive (like an ad-blocker) and active tracking attempts but also obfuscate online activities that could be classified as noteworthy.

### Source Code

I prepare a open source release of my recent source code if there is enough interest. I will announce the procedure in a later post.

## The alleged NSA malware developers are at risk to be identified

On August the 15th a Hacker group (Shadow Brokers) released an archive which was claimed to contain malware, developed and distributed by the NSA. The assertion wasn’t officially confirmed.

The archive with the sha256 digest: cf840f3d9bfb72eccf950ef5f91a01124b3e15cbf6f65373a90b856388abf666 is distributed via sharehosters. Besides encrypted files, the archive contained partially accessible data. The unencrypted parts are also available on GitHub.

In this post, I deal exclusively with the Linux binaries (ELF 32-bit code). The collected data for this article are of statically nature. I will only show the simple steps to obtain the information. I’m using my slightly dusty NLP knowledge for the necessary queries but most results can also be obtained by simple regular expressions, shell scripts and some help from GitHub.

## Some Background information

Legally obtained/installed software contains a signature. In case of Windows (64Bit) and macOS this signature is additionally verified and identifiable by Microsoft or Apple. Most Linux distributions use the web of trust for a verified software distribution.

Conversely the software in this archive has to be installed without approval of the the user. It does not show up in any logs.

It’s purpose is the setup of a backdoor to bypass a firewall. A backdoor allows unauthorized online access to a system to collect or upload data.

The software was unnoticed until now, which shows clearly the high quality of the product and gives first indications of the producer.

The developers are security, network and Linux experts.

## Creating a digital fingerprint

Only a sufficient number of parameters permit a clear identification.

Fingerprint libraries for web-browsers (Example: Valve) use more than 20 parameters for a unique identification. However, in contrast to 3.5 billions Internet users, only a few hundred experts have to be identified.

## A closer look at the software

The binaries are partially unstripped. In this case they still contain all debugging information. Libraries are statically linked. The first thing to do, is to extract all readable text like variable names, constants or method names.

find . -type f -exec strings {} \; >> allText.txt;

Remove the noise:

sed -r '/^.{,5}\$/d' allText.txt

Sort the output and remove duplicates:

sort -u allText.txt > corpus.txt

The simple corpus is now ready to use.(If you have further questions or are interested in scripts or a short live presentation feel free to contact me.).

## Library names

If the list of symbolic names (generated with nm) is inconclusive you can also search your text file. Method and variable names like X509V3_get_section provide clues for the used libraries.

This leads to a second indication. How widespread is the use of the libraries. In case of libgcrypt the filter would not be sufficient but several other libraries only received few github stars.

## Finding names in the Code

If you don’t have access to an English name database here is the result of a name search:

Ada ,Ai ,Al ,An ,Asa ,Camellia ,Chan ,Del ,Delta ,Don ,Echo ,Ed ,Elba ,Else ,Era ,Eric ,Ha ,Hai ,Hal ,Ike ,In ,June ,Lai ,Len ,Long ,Lu ,Mac ,Major ,Manual ,Many ,Mark ,Max ,May ,Min ,Numbers ,Ok ,Oscar ,Page ,Ping ,Precious ,Rose ,See ,September ,Sha ,So ,Soon ,Su ,Ta ,Tiny ,Tu ,Ty ,Un ,Val ,Vi ,Will ,Youn

I will not release the result for combined surnames with first names. This would lead to the next fingerprint. LinkedIn will show you the professional discipline, GitHub the shared libraries and their publicity.

Additionally you can search for email addresses with a regular expression like this:
\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b

## NLP Author identification

Up until now it was only possible to identify the developers of the used libraries. If the field of application for a library is also limited, it could indicate a indirect connection to the malware development.

To find direct clues for who the authors are, I use a variant of NLP author identification.

To identify the Author of a text, the frequency of bigrams and triplets is normally analyzed, but this next corpus is further limited. It contains only the debugging code of the executables. It can be accessed with gdb. For example:

info functions disassemble /m main

The corpus contains only words and short sentence fragments.

It is therefore necessary to adapt a modified technique.

#### Observation

The provided binaries are written in C and Assembler. The code convention for these languages are not very strict and changed over time. The variable names contained in in the debugging code can be consequently analyzed.

#### Examples: Different styles of variable naming

Underline or capitalization for hyphenation:
aVariable vs a_Variable

All uppercase global variables:

Language slip:
All vs. Alle (unintentionally switched to german)

Abbreviations: missing consonants, word beginning etc.
attr vs attributes

Uppercase Abbreviations
CMS_encrypt vs cms_encrypt

Time
constructed vs construction

countersignature vs counterSignature

Describing a process
cr_cancels_micro_mode

Last but not least: Typos
Distrubution, occured

#### Data matching

GitHub offers a, flexibility, and ease to use API. Unfortunately it is not possible to use regular expression. It is thereby beneficial to limit the search by the fingerprints already collected from the library.

For example:

(Word/Discoveries)

occured 1544
occured + network + C 7

After the number of repositories has been narrowed down, they can be cloned and and scanned with regular expressions.

An accumulation of certain expression indicate a hit:

#([A-Z]*)_.* > #(.*)_.*

#### To summarize the assumptions

• The developers of the malware are leading experts in the area of Linux, Network and Security development.
• They were discovered and not trained.

Because the archive contains a collection of applications, the calculated result-set is reasonable small for further investigations.

## YAGL3 (OpenGL/Vulkan) final update status

Since the new render pipeline with Vulkan support still misses crucial features like function pointers and reflections, OpenGL Version 4.1 (+ extensions) remains the default. The missing API features are still in development, but a lack of demand makes it necessary to halt any further development. Here are some final screenshots of the last build: