## What the Average Private Donor should know about the latest DNC Hack

New datasets from the Democratic National Convention were published this week. A hacker called Gucciffer 2.0 released an archive that contained, among other things, 125 Excel sheets and 38 pdf files. Unfortunately the data has not been anonymized and comprehensive informations about private donors are now public.

The earliest entries date back to the year 2008. The good news is that no credit card information could be found. I will give a brief summary about the content that can be found during a short search(in about 10 minutes).

## Primary data

This section contains information that can be immediately obtained form the datasets.

In total 3.219.237 unique email addresses could be identified. Most of them could be associated to:

• A Name
• Gender: About 48.35% of all identities could be identified as female.
• Date and amount of a donation.

For donations above a certain contribution level the residence and address is also included.

## Deducible information

The datasets can be further analyzed by Natural Language Processing (An error rate up to 20% respectively 643847 faulty records can be expected.)

## Nationality

A record is searched for a country name, id or domain name.
For example 177 german(ger, de) residents and 2 residents from Vatican City(vs) could be identified.

## Employer

Email addresses not provided by public email services can be associated to an Employer. For example: 6542 identities could be identified as civil servants.

Thereof:

• NASA: 91
• CIA: 8
• NSA: 8
• FBI: 3

## Anomalies

Are the records genuine? My experimental approach to solve this question is to compare word frequencies between similar sources. In this case, I used the frequency of occurrences of company and CEO names and compared them against Google trend.

The results are in no way significant, but interestingly Apple did an almost perfect match (Most unlikely to be manipulated).

## What is missing

Phone Numbers: Phone numbers could not be found by regular expressions.

## Summery

• Over 3 million unique email addresses can be easily identified from this leak.
• Many addresses can be associated with names, gender, employment and residence.

## Feedback: The alleged NSA malware developers are at risk to be identified

First of all, I would like to thank my webhoster to maintain the availability of my home page over the last few days. The user numbers were considerably increased after the last post. If nevertheless my page becomes unavailable, an archived version is accessible via: https://web.archive.org/web/*/https://www.yousry.de/

I published the last post on different news aggregators and as result could not follow every discussion in realtime. I would like to add a remark at central position.

### My new Project: Anomaly Detection

The trigger for this idea was an article about electoral fraud via voting machines. Unfortunately, it did not contain any further information about the detection process. As result, I started experimenting with the usual methods like hypothesis testing, σ-distances and so on.

I consider my knowledge base in this field as average. I had the “luck” to repeat the statistics course for my intermediate-diploma and acquired some additional knowledge about stochastics for my informatics diploma (Statistics := about probability, Stochastic := about randomness.). I came up with ideas like event frequency- and partitioned outcome distribution tests.

As next step I tried to think of methods to delay the detection of a manipulation or, described as stochastic process, hide information inside entropy:

Finally I tried to figure out, if the detection prevention itself is detectable.

#### From Experiments to Data evaluation

To test my algorithms, I started with simple n-dimensional point clouds and later switched to mass-data collections, primarily from the financial sector (mostly quarterly results from corporations and banks) and Internet traffic (mostly advertisement). Before significant deviations can be identified, it is necessary to create a model.

If significant deviations are detected in different datasets it is necessary to identify the cause and update the existing model if necessary.

#### Adaption for the real world

Two month ago it became evident that my recent projects are not going to be successful and I searched for new products and business models. In retrospective it was obvious that the end customer market for applications and software is supersaturated. Nor could I give my software away for free, neither could I “force” the installation of my apps. Luck, as suggested from some people, as final parameter wasn’t on my side.

At the same time I noticed that the methods for online advertisement had dramatically changed. Ad networks added advanced features for customer tagging and identification. More importantly, public organizations showed a growing interest in monitoring their citizens. That these interests have a close match should be obvious.

The idea of an intelligent proxy emerged, that not only blocks passive (like an ad-blocker) and active tracking attempts but also obfuscate online activities that could be classified as noteworthy.

### Source Code

I prepare a open source release of my recent source code if there is enough interest. I will announce the procedure in a later post.

## The alleged NSA malware developers are at risk to be identified

On August the 15th a Hacker group (Shadow Brokers) released an archive which was claimed to contain malware, developed and distributed by the NSA. The assertion wasn’t officially confirmed.

The archive with the sha256 digest: cf840f3d9bfb72eccf950ef5f91a01124b3e15cbf6f65373a90b856388abf666 is distributed via sharehosters. Besides encrypted files, the archive contained partially accessible data. The unencrypted parts are also available on GitHub.

In this post, I deal exclusively with the Linux binaries (ELF 32-bit code). The collected data for this article are of statically nature. I will only show the simple steps to obtain the information. I’m using my slightly dusty NLP knowledge for the necessary queries but most results can also be obtained by simple regular expressions, shell scripts and some help from GitHub.

## Some Background information

Legally obtained/installed software contains a signature. In case of Windows (64Bit) and macOS this signature is additionally verified and identifiable by Microsoft or Apple. Most Linux distributions use the web of trust for a verified software distribution.

Conversely the software in this archive has to be installed without approval of the the user. It does not show up in any logs.

It’s purpose is the setup of a backdoor to bypass a firewall. A backdoor allows unauthorized online access to a system to collect or upload data.

The software was unnoticed until now, which shows clearly the high quality of the product and gives first indications of the producer.

The developers are security, network and Linux experts.

## Creating a digital fingerprint

Only a sufficient number of parameters permit a clear identification.

Fingerprint libraries for web-browsers (Example: Valve) use more than 20 parameters for a unique identification. However, in contrast to 3.5 billions Internet users, only a few hundred experts have to be identified.

## A closer look at the software

The binaries are partially unstripped. In this case they still contain all debugging information. Libraries are statically linked. The first thing to do, is to extract all readable text like variable names, constants or method names.

find . -type f -exec strings {} \; >> allText.txt;

Remove the noise:

sed -r '/^.{,5}\$/d' allText.txt

Sort the output and remove duplicates:

sort -u allText.txt > corpus.txt

The simple corpus is now ready to use.(If you have further questions or are interested in scripts or a short live presentation feel free to contact me.).

## Library names

If the list of symbolic names (generated with nm) is inconclusive you can also search your text file. Method and variable names like X509V3_get_section provide clues for the used libraries.

This leads to a second indication. How widespread is the use of the libraries. In case of libgcrypt the filter would not be sufficient but several other libraries only received few github stars.

## Finding names in the Code

If you don’t have access to an English name database here is the result of a name search:

Ada ,Ai ,Al ,An ,Asa ,Camellia ,Chan ,Del ,Delta ,Don ,Echo ,Ed ,Elba ,Else ,Era ,Eric ,Ha ,Hai ,Hal ,Ike ,In ,June ,Lai ,Len ,Long ,Lu ,Mac ,Major ,Manual ,Many ,Mark ,Max ,May ,Min ,Numbers ,Ok ,Oscar ,Page ,Ping ,Precious ,Rose ,See ,September ,Sha ,So ,Soon ,Su ,Ta ,Tiny ,Tu ,Ty ,Un ,Val ,Vi ,Will ,Youn

Able ,Ack ,Agent ,Alert ,Alias ,All ,Alt ,Andersen ,Anon ,App ,Appl ,Ar ,Arch ,Area ,Arena ,Arp ,Ash ,Back ,Bad ,Bak ,Base ,Bash ,Be ,Been ,Begin ,Below ,Best ,Big ,Bigger ,Bio ,Blank ,Block ,Body ,Bogus ,Boot ,Both ,Bottom ,Bound ,Bounds ,Box ,Brace ,Bracket ,Brand ,But ,Call ,Can ,Cancel ,Carrier ,Case ,Catching ,Certain ,Chain ,Channel ,Char ,Chars ,Check ,Child ,Choice ,Chown ,Cid ,Cisco ,Class ,Clause ,Clear ,Client ,Clock ,Close ,Code ,Colon ,Comment ,Common ,Console ,Constant ,Cool ,Core ,Corp ,Counter ,Counts ,Cross ,Cui ,Current ,Dam ,Dates ,Days ,Dec ,December ,Delay ,Dest ,Distance ,Dk ,Do ,Doing ,Done ,Door ,Double ,Down ,Driver ,Due ,Dul ,During ,Ear ,Early ,Ede ,Ehl ,Elem ,End ,Ends ,Enter ,Even ,Every ,Exe ,Fail ,Fake ,Fast ,Fetch ,Few ,Field ,Fields ,File ,Files ,Fill ,Filter ,Fini ,First ,Fix ,Flash ,Font ,Force ,Forget ,Form ,Forward ,Found ,Frame ,Free ,Freed ,From ,General ,Getting ,Given ,Go ,Goes ,Going ,Good ,Grand ,Guard ,Guess ,Halt ,Handler ,Hard ,Has ,Hash ,Haven ,Head ,Headings ,Hellman ,Helper ,Hidden ,High ,Hint ,Holes ,Home ,Hook ,Hooks ,Host ,Hu ,Huge ,Human ,Ing ,Ip ,Job ,Jobs ,July ,Jump ,Just ,Kea ,Keep ,Key ,Keys ,Kill ,Kind ,Ko ,La ,Lab ,Lam ,Landing ,Large ,Larger ,Last ,Left ,Legacy ,Less ,Letters ,Level ,Levels ,Like ,Line ,Lines ,Link ,Linker ,Links ,List ,Listen ,Little ,Lock ,Locks ,Longest ,Loop ,Low ,Magic ,Mail ,Main ,Manner ,Mar ,Marker ,Marking ,Mask ,Master ,Matter ,Media ,Memory ,Merchant ,Mini ,Minor ,Mode ,Model ,Mole ,Montgomery ,More ,Moss ,Most ,Mount ,Mounts ,Much ,Must ,Nee ,Needs ,Neither ,Net ,Never ,New ,Nist ,No ,Nop ,Notice ,November ,Null ,Number ,Oakley ,Off ,Old ,Older ,On ,Or ,Ord ,Other ,Over ,Pack ,Pages ,Pair ,Parent ,Part ,Pascal ,Pass ,Pax ,Payment ,Peer ,People ,Pi ,Pipe ,Pipes ,Place ,Plain ,Plan ,Point ,Pointer ,Points ,Pool ,Pop ,Port ,Ports ,Post ,Power ,Press ,Prime ,Primes ,Proto ,Public ,Push ,Py ,Quiet ,Ram ,Rand ,Range ,Raw ,Re ,Read ,Reader ,Reading ,Real ,Reason ,Reasons ,Record ,Records ,Red ,Register ,Right ,Ring ,Rm ,Ro ,Rom ,Room ,Root ,Round ,Rounds ,Route ,Row ,Rule ,Running ,Sa ,Safe ,Safi ,Salt ,Save ,Screen ,Search ,Second ,Section ,Seed ,Seeds ,Seek ,Seen ,Self ,Sep ,Server ,Service ,Session ,Sessions ,Setting ,Share ,Shell ,Shells ,Short ,Show ,Side ,Simonsen ,Simple ,Sing ,Single ,Six ,Slot ,Small ,Smaller ,Space ,Spare ,Sport ,Square ,Stabs ,Stack ,Stage ,Stager ,Stamp ,Standard ,Start ,State ,Storage ,Store ,Stream ,Streams ,Strength ,String ,Strong ,Style ,Such ,Sudo ,Sum ,Tables ,Tag ,Tags ,Takes ,Te ,Temp ,Test ,Than ,Them ,Then ,Times ,To ,Too ,Tools ,Top ,Touch ,Trace ,Trad ,Treat ,True ,Try ,Unavailable ,Us ,Vars ,Ve ,Via ,View ,Wait ,Wall ,Want ,Warn ,Warning ,Well ,Whack ,While ,Wide ,Wild ,Win ,Wince ,Winning ,Wins ,Wish ,Word ,Work ,Xu ,You ,Zone

I will not release the result for combined surnames with first names. This would lead to the next fingerprint. LinkedIn will show you the professional discipline, GitHub the shared libraries and their publicity.

Additionally you can search for email addresses with a regular expression like this:
\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b

## NLP Author identification

Up until now it was only possible to identify the developers of the used libraries. If the field of application for a library is also limited, it could indicate a indirect connection to the malware development.

To find direct clues for who the authors are, I use a variant of NLP author identification.

To identify the Author of a text, the frequency of bigrams and triplets is normally analyzed, but this next corpus is further limited. It contains only the debugging code of the executables. It can be accessed with gdb. For example:

info functions disassemble /m main

The corpus contains only words and short sentence fragments.

It is therefore necessary to adapt a modified technique.

#### Observation

The provided binaries are written in C and Assembler. The code convention for these languages are not very strict and changed over time. The variable names contained in in the debugging code can be consequently analyzed.

#### Examples: Different styles of variable naming

Underline or capitalization for hyphenation:
aVariable vs a_Variable

All uppercase global variables:

Language slip:
All vs. Alle (unintentionally switched to german)

Abbreviations: missing consonants, word beginning etc.
attr vs attributes

Uppercase Abbreviations
CMS_encrypt vs cms_encrypt

Time
constructed vs construction

countersignature vs counterSignature

Describing a process
cr_cancels_micro_mode

Last but not least: Typos
Distrubution, occured

#### Data matching

GitHub offers a, flexibility, and ease to use API. Unfortunately it is not possible to use regular expression. It is thereby beneficial to limit the search by the fingerprints already collected from the library.

For example:

(Word/Discoveries)

occured 1544
occured + network + C 7

After the number of repositories has been narrowed down, they can be cloned and and scanned with regular expressions.

An accumulation of certain expression indicate a hit:

#([A-Z]*)_.* > #(.*)_.*

#### To summarize the assumptions

• The developers of the malware are leading experts in the area of Linux, Network and Security development.
• They were discovered and not trained.

Because the archive contains a collection of applications, the calculated result-set is reasonable small for further investigations.

## YAGL3 (OpenGL/Vulkan) final update status

Since the new render pipeline with Vulkan support still misses crucial features like function pointers and reflections, OpenGL Version 4.1 (+ extensions) remains the default. The missing API features are still in development, but a lack of demand makes it necessary to halt any further development. Here are some final screenshots of the last build:

## Technique: Time-lapse videos segmented by circular sectors (and a small question)

Instead of playing a time-lapse video frame by frame it is also possible to show all frames at once. To accomplish this, the screen needs to be partitioned into segments and each frame covers only a part of it. The screen partitions work like a cut masks. Their position always corresponds to the position inside the frame.

To reintroduce an animation effect each frame segment can be moved. As result, the segmented video contains the same information as the original time-lapse recording, it is only presented differently.

This technique is useful to show small changes over time. Rapid changes, as they can occur during camera panning, produce image errors.

It is interesting to observe how the environment light is influenced by the passing clouds and small movements (the swaying plants) still behave like in the original video.

## Calculation of the frame segments

The noteworthy part of the algorithm is the frame selection for the currently processed segment:

• I prefer normalized vector coordinates for my work. All screen coordinate are mapped into the range ([0..1]/[0..1]).
• The circle center c and an up-vector are chosen. In this case, I selected the bottom center position for the circle and the negative x-axis as up-vector.
• For each pixel the normalized vector according to p is calculated.
• The angle between p and the up-vector is calculated. The result lies between 0..π.

In short:

(text{acos}((vec p - vec c) * vec text{up})) / pi * |text{frames}|

calculates the frame for the active segment.

## Reference

I recently found a video using this technique on a social media website (A harbor during a tide). Unluckily I was not able to rediscover this video and acknowledge the creator.

## Complexity

While viewing the reference video I realized that the creation could be tricky because of the huge quantity of data necessary to access. My original time-lapse video from above consists of 5000 single full-hd images. Loading the corresponding image for each pixel of the screen would create a data transfer rate of 20.85TB (1920 * 1080 * 3 * 4 * 1280 * 720 Bytes) for the worst case. The creation of an animation would be unreasonable on current desktop computers. Therefore it was necessary to create a smart caching strategy.

I estimated that with the naive approach (loading the calculated frame for each pixel) about two hours would be necessary to render a single image. Reversing the process, loading each image of the original animation and only draw the pixels inside the corresponding segment, reduced the render time to 17 minutes. As final optimization a fixed size (12 GB) cache was introduced. One half was preallocated by a coarse grained frame selection, the remaining part was handled by a concurrent thread. The processing of fragment pixels, whose frames missed in the preallocated cache, were sorted by their frame number and delayed. After a specified time-limit the remaining pixel were assigned to the nearest available (cached) frame. With some other minor modifications (Stretching the time segments by the factor 2) the rendertime was reduced to a split second per image.

## My request for one minute of your attention

Over the last months my already low sales on Apples AppStore dropped steadily to zero. In times of omnipresent add-blockers and click-frauds I nevertheless tried my luck with advertisement and, as result, never burned my money so senselessly. Under the current conditions it becomes useless to continue application development.

So I sincerely ask you to consider the purchase of one of my graphics applications: