Feedback: The alleged NSA malware developers are at risk to be identified

First of all, I would like to thank my webhoster to maintain the availability of my home page over the last few days. The user numbers were considerably increased after the last post. If nevertheless my page becomes unavailable, an archived version is accessible via: https://web.archive.org/web/*/https://www.yousry.de/

I published the last post on different news aggregators and as result could not follow every discussion in realtime. I would like to add a remark at central position.

My new Project: Anomaly Detection

I mentioned my new project and was asked for additional information:

The trigger for this idea was an article about electoral fraud via voting machines. Unfortunately, it did not contain any further information about the detection process. As result, I started experimenting with the usual methods like hypothesis testing, σ-distances and so on.

I consider my knowledge base in this field as average. I had the “luck” to repeat the statistics course for my intermediate-diploma and acquired some additional knowledge about stochastics for my informatics diploma (Statistics := about probability, Stochastic := about randomness.). I came up with ideas like event frequency- and partitioned outcome distribution tests.

As next step I tried to think of methods to delay the detection of a manipulation or, described as stochastic process, hide information inside entropy:

random generators
Click on the image to watch the Video

Finally I tried to figure out, if the detection prevention itself is detectable.

From Experiments to Data evaluation

To test my algorithms, I started with simple n-dimensional point clouds and later switched to mass-data collections, primarily from the financial sector (mostly quarterly results from corporations and banks) and Internet traffic (mostly advertisement). Before significant deviations can be identified, it is necessary to create a model.
Here is an imaginary example for advertisement costs:

Advertisement turnover example
Advertisement turnover example.

If significant deviations are detected in different datasets it is necessary to identify the cause and update the existing model if necessary.

Adaption for the real world

Two month ago it became evident that my recent projects are not going to be successful and I searched for new products and business models. In retrospective it was obvious that the end customer market for applications and software is supersaturated. Nor could I give my software away for free, neither could I “force” the installation of my apps. Luck, as suggested from some people, as final parameter wasn’t on my side.

At the same time I noticed that the methods for online advertisement had dramatically changed. Ad networks added advanced features for customer tagging and identification. More importantly, public organizations showed a growing interest in monitoring their citizens. That these interests have a close match should be obvious.

The idea of an intelligent proxy emerged, that not only blocks passive (like an ad-blocker) and active tracking attempts but also obfuscate online activities that could be classified as noteworthy.

Source Code

I prepare a open source release of my recent source code if there is enough interest. I will announce the procedure in a later post.

The alleged NSA malware developers are at risk to be identified

Step 1 of 6

On August the 15th a Hacker group (Shadow Brokers) released an archive which was claimed to contain malware, developed and distributed by the NSA. The assertion wasn’t officially confirmed.

The archive with the sha256 digest: cf840f3d9bfb72eccf950ef5f91a01124b3e15cbf6f65373a90b856388abf666 is distributed via sharehosters. Besides encrypted files, the archive contained partially accessible data. The unencrypted parts are also available on GitHub.

In this post, I deal exclusively with the Linux binaries (ELF 32-bit code). The collected data for this article are of statically nature. I will only show the simple steps to obtain the information. I’m using my slightly dusty NLP knowledge for the necessary queries but most results can also be obtained by simple regular expressions, shell scripts and some help from GitHub.

Some Background information

Legally obtained/installed software contains a signature. In case of Windows (64Bit) and macOS this signature is additionally verified and identifiable by Microsoft or Apple. Most Linux distributions use the web of trust for a verified software distribution.

Conversely the software in this archive has to be installed without approval of the the user. It does not show up in any logs.

It’s purpose is the setup of a backdoor to bypass a firewall. A backdoor allows unauthorized online access to a system to collect or upload data.

The software was unnoticed until now, which shows clearly the high quality of the product and gives first indications of the producer.

The developers are security, network and Linux experts.

Creating a digital fingerprint

Only a sufficient number of parameters permit a clear identification.

Fingerprint libraries for web-browsers (Example: Valve) use more than 20 parameters for a unique identification. However, in contrast to 3.5 billions Internet users, only a few hundred experts have to be identified.

A closer look at the software

The binaries are partially unstripped. In this case they still contain all debugging information. Libraries are statically linked. The first thing to do, is to extract all readable text like variable names, constants or method names.

find . -type f -exec strings {} \; >> allText.txt;

Remove the noise:

sed -r '/^.{,5}$/d' allText.txt

Sort the output and remove duplicates:

sort -u allText.txt > corpus.txt

The simple corpus is now ready to use.(If you have further questions or are interested in scripts or a short live presentation feel free to contact me.).

Library names

If the list of symbolic names (generated with nm) is inconclusive you can also search your text file. Method and variable names like X509V3_get_section provide clues for the used libraries.

This leads to a second indication. How widespread is the use of the libraries. In case of libgcrypt the filter would not be sufficient but several other libraries only received few github stars.

Finding names in the Code

If you don’t have access to an English name database here is the result of a name search:

Ada ,Ai ,Al ,An ,Asa ,Camellia ,Chan ,Del ,Delta ,Don ,Echo ,Ed ,Elba ,Else ,Era ,Eric ,Ha ,Hai ,Hal ,Ike ,In ,June ,Lai ,Len ,Long ,Lu ,Mac ,Major ,Manual ,Many ,Mark ,Max ,May ,Min ,Numbers ,Ok ,Oscar ,Page ,Ping ,Precious ,Rose ,See ,September ,Sha ,So ,Soon ,Su ,Ta ,Tiny ,Tu ,Ty ,Un ,Val ,Vi ,Will ,Youn

Able ,Ack ,Agent ,Alert ,Alias ,All ,Alt ,Andersen ,Anon ,App ,Appl ,Ar ,Arch ,Area ,Arena ,Arp ,Ash ,Back ,Bad ,Bak ,Base ,Bash ,Be ,Been ,Begin ,Below ,Best ,Big ,Bigger ,Bio ,Blank ,Block ,Body ,Bogus ,Boot ,Both ,Bottom ,Bound ,Bounds ,Box ,Brace ,Bracket ,Brand ,But ,Call ,Can ,Cancel ,Carrier ,Case ,Catching ,Certain ,Chain ,Channel ,Char ,Chars ,Check ,Child ,Choice ,Chown ,Cid ,Cisco ,Class ,Clause ,Clear ,Client ,Clock ,Close ,Code ,Colon ,Comment ,Common ,Console ,Constant ,Cool ,Core ,Corp ,Counter ,Counts ,Cross ,Cui ,Current ,Dam ,Dates ,Days ,Dec ,December ,Delay ,Dest ,Distance ,Dk ,Do ,Doing ,Done ,Door ,Double ,Down ,Driver ,Due ,Dul ,During ,Ear ,Early ,Ede ,Ehl ,Elem ,End ,Ends ,Enter ,Even ,Every ,Exe ,Fail ,Fake ,Fast ,Fetch ,Few ,Field ,Fields ,File ,Files ,Fill ,Filter ,Fini ,First ,Fix ,Flash ,Font ,Force ,Forget ,Form ,Forward ,Found ,Frame ,Free ,Freed ,From ,General ,Getting ,Given ,Go ,Goes ,Going ,Good ,Grand ,Guard ,Guess ,Halt ,Handler ,Hard ,Has ,Hash ,Haven ,Head ,Headings ,Hellman ,Helper ,Hidden ,High ,Hint ,Holes ,Home ,Hook ,Hooks ,Host ,Hu ,Huge ,Human ,Ing ,Ip ,Job ,Jobs ,July ,Jump ,Just ,Kea ,Keep ,Key ,Keys ,Kill ,Kind ,Ko ,La ,Lab ,Lam ,Landing ,Large ,Larger ,Last ,Left ,Legacy ,Less ,Letters ,Level ,Levels ,Like ,Line ,Lines ,Link ,Linker ,Links ,List ,Listen ,Little ,Lock ,Locks ,Longest ,Loop ,Low ,Magic ,Mail ,Main ,Manner ,Mar ,Marker ,Marking ,Mask ,Master ,Matter ,Media ,Memory ,Merchant ,Mini ,Minor ,Mode ,Model ,Mole ,Montgomery ,More ,Moss ,Most ,Mount ,Mounts ,Much ,Must ,Nee ,Needs ,Neither ,Net ,Never ,New ,Nist ,No ,Nop ,Notice ,November ,Null ,Number ,Oakley ,Off ,Old ,Older ,On ,Or ,Ord ,Other ,Over ,Pack ,Pages ,Pair ,Parent ,Part ,Pascal ,Pass ,Pax ,Payment ,Peer ,People ,Pi ,Pipe ,Pipes ,Place ,Plain ,Plan ,Point ,Pointer ,Points ,Pool ,Pop ,Port ,Ports ,Post ,Power ,Press ,Prime ,Primes ,Proto ,Public ,Push ,Py ,Quiet ,Ram ,Rand ,Range ,Raw ,Re ,Read ,Reader ,Reading ,Real ,Reason ,Reasons ,Record ,Records ,Red ,Register ,Right ,Ring ,Rm ,Ro ,Rom ,Room ,Root ,Round ,Rounds ,Route ,Row ,Rule ,Running ,Sa ,Safe ,Safi ,Salt ,Save ,Screen ,Search ,Second ,Section ,Seed ,Seeds ,Seek ,Seen ,Self ,Sep ,Server ,Service ,Session ,Sessions ,Setting ,Share ,Shell ,Shells ,Short ,Show ,Side ,Simonsen ,Simple ,Sing ,Single ,Six ,Slot ,Small ,Smaller ,Space ,Spare ,Sport ,Square ,Stabs ,Stack ,Stage ,Stager ,Stamp ,Standard ,Start ,State ,Storage ,Store ,Stream ,Streams ,Strength ,String ,Strong ,Style ,Such ,Sudo ,Sum ,Tables ,Tag ,Tags ,Takes ,Te ,Temp ,Test ,Than ,Them ,Then ,Times ,To ,Too ,Tools ,Top ,Touch ,Trace ,Trad ,Treat ,True ,Try ,Unavailable ,Us ,Vars ,Ve ,Via ,View ,Wait ,Wall ,Want ,Warn ,Warning ,Well ,Whack ,While ,Wide ,Wild ,Win ,Wince ,Winning ,Wins ,Wish ,Word ,Work ,Xu ,You ,Zone

I will not release the result for combined surnames with first names. This would lead to the next fingerprint. LinkedIn will show you the professional discipline, GitHub the shared libraries and their publicity.

Additionally you can search for email addresses with a regular expression like this:
\b[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,}\b

NLP Author identification

Up until now it was only possible to identify the developers of the used libraries. If the field of application for a library is also limited, it could indicate a indirect connection to the malware development.

To find direct clues for who the authors are, I use a variant of NLP author identification.

To identify the Author of a text, the frequency of bigrams and triplets is normally analyzed, but this next corpus is further limited. It contains only the debugging code of the executables. It can be accessed with gdb. For example:

info functions
disassemble /m main

The corpus contains only words and short sentence fragments.

It is therefore necessary to adapt a modified technique.

Observation

The provided binaries are written in C and Assembler. The code convention for these languages are not very strict and changed over time. The variable names contained in in the debugging code can be consequently analyzed.

Examples: Different styles of variable naming

Underline or capitalization for hyphenation:
aVariable vs a_Variable

All uppercase global variables:
uploaddata vs UPLOADDATA

Language slip:
All vs. Alle (unintentionally switched to german)

Abbreviations: missing consonants, word beginning etc.
attr vs attributes

Uppercase Abbreviations
CMS_encrypt vs cms_encrypt

Time
constructed vs construction

Simplified reading of long words
countersignature vs counterSignature

Describing a process
cr_cancels_micro_mode

Last but not least: Typos
Distrubution, occured

Data matching

GitHub offers a, flexibility, and ease to use API. Unfortunately it is not possible to use regular expression. It is thereby beneficial to limit the search by the fingerprints already collected from the library.

For example:

(Word/Discoveries)

occured 1544
occured + network + C 7

After the number of repositories has been narrowed down, they can be cloned and and scanned with regular expressions.

An accumulation of certain expression indicate a hit:

#([A-Z]*)_.* > #(.*)_.*

To summarize the assumptions

  • The developers of the malware are leading experts in the area of Linux, Network and Security development.
  • They were discovered and not trained.

Because the archive contains a collection of applications, the calculated result-set is reasonable small for further investigations.

YAGL3 (OpenGL/Vulkan) final update status

Since the new render pipeline with Vulkan support still misses crucial features like function pointers and reflections, OpenGL Version 4.1 (+ extensions) remains the default. The missing API features are still in development, but a lack of demand makes it necessary to halt any further development. Here are some final screenshots of the last build:

YA3A

YA3B

YA3C

YA3D

YA3E

Technique: Time-lapse videos segmented by circular sectors (and a small question)

Instead of playing a time-lapse video frame by frame it is also possible to show all frames at once. To accomplish this, the screen needs to be partitioned into segments and each frame covers only a part of it. The screen partitions work like a cut masks. Their position always corresponds to the position inside the frame.

To reintroduce an animation effect each frame segment can be moved. As result, the segmented video contains the same information as the original time-lapse recording, it is only presented differently.

This technique is useful to show small changes over time. Rapid changes, as they can occur during camera panning, produce image errors.

Time-lapse videos segmented by circular sectors
Click on the image to watch the Video

If you prefer youtube, you can follow this link: https://youtu.be/FYsOhG1V4B0

It is interesting to observe how the environment light is influenced by the passing clouds and small movements (the swaying plants) still behave like in the original video.

Calculation of the frame segments

The noteworthy part of the algorithm is the frame selection for the currently processed segment:

  • I prefer normalized vector coordinates for my work. All screen coordinate are mapped into the range ([0..1]/[0..1]).
  • The circle center c and an up-vector are chosen. In this case, I selected the bottom center position for the circle and the negative x-axis as up-vector.
  • For each pixel the normalized vector according to p is calculated.
  • The angle between p and the up-vector is calculated. The result lies between 0..π.

In short:

`(text{acos}((vec p – vec c) * vec text{up})) / pi * |text{frames}|`

calculates the frame for the active segment.

Reference

I recently found a video using this technique on a social media website (A harbor during a tide). Unluckily I was not able to rediscover this video and acknowledge the creator.

Complexity

While viewing the reference video I realized that the creation could be tricky because of the huge quantity of data necessary to access. My original time-lapse video from above consists of 5000 single full-hd images. Loading the corresponding image for each pixel of the screen would create a data transfer rate of 20.85TB (1920 * 1080 * 3 * 4 * 1280 * 720 Bytes) for the worst case. The creation of an animation would be unreasonable on current desktop computers. Therefore it was necessary to create a smart caching strategy.

I estimated that with the naive approach (loading the calculated frame for each pixel) about two hours would be necessary to render a single image. Reversing the process, loading each image of the original animation and only draw the pixels inside the corresponding segment, reduced the render time to 17 minutes. As final optimization a fixed size (12 GB) cache was introduced. One half was preallocated by a coarse grained frame selection, the remaining part was handled by a concurrent thread. The processing of fragment pixels, whose frames missed in the preallocated cache, were sorted by their frame number and delayed. After a specified time-limit the remaining pixel were assigned to the nearest available (cached) frame. With some other minor modifications (Stretching the time segments by the factor 2) the rendertime was reduced to a split second per image.

My request for one minute of your attention

Over the last months my already low sales on Apples AppStore dropped steadily to zero. In times of omnipresent add-blockers and click-frauds I nevertheless tried my luck with advertisement and, as result, never burned my money so senselessly. Under the current conditions it becomes useless to continue application development.

So I sincerely ask you to consider the purchase of one of my graphics applications:

Bundle Small
Virtual Mannequin
Forced Perspective
Color Essence

If you need otherwise a free license for a review please contact me.

Many thanks in advance,

Yousry Abdallah
yousry@protonmail.ch

Video: Cellular automaton

The cells in this demo are not static but flow in a determined direction. Similar to simulated annealing the cells try to emit their energy to neighbor cells. The quantity of the transmission is limited. The simulation area is closed (Left contacts with right and up with down). My application tried to save all images at once, therefore the animation is limited to 1000 frames:

Quantum Transfer
Quantum Transfer

Deploying Linux OpenGL – Applications with AppImage

toplAppImage

The setup of AppImageKit was straightforward and took with instructions from a short README about 10 minutes. During tests it became clear that the AppImage binary contains an iso-image which is in turn mounted as a hidden read-only directory in /tmp. It should be taken into account that the directory name contains a random part and any references to resources (audio/video) have to be updated dynamically for each start. Similar to mac os(sandboxes) or ios, I had to disable several patch/update features.

Collecting the necessary libraries

AppImage does not not collect the necessary library dependencies on its own. Therefore It is necessary to copy these by hand. I wrote a small shell-script to simplify the process(→GIST).

Experimental Build

An experimental image build can be downloaded from here: (→itchio). It can be executed via command-line. Because of lack of time, extensive testing wasn’t possible.

Coloring Picture: Draco

Memory copy from the zoology book Historia Animalium Volume 2 by Ulisse Aldrovandi released in 1630:
Draco

Almost four hundred years ago it was a known fact that dragons are related to snakes and elephants and live in swamps in Africa. Their main diet consist of buffaloes. Attentive readers, however, have certainly noticed that the head has suspicious similarities with a mouse-head (minus the snakeskin and a few bumps), the legs, or arms, are pretty useless because they are out of the center of mass and the wings are a little bit to small to carry the weight of the dragon.

Nevertheless, most people still believed in dragons, because it was written in a book.

Update June 30
The reference image:
DracoRef