Many of our algorithms are implemented in a way
that enables them to execute many calculations simultaneously. It is
called parallelization. In my short IT career I have seen many
erroneous thinking “If 8 simultaneous threads speeded up my
algorithm 6 times than I'll increase number of threads and this will
give me even better results! Great!” Unfortunately, Nope …
If more than N threads are trying to run in
parallel on N core processor this is physically impossible. In
this case (most often) operating system “cuts” your algorithm in
a few pieces trying to switch between them giving impression that we
can create arbitrary number of threads with impunity.
Unfortunately this switching (named context
switching) comes with a cost:
Ideal multitasking
multitasking with context switching
All of us can easily
understand that context switching involves some time consuming logic
but why the hell tasks with context switching are running little bit
longer? Anter alia because context switch clears TLB
(http://en.wikipedia.org/wiki/Translation_lookaside_buffer)
and “dirty” CPU caches. That's why we often should take a look at
hardware and “mechanical sympathy”.
So, what we really want is
more cores/processors but companies like Intel are selling us in
reasonable price maximum 8 cores. These processors are fast but
sometimes
we want to have little bit slower but two or four times more of them.
A few months ago I started looking for cheap multimulti processor
computer and I found Parallella (http://www.parallella.org/).
Parallella supercomputer
Parallella computer costs 99$ and contains:
- 16 cores Epiphany coprocessor (64 cores en
route),
- 2 cores ARM-A9 host processor (with FPGA logic),
- 1 GB SDRAM,
- Gigabit Ethernet, 2x uUSB, uHDMI, uSD.
Impressed? I was.
Epiphany architecture
Heart of Parallela is Epiphany coprocessor. It
consists of 16/64 RISC Processors connected together by mesh
network-on-chip.
The main challenge in
programming on Epiphany is to create the correct algorithm to offload
host processor. You have multiple CPUs available but currently no
operating system and what is more important no memory fences
(http://en.wikipedia.org/wiki/Memory_barrier).
Memory access architecture is referred as “weak ordering of loads
and stores”. The guarantees are:
- Load operations complete before the returned data is used by a subsequent instruction,
- Load operations using data previously written use the updated values,
- Store operations eventually propagate to their ultimate destination.
Why the hell am I writing
about this? Because the Java Memory Model
(http://www.slideshare.net/michalwarecki/java-memory-model-23207253)
requires specified rules when using monitors
and volatile
keyword.
E.g. x86 processors contains locked
instruction which allows us to be
sure that sequential write and read to the same variable will be
executed in this order.
Epiphany doesn't have such instructions but if we
want to be Java Language Specification (JLS) compatible we need to
implement it somehow. Such JLS requirements will be the hardest to
implement.
Epiphany SDK
Epiphany comes with very good Software Development
Kit which makes programming easier. It consists of:
- Epiphany Multicore Development IDE (based on Eclipse),
- C/C++ Compiler (E-GCC),
- Assembler (E-AS),
- Linked (E-LD),
- Instruction set simulator (E-RUN),
- Hardware Connection Server (E-SRVER),
- Loader Utility (E-LOADER),
- Debugger (E-DBG),
- Utility libraries.
This is enough to create powerful programs.
Great but can I run Java on it?
Not yet but we can help to make it work. Java has
support for many architectures but Epiphany is not enough popular to
interest large companies like Oracle or Redhat. I'm a big fun of Java
and all hardware novelties so I decided to try implementing it on my
own. Currently, I'm working on engineering Epiphany assembly code
generator which is based on OpenJDK Graal JiT Compiler. Another
challenge is to create efficient code dispatcher and Java API which
would make programming easier and safer. It has to be designed for
heterogeneous architectures like Parallella. Here comes OpenJDK
Sumatra project started by AMD to support GPU/APU offload handling.
Implementation of communication by a shared memory
buffers between main Java process and offloaded code will be just a
pleasure :-)
OpenJDK Graal
Purpose of Graal project is to exposure
functionalities to make dynamic compilers and interpreters in pure
Java. Currently, there is possibility to generate assembly code for
architectures:
- AMD64,
- Sparc,
- PTX,
- HSAIL.
I want to make it possible to generate Epiphany
assembly code. People often ask me, “why Graal?” so the simplest
answers are:
- because it will be much more quicker to implement such code generator (e. g. why would we need to reinvent the weel?),
- because it has a very good architectural design,
- because I can refer to the existing implementations,
- because it is supported by Oracle and AMD.
Code example for HSAIL:
Java code:
public static void testMulThreeArrays(int[] out, int[] ina, int[] inb, int gid) { out[gid] = ina[gid] * inb[gid]; }
Generated HSAIL code:
kernel &run ( kernarg_u64 %_arg0, kernarg_u64 %_arg1, kernarg_u64 %_arg2 ) { ld_kernarg_u64 $d6, [%_arg0]; ld_kernarg_u64 $d2, [%_arg1]; ld_kernarg_u64 $d1, [%_arg2]; workitemabsid_u32 $s8, 0; @L0: ld_global_s32 $s0, [$d2 + 12]; cmp_ge_b1_u32 $c0, $s8, $s0; cbr $c0, @L1; @L2: ld_global_s32 $s0, [$d1 + 12]; cmp_ge_b1_u32 $c0, $s8, $s0; cbr $c0, @L3; @L4: ld_global_s32 $s0, [$d6 + 12]; cmp_ge_b1_u32 $c0, $s8, $s0; cbr $c0, @L5; @L6: cvt_s64_s32 $d0, $s8; mul_s64 $d0, $d0, 4; add_u64 $d2, $d2, $d0; ld_global_s32 $s0, [$d2 + 16]; cvt_s64_s32 $d3, $s8; mul_s64 $d3, $d3, 4; add_u64 $d1, $d1, $d3; ld_global_s32 $s3, [$d1 + 16]; mul_s32 $s3, $s3, $s0; cvt_s64_s32 $d8, $s8; mul_s64 $d8, $d8, 4; add_u64 $d6, $d6, $d8; st_global_s32 $s3, [$d6 + 16]; ret; @L1: ret; @L3: ret; @L5: ret; };
Cool! I'm not familiar with HSAIL but it is very
readable. How to generate such code:
private void test(String snippet) { StructuredGraph graph = parse(snippet); HSAILCompilationResult compResult = HSAILCompilationResult .getHSAILCompilationResult(graph); compResult.dumpCompilationResult(); }
where snippet is a method name. Nice and
simple. I'll skip parse implementation because it is not
necessary here.
Main Graal compiler internal components:
- architecture: in this component we define how hardware architecture (registers etc.) looks like. For example this is my draft implementation of Epiphany:
/** * Represents the Epiphany architecture. */ public class Epiphany extends Architecture { /** * Epiphany architecture does not provide any implicit memory barrier. */ public static final int NO_MEM_BAR = 0x0000; /** * The general purpose registers can all be used in an integral context as well as in a floating-point context. * There is no distinction per FPU and CPU so CPU category is used. */ public static final Register.RegisterCategory CPU = new Register.RegisterCategory("CPU"); public static final Register.RegisterCategory FPU = new Register.RegisterCategory("FPU"); //Argument / result / scratch registers public static final Register a1 = new Register(1, 1, "r0", CPU); public static final Register a2 = new Register(2, 2, "r1", CPU); public static final Register a3 = new Register(3, 3, "r2", CPU); public static final Register a4 = new Register(4, 4, "r3", CPU); //Registers variable public static final Register v1 = new Register(5, 5, "r4", CPU); public static final Register v2 = new Register(6, 6, "r5", CPU); public static final Register v3 = new Register(7, 7, "r6", CPU); public static final Register v4 = new Register(8, 8, "r7", CPU); public static final Register v5 = new Register(9, 9, "r8", CPU); public static final Register v6 = new Register(10, 10, "r9", CPU); public static final Register v7 = new Register(11, 11, "r10", CPU); //Variable Register / Frame Pointer public static final Register v8 = new Register(12, 12, "r11", CPU); //Intra - procedure call scratch register public static final Register pcs = new Register(13, 13, "r12", CPU); //Stack Pointer public static final Register sp = new Register(14, 14, "r13", CPU); //Link Register public static final Register lr = new Register(15, 15, "r14", CPU); //General use public static final Register r15 = new Register(16, 16, "r15", CPU); public static final Register r16 = new Register(17, 17, "r16", CPU); public static final Register r17 = new Register(18, 18, "r17", CPU); public static final Register r18 = new Register(19, 19, "r18", CPU); public static final Register r19 = new Register(20, 20, "r19", CPU); public static final Register r20 = new Register(21, 21, "r20", CPU); public static final Register r21 = new Register(22, 22, "r21", CPU); public static final Register r22 = new Register(23, 23, "r22", CPU); public static final Register r23 = new Register(24, 24, "r23", CPU); public static final Register r24 = new Register(25, 25, "r24", CPU); public static final Register r25 = new Register(26, 26, "r25", CPU); public static final Register r26 = new Register(27, 27, "r26", CPU); public static final Register r27 = new Register(28, 28, "r27", CPU); //Reserved for constants public static final Register r28 = new Register(29, 29, "r28", CPU); public static final Register r29 = new Register(30, 30, "r29", CPU); public static final Register r30 = new Register(31, 31, "r30", CPU); public static final Register r31 = new Register(32, 32, "r31", CPU); //General use public static final Register r32 = new Register(33, 33, "r32", CPU); public static final Register r33 = new Register(34, 34, "r33", CPU); public static final Register r34 = new Register(35, 35, "r34", CPU); public static final Register r35 = new Register(36, 36, "r35", CPU); public static final Register r36 = new Register(37, 37, "r36", CPU); public static final Register r37 = new Register(38, 38, "r37", CPU); public static final Register r38 = new Register(39, 39, "r38", CPU); public static final Register r39 = new Register(40, 40, "r39", CPU); public static final Register r40 = new Register(41, 41, "r40", CPU); public static final Register r41 = new Register(42, 42, "r41", CPU); public static final Register r42 = new Register(43, 43, "r42", CPU); public static final Register r43 = new Register(44, 44, "r43", CPU); public static final Register r44 = new Register(45, 45, "r44", CPU); public static final Register r45 = new Register(46, 46, "r45", CPU); public static final Register r46 = new Register(47, 47, "r46", CPU); public static final Register r47 = new Register(48, 48, "r47", CPU); public static final Register r48 = new Register(49, 49, "r48", CPU); public static final Register r49 = new Register(50, 50, "r49", CPU); public static final Register r50 = new Register(51, 51, "r50", CPU); public static final Register r51 = new Register(52, 52, "r51", CPU); public static final Register r52 = new Register(53, 53, "r52", CPU); public static final Register r53 = new Register(54, 54, "r53", CPU); public static final Register r54 = new Register(55, 55, "r54", CPU); public static final Register r55 = new Register(56, 56, "r55", CPU); public static final Register r56 = new Register(57, 57, "r56", CPU); public static final Register r57 = new Register(58, 58, "r57", CPU); public static final Register r58 = new Register(59, 59, "r58", CPU); public static final Register r59 = new Register(60, 60, "r59", CPU); public static final Register r60 = new Register(61, 61, "r60", CPU); public static final Register r61 = new Register(62, 62, "r61", CPU); public static final Register r62 = new Register(63, 63, "r62", CPU); public static final Register r63 = new Register(64, 64, "r63", CPU); public static final Register[] gprRegisters = { r15, r16, r17, r18, r19, r20, r21, r22, r23, r24, r25, r26, r27, r32, r33, r34, r35, r36, r37, r38, r39, r40, r41, r42, r43, r44, r45, r46, r47, r48, r49, r50, r51, r52, r53, r54, r55, r56, r57, r58, r59, r60, r61, r62, r63 }; public static final Register[] allRegisters = { a1, a2, a3, a4, v1, v2, v3, v4, v5, v6, v7, v8, pcs, sp, lr, r15, r16, r17, r18, r19, r20, r21, r22, r23, r24, r25, r26, r27, r28, r29, r30, r31, r32, r33, r34, r35, r36, r37, r38, r39, r40, r41, r42, r43, r44, r45, r46, r47, r48, r49, r50, r51, r52, r53, r54, r55, r56, r57, r58, r59, r60, r61, r62, r63 }; public Epiphany() { /** * About machine code call displacement offset: * Epiphany has relative as well as absolute branching. Each offers a 16-bit and 32-bit opcodes. * For relative branch, the offset is either 8-bit or 24-bit (signed) accordingly. * It can also branch based on an absolute register value, where the offset can be 32-bit. */ super("Epiphany", 4, ByteOrder.LITTLE_ENDIAN, false, allRegisters, NO_MEM_BAR, 1, r63.encoding + 1, 4); } @Override public boolean canStoreValue(Register.RegisterCategory category, PlatformKind platformKind) { if (!(platformKind instanceof Kind)) { return false; } Kind kind = (Kind) platformKind; if (category == CPU) { switch (kind) { case Boolean: case Byte: case Char: case Short: case Int: case Long: case Object: return true; } } else if (category == FPU) { switch (kind) { case Float: case Double: return true; } } return false; } @Override public PlatformKind getLargestStorableKind(Register.RegisterCategory category) { if (category == CPU) { return Kind.Long; } else if (category == FPU) { return Kind.Double; } else { return Kind.Illegal; } } }
- compiler: as the name implies, this components is used for code compilation,
- ASM: generates architecture-specific assembly code,
- LIR: aims to manage intermediate representation graph.
OpenJDK Sumatra
This project is even more interesting. The primary
purpose of it is to use graphics processing unit (GPU) and
accelerated processing unit (APU – i.e. Epiphany) in JVM. It is so
awesome that it creeps me out :-).
Graphics cards have computations power of
thousands of GFlops, so not using this power is just a waste of
money. But heterogeneous architecture is not so easy to implement.
Guys from AMD and Oracle wants to make it easier to offload logic
written in JVM based languages to GPUs and APUs. Currently, it is
possible to redirect code but in the future such offload will be
possible dynamically (i.e. using C2).
Example of redirecting Java code:
IntStread.range(0, players.length).parallel().forEach(n → { Player p = players[n]; if(p.getAb() > 0) { p.setBa((float)p.getHits() / (float) p.getAb()); } else { p.setBa((float)0.0); } });
I this example, we can see that each calculation
of player batting average is executed in parallel on different
processing unit where n is the unique identifier of this unit.
Summary
If you like what I do and would like
to participate, please send me a message. It will be motivating!