std-proposals: Parralized instructions in language in CPU

From: Wesley Oliver <wesley.olis_at_[hidden]>
Date: Mon, 18 Jan 2021 10:12:18 +0200

Hi,

I would like to look at how to achieve the same performance that c++ would
be capable of achieving with characters that are '\0' terminate physically
by there last data position. Because data that is null terminate, doesn't
require range checking.

So my question is how could we improve things, such that the typically
conditional bounds checking statements for int array or similar could be
reduce or written in slightly different form,
such that we can achieve the same performance as null terminated data.

For this to happen, it would require improved compiler and also
hypothetically invisigaing new cpu wiring or logic, that in 2025 years time
would give us massive performance boost, as we have figure out how to write
could that has many performance knocks in better way, by reduction or what
every that technic is, to reduce the number of instructions required.

However, I really require the technical people at like arm or and like the
c++ language to look at what is possible, to reduce this and improve the
performance of this common few cases, with new technics.

There was like website for arm, where was more open to developing and give
the standard direction, however I can't find that right now. If have or
know of the links or place to start, please pop me the link.

What I would like to do is the follows, imagion a normal loop

for(int i = current search pos., j = 0; i < maxlen; i++, j++) {

if(numArray[i] == numArray2[j]) {
// do somthing.
}

As you can see the core of the code above, can't be reduce to what is below.
the conditional if statements in places can be combined into one, because
it can relay on the data to terminate the looping, where as normal
datastructures can't ever value has meaning.
So how could we look at optermizing the above to achieve the same
performance as below.

Here are a couple of my ideas.
Watch register, however, with enough register that can do i < max length
checking, as just store
direct tot he register use for i and the maxlen comparison, if there was
i+= 2.
however, the same would be need for maxLength, as structure could be
lengthened.

So the ideas I have from above, would be conditional statements that could
be parralized, with out changing the logic of the program. so think that
both
numArray[i] == numArray2[j]
i < maxlen
could be evaluate at the same time and combine true/false results by
wiring, such doesn't require an additional instruction to value and of true
and false, would allow one to achieve the same performance as below.
The problem comes if the code is more complex, then register will have to
do a lot more load and store operations as the program can't just be run
from the registers. So I guess the question then is which registers based
on the code hit rate and usage should be used to reduce the over code
length and number of load/ store operations to and from registers.

So there a way in which one could combine range/bounds checking implicitly
with data evaluation or that the cpu could intelligently and internal have
an additional bit, that could flag the end of data stream, in which it
could implicitly raise a flag, which could be combined with evaluation
logic.

So think that when loading data into the cpu, that the last item of data
would be flag with its end range, which allow cpu instruction to just use
data evaluation as the cpu give implicit boundery checking. Problem comes
is if the data grows or shinks, the this flag would move, solution would
n't work well for that type of code.

The question might come with language like javascript, in which accessing,
the boundaries are automatically checked, with some code, slowing do the
raw performance.

So the next question then would be how do get such a language to also get
performance benefit, by allowing range checking for access to not happen.

Think of it like parallel if statement, being evaluated like compound
register, range check and conditional, that is one cycle, that can be
hyrbide and use in code evaluation to do range checking in parallel. Maybe
hybride register could convert range check into data check, but still need
2 cycles.

So would like to have discussion with like minded people that would like to
improve language cpu and compiler interface with ideas to reduce range
checking overhead, to be reduced.

I am thinking that possibly a new syntax, that would allow condition to be
evaluated and range checked on 2 variables so basically like
4 variables for range checking and then two for the condition, where these
statements could be executed in parrallel.

c/c++ allows the comma syntax for additional statments.

What if {and statements that are separated by a common} the compiler would
look at
executing them in parallel to one another. if not support then
sequentially. Thing is one needs 4 variable over head to reduce the number
of load and store operations, when too much other logic. unless change how
data could be access to sequence and stride skipping, so could execute
directly on data, with out have to be placed in a register first.
We would just be writing code, where we can see the micro sequential
improvements, and it would allow use to make use them if we as humans see
they exist.

Also one of the reason that one many want to run a program across multiple
cores, because there are more registered available, reducing load and store
and also allows for parallel evaluation, maybe in future cores could
support micro block parallel execution across cores,
to solve this issues and reduce overheads. As the cores just need wire, to
combine and the parallel results.

So basically this is the current area of interest apart from async/ await
improvements that I feel Microsoft got wrong and java got right and needs
be left to the compiler to decided on weather to inline or async a method.

Would appreciate any feedback or direction here with look at how one could
improve this situation.

I guess my mind will continue to tick away at the problem as life goes on,
on how this situation could be improved.

function match(char* str) {

char* matchme = "matchme\0";

let countMatch = 0;

for(char* ch = str, char chd = ch*;chd != '\0';ch++, chd = *ch)
{
  // could make this an inline function.

  char* chs = str, char chds = chs*;
  char* mech = matchme, char mechd = *mech ;
  while(true) { // sure that by now compiler optermized, could just say
loop
     if (chds != mechd) {
        if(mechd == '\0') {
           break;
          continue; // kicks out of the loop and skip the rest of the
parent look code, for a case that fail, in the case of successfully match,
then countMatch++ will execute.
       }
     }
  mech++; mechd = * mech; chs++; chds = *chs ;
  }
  countMatch++;
}
}

Kind Regards,

Wesley Oliver

-- 
----
GitHub:https://github.com/wesleyolis
LinkedIn:https://www.linkedin.com/in/wesley-walter-anton-oliver-85466613b/
Blog/Website:https://sites.google.com/site/wiprogamming/Home
Skype: wezley_oliver
MSN messenger: wesley.olis_at_[hidden]

Received on 2021-01-18 02:12:35