Date: Wed, 20 Jan 2021 09:47:26 +0200
Hi,
Thank you for this information, I have yet to come across this information.
Unfortunately, things keep changing and my main focus is not processor
design,
I have to focus on other areas and in my spare time look into all processor
that have gone further
than VHDL 2 weeks of processor design, we did at university in 2007, and
assembly which is not all that difficult, just
require a lot of manual reading of each specific flags and all details, to
ensure 100% correct solutions, lets just say I did pretty well this area.
As the article says, it is limited to like 5 ops, back then whatever it may
be now and two types compiler and process optimizations.
I do see that micro-ops come with some predefined determination, which is
limited to micro-operations which combinations have been predetermined.
masync or mparrallel execution block constraints, in high-level language,
loaded with list micro-operations at this level combination supported,
could allow an ide, to limit and suggestion, validate possible new
suggestions to the existing predefined patterns for compiler and process
code optimization.
Where someone one might see new ways or for peace of algorithm, trade-off
that compiler processor, don't have optimization for that the human does.
Maybe if you could hear my argument out here, C/C++ is supposed to be HIgh,
but as low level as possible is my current understanding,
however, language doesn't allow a modern ide, to expose lower-level process
designs at a high level, where a human can possibly find more optimization
that have yet to be thought of yet.
If one were to have the IDE support this concept, then upon compilation,
there could be a crowd source set of solutions to different patterns and
algorithms/solutions,
in which could analysis and later further be included the compilers as
auto-optimization patterns, in which specific individual researching these
patterns may not have seen yet.
But out of the 8 billion in the world, there may be a few more out there
that other people, with functionality exposed, can take advantage of.
Stats of compiler distributed programs, could also all be sent back to crow
source initiative, to determine program hit rate of different patterns, for
general-purpose computing optimizations.
I know there is pipeline and all the rest. my one thought one exposing
masync{} of some code, would be in more modern-day process where
instruction set can have pre-processing pipeline,
which it can process instructions faster and parralize them based on this
constraint as much as possible across register or instruction, input,
output -> result branch.
almost like converting these to micro-operations, but this sits in dubious
middle ground.
Think that if I had 16 X64bit datastore and 8 selector busses, then I
could run any combination of that uses 8 selector buses with instruction
control logic, provided there is instruction control logic
available to parralize the operations. so say have instruction control
logic, that allows to operate on all 8 buses at once or just subset with a
stride, then could run 4 comparison control logic using 2 bus per unit,
then run all 8 buses and 4 instruction in parallel. But all 8 busses are
wired up to all 16 data store registers, for concurrence.
Mabye does this parralization all the time and just has fixed sync points
blocks injected as an instruction or the opposite.
Anyways I done just an idea, need find the right deep dive into arm
instruction set for myself now, which see from glance is now quite
different in ways.
Hope just maybe ops you up to alternative ways of thinking about how could
go about things and maybe crowed source more optermizations.
Kind Regards,
Wesley Oliver
On Tue, Jan 19, 2021 at 6:51 PM Thiago Macieira via Std-Proposals <
std-proposals_at_[hidden]> wrote:
> On Monday, 18 January 2021 08:59:09 PST Wesley Oliver via Std-Proposals
> wrote:
> > Only way I see that happen in the future is if a processor can execute
> two
> > instructions simultaneous, involving different registers.
>
> They do that. Intel processors for the past 10 years have been able to
> execute
> up to 5 instructions (micro-ops) per cycle. That's the theoretical maximum
> and
> won't happen all the time, but it's fairly common to see code execute 2 or
> 3
> instructions per cycle. I believe this is common on AArch64 systems too
> but I
> don't have direct experience.
>
> This is especially true of loop-overhead instructions, since more often
> than
> not they don't depend on the data itself. Range checking certainly
> qualifies
> for this. In fact, because the processor runs pipelined, it will have
> concluded the loop has terminated or not terminated 10 to 100 cycles
> before
> the instructions doing the actual work have reached there.
>
> And this is even more true of short loops like yours because the entire
> work
> fits in the CPU structures designed to detect loops and execute them
> REALLY
> fast. See https://www.anandtech.com/show/2594/4 (and note how this
> article is
> from 2008!).
>
> So please study the state of the art first. Your initiative to improve
> things
> is commendable, but if you propose things that were done 15 years ago,
> you'll
> be wasting your time.
>
> [For the nitpickers: yes, I know LSD was turned off in some
> microarchitectures
> because it had problems. That doesn't invalidate the point.]
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Software Architect - Intel DPG Cloud Engineering
>
>
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
Thank you for this information, I have yet to come across this information.
Unfortunately, things keep changing and my main focus is not processor
design,
I have to focus on other areas and in my spare time look into all processor
that have gone further
than VHDL 2 weeks of processor design, we did at university in 2007, and
assembly which is not all that difficult, just
require a lot of manual reading of each specific flags and all details, to
ensure 100% correct solutions, lets just say I did pretty well this area.
As the article says, it is limited to like 5 ops, back then whatever it may
be now and two types compiler and process optimizations.
I do see that micro-ops come with some predefined determination, which is
limited to micro-operations which combinations have been predetermined.
masync or mparrallel execution block constraints, in high-level language,
loaded with list micro-operations at this level combination supported,
could allow an ide, to limit and suggestion, validate possible new
suggestions to the existing predefined patterns for compiler and process
code optimization.
Where someone one might see new ways or for peace of algorithm, trade-off
that compiler processor, don't have optimization for that the human does.
Maybe if you could hear my argument out here, C/C++ is supposed to be HIgh,
but as low level as possible is my current understanding,
however, language doesn't allow a modern ide, to expose lower-level process
designs at a high level, where a human can possibly find more optimization
that have yet to be thought of yet.
If one were to have the IDE support this concept, then upon compilation,
there could be a crowd source set of solutions to different patterns and
algorithms/solutions,
in which could analysis and later further be included the compilers as
auto-optimization patterns, in which specific individual researching these
patterns may not have seen yet.
But out of the 8 billion in the world, there may be a few more out there
that other people, with functionality exposed, can take advantage of.
Stats of compiler distributed programs, could also all be sent back to crow
source initiative, to determine program hit rate of different patterns, for
general-purpose computing optimizations.
I know there is pipeline and all the rest. my one thought one exposing
masync{} of some code, would be in more modern-day process where
instruction set can have pre-processing pipeline,
which it can process instructions faster and parralize them based on this
constraint as much as possible across register or instruction, input,
output -> result branch.
almost like converting these to micro-operations, but this sits in dubious
middle ground.
Think that if I had 16 X64bit datastore and 8 selector busses, then I
could run any combination of that uses 8 selector buses with instruction
control logic, provided there is instruction control logic
available to parralize the operations. so say have instruction control
logic, that allows to operate on all 8 buses at once or just subset with a
stride, then could run 4 comparison control logic using 2 bus per unit,
then run all 8 buses and 4 instruction in parallel. But all 8 busses are
wired up to all 16 data store registers, for concurrence.
Mabye does this parralization all the time and just has fixed sync points
blocks injected as an instruction or the opposite.
Anyways I done just an idea, need find the right deep dive into arm
instruction set for myself now, which see from glance is now quite
different in ways.
Hope just maybe ops you up to alternative ways of thinking about how could
go about things and maybe crowed source more optermizations.
Kind Regards,
Wesley Oliver
On Tue, Jan 19, 2021 at 6:51 PM Thiago Macieira via Std-Proposals <
std-proposals_at_[hidden]> wrote:
> On Monday, 18 January 2021 08:59:09 PST Wesley Oliver via Std-Proposals
> wrote:
> > Only way I see that happen in the future is if a processor can execute
> two
> > instructions simultaneous, involving different registers.
>
> They do that. Intel processors for the past 10 years have been able to
> execute
> up to 5 instructions (micro-ops) per cycle. That's the theoretical maximum
> and
> won't happen all the time, but it's fairly common to see code execute 2 or
> 3
> instructions per cycle. I believe this is common on AArch64 systems too
> but I
> don't have direct experience.
>
> This is especially true of loop-overhead instructions, since more often
> than
> not they don't depend on the data itself. Range checking certainly
> qualifies
> for this. In fact, because the processor runs pipelined, it will have
> concluded the loop has terminated or not terminated 10 to 100 cycles
> before
> the instructions doing the actual work have reached there.
>
> And this is even more true of short loops like yours because the entire
> work
> fits in the CPU structures designed to detect loops and execute them
> REALLY
> fast. See https://www.anandtech.com/show/2594/4 (and note how this
> article is
> from 2008!).
>
> So please study the state of the art first. Your initiative to improve
> things
> is commendable, but if you propose things that were done 15 years ago,
> you'll
> be wasting your time.
>
> [For the nitpickers: yes, I know LSD was turned off in some
> microarchitectures
> because it had problems. That doesn't invalidate the point.]
>
> --
> Thiago Macieira - thiago (AT) macieira.info - thiago (AT) kde.org
> Software Architect - Intel DPG Cloud Engineering
>
>
>
> --
> Std-Proposals mailing list
> Std-Proposals_at_[hidden]
> https://lists.isocpp.org/mailman/listinfo.cgi/std-proposals
>
-- ---- GitHub:https://github.com/wesleyolis LinkedIn:https://www.linkedin.com/in/wesley-walter-anton-oliver-85466613b/ Blog/Website:https://sites.google.com/site/wiprogamming/Home Skype: wezley_oliver MSN messenger: wesley.olis_at_[hidden]
Received on 2021-01-20 01:47:42