std-proposals: Generic/Auto types in C (chain from '.' operate on pointers)

From: J Decker <d3ck0r_at_[hidden]>
Date: Tue, 10 Mar 2020 16:04:01 -0700

TL;DR musings on the behavior of JS JIT Compiler vs C AOT Compilation and
sideroads inbetween

On Tue, Mar 10, 2020 at 6:13 AM Bjarne Stroustrup <bjarne@**.com> wrote:

>
> On 3/10/2020 5:01 AM, J Decker via Liaison wrote:
>
> This really sort of opens the door for auto types in C. I know 'auto'
> is currently a reserved word like 'register'; maybe 'var' would be a good
> substitute.
>
> Eliminating the pointer vs. object distinction has nothing to do with
> "auto".
>
> Before re-purposing "auto" in C++, we did a survey of many millions of
> lines of code and found "auto" use limited to test programs and a couple of
> bugs (really a couple, not a couple of hundreds).
>
> I still think that the motivation for the suggested change is weak and
> weakly articulated. Who benefits and how? What's the cost (tools, teaching,
> etc.)?
>
That email was quickly written, because it was off-topic to the thread, and
audience. This starts with using '.' as a universal member access
operator, which is 'this sort of opens the door...' that and the context of
the rest of the message was about weakly typed pointers with member
accesses when using ->.

function f( var a, var b ) {
    a.me = &b.next;
    b.next = a;
}

Obviously, in the above code, in the current C/C++ view, all of those
member accesses are on a structure by-value and not a pointer reference, so
you're limited to what you can pass to that function (Although in C++ the
above could be a & (reference), it can't be a pointer). And really passing
a structure by value is entirely the wrong thing to do in the above case,
but rather it's written using the proposed '.' operator to dereference the
pointers.

Some things I further thought on functions; was in C++ function overloading
is done with name mangling, which is actually a case-by-case basis.
There's not a lot of reason that these functions have to be named? I was
envisioning more of a function as a group of entry points that could be
selected, if there is more than one combination of inputs; but there is
some work that would have to be done in a dynamic linking case.. some way
to communicate that, which mangled names result with. Although, it can be
that auto functions can only be used within a library unit, and cannot
themselves be exported, since it would require knowing the caller before
ever being loaded into memory in order to resolve the type; certainly well
known type conversion points can be exported by a developer.

A function pointer to a function with abstract arguments can't really be a
thing, because it's not a singular function; perhaps access to such a
function can be assigned to a function pointer, which itself has the proper
type of arguments specified, especially if using a cast.

Maybe some historical personal context...
I've spent a lot of time in C/C#, dabbled in C++, went back to C, while
keeping 'this' but as a named parameter to functions that operate on that
type of thing, and manually name-mangling overloaded functions like
ApplyTransformToMatrix, ApplyTransformToVector. Went to C#, tried to
compile my C as CLR/CLI metacode, found that C#'s C++ extension of a '^'
type that's a 'reference' is entirely incompatible with any existing
pointer code, and everything would have to be wrapped to be exposed; and
there just wasn't that much utility to be gained from that. The CLI/CLR
environment (compile to meta-code), is only C++, and has no C compiler.
The library that I have for a long time ( https://github.com/d3x0r/sack ),
wasn't tiny, but 99% of the conflicts came from auto cast of void* to a
pointer type that C does espcially where memory is allocated, which C++
requires explicit casting. Going through that process I found several tiny
bugs in corner cases where the type passed and converted weren't specific
correctly; but there was some 200k+ errors fixed; mostly by adopting a
macro #define New(type) (type*)malloc(sizeof(type)) and #define NewArray(
type, count ) (type*)malloc(count*sizeof(type)); and then instead of using
void* for callback instance parameters started using uintptr_t (once MS got
c99, it used to be called PTRSZVAL (pointer size value)). This causes the
same sort of type conversion error in C as C++ so the casts get applied the
same way. (Technically, the macro isn't 'malloc' but 'Allocate' which is
void*AlllocateEx(size_t [/*in debug*/, char const *, int ] ) which is
#define Allocate(n) AllocateEx( n [/*in debug*/, __FILE__, __LINE__ ] ); so
I know what code is responsible for having created a thing, and usually
it's not actually where the allocation happened, but the caller of the
thing that did the allocation that's to blame. I also decorated modules
within the library with namespace SACK { namespace timers { ... }} instead
of "extern"C" {", which improved the ability to generate documentation with
Doc-o-matic ( http://sack.sf.net )

I've known of JS for a very long time, but never really dabbled in it,
figuring one day they would get a 'real' language to do that job. I then
was stumbling around and found a Douglas Crockford video 'JavaScript - the
Good Parts' from around 2013(?). I also learned there was 'Node' which
would let me just run scripts, because not everything we do has a GUI; and
the scripts written there could also be used on webpages, and finally,
'write once, run anywhere' is really looking feasible. And long that line
there's WASM, which serves for 'compile once, run anywhere'. Plus, node is
a single executable dependency. And really what I learned is ES2015 (ECMA
Script) not really JS, because there's a ton of features that make it a
better, more solid, language that were added, and old things which are
destined to be broken.

So, JS and C feel a lot a like, in that is it's fairly trivial to port
algorithms back and forth; the data structures are a different matter. If
you entirely omit 'eval()' from JS, it is nearly C (okay it's not at all,
in any way, calm down :) ). Since the things written in the script aren't
going to change, there's only one compilation that really NEEDS to be done;
although, JS is always, every time, compiled.

In the larger view, however, there's a lot I've learned about the
simplicity of development to not have to re-specify type information, when
really, the information you entered already defined the type of the thing.
123 - it's a number. 54.0 - it's a float, "asdf" - it's a string, { }
it's an object, made up of all these simple types; although there's a lot
in the world that prefer to have that decoration so tools can help them
choose appropriate things(?). Like Typescript and Dart both bring the type
information (back) to JS, but since they both transpile TO JS, (
'Transpile' that's a new word to the C/C++ world I'm sure, that means to
take source of one type, and make it another type; there's lots of dialects
of languages that compile to JS but are not themselves JS.) In that
compilation that information gets lost anyway, so was it really needed in
the first place? Yes, again, to provide advisories to developers that this
thing takes this sort of other thing as an operand. Transpilation can also
be a minimizing step, that takes all the identifiers used internally and
not public, into single character tokens; This reduces the size of the
product transported, and some expressions that are 'const' can be
pre-evaluated and lots of chunks of code get omitted - nearly as good as C
Preprocessing directive #if FEATURE_ENABLED.

Maybe it's a view gathered from writing a c pre-processor (for when
compilers didn't support __VA_ARG__, something else?), and then parsing
JSON, but it wasn't really until I started actually using the JSON in JS
that I got an appreciation for the expressiveness of the data itself to be
its own type. https://github.com/d3x0r/jsox#pretty-images I wanted to
extend JSON slightly, and found some variations, like JSON5 (
https://github.com/json5/json5#json5--json-for-humans ) which adds
comments, several other convenience features like unquoted identifiers in
objects like {a:"apple"} vs pure JSON { "a":"apple" }. Things like
comments (// ... \n or /* ... */ , # ... \n ) are a simple token analysis
to add. But then you can also express dates as numbers, if you add just a
few more tokens to number parsing (which already include '.+-eE', and add
':ZT' ) and have a complex 'Date()' type. I added typed-objects like
'type{ .. }' since string-object without a colon, or comma is an invalid
parsing state in JSON, was a simple place to add an exception handler.
This let me revive objects as a specfic prototype. Although in C,
'prototype' doesn't really mean a lot, it is still a string returned in the
parsed graph to indicate the object type, if one cared to inspect it.
There are some convenience types like TypedArray binary buffers and Map()
which, when serialized to JSON looks like an object with key:value
entries. (it's sort of 'just an object'), but as a map, is kept as a
general key lookup store with its own properties different than an object.

This is the list of 'types' that can be expressed in JSOX, although with
tagged objects/strings user types also exist... the value contain mostly
just tracks the string that was parsed in; although it does have a space
for largest_int, largest_float binary representation of the value; though
many of the values are just the primitive valueType enum alone like
true/false. JSOX, like JSON, expresses only the data structure part of JS,
and does not transport code (algorithms).

  -------

enum jsox_value_types {
JSOX_VALUE_UNDEFINED = -1
, JSOX_VALUE_UNSET = 0
, JSOX_VALUE_NULL //= 1 no data
, JSOX_VALUE_TRUE //= 2 no data
, JSOX_VALUE_FALSE //= 3 no data
, JSOX_VALUE_STRING //= 4 string
, JSOX_VALUE_NUMBER //= 5 string + result_d | result_n
, JSOX_VALUE_OBJECT //= 6 contains
, JSOX_VALUE_ARRAY //= 7 contains

// up to here is supported in JSON
, JSOX_VALUE_NEG_NAN //= 8 no data
, JSOX_VALUE_NAN //= 9 no data
, JSOX_VALUE_NEG_INFINITY //= 10 no data
, JSOX_VALUE_INFINITY //= 11 no data
, JSOX_VALUE_DATE // = 12 comes in as a number, string is data.
, JSOX_VALUE_BIGINT // = 13 string data, needs bigint library to process...
(n suffix number)
, JSOX_VALUE_EMPTY // = 14 no data; used in [,,,] as place holder of empty
, JSOX_VALUE_TYPED_ARRAY // = 15 string is base64 encoding of bytes.
, JSOX_VALUE_TYPED_ARRAY_MAX = JSOX_VALUE_TYPED_ARRAY +12 // = 14 string
is base64 encoding of bytes.
};

struct jsox_value_container {
char * name; // name of this value (if it's contained in an object)
size_t nameLen;
enum jsox_value_types value_type; // value from above indiciating the type
of this value
char *string; // the string value of this value (strings and number types
only)
size_t stringLen;

int float_result; // boolean whether to use result_n or result_d
union {
double result_d;
int64_t result_n;
//struct json_value_container *nextToken;
};
PDATALIST contains; // list of struct json_value_container that this
contains.
PDATALIST *_contains; // actual source datalist(?)
char *className; // if VALUE_OBJECT or VALUE_TYPED_ARRAY; this may be non
NULL indicating what the class name is.
size_t classNameLen;
};

-------

So, now that there's this technology developed for JS where the compilation
is done just -in -time, especially on hot paths, it internally must know
all the types of the things its been operating on, so if it receives
something new, it has to change; this is potentially an issue with compiled
C code (var type return)...

There's a lot of meta-sizes of types in C/C++, char, short, long, long
long, which the programmer has to be able to select, in a certain extreme
(everything var), the precision required could be analyzed by the compiler
based on the inputs it knows to be able to receive from a thing; There are
times when the specific size of value is required to be specified as
required for serialization for communication. This is why I'm doing my
object-storage driver in C and not JS; although JS has Typed Arrays that
provide a literal view of values in memory like Uint8Array() or
Uint32Array(), and could be used to generate array-like objects for the
access to set specific bits in specific places, but that surely cannot
compile into 'uint8_t disk[4096]; disk[1234] = n;'.And certainly there are
occasions

var myFunc( var a, var b ) {
   var d, e, f;
   // I mean, it's just code, the only thing you don't specify HERE is the
type...
}

int main( void ) {
   struct {
      int d;
      float f;
      char e[123];
   } data1, data2;
   myFunc( &data1, &data2 ); // all the type information comes from here.
}

Automatic return types are an issue, since depending on the value of a
string, perhaps, the result is different, and non-deterministic; I would
elect that a first cut should prevent automatic return; the following
routine could result in a user type like the above jsox_value_container,
but shouldn't put that burden on the language...

// maybe on functions, the old K&R style of non-typed declaration, without
following types can default to var? zlib is still K&R style declarations,
but have type information.
var parseString( /*var*/ jsonToken ) {
   // if token looks like a number, return (int)atoi( jsonToken );
   // if it has quotes return (char*)stripQuotes( jsonToken );
}

var doSomething() {
    var a= parseString( "1234" );
    var b = parseString( "\"Hello World\"" );
    if( strcmp( a, b ) == 0 ) { // this would fail, strcmp, as defined,
takes a char*, and type of of 'a' is not, at this time, a char*, but would
be an int; even if it worked
   }
}

the expression 'parseString' is not a valid function pointer; although
'(var(*)(char*))parseString' might be. The whole return type as-var is
also only a new thought; that would require returning some sort of value
container type maybe, that would be more like a variant in BASIC... (along
the way I sort of talked myself out of variant return)

The 'advantage' of not specifying types everywhere, is that is I change,
and add a field to an object, or use an entirely differently constructed
object, everything that worked before still works. The notable
dis-advantage, is the cognitive overhead of having to maintain what sort of
thing some function expects. In JS, the nested function context helps
manage this, and is not something C has; and in-struct function definition.

There's a finite graph products when a source is compiled, this graph
includes all the sources and sinks of all data structures, so it should be
able to match that the passed objects to the routines indeed have all the
fields mentioned when actually used, or an error would be generated.

I suppose the other feature that JS has, which C lacks, is immediate
declaration of objects like '{ member: "value", date: time() }; ' which
inherits the field types from the specified code, or function result.
Working with JS, the pre-definition of 'struct xyz{ ... }' and then the
specification of struct xyz* or struct xyz EVERYWHERE seems really
redundant, when I know that the JS interpreter is able to infer that
because that's what the data passed to it was... so I'd just have to
specify the definition in one place... and that's perhaps even as a token
into the language like '123' .

However, none of the above is really possible as long as -> is requires for
some objects and '.' for others....because you'd have to write two
functions like the first...

function f_stat( var a, var b ) {
    a.me = &b.next;
    b.next = a;
}
function f_ptr( var a, var b ) {
    a -> me = &b->next;
    b -> next = a;
}

or maybe the '*' is forced to remain external so you specify 'var *' and
'var', but that starts to defeat the whole 'automaticness' of variables.

And none of this is really much more than musing about ways code is
executed on a machine. JS Is actually able to run at native speed, on
hardware without interpretation; it is also able to receive dynamic packets
from a network that is an arbitrary string defining a object structure, and
remain able to dynamically look up the fields, and build types so that
existing hot-path compiled code can quickly chew on the data.

I would think that within the scope of a C program, that the lifetime of
objects would be knowable by the compiler by traversing the graph of
references to that object. But probably by having 'malloc' and allowing
the programmer to script new objects none of those can be tracked would
break that?

There is certainly some waste in having private routines that are only
usable internally, when those same sort of primitives probably are useful
to other consumers of a library, but then again probably not; if they were,
could provide like standard iterable interfaces to them or something.

Received on 2020-03-10 18:06:59