TL;DR musings on the behavior of JS JIT Compiler vs C AOT Compilation and sideroads inbetween

On Tue, Mar 10, 2020 at 6:13 AM Bjarne Stroustrup <bjarne@**.com> wrote:


On 3/10/2020 5:01 AM, J Decker via Liaison wrote:
  This really sort of opens the door for auto types in C.  I know 'auto' is currently a reserved word like 'register';  maybe 'var' would be a good substitute.

Eliminating the pointer vs. object distinction has nothing to do with "auto".

Before re-purposing "auto" in C++, we did a survey of many millions of lines of code and found "auto" use limited to test programs and a couple of bugs (really a couple, not a couple of hundreds).

I still think that the motivation for the suggested change is weak and weakly articulated. Who benefits and how? What's the cost (tools, teaching, etc.)?

That email was quickly written, because it was off-topic to the thread, and audience.  This starts with using '.' as a universal member access operator, which is 'this sort of opens the door...' that and the context of the rest of the message was about weakly typed pointers with member accesses when using ->. 

function f( var a, var b ) {
    a.me = &b.next;
    b.next = a;    
}

Obviously, in the above code, in the current C/C++ view, all of those member accesses are on a structure by-value and not a pointer reference, so you're limited to what you can pass to that function (Although in C++ the above could be a & (reference), it can't be a pointer).  And really passing a structure by value is entirely the wrong thing to do in the above case, but rather it's written using the proposed '.' operator to dereference the pointers.

Some things I further thought on functions; was in C++ function overloading is done with name mangling, which is actually a case-by-case basis.  There's not a lot of reason that these functions have to be named?  I was envisioning more of a function as a group of entry points that could be selected, if there is more than one combination of inputs; but there is some work that would have to be done in a dynamic linking case.. some way to communicate that, which mangled names result with.  Although, it can be that auto functions can only be used within a library unit, and cannot themselves be exported, since it would require knowing the caller before ever being loaded into memory in order to resolve the type; certainly well known type conversion points can be exported by a developer.

A function pointer to a function with abstract arguments can't really be a thing, because it's not a singular function; perhaps access to such a function can be assigned to a function pointer, which itself has the proper type of arguments specified, especially if using a cast.

Maybe some historical personal context... 
I've spent a lot of time in C/C#, dabbled in C++, went back to C, while keeping 'this' but as a named parameter to functions that operate on that type of thing, and manually name-mangling overloaded functions like ApplyTransformToMatrix, ApplyTransformToVector.  Went to C#, tried to compile my C as CLR/CLI metacode, found that C#'s C++ extension of a '^' type that's a 'reference' is entirely incompatible with any existing pointer code, and everything would have to be wrapped to be exposed; and there just wasn't that much utility to be gained from that.  The CLI/CLR environment (compile to meta-code), is only C++, and has no C compiler.  The library that I have for a long time ( https://github.com/d3x0r/sack ), wasn't tiny, but 99% of the conflicts came from auto cast of void* to a pointer type that C does espcially where memory is allocated, which C++ requires explicit casting.  Going through that process I found several tiny bugs in corner cases where the type passed and converted weren't specific correctly; but there was some 200k+ errors fixed; mostly by adopting a macro #define New(type) (type*)malloc(sizeof(type)) and #define NewArray( type, count ) (type*)malloc(count*sizeof(type));  and then instead of using void* for callback instance parameters started using uintptr_t (once MS got c99, it used to be called PTRSZVAL (pointer size value)).   This causes the same sort of type conversion error in C as C++ so the casts get applied the same way.  (Technically, the macro isn't 'malloc' but 'Allocate' which is void*AlllocateEx(size_t [/*in debug*/, char const *, int ] )  which is #define Allocate(n) AllocateEx( n [/*in debug*/, __FILE__, __LINE__ ] ); so I know what code is responsible for having created a thing, and usually it's not actually where the allocation happened, but the caller of the thing that did the allocation that's to blame.  I also decorated modules within the library with namespace SACK { namespace timers { ... }} instead of "extern"C" {", which improved the ability to generate documentation with Doc-o-matic ( http://sack.sf.net )

I've known of JS for a very long time, but never really dabbled in it, figuring one day they would get a 'real' language to do that job.  I then was stumbling around and found a Douglas Crockford video 'JavaScript - the Good Parts' from around 2013(?).  I also learned there was 'Node' which would let me just run scripts, because not everything we do has a GUI; and the scripts written there could also be used on webpages, and finally, 'write once, run anywhere' is really looking feasible.  And long that line there's WASM, which serves for 'compile once, run anywhere'.  Plus, node is a single executable dependency.  And really what I learned is ES2015 (ECMA Script) not really JS, because there's a ton of features that make it a better, more solid, language that were added, and old things which are destined to be broken.

So, JS and C feel a lot a like, in that is it's fairly trivial to port algorithms back and forth; the data structures are a different matter.  If you entirely omit 'eval()' from JS, it is nearly C (okay it's not at all, in any way, calm down :) ).  Since the things written in the script aren't going to change, there's only one compilation that really NEEDS to be done; although, JS is always, every time, compiled.

In the larger view, however, there's a lot I've learned about the simplicity of development to not have to re-specify type information, when really, the information you entered already defined the type of the thing.  123 - it's a number.  54.0  - it's a float, "asdf" - it's a string, { } it's an object, made up of all these simple types; although there's a lot in the world that prefer to have that decoration so tools can help them choose appropriate things(?). Like Typescript and Dart both bring the type information (back) to JS, but since they both transpile TO JS,  ( 'Transpile' that's a new word to the C/C++ world I'm sure, that means to take source of one type, and make it another type; there's lots of dialects of languages that compile to JS but are not themselves JS.)  In that compilation that information gets lost anyway, so was it really needed in the first place?  Yes, again, to provide advisories to developers that this thing takes this sort of other thing as an operand.  Transpilation can also be a minimizing step, that takes all the identifiers used internally and not public, into single character tokens;  This reduces the size of the product transported, and some expressions that are 'const' can be pre-evaluated and lots of chunks of code get omitted - nearly as good as C Preprocessing directive #if FEATURE_ENABLED.

Maybe it's a view gathered from writing a c pre-processor (for when compilers didn't support __VA_ARG__, something else?), and then parsing JSON, but it wasn't really until I started actually using the JSON in JS that I got an appreciation for the expressiveness of the data itself to be its own type.  https://github.com/d3x0r/jsox#pretty-images  I wanted to extend JSON slightly, and found some variations, like JSON5 ( https://github.com/json5/json5#json5--json-for-humans  ) which adds comments, several other convenience features like unquoted identifiers in objects  like {a:"apple"}  vs pure JSON { "a":"apple" }.  Things like comments (// ... \n or /* ... */ , # ... \n ) are a simple token analysis to add.  But then you can also express dates as numbers, if you add just a few more tokens to number parsing (which already include '.+-eE', and add ':ZT' )  and have a complex 'Date()' type.  I added typed-objects like 'type{ .. }'  since string-object without a colon, or comma is an invalid parsing state in JSON, was a simple place to add an exception handler.  This let me revive objects as a specfic prototype.  Although in C, 'prototype' doesn't really mean a lot, it is still a string returned in the parsed graph to indicate the object type, if one cared to inspect it.  There are some convenience types like TypedArray binary buffers and Map() which, when serialized to JSON looks like an object with key:value entries.  (it's sort of 'just an object'), but as a map, is kept as a general key lookup store with its own properties different than an object.

This is the list of 'types' that can be expressed in JSOX, although with tagged objects/strings user types also exist... the value contain mostly just tracks the string that was parsed in; although it does have a space for largest_int, largest_float binary representation of the value; though many of the values are just the primitive valueType enum alone like true/false.  JSOX, like JSON, expresses only the data structure part of JS, and does not transport code (algorithms). 

  ------- 
 
enum jsox_value_types {
JSOX_VALUE_UNDEFINED = -1
, JSOX_VALUE_UNSET = 0
, JSOX_VALUE_NULL //= 1 no data
, JSOX_VALUE_TRUE //= 2 no data
, JSOX_VALUE_FALSE //= 3 no data
, JSOX_VALUE_STRING //= 4 string
, JSOX_VALUE_NUMBER //= 5 string + result_d | result_n
, JSOX_VALUE_OBJECT //= 6 contains
, JSOX_VALUE_ARRAY //= 7 contains

// up to here is supported in JSON
, JSOX_VALUE_NEG_NAN //= 8 no data
, JSOX_VALUE_NAN //= 9 no data
, JSOX_VALUE_NEG_INFINITY //= 10 no data
, JSOX_VALUE_INFINITY //= 11 no data
, JSOX_VALUE_DATE  // = 12 comes in as a number, string is data.
, JSOX_VALUE_BIGINT // = 13 string data, needs bigint library to process... (n suffix number)
, JSOX_VALUE_EMPTY // = 14 no data; used in [,,,] as place holder of empty
, JSOX_VALUE_TYPED_ARRAY  // = 15 string is base64 encoding of bytes.
, JSOX_VALUE_TYPED_ARRAY_MAX = JSOX_VALUE_TYPED_ARRAY +12  // = 14 string is base64 encoding of bytes.
};

struct jsox_value_container {
char * name;  // name of this value (if it's contained in an object)
size_t nameLen;
enum jsox_value_types value_type; // value from above indiciating the type of this value
char *string;   // the string value of this value (strings and number types only)
size_t stringLen;

int float_result;  // boolean whether to use result_n or result_d
union {
double result_d;
int64_t result_n;
//struct json_value_container *nextToken;
};
PDATALIST contains;  // list of struct json_value_container that this contains.
PDATALIST *_contains;  // actual source datalist(?)
char *className;  // if VALUE_OBJECT or VALUE_TYPED_ARRAY; this may be non NULL indicating what the class name is.
size_t classNameLen;
};

-------

So, now that there's this technology developed for JS where the compilation is done just -in -time, especially on hot paths, it internally must know all the types of the things its been operating on, so if it receives something new, it has to change; this is potentially an issue with compiled C code (var type return)...

There's a lot of meta-sizes of types in C/C++, char, short, long, long long, which the programmer has to be able to select, in a certain extreme (everything var), the precision required could be analyzed by the compiler based on the inputs it knows to be able to receive from a thing; There are times when the specific size of value is required to be specified as required for serialization for communication. This is why I'm doing my object-storage driver in C and not JS; although JS has Typed Arrays that provide a literal view of values in memory like Uint8Array() or Uint32Array(), and could be used to generate array-like objects for the access to set specific bits in specific places, but that surely cannot compile into 'uint8_t disk[4096]; disk[1234] = n;'.And certainly there are occasions 


var myFunc(  var a, var b ) {
   var d, e, f;
   // I mean, it's just code, the only thing you don't specify HERE is the type... 
}

int main( void ) {
   struct {
      int d;     
      float f;
      char e[123];
   } data1, data2;
   myFunc( &data1, &data2 );  // all the type information comes from here.
}

Automatic return types are an issue, since depending on the value of a string, perhaps, the result is different, and non-deterministic; I would elect that a first cut should prevent automatic return; the following routine could result in a user type like the above jsox_value_container, but shouldn't put that burden on the language...

// maybe on functions, the old K&R style of non-typed declaration, without following types can default to var? zlib is still K&R style declarations, but have type information.
var parseString( /*var*/ jsonToken ) {
   // if token looks like a number, return (int)atoi( jsonToken );
   // if it has quotes return (char*)stripQuotes( jsonToken );
}

var doSomething() {
    var a= parseString( "1234" );
    var b = parseString( "\"Hello World\"" );   
    if( strcmp( a, b ) == 0 ) { // this would fail, strcmp, as defined, takes a char*, and type of of 'a' is not, at this time, a char*, but would be an int; even if it worked
   } 
}

the expression 'parseString' is not a valid function pointer; although '(var(*)(char*))parseString' might be.  The whole return type as-var is also only a new thought; that would require returning some sort of value container type maybe, that would be more like a variant in BASIC...  (along the way I sort of talked myself out of variant return)

The 'advantage' of not specifying types everywhere, is that is I change, and add a field to an object, or use an entirely differently constructed object, everything that worked before still works.  The notable dis-advantage, is the cognitive overhead of having to maintain what sort of thing some function expects.  In JS, the nested function context helps manage this, and is not something C has; and in-struct function definition.

There's a finite graph products when a source is compiled, this graph includes all the sources and sinks of all data structures, so it should be able to match that the passed objects to the routines indeed have all the fields mentioned when actually used, or an error would be generated. 

I suppose the other feature that JS has, which C lacks, is immediate declaration of objects like '{ member: "value", date: time() }; ' which inherits the field types from the specified code, or function result.  Working with JS, the pre-definition of 'struct xyz{ ... }' and then the specification of struct xyz* or struct xyz EVERYWHERE seems really redundant, when I know that the JS interpreter is able to infer that because that's what the data passed to it was... so I'd just have to specify the definition in one place... and that's perhaps even as a token into the language like '123' .

However, none of the above is really possible as long as -> is requires for some objects and '.' for others....because you'd have to write two functions like the first...

function f_stat( var a, var b ) {
    a.me = &b.next;
    b.next = a;    
}
function f_ptr( var a, var b ) {
    a -> me = &b->next;
    b -> next = a;    
}

or maybe the '*' is forced to remain external so you specify 'var *' and 'var', but that starts to defeat the whole 'automaticness' of variables.

And none of this is really much more than musing about ways code is executed on a machine.  JS Is actually able to run at native speed, on hardware without interpretation; it is also able to receive dynamic packets from a network that is an arbitrary string defining a object structure, and remain able to dynamically look up the fields, and build types so that existing hot-path compiled code can quickly chew on the data.   

I would think that within the scope of a C program, that the lifetime of objects would be knowable by the compiler by traversing the graph of references to that object.  But probably by having 'malloc' and allowing the programmer to script new objects none of those can be tracked would break that?

There is certainly some waste in having private routines that are only usable internally, when those same sort of primitives probably are useful to other consumers of a library, but then again probably not; if they were, could provide like standard iterable interfaces to them or something.