Belfast 2019 SG19 (machine learning) Friday afternoon after break Automatic differentiation Present: * Benjamin Saks * Vincent Reverdy * Antony Peacock * Duygu Cakmak * Domagoj Šarić * Guilherme Hartmann * Vito Castellana * Rob Simpson * Michael Wong (chair) * Mihail Mihailov * Lukasz Wojakowski * Matti Rintala * Matthias Kretz * Marco Foco (coauthor) * Mark Hoemmen (mfh) (scribe) * Morris Hafner (MH) * Mark Zeren * Vasil Vasilev (speaker) [Scribe note: This is a presentation not based on a paper submitted to the pre-Belfast 2019 mailing.] http://wiki.edg.com/pub/Wg21belfast/SG19/SG19_Automatic_Differentiation_in_C.pdf VV: [3 ways to compute derivatives: numerical, symbolic, and algorithmic. Algorithmic fixes the previous ways' options, at the cost of introducing extra software dependencies.] VR: Why do you say that algorithmic differentiation introduces extra software dependencies? [Does it] currently [do so]? VV: If you want to produce the derivative, then you need software support -- either operator overloading, or compile-time transformation. VR: OK. You need something on top of compilers to be able to .... VV: Yes. AP: It can also be done by hand. VV: Yes. That's a human dependency[, as opposed to a software dependency.] AP: Not saying it's a good idea. VV: [Algorithmic/Automatic differentiation (1 of 2)] VV: [Algorithmic/Automatic differentiation (2 of 2): Implementations based on operator overloading/source transformation] MF: [copresenting] [Dual Numbers] are a mathematical tool [for expressing forward automatic differentiation]. MF: [AD with Dual Numbers (1 of 2)] VV: Just to add: For example, the way you want to express the derivative computation in pure mathematics, you can express it in the field of dual numbers. It "overloads" the ... operator ... to decompose naturally. .... This is a pure mathematical property which we use to simplify the ... algorithm. VR: What's the difference between mathematical and symbolic differentiation? VV: Symbolic is like Mathematica -- type in an expression, get its derivative as an expression. Automatic differentiation gets a function that returns a value. MF: [clarifies] VV: Think of a template pattern. Rules -> pattern with guarantees. This is same: You have a pattern, the original function ... you get guarantee that the other function that's produced is the first derivative. MF: [AD with Dual Numbers (2 of 2)] This slide tells you that any function you have, you will get the derivative. MW: You're still just calculating f(x + epsilon). MF: Yes [clarifies]. MF: [AD with Dual Numbers -- code] mfh: What does the "eps" method do? MF: It takes the epsilon part of the dual number. MZ: It's like the imag function for complex numbers. VV: [Clad: Clang C/C++ AD [automatic differentiation] plugin] Template metaprogramming approach to AD requires a new type. This doesn't fly very well with legacy code, especially if the legacy code has numerical functions that people are nervous about changing. Source-to-source transformation is another approach, but most have custom parsers -- they work on a subset of C or a very small subset of C++. It's hard to write a parser for C++. Clad is a Clang plugin. It lets us use Clang's parser, so we have full access to C++. Some constructs are challenging to parse, but not hard to differentiate. VV: [Clad: Source Transformation] VV: [Clad: Implementation] Clad operates on Clang AST. AST transformation with clang::StmtVisitor. Two modes of operation: Inside the compiler (thus, dependency on Clang), or produce a file with the derivatives. Compile once, produce the file, then build with the application. But the latter implies a problem: If you modify the application, you'll need to reproduce the file. VV: [AD Transformation. Chain Rule] The Chain Rule for differential calculus assumes you can split the computation into atomic operations. Chain Rule is symmetric. This will change computational complexity, as we'll see. VV: [Forward mode: clad::differentiate] Symmetry means we could "fuse" the two function bodies (from differentiating one way or the other). We could also compute both f and its derivative at the same time. MF: To go back to the dual numbers: with the dual numbers, you obtain the same result. The result always contains the original function and the first dervivative. You can have different kinds of dual numbers and get multiple derivatives at the same time. VV: Complexity of the function is a function of its input parameters. When it comes to machine learning, there we can have millions of parameters. This is not nice for this use case -- computational complexity would be proportional to the number of inputs. MW: The feature vector. VV: Yes. MF: Yes. VV: [Forward mode: clad::differentiate (2 of 2)] VV: [clad::differentiate] This is what you get if you call Clad. We do transformation pre-optimization, so it's OK to leave extra expressions and constants in place. VV: [Reverse mode: clad::gradient] The chain rule is symmetric, so if you reverse the way you generate the derivative, you'll start accumulating the other way around, so you'll reuse intermediate values. Now the computational complexity does not depend on the dimension of the input, but on the dimension of the output. This is very useful if you have a reducing function. .... VR: So here you're transforming sine into cosine, so you would have a list of transfomration rules in your thing? VV: Yes. VR: I'm worried this would result in an explosion of rules -- e.g., special functions in C++. VV: Let's defer the discussion.... MF: The idea is to be able to provide together with the function, to express "and this is the derivative of the function." VV: Those are for the atomic functions, "atomic" in terms of differentiation rules -- not a great many. Doesn't have to be part of the language. Can go in custom language. I can [also] do it when I have a more performant function -- I don't want to use autmoatically generated function, but a custom optimized one. RS: There are some functions which are not differentiable[; what does AD do then?] VV: AD cannot run everything. People need to write functions in a way that's AD friendly. For example, if the function has a removable discontinuity but is otherwise differentiable... [control flow...]. Also, for pow2(x). Suppose [I write it like this]: return x==3 ? 9 : x*x. How do I differentiate that? The function's author needs to be aware of AD. There is good theory about that. VR: Not all functions are AD'able, correct? VV: That's correct; not all functions are AD'able. e.g., printf. MF: Research is ongoing. I would rather start from something simple, and eventually increase the power of these tools when the theory comes for doing more things. VR: Let's say if you have these functions part of C++, through a keyword. When we started with constexpr, we had a lot of rules about what was allowed inside a constexpr function. I guess this would be the same. MF: Yes -- that's the example I use usually. VV: [Reverse mode: clad::gradient (2 of 2)] You are reusing computation -- making good use of intermediate results. This is very important [to reuse] when you go to 100s or 1000s of dimensions. VV: [clad::gradient] Forward mode gives you the value of the derivative in the given direction -- here in reverse mode you get all values of all dimensions -- a vector out. VR: The example showed some of them have loops -- in this one, you have an if statement. Can you differentiate on if statements? In that case, you combine ... your function is defined by part, so the resulting differentiated function you get is defined by parts. VV: Yes. The theory says, if you don't have a derivative at a point, then you don't have a value at that point. MF: When you have a condition, you know if the point is included or not, so if you have a conditional on an independent variable, then you know the part to which the point belongs -- the side [from which you approach the point]. VV: Research has proven that if you have control flow, you may safely differentiate. RK: printf has a derivative somehow -- it returns a value, the number of printed characters. [The room laughs.] VV: This gives me ideas [haha]. VV: [What can be differentiated] Loops -- when the loop depends on the dependent variable / direction, else differentiability is questionable, but we can diagnose [Scribe: I think that means "emit a diagnostic"] all these cases. VV: [Benchmarks: in High-Energy Physics Uses] VV: [Benchmarks [outside of high-energy physics]] Some theory for how to optimize intermediate results. This is what clad::push etc. is about. Speedups vary depending on input and the function. VR: If one of your variables is a std::vector and you do push_back, this is not differentiable, right? VV: [not sure] It might be mathematically possible. VR: If you're allocating memory.... VV: At the moment, I think we cannot differentiate std::vector objects. VR: Let's say you have a C array and you realloc it; how would you handle that? VV: If the C array is realloc'd, the derivative would also be realloc'd. MW: Regarding the slide, is [the] "Numerical" [function] generated? VV: No, it's hand written. If you want to do numerical differentiation, you must write this algorithm. MW: Ah, that's just the comparison. VV: Hand writing is the fallback if the tool can't differentiate but a human can. RS: Just recently I was trying to do some numerical differentiation -- I found it completely unreliable, due to underflow. Adaptive approaches [for avoiding underflow etc.] were much more complicated. I think your comparison is pessimistic -- automatic approach is better. VV: "Numerical" is just what we have to do as a fallback. Stabilization and consulting with mathematician may be needed for numerical approach. VV: The thing we could not figure out, was that the theory also says, that the gradient computation of the initial function cannot require more than 4x operations of the original function. I don't have at the moment a reference for the proof of this theorem. But it says that you have an asymptotic upper bound on the number of operations of the gradient -- which is a good property, because it gives you scalability of the gradient. VV: [Future Work] .... I could generate GPU code. This was the motivation of the project. VV: [Thank you!] https://github.com/vgvassilev/clad. www.autodiff.org. RK: Could you show Slide 25 again? [Benchmarks] You are tracking iteration value because you do not have a polynomial model ... size of loop? VV: In some cases we can, but the default is that you track. .... If you infinite. .. you can just expand it ... there is no loop ... but here .... DS: The 3 approaches you mentioned at beginning -- numeric, symbolic, automatic -- to me, this looks like a hybrid of the last two. For example, you fall back to symbolic for math functions like sine and cosine. VV: Yes. DS: Especially after watching EWG bikeshedding yesterday, I suspect the chances of putting this into the language are very small. The most likely to succeed approach would be a pure library approach -- e.g., expression templates thing I mentioned this morning (see CppCon 2019 talk by Joel Falcou and Vincent Reverdy, "EDSL Infinity Wars: Mainstreaming Symbolic Computation"). If you want some help from the language, maybe trying with the reflection proposal -- expand it to include analysis fo functions and function attributes -- mark function as "differentiable" to get diagnostics from compiler. Either of those sounds doable, but arbitrary sublanguage in the compiler itself sounds impossible [i.e., unlikely to get through WG21]. VV: The library-only solution would mean operator overloading. DS: If you do it with expression templates-- you could do the same analysis at compile time -- you would write your function with types -- and from that expression type, .... MF: We considered those two cases. For library approach, you can't differentiate "if" without expressing it in a different way, as a function. e.g., select_condition. MZ: If the ternary were overloadable, would you be happy? [Answer was "yes."] mfh: Our library-based solution has a function "if_then_else", like that. DS: We hae a reflection proposal, but it doesn't deal with functions. MF: Correct, and it would require standardizing the AST -- a way to introspect the code inside the function. DS: But if you restrict the functions that coudl be analyzed in this way. MF: Yes, that's always possible -- attribute to mark function as differentiable, error to try to differentiate function if not differentiable. I don't think SG7 will standardize AST any time soon. VV: In some cases, in the code that I've seen, things are more complicated I've seen than just a function call. Things have state ... really need .... VR: I've been working with Joel on symbolic ... expression templates. Type LaTeX in C++, get expressions. We thoguht about differentiation. I think pure library approach is probably doable. It would require users to change a few things, like "if" and "for," but if we do that, then I'm almost entirely sure that it could be done in a library-only approach. It would be slow to compile despite optimization efforts and we would lose the opportunity to reuse code. It would make people type their code in a special way if they want it to be AD, while in your approach, if the function meets the requirements, it would work. MF: The library approach implies generic functions -- "auto" parameters everywhere to extract the AST. MZ: Both are possible -- inject auto ... or from ... ? DS: Do you mean having the function as a type? MF: [Explains] Mark: How does this compete with work being done in [the] Swift [programming language] for AD? VV: I've been in contact occasionally with Richard W?, who worked on that in Swift. Mark: I'm using the word "compete" -- I chose that word. VV: Swift bragged about being AD language. Mark: This functionality is important for longetivity of the language, to be useful for scientists. Also units -- how would you interoperate with units -- things that you use to attach to integral types -- the chrono's of the word -- do you have any idea how you would tell the compiler that this is a simple thing you would treat just like a double, but .... VV: If there is a desugaring step in the compiler.... Mark: Perhaps an enum backed by a double, or a double in a struct, not sure how MP would ipmlement it. "Treat it like you used to treat double...." [Scribe note: This sounds scary if there are quantities that are, say, both lengths but have different units. We shouldn't strip away those scaling factors for AD.] VV: I know it's implementable, but I want to reduce complexity of the implementation. There is nothing that precludes us from [this], but if it becomes too complicated, it would be less likely to be adopted as a feature. Mark: These two features are closely related in my mind. AP: There's precedent for the library solution -- AC DCO (?). Template library, algorithmic differentiatable time.... VV: Yes, I think I know about this. It uses operator overloading. VR: Can you provide special rules on how you transform your functions? With Joel, we provide a way for users to provide their own simplifying formulas -- like implementing compiler optimizer. Users can e.g., decide that this function should be derived in some way and this one in another way. VV: Rules of differentiation you wouldn't want to change. But [if?] we have a custom namespace -- if some symbolic mapping ..., then we just pick it up. This helps us if we do not have the source code of the function. This is done in the interpreter -- we have a C++ interpreter. This is a way to support functions that are invisible for [or to?] the interpreter to be differentiated. VR: I would really ... Not saying that library version would be better than language version -- if you give us language version, we could do crazy things and wouldn't need Mathematica any more -- but if we don't have language solution, am pretty sure we could do it in the library level. If EWG forbade language solution, I'm pretty sure we could do most of the stuff at the library level. MF: Regarding struct -- it can be expanded if input, can be compressed if output. Like having multiple parameters. Mark: It's just knowing that a particular user-defined type should be unwrapped automaticaly and just treated as a double. MF: Yes, probably it's generalizable. Mark; It's like adding an attribute to the struct. ... MM: In the variant with language support, what kind of language support -- just for AD, or something more general? VV: I don't know. There are disadvantages to both. If just for AD, not consistent with rest of language. If we wait for generic solution, we risk to lose people from machine learning community from C++. This is a feature very useful for machine learning. MF: What I would like to try to produce, is something that could work as a language feature first, and then, if AST SG7 tool stnadardized, it could be transformed into a library. DS: Besides reflection, we have JIT proposal [Hal]; that could be used to analyze and differentiate functions without source code. MF: No -- it works on the AST. We don't do source analysis. DS: Thsi would be a minimum addition to the language, if we had JIT. MW: Not clear to me that JIT proposal would succeed. VV: The JIT -- I've been in contact with Hal, we'll be working together on JIT side. but I don't see how this would map immediately to JIT. Some people have done it in LLVM IR. Disadvantage of lower level: Lose language information -- e.g., seeing that you can't differentiate. Can't give user diagnostic in that case. That's why we use AST. mfh: Trying to work at JIT level means you might lose accuracy, if you step into a black-box function (say sine) and differentiate the numerical approximation of the function. VV: Could have "numerical diferentiation" function -- compiler could differentiate if possible, else fall back to numerical. GH: I think if it's a library approach, LEWG could look into it and propose a keyword. It's easier for LEWG to push into EWG. Improving a library to a language feature is a good approach. MW: Any votes or equesitons? MF: Yes. As VV said, an interesting approach could be starting with numeric differentiation, with AD on top of that -- to substitute. Interest to propose a paper here for numeric differentiation -- it's a function. MW: A library. VV: I would like to propose a poll -- would the audience like to pursue this general direction? MW: It's not clear what "this direction" is. MZ: The feature in some way, in C++. MF: That means, AD, either as a library, or as a language feature. VV: Re. numeric -- not sure if it would be useful. I'm not sure whether it can be generic enough, so that it's able to be generically used by all the potential math users. MF: Can we also have one poll for the pure language? Implementation either as language or as library is quite different. Library has drawbacks as I mentioned. VV: Do we agree that a language feature is worth pursuing, rather than a library feature? DS: Maybe you could loosen it a bit? There are options in between, or other options. MF: I think a "language feature" is anything, including JIT or reflection extensions. Mark: I'm assuming the job of this room is to advance machine learning in C++, not to worry about language changes. Don't worry about what EWG would do. VV: "Language feature" means "working with atomic types" -- without need of custom types -- no operator overloading. I don't want the Clang plugin in the standard -- I want the implementation to work with doubles and floats. Operator overloading would require introducing a new type, which would not work well with existing code. MW: I'm satisfied with this poll as it is, that does not specify the exact methodology. ---- MW: Potential poll: Pursue this feature -- AD in language or in library -- in C++. BS: For this poll, do we mind if it's a mix of language and library? Mark: Too much detail. MW: This poll is permissive. __POOL: Pursue this feature -- AD in language or in library -- in C++.__ |SF|F|N|A|SA| |-|-|-|-|-| |12|6|1|0|0| __Attendance:__ 20 __Authors in room:__ 2 __Authors' votes:__ SF __POLL: Do we want a paper on adding a library-based numerical differentiation feature, not excluding AD?__ |SF|F|N|A|SA| |-|-|-|-|-| |1|10|8|0|0| __Attendance:__ 20 __Authors in room:__ 2 __Authors' votes:__ SF, F. BS: How do we interpret this poll? MW: In SGs, because we're all of like mind, it's rare for SGs to vote down things. I'm almost discounting the negative votes[, if they are outweighed by positives]. As proposals rise [through the process], dissent will be magnified. If SG approves, I try to get you to get work done and vote it out, so it gets exposed as fast as possible, before you get bogged down in .... MW: Proposed POLL: We want to pursue AD as the feature, in the language. MM: Does this mean the entire functionality is built into the language? MW: It would need some sort of .... Mark: One way to do the generic thing is to bring it in as part of reflection. [uncertainty about poll] MF: To clarify -- must be done as a language extension to C++20 [not regarding future language extensions]. VV: I'm worried if we were bound to reflection or to any other generic approach. Thsi would delay everything by five years. Mark: If this room polls on "in the language," you can make a decision about [what that means]. MW: Yes, I agree. __POLL: We want to pursue AD as the feature, in the language.__ |SF|F|N|A|SA| |-|-|-|-|-| |3|6|7|2|1| __Attendance:__ 20 __Authors in room:__ 2 Strongly Against: It's science fiction -- such a niche that it will be a waste of time -- we have a lot of other problems to solve. Some language features for this would be strange in C++ standard. VV: If there were a lagnuage feature, it would be minimal. MZ: But you still would have to specify everything, and that specification would be major. DS: We have a spaceship operator; we can have science fiction [haha]. VR: I voted against, for the same reason as the person above. To me, this is too specific. Maybe a more general version, based on reflection. MF: But that was included in this poll. GH: .... RK: but then it's a library. Mark: I voted strongly for, because I think a library will fail. It doesn't make the language competitive. Library solution is just telling people to learn a new language that's super slow to compile and generates long errors. MF: There are already several library solutions. MH: I was on the verge of voting against. AD is still such a reserach topic. Cost of adding language feature is higher than a library feature, so there's greater danger involed. If we have as library solution, it would compile slowly, though metaprogrammin gfeauters in language are improving so that might improve. There are also existing library solutions and libraries that use metaprogramming. MZ: Library solution might get an optimized compiler implementation -- may not be fully library -- could thus be competitive. VV: To clarify: Could you define what's "recent research"? MH: [In recent years.] Is there a general-purpose language with AD built in? MF: Swift -- experimental branch, but will go into language. Other examples: PyTorch (language built atop Python), Julia (implemented in JIT -- SSA version of the AD). GH: Tensorflow, AD in C++ -- very noisy. VR: Maybe something that could happen is that we could design a library solution, and compiler implementers might ask SG7 to help solve the problem with assistance from new language features. Mark: Or you could add a language feature that drives reflection faster. VV: Research into AD happened in late '80s to early 90's. Also we have better [compiler] tools now. [More recent research builds on existing theory.] More modern research topic: automatic integration. MW: I would love to have this in the language. I went Neutral because I didn't understand all the tradeoffs. .... VV: I also sympathize with the hybrid approach, where you do things with templates or what not -- we hope to get compiler support, to make things not explode -- second or third derivative of an Eigen thing might explode the stack .... MF: To speculate: a library solution will be voted down by LEWGI or LEWG, on basis that there's not an urge to have this in the standard library for C++, that it could be an external library. mfh: I like this solution just as a library or Clang plugin, but I feel that this is a powerful tool that can be dangerous in the hands of a naive user. This makes me neutral about pushing to put it into the standard library. VR: Regarding library solution being voted down -- I need to talk with Joel, but I think we could provide a framework that could be used to do more stuff than just AD, and in that case, it would have a higher chance to be accepted -- in the same way that having a serialization of the AST, would have more chance... it would be far more generic than just a specific AD tool. MW: I'll close this down; it's 17:30. I'm really excited by this.