Showing posts tagged “pl design”. Show All
14th
2008
Apr
permalink

The Phase Concept

Anyone who’s been following my blog for a while may have seen a pattern by now. Everything I’ve written about programming languages has a theme, which when extrapolated, has one logical conclusion: to create a compiler for a programming language that is a good tool for creating other programming languages (possibly mini languages otherwise known as APIs or DSLs) with a GUI editor that is aware of the semantics of the language and whose target language is well-established with many existing tools.

The compiler project I mentioned last post is a start of that. However, it is by no means the final product. First of all, I conjecture that an s-expression-based source language will lend itself as a good target language for a graphical code-editor built later.

Secondly, the idea of having PHP as the compiler’s target language was based on the desire to take advantage of the hordes of PHP code already out there for creating web apps. However, after creating an initial prototype, it became obvious that PHP’s lack of support for closures is a huge obstacle in creating the compiler whose source language has closures. I can’t imagine not having closures, so PHP is out. (It’s not that it can’t be done, but it would be significantly more work to compile away the closures.) This is good though, because it forces me to re-write (i.e. re-design) the compiler.

I also decided against compiling to Python. Even though I have good feelings towards it, I can’t justify using it when it restricts closures to be one-liners in an otherwise imperative language. I was also considering Common Lisp as a target language. The thing is, using it as a target language leaves this new language with all the same problems that Common Lisp has, and so in a way that would defeat the purpose of building on top of something supported by armies of coders. Put another way, CL’s armies are significantly smaller than the armies of other languages.

As much as I don’t want to admit this, Ruby is starting to look like the best option for a target language.

So for those of you wondering if I’m going to release my little prototype, I see no reason to. It was written in Haskell as a proof of concept. The s-expression parser was taken from the Lisp interpreter I wrote, and I simply added the translation to PHP.

I have concerns though about certain features like eval. My first inclination was to include it, as I plan on having something like macros à la Lisp. That could slow down the development of a prototype, and thus feedback, so I may cut it from the first version. Including eval creates a bootstrapping problem. It requires me to either write the compiler in the target language or include enough language primitives to implement eval in the source language itself and re-write eval in my new language. This is a sad cut, but it’s necessary to get a feel for the language quickly.

So what is this “language” I keep referring to? What’s special about it? What will its purpose be? It’s just an idea I’ve been toying with, and this prototyping is meant to try to figure out if it’s a good idea or not.

Every language lends itself to writing code in a certain way. Java, for example, lends itself to writing code in an object-oriented way. You could, however, write Java code that looks more like garbage-collected C code with classes used only as namespaces. Or you could write functional code in Java, passing around “functors” built out of anonymous classes. But the reason people tend towards writing object-oriented code in Java is because Java lends itself to an OO design. It makes writing OO code cheap — so cheap that it changes the way you think about algorithms so that they fit an OO model.

But me, I already think of everything as a compiler. I see every program as a compilation from inputs to outputs. A giant function if you will. Of course, when a program is big, you break it up into multiple functions, each with its own inputs and outputs. On a larger scale, you break up groups of functions among modules, where each module’s interface defines a mini DSL, and each module’s implementation is the compiler for it.

In this way, every program is a composition of mini compilers between mini languages. Oftentimes data in a program will pass through many intermediate stages as it flows from input to output through various transformations. In the same way that C++ code gets compiled to C code, then to object code, and then finally to machine code, each stage that data flows through is a compilation phase.

With a C++ compiler, the data happens to be C++ source code which gets translated into machine code. However, a clock program is a compiler from the OS’s API for retrieving the system’s time to a graphical readout of the time. A database engine is a compiler from SQL statements (select, update, delete, etc.) to result sets. (Order of execution is significant, as updates affect the results of compiling select statements in the future.)

A text editor is an advanced compiler with many phases of compilation. Ignoring the transformation (or compilation) of keystrokes to key codes at the hardware and OS levels, text editors transform key presses (and perhaps mouse input) into formatted text, formatted text into the graphical layout of the formatted text, formatted text into linear byte-streams for saving, formatted text into postscript or something suitable for printing.

I already see everything as a compiler, so why not have a language that lends itself to writing programs in this paradigm. A language that makes it cheap to express computations as the composition of multiple phases of translations from one language to another.

It’s all about dataflow and how that data changes form as it passes from one phase to the next. So for now, “phase” is its code name.

Tags:
29th
2008
Feb
permalink

Stand on the Shoulders of Giants

I feel like I haven’t even written a useful piece of code in ages because the mere thought of boilerplate code stops me in my tracks. This is one of the main motivations behind creating a new language free of hindering boilerplate code. But then you start running into other problems.

The egotistic developer will believe that creating a new programming language is the key, secretly striving for the silver bullet, even though he would never openly admit it because he is not even conscious that he is doing it. Some people believe that the silver bullet is Lisp, but for some strange reason, it has failed to slay the dragon for the past 50 years and counting.

The more seasoned developer, unambitious and static, will believe that creating a new language is a waste of time due to how many wheels have to be re-invented in a new language before it even approaches the usefulness of existing languages. Not to mention the fact that any feature could simply have been implemented in (or on top of) the existing language in the first place.

I think the only thing that makes sense is to strike a balance with something that allows (an order of magnitude) more succinctness and extensibility without sacrificing the millions upon millions of man-hours already spent by armies of programmers. We can stand on the shoulders of giants like IBM, Microsoft, Sun, and Google, and move forward without taking a giant leap back.1

On that note, we have people building things like Instapaper. And Instapaper is great; I use it. But it’s silly that I now use 2 completely separate bookmarking services that are oblivious to each other.

Of course, upon closer inspection and further use, I realized that the way I use del.icio.us and the way I use Instapaper are completely orthogonal. That is, the set of links that I save to del.icio.us and the set of links I save to Instapaper are completely disjoint. Occasionally I will first save a link to Instapaper, read the article, and then save it to del.icio.us, but that is a rare exception.

However, it’s completely obvious to me that the functionality of Instapaper is a proper subset of the functionality of del.icio.us. In other words, everything that Instapaper can do, del.icio.us can do and more.

Why then didn’t Marco build Instapaper over del.icio.us? It seems like a perfect match.

With the Web 2.0 craze, APIs started popping up everywhere. Even tools like Pipes and Popfly to wire those services together. But who is using them?

Instapaper’s value is solely in its amazingly simple interface. The reason I use it differently from del.icio.us is because its interface affords to different things. It makes different operations cheap, and that changes the way I think about those operations. But why doesn’t Instapaper integrate with my del.icio.us account, simply tagging things with a “read-later” tag? When it comes to bookmarking, del.icio.us is king; there’s no disputing that. Instapaper wouldn’t have less value if it were built on top of del.icio.us, it would actually have more.

And this, in my humble opinion, is the next step for the web. People will finally start realizing that most of the functionality in their little web app to-be is already done — not in a library — but in a web service.2

Services that do single things — and do them right — will be indispensable to the web. And a glue language will be in high demand. But personally, as an entrepreneur, I’m not looking towards Yahoo Pipes or Microsoft Popfly (yes, in direct opposition to the idea of standing on the shoulders of giants). I want to own my mashups, and those services don’t give me that at all.3


1. And this is why I have been very interested in Clojure — an extremely practical Lisp built on top of the JVM.
2. For example, I will almost never have to create my own charts, as Google already did it. Ditto for maps.
3. If you haven’t heard me practically shouting this already, I want a glue language for the web! I will personally thank anyone who builds one that doesn’t suck. But people, please don’t comment here saying X is already that glue language. [Ruby on Rails people, I’m looking in your direction. ;-) ]
3rd
2008
Feb
permalink

The Cost of a New Language

As I described in my post on tool-making, the most sensible thing to do is to minimize total cost to reach your goal. Accounting for all costs, it occurred to me that creating new tools such as programming languages can inadvertently increase total cost by reducing the usefulness of other tools designed to be used with other languages.

For example, gdb is an extremely useful tool for working with C code. But if you start using Java, the usefulness — or the amount of cost savings — of gdb is reduced so much that it is practically useless. What you really need is a debugger designed for Java. If you didn’t have a debugger for Java, your total cost may have increased as a result of switching to Java. (I’m exaggerating of course, because it’s not just gdb but other tools as well. And they all add up.)

I recently became aware of Google Web Toolkit (GWT) which basically compiles Java to JavaScript for use with AJAX web apps. At first this sounds right. Gain static typing and a unified server- and client-side language. However, this didn’t sit right in my mind and I wasn’t sure why. Then I realized it troubled me because, in a way, GWT was compiling from a lower-level language to a higher-level language.1 And the thought of that is just absurd. Once you take into account the reduction of usefulness of tools though that results from switching from Java to JavaScript, it all makes sense.

So when does it ever make sense for a small software company to create a new programming language? I’m beginning to think the answer is: never. Although there are inspiring words saying how much more productive a good programming language can be, there are also stories of people working with such languages their entire lives and how they became disillusioned.

The only sense I can make of all this is that people love to hear stories like Paul Graham’s that hold the carrot of hope out in front; the idea of creating the ultimate programming language is extremely ego-boosting. But at the end of the day, if your goal is to create great software that you can live off of, then there are very real costs to using less popular languages, and the trade-off does not often make sense. GWT on the other hand, works because it enables rather than disables the use of well-developed tools of another language. A new language is just a tool, and it should help, not hinder, overall.

…Another burst of insight came to me. Say you created a new source language S but wanted to exploit the abundant, well-developed tools in a target language T. What if, when you wrote your compiler from S to T, you also wrote a translator for the tools. In other words, think of the API or data-structure that the tool works on as T’, and create another compiler from S to T’. It all goes back to the idea that everything is a compiler.

I don’t know if this is something feasible. Again, it’s more of a what-if question. Is there a way to compile away the costs of a new language?


1. Another reason GWT seems weird to me is because instead of its source language being under-developed, its target language is.
Tags:
30th
2007
Dec
permalink

The Greatest Contribution to Programming Languages

Thinking about tools, it occurred to me that the greatest contribution to programming languages is not any single language. No, not Lisp, C, ML, Haskell, Qi, or any of the more obscure yet equally great languages.

The greatest contribution someone could make to the programming language and compiler community   is not to make a great programming language, but to make the tools necessary for making new programming languages cheap — so cheap, that programming languages and the tools that allow their efficient use (e.g. compilers, debuggers, code-editors, etc.) become abundant.

And once they become abundant, people will start to see how interfaces are everything, how everything is a language for a compiler, and how there’s no real answer to code’s worst enemy — size — other than abstracting away parts of it and then specifying which parts you need in a particular situation via a language.1

What tools would make implementing new programming languages cheap? Let’s make a list.2

  • Parsing and Printing (Bi-directional Mappings between Text and Abstract Syntax)
  • Tree/Graph Editor
  • Inference Engine (Theorem Prover)
  • Data-flow Analysis
  • Incremental Transformation Engine (Rule-based Transformation, Incremental Compilation, Execution Stepper)
  • Dependency Analysis
  • Unified Runtime System (Memory Allocation, Garbage Collection, Runtime Safety Verification)

Some more details…


Parsing and Printing

This may be a first-step before a true code-editor is created, but I suspect that it will still be desirable in some cases to stringize code. In addition, I imagine parsing code, perhaps not entire programs but small code-blocks, will still be useful in a code editor.

What is completely obvious to a compiler-writer (in fact, I knew of someone in college whose side project was exactly this) yet others seem completely oblivious to, is that you can make “skins” for your code to format it differently in text, but have it parse down to the same underlying representation (i.e. abstract syntax). This way, Joe can use curly braces to specify blocks while I can use whitespace indentation, and it will represent the same code. The same for every other parsing variation, small or large.

What’s cool about this is that even though Joe and I are typing in code differently, I can view Joe’s code in my format and vice versa. So no more conforming to (practically uncheckable) coding standards enforced by your tyrannical employer.

I imagine specifying these via bi-directional mappings between strings and abstract syntax data-structures. In the past, I’ve always had to write one piece of code to parse and another piece of code to print, and that’s just silly. Of course, there are corner-cases which are specific to one as opposed to the other, and I imagine some kind of declarative DSL for specifying these.

Tree/Graph Editor

Code at its core is a tree. Once it is parsed, it is represented internally as an abstract-syntax tree, or absyn tree. If you consider the semantics of the language though, variable references for example, refer back to nodes closer to the root of the tree. So in this sense, it is a graph.

The core of most IDEs today is a text-editor. This is silly though since the programmer is actually editing code, which is this tree or graph data-structure. In particular, when you edit the name of a variable reference, the editor should distinguish between changing what that node refers to and re-naming the node, as overwriting a variable in text changes what the node refers to; to re-name, you have to manually replace all references to it, including taking into account scoping rules of the language. The editor should understand when you’re trying to wrap an expression in another expression, instead of relying on you to go to the beginning, add an open-paren, go to the end, add a close-paren.

Sure, it’s a matter of convenience, but it’s also a matter of error-prone-ness. I can’t count the number of times I’ve had bugs due to copying and pasting textual code. If the editor I was using was aware that the structure I was copying was a graph, it would indicate to me which nodes the pasted code were referring to, allowing me to see my bug immediately.

The editor must be aware of the semantics of the language you are editing and give continuous feedback of changes made. If it weren’t aware of the semantics of the language, how would it give you feedback on what you are editing that is specific to the language you are using. Thus, your code editor must be tied very tightly with your compiler. However, a general tree-editor would be a huge step in the right direction.

Inference Engine

An inference engine is critical for any code-level automation. For example, type checking, type inference, and semantic-preserving optimization. This is no simple feat to create, as it is, at its core, a theorem prover.

Instead of re-writing a theorem prover for every compiler, a compiler writer should only need to specify the axioms and rules of inference for his language to the industrial-strength theorem-prover. Simple things like constant folding and more complicated things like induction-variable elimination can fall out of this.

Data-flow Analysis

Of course, some standard optimizations like dead-code elimination (if your language is eager) and semantic checks like whether variables have been initialized (if your language allows such things) rely on data-flow analysis. Some compilers even use this to optimize memory allocation, as in whether to allocate something on the stack or the heap, by determining whether a reference to an object may escape a local block.

Again, this is another instance of something that can be done once generally, but it should also be tied in with the incremental transformation engine to allow for only re-compiling or re-executing pieces of code that have changed. This could also be extremely useful in a static sense where the editor displays to the programmer what pieces of code are affected by the flow of changes made (since the last save or last revision for example).

Incremental Transformation Engine

This is the meat of the compiler: the transformation engine. It transforms source code to object code. Or in the case of an interpreter, source code to immediately-executed instructions. It must be incremental in the global sense in that it must handle incremental building for a more responsive development environment. But it must also be incremental in the local sense in that it must allow for the programmer to pause, step, observe, or otherwise control the transformation for debugging purposes.

Home-grown interpreters usually start out all-or-nothing; they either transform or they don’t. The reason is simply because the most straight-forward and clean implementation is a recursive traversal of an absyn tree. This leaves no room for stepping through code and displaying the current state to the programmer. A debugger is an after-thought, and rightly so, as its cost outweighs its benefit when a language implementor is creating a language for himself or others who have access to the language the compiler is written in.

I imagine writing pattern-matching rules for the transformation engine similar to the way someone might write a recursive traversal of an absyn tree in ML, however, the transformation engine will automatically handle incremental stepping and observing of the transformation process for interaction with the programmer when debugging. In addition, I imagine many additions to this language compared to ML which allow short-hand for specifying common cases of transformation, like only specifying the differences in base cases and the engine recursively applying the transformations to all other nodes.

Dependency Analysis

Based on dependency analysis, the transformation engine can automatically determine which code to (re-)compile and in which order. This should include external dependencies as well.

In many projects I’ve worked on in the past, circular dependencies have prevented standard tools from working as expected. As a work-around, we were forced to transform our module layouts and handle the dependencies manually. Although circular dependencies in general can lead to confusing module-interfaces, there are some cases when it just makes sense, and the transformation engine should handle circular dependencies by simply using a two-pass algorithm.

Unified Runtime System

This has got to be the most depressing topic for all language implementors. If you implement your own runtime system for memory allocation and garbage collection, what hope do you ever have of your language interacting with the rest of the world — something that is necessary both for production apps and popular adoption. If you use an existing runtime system, you are stuck on its platform and forced to use its (often unsuitable) virtual machine.

Parrot, looking good. Perhaps this is on its way already.

As for the runtime safety verification, I would love to see something that used proof-carrying code. For those who don’t know about this, it is a way to guarantee runtime safety in the way that traditional sandbox-like VMs do at runtime, but statically, so that runtime overhead is eliminated and native machine-code can be run directly with all the same safety guarantees.3


All of these tools are non-language-specific, and they would help make creating new programming languages economically viable for software developers and startups in particular. In the end, this would advance the field of programming languages immensely, more than the development of any single language could.


1. I mean “language” in the most general sense, as it’s sometimes suitable to abstract something away into a library and have your language be function calls into that library. Other times it makes sense to have a full-blown Turing-complete language. Anyway, this is why PL design is so important — it’s really being done everywhere even though most people don’t realize it.
2. What’s missing from this list?
3. Peter Lee, a professor I worked for in college, was one of the inventors of proof-carrying code, and it turns out something so amazingly useful is practically a special case of type checking. However, this also requires proof-generation, another use of a good theorem prover. In the sci-fi stories that talk about computers programming themselves — they’re talking about theorem provers. …It’s sad for us CSers. Most people are simply unaware of the amazing things currently in development.
Tags: