October 15, 2022
In the beginning was code (nice TED talk here).
Way back in the 1980s, the client server programming paradigm came into being. This changed the then long-standing paradigm of a monolithic application or tool being served to users packaged with application logic. There was a now a need to decompose applications into “backend” logic and “frontend” logic.
This surfaced a new kind of applications called Web Apps, which slowly started replacing desktop applications. With this, came different meanings to the stack such as - Ruby on Rails, Laravel, Python Django, LAMP, XAMPP and more! This meant there was more fragmentation in the ecosystem, but also several “communities” of developers.
This is when we had the unfolding of the Full Stack Developer. Instead of hiring different backend, front-end and database developers, organizations started focussing on hiring full-stack developers who had experience across the choice of the tech stack that the organization had built up. In the early days of Web 2.0, most web applications were just hosted on rented Linux boxes on a hosting service provider (like GoDaddy).
The new full stack developer beyond 2015 started looking like this
Lets laterally shift to talk about Machine Learning. Machine Learning, unlike popular knowledge, is old - it is a very old field, older than the web itself. It was coined in 1959 in a paper titled Some Studies in Machine Learning Using the Game of Checkers.
The community made systematic strides forward - with the invention of the first neural network soon after in 1960, and that too in physical form. Since this was also around the birth of computing, most of the community was focussed on Algorithms, such as A* and breadth first search. The focus of this artificial intelligence phase was not on data, but instead to efficient solutions to known problems. Meanwhile, the “learn from data” paradigm was inching forward - with the creation of Backpropogation, which itself was borrowed from control theory (Gradient Theory of Optimal Flight Paths). This resulted in the creation of the first practical neural network - LeNet (Back propagation applied to handwritten zip code recognition). The neural network community continues to move, mostly laterally instead of forward, but albeit slowly because soon enough, they hit compute limits. ConvNets also suffered from AT&T breaking up (Yann LeCun’s rant on ConvNets being snatched away from him).
While the neural network slowed down, "machine learning" as a term started gaining popularity. The focus was to go back into fundamental statistics such as hyperplanes (SVM), Bayesian approaches, distributions and graphical models. What could not be achieved in an end to end system such as Neural networks, was now achieved by handcrafting features. Feature detection and extraction became common place words, with algorithms such as HOG, LBP, Sift and many more being published. A lot of emphasis was placed on “understanding data” and improving the amount of information that could be extracted from raw data. However, this approach was very man-made and handcrafted for specific purposes - and the extraction of information was not guided by the distribution of data. Proxy methods such as Bag of words materialized wherein many of these handcrafted features were extracted, and a clustering based mechanism acted as an adaptation to a given dataset. This phase also created a distinct separation between feature extraction phases and machine learning phases, and a distinct pipeline emerged. One drawback of this generation was that the machine learning algorithms did not scale with data. As the quantity of data being collected increased, these pipelines plateaued without being able to “absorb” large scale data.
In the background, Python was created in the early 90s, and went on to gain popularity in the late 2000s
It all started in 2012. Three essential ingredients were in place. One - A large scale dataset was created by Fei-Fei Li and Co. called ImageNet in 2009. Two - ConvNet’s patent had lapsed in 2007, and future iterations of Convolutional networks could be invented permissibly. Three - Gaming really picked up with amazing games like Crysis and Skyrim requiring demanding GPU cards. What happened as a result was the organic combination of these raw materials. AlexNet came into being, which can essentially be summarized as follows - “Let’s take a convolutional network and make it deeper because now we have more compute, and let's train it on a supermassive dataset, because now we have ImageNet”. They achieved amazing results on ImageNet and ushered in a new generation - which took a while to pick up because of the cost of compute. What this new deep learning generation achieved was the ability to scale the performance of one’s model to large data.
In parallel, in the rest of the computing world, several things were happening
This new phase where we are right now is all about leveraging the cloud for incredible applications and machine learning models. This means that machine learning models had to be inherently distributable. This triggered the creation of frameworks such as Horovod and Spark, and PyTorch natively started supporting distributed training. These frameworks also meant that the fixed cost for new companies dramatically reduced while more of the cloud was adopted. Models that might have required 5 weeks to be trained on a big GPU now required just 1 day and ~50 GPUs. These constructs meant that the end to end training time for models dramatically reduced, which translated into more iterations of models in production. Paradigms such as Data-centric AI arose, which put the focus back on data and the movement of data. Likewise, cloud native machine learning frameworks were born - such as Sagemaker and Databricks.
A notable development started happening in deep learning research - where in Machine learning research was more active in companies than academia.
Finally coming to the purpose of this blog post, that is the description of the full-stack-machine-learning person. As Machine learning evolved in tandem with large scale web infrastructure, Machine learning started leveraging the cloud and beyond. Likewise, since most of the machine learning today is happening in industry, a lot of practical machine learning is driven by product and therefore value generation. This also means that the end goal is not a research paper and hence the code that backs the machine learning model has to be optimized, maintainable, scalable and deployed as a product or into a product, creating the need of a skill level that is “full stack” in ML and well versed across the stack.
Tangentially, a lot of the machine learning is now happening on IOT devices (such as Alexa and Nest), which means that there is a need for a distinct optimization(or quantization) phase to make ML inference cheap and fast. As a product becomes more mature, and processes start becoming more fixed, data pipelines may be built to automate as many processes as possible.
Today’s full stack ML engineer works across the stack, and can work from ideation all the way till deployment. In today’s complex cloud infrastructure world, this means getting down and dirty with the cloud and distributed compute. Let me walk you through the layers (libraries/frameworks linked are representative examples)
replace X library/framework/language by your-favourite-alternative