AutoML

I recently wanted to play around with Kaggle again, so I made a fresh account to hide what I worked on 2 years ago and dived into a playground competition. These competitions in particular aren’t anything needing of “novel technique” to work on and were my sort of light reintroduction back into the sorts.

I’ve heard about AutoML and understood the principles behind it, but I never got to actually try it. To no surprise, it worked, and decently well to that extent that I wanted to ask myself: “What’s Left?”

What’s Left?

Again, to no surprise, there is so much more left for an “machine learning engineer”.

AutoML currently handles tabular workflows and does model selection quite well, but is still lacking in workflows for other forms of information. The reason why it does its job so well is that it knows the best practices for a given problem type, and if not, it has set presets.

AutoML workflows usually include the “manual” things that are fine being automated, like generating interaction terms or data normalization, and just general “feature engineering” tasks. It will specifically generate different feature sets and then on a single model type, and repeat this for all the model types.

In the end, it ensembles the model predictions and voila, you have a pretty good model. No need to understand optimization, validation, or model formulae, just run an AutoML workflow and it will generate an effective model. So what is left?

“Actual” Feature Engineering

If AutoML were as good for everything, it would be used everywhere. That isn’t what I see when browsing around different Kaggle competitions. What I see in “real” competitions worth their prize money is that feature engineering gets you 80% while model selection gets you 20%, a common Pareto here.

Feature engineering here references the fact that these competitions are not working with tabular data and are working with either image, sound, textual, or incredibly multivariate forms of data. And these forms of data are the ones that demand more “novel” or “ingenious” approaches when creating model features. This effort is on top of the fact that you need to know how to measure the required data to predict a given phenomenon.

In the competition, I have been playing around with distinguishing between a real and a fake pair of text elements. It is possible to create a logistic discriminator and use the labeled data to create a model that assesses a single text input, but the higher-scoring models don’t do that. They use “Siamese Twins” to extract either dot product, cosine, or Euclidean features from the two text inputs, which are further fed to a model.

The point being made here is that, if you start working with non-tabular data, you have to start getting inventive with how you pre- and post-process the model inputs and outputs. This is the point where “actual” feature engineering begins, and something I don’t think AutoML workflows can perform, as it’s a very “bespoke” task. That is, until we have “general repeatable workflows” for things like time-series or text comparisons, etc., etc.

Model Interpretation

The second thing that you shouldn’t expect AutoML to do well is hypothesis testing.

Everyone treats AI/ML as a field sought out to create predictive models, and that’s the end-all be-all. However, the origins of statistical models were made to verify hypotheses, not to achieve 100% accuracy. I mean “t-tests” can be studied as a “generalized linear model”.

Like what ML person actually looks at the “statistical significance” of their parameters over their model performance metrics.

The point here is just to say Stats people use models with a different telos than the one AutoML tools create models with. But also to say that building predictive models often isn’t to “generate knowledge”¹, unless you are trying to extract a function, via “universal approximation theory”.

Compute and Data

Another thing that AutoML can’t “solve” is the problem of compute and data. In some Kaggle competitions, it isn’t even the feature engineering that is holding people back from scoring better; it is access to compute and data.

For example, there is a text-classification challenge where you input math questions and solution explanations from pre-K to high school students. The feature input is taking all the tabular features and then generating a prompt for an LLM to then “classify” into some label, which, for the record, is different from “text generation”, but also just requires tacking on a “head” to a pre-built model.

In this case, better scores are achieved not by training models locally or finding a cleverer feature; it boils down to which existing 7B+ model has been trained specifically on mathematical data and how you can slightly tune that for this dataset. It would be both impractical and probably unethical to retrain a model of that size.

Suggestions

learn math/statistics?

A lot of companies typically only hire PHD or master’s students into industry for ML-related job roles because they want you to understand why the models work well and also select the best predictive model.

The added point is that to do “good” model witchcrafting, specifically NN witchcrafting, you’d have to understand what it means to change parts, etc., in creating a bespoke model.

learn computer science?

There is a lot of fruitful research going on to convert the mathematical and statistical theories into algorithms and fast algorithms at that. Optimizing the “compute”, whether that be through hardware or software means.

infrastructure?

The deployment of machine learning models or algorithms is something to look into. MLOps, though I have little knowledge of its idiosyncrasies myself.

If adding what I ate for lunch as a feature increases model performance, it will be kept. ↩