What data are you using to finetune LLMs?

Question

There have been a number of posts on HN recently about how to fine tune LLMs. These posts talk mainly about:- different methods of finetuning (full retraining, LoRA)- different base models- data sets (e.g. Alpaca)- objectives (creativity, instruction following)I haven't seen much discussion about people finetuning an LLM on domain-specific data, e.g.- medical records- standup comedy jokes- internal corporate dataSo, are any of you fine-tuning your LLMs using such niche data? I'd love to hear about your experiences and motivations!Even if you're working with proprietary datasets, I'm still interested. After all, knowing what you're doing won't allow us to duplicate it, as we don't have access to the same data.

PaulHoule · Accepted Answer

Yoshinon, my smart RSS reader, trains a classifier to predict “Will i like this article?”. My current production classifier runs the article through an BERT-like embedding and applies an SVM (classical machine learning) on my last 40 days of judgements to make decisions. I have also fine-tuned BERT-like models to do the same with (so far) roughly the same performance except it takes 30 minutes instead of 30 seconds. I also trained a similar model to predict how many votes a headline will get on HN which I don’t like as well as my bag of words predictor that it will get > 10 votes.
Part of the problem is that these are both ill-defined problems (I started out being mad that I was getting so many articles about ‘Arsenal’ and after doing the feature engineering to understand why the classifier I was using then (also bag of words) couldn’t learn I hated soccer and loved the NFL I became a soccer fan.) One of these days I am going to try a crisper classification problem, also I want to try fine-tuning a T5.
For today, Yoshinon trains a new model every day and I am using my old classifier because it never screws up.