Training a model on all HN data?
I just had a thought, maybe dang could chime in. Has anyone considered training a model or fine tuning a model on all of hacker news discussions?
It's relatively straightforward to download all HN submissions/comments via BigQuery and then finetune an LLM, there's just not much point to it.
You can safely assume all modern LLMs have been trained in part on HN data.
HN was part of the training set for ChatGPT. But it might be interesting to train/fine tune on HN alone. You could weight by karma or conversely you might identify shortcomings in the karma system.