Improving LLM Performance

Question

I'm trying to figure out whether it's possible to use LLM's to categorize the internet. Using back of napkin math, if it takes a few seconds per web page this would take $XXX,XXX+ to process the common crawl. Does anyone have tips on speeding up LLMs? Is it possible to use LLMs to train a cheaper student model? Thanks!

throwaway888abc · Accepted Answer

You should have look at DMOZ.https://dmoz-odp.org/It feels so old to write this, but back in the days this one was important.Found this on Kagle: https://www.kaggle.com/datasets/shawon10/url-classification-...

PaulHoule · Answer

Common Crawl is full of real junk, you'd need some kind of classifier just to pick out the stuff that's worth classifying...