Another alternative would be some kind of excitement recognizer, or monologue recognizer, etc. Get a machine learning model to find the moments that seem like a good back-and-forth or a solid exposition. Pad the recording a minute or two on both ends, etc...
But that raises the issue that in podcasts a lot of the time people build off of parts of the conversation that happened a long time earlier. So, it may be just difficult to sample longer podcasts in general. Perhaps you could try to listen to 5 minute podcasts, or something like that.