HACKER Q&A
📣 didgetmaster

Does anyone need data 'shuffled' in a large DB table?


If you have relational data (stored in a table, a CSV file, or a Json file) there might be a need to mix it all up for testing or anonymity purposes.

So you might have a table with 20 columns (e.g. name, address, city, state, country, order_date, etc.) and a few million rows of customer data. Is there value in being able to mix up some or all of the data within a single column, multiple columns, or all the columns? The values would remain intact, but just assigned to different rows in a random manner.

Doing this on a conventional row-oriented database could be very costly for time and I/O operations; but I have invented a new kind of data manager that uses Key-Value pairs to assign values to form relational tables. I just realized that I could implement a feature to shuffle the data very quickly (e.g. reorder all data in a million row table in just a few seconds).

So a simple table with just 3 rows: name|city|state John|New York|New York Bob|Miami|Florida Jane|Dallas|Texas

Might, after a shuffle, look like this instead: name|city|state Jane|New York|Florida Bob|Dallas|New York John|Miami|Texas

I don't want to implement this feature if no one sees a reasonable need to do something like this.


  👤 jonahbenton Accepted Answer ✓
Generally that technique is something that was pretty common 20 years ago but is not a good strategy for either anonymizing or for testing/capturing variations. It actually is a bad strategy for many/most anonymization use cases, and its use would count as a data breach. There is so much good tooling now for synthetic generation just at the data set level. And usually there are custom semantics "embedded" in the values and arities of data relationships in multi-dataset contexts, where a random "mixing" of real data would cause unhelpful impossible breakage. So- no, don't implement, better uses for your time.