Edit: Guys this is just an example. I'm looking for a general solution. It could be 10 million rows.
I had a similar problem at work where I needed to do some formula on a 5 GB CSV file. Excel can't handle more than 1M rows. Database through command line is too clunky. I did try to split the CSV into multiple files before but using formula on top of multiple files isn't easy. Eventually I built a Desktop GUI wrapper on SQLite, and it grew into Superintendent.app (now powered by DuckDB).
The newest version supports "Ask AI", which can be used for "Ask your CSVs anything" and "Ask AI to write/rewrite SQLs for you". It has been pretty fun to use and tell AI to "Expand *", "format all dates to something like May 3, 2024", and etc.
I dont know the api or recommended scripting language. This would be a good case for chatgpt or equivalent type task. Enough to get started.
edit: I asked chatgpt, it recommended python and 'pandas' for interacting with excel
python
import pandas as pd
# Load the data from an Excel file, assuming headers are in the first row by default
data = pd.read_excel('path/to/your/file.xlsx')
# Define the number of records per chunk
chunk_size = 500
# Split the data into chunks and write each chunk to a new Excel file
for i in range(0, len(data), chunk_size):
# Extract the chunk of data based on the current index range
chunk = data.iloc[i:i + chunk_size]
# Write the chunk to a new Excel file, including headers as column names
chunk.to_excel(f'output_{i // chunk_size + 1}.xlsx', index=False)
I asked about the first 'row', and it claims panda includes that in each chunk, but I don't know about that.
It's at least a place to start to iterate from. Would need to iterate further with real code/tests.
library(tidyverse)
library(readxl)
library(writexl)
read_excel("file.xlsx") %>%
group_by(group_id = row_number() %/% 20) %>%
group_walk(~ write_xlsx(.x, paste0("file_", .y, ".xlsx")))
edit: updated to write xlsx instead of csv
#!/usr/bin/perl -CSD -w -Mstrict -Mwarnings -MText::CSV
# chunk.pl -- split csv files into chunks
# Usage message and exit if needed
if (!@ARGV || $ARGV[0] eq '-h') {
print "Usage: $0 input_csv [chunk_size] [output_filename_format] [separator]\n";
print "Example: $0 input.csv 500 'input-%08d.csv' ','\n";
exit;
}
# Set command-line arguments
my ($INFILE, $CHUNKSIZE, $FMT, $SEP) = @ARGV;
$CHUNKSIZE //= 500;
$FMT //= "data-%08d.csv";
$SEP //= ",";
# Initialize CSV, file handles, and counters
my $csv = Text::CSV->new({ binary => 1, auto_diag => 1, sep_char => $SEP, eol => "\n" });
my ($i, $f, $out) = (0, 1, undef);
open my $in, "<:encoding(UTF-8)", $INFILE or die "Cannot open $INFILE: $!";
# Main loop
while (my $row = $csv->getline($in)) {
if ($i % $CHUNKSIZE == 0) {
close $out if defined $out;
open $out, ">:encoding(UTF-8)", sprintf($FMT, $f++) or die "Cannot open output file: $!";
}
$csv->print($out, $row) or die "Failed to write row: $!";
$i++;
}
# Clean up: close file handles
close $out if defined $out;
close $in;
Dim rng as range: set rng = Sheet1.UsedRange
Dim rows as Long: rows = rng.rows.count
Dim cols as long: cols = rng.columns.count
Const SIZE as long = 500
For i = 2 to rows step SIZE
Dim wb as workbook: set wb = workbooks.add()
wb.sheets(1).range("A1").resize(1, cols).value = rng.resize(1).value
wb.sheets(1).range("A2").resize(SIZE,cols).value = rng.offset(i-1).resize(SIZE).value
Call wb.SaveAs("C:\Temp\" & i & ".xlsx")
next
(require 'csv-mode)
(let ((input-file "bigfile.xls")
(output-dir "chunked/")
(chunksize 1000))
(cl-labels ((read-csv-file (filename)
(with-temp-buffer
(insert-file-contents filename)
(csv-mode)
(csv-parse-buffer t)))
(write-csv-file (filename data)
(with-temp-buffer
(csv-mode)
(dolist (row data)
(csv-insert-list row))
(write-region (point-min) (point-max) filename)))
(chunk-data (data chunk-size)
(let (result)
(while data
(push (cl-subseq data 0 (min chunk-size (length data))) result)
(setq data (nthcdr chunk-size data)))
(nreverse result))))
(let* ((data (read-csv-file input-file))
(chunks (chunk-data data chunk-size))
(file-count 1))
(dolist (chunk chunks)
(let ((output-file (expand-file-name (format "chunk-%04d.csv" file-count) output-dir)))
(write-csv-file output-file chunk)
(setq file-count (1+ file-count))))
(message "Processing complete. %d files created." (1- file-count)))))
(This is a joke. Do not use this.)
I would strongly consider dumping rows into SQL, for more "natural" selection.
[1] https://learn.microsoft.com/en-us/dotnet/api/microsoft.offic...
For readable/not binary files, there are standard tools like split, awk, etc.
If it had 100k rows, I'd be out of my depth, so I'd hit google.
It can't be. Excel's maximum row limit is only ~1 million.
Opening and working on an excel file with a few million rows can need a bit more ram than anticipated especially based on it's size/complexity.
The quickest way I'd start with is to convert it to a CSV, read the file in, and rewrite it out 500 lines at a time.
If you're worried about data not fitting in memory, then stream the file. It seems like the Java API has support for this, surely other languages do too.
This turns into a massive graph theory nightmare problem if there are lots of references going on
Xlsx or xls?
If it’s xlsx, stream the file, chunking 500 rows into each new file, watching for ref A1, then injecting that into previous files.
The xlsx format is a zip file of xml, not so bad to work with once you start.
I should note, this works best if you don’t care about the order of the rows. They technically aren’t required to be in order (I think).
Not even going to ask AI. Waiting for a bash one-liner before OP reminds us they are on corporate Windows machine.
If the input is not a spreadsheet, use a tool appropriate to the scale of the data store. Provide an api for access from Excel if end users need Excel.
If input is an Excel but the integrity of the data as a spreadsheet doesn’t matter,
If it is an Excel already, Excel provides end users with access from Excel. If the Excel is likely to outgrow Excel, use a tool appropriate to the scale of the data. Provide an api for access from Excel if end users need Excel.
If it-is-Execl is your problem, breaking it up into multiple Excels is slicing the magic broom. Good luck.
Good luck.
If you actually had to do it 'properly' there are actually a ton of options:
- do it old school with a VBA macro
- use the newer js macro stuff
- xslx files are just a zip file of XML so could just do it in pretty much any language