-
If you can get away with storing your stuff in text files, you should store stuff in text files. Text files are the shit because everything’s a text file.
- Files are readable. Anyone can read a file, even if encodings and separators get in the way, you can just haxxor around that with a neat script. You can easily distribute files across a network using a simple httpd.
- Files are easy to back up. And restore. It’s just a file! You can put some of the files on one disk, some on the other. Hey! Why not stick some files in the cloud? The cloud will know what to do with them.
- Files are managable. It’s easy to see if you have files or don’t have files. ‘Today I have 500 files, yesterday I only had 20. What’s with all the new files?’. And if you get too many files, you can just compress some (
gzipdoes it in place) and put them somewhere safe. Or delete them, if you don’t like them. You’ll get your harddrive space back instantly. - Files are reliable. I have great difficulty imagining you’ll be able to re-read your mysql dumps in 20 years time. Contrarily, I can’t imagine you’ll have any problems ever reading your text files.
- Files are all around you. Pretty much everything is awesome at reading files. There are no drivers to learn or break. Everyone reads and writes files so that’s generally on the short list for early optimization for any operating system, file system, VM, programming language, etc.
Files are AWESOME!
Appending to a file is a breeze. Reading is a breeze. For reasonable sized files, you can use a whole bunch of free, existing and reliable tools, like
grep,cut,sort,uniq,wc, etc. Check out this little example calleddata.txt:v1;Alice;Burt AB;SmltbXkgSG9mZmE= v1;Bob;Burt AB;Kg== v1;Alice;Somewhere Else;aHR0cDovL2dvby5nbC94VWFocA== v1;Carol;Somewhere Else;QnJhZCBQaXR0Wanna know how many entries there are in
data.txt?$ wc -l data.txt 4 data.txtHow many unique people are there? (i.e., using ‘;’ as a delimiter, pick the second field, sort these values, figure out the uniques and count the number of lines)
$ cut -d';' -f2 data.txt | sort | uniq | wc -l 3How many places has each person worked at? (i.e., using ‘;’ as a delimiter, pick the second field, sort these values, figure out the uniques tell me how many times each unique appears)
$ cut -d';' -f2 data.txt | sort | uniq -c 2 Alice 1 Bob 1 CarolWhich ones have never worked at Burt? (i.e., Find all lines that don’t include Burt AB and pick the second field of each line, using ‘;’ as delimiter)
$ grep -v "Burt AB" data.txt | cut -d';' -f2 Alice CarolBut that’s easy. What if it wasn’t? Imagine you have millions of these rows and you want to be able to get at the data (in this case base64 encoded) quickly. What you want to do is create an index of all offsets within a file and keep that in memory so that you can quickly scan to the exact byte in your file. Have a little look here.
#indexer name_index = Hash.new { |h,k| h[k] = [] } File.open('data.txt') do |f| begin pos, name = f.pos, f.readline.split(';')[1] name_index[name] << pos end until f.eof? end puts name_indexHow simple was that? It’s brilliant! Before we read a line, we save the current file position. Then we parse the line for the name and save the name and the offset in a list of offsets for that name. We can now save this index in the file
data_name.idxor keep it in memory. Keeping track of different files is also a piece of cake — just bundle it together with the index. Reading is quick and painless.#reader def read_data(file, offset) File.open(file) do |f| f.seek(offset, IO::SEEK_SET) f.readline.split(';')[3] end end name_index['Alice'].each { |offset| puts 'Alice worked with ' + read_data('data.txt', offset) }You can easily adjust the indexer type to index as you write.
So, why doesn’t everyone use files then?
Well… I guess.. editing them can be a bit of a hassle. Renaming Alice in the above
data.txtquickly and efficiently.. ‘requires some finesse’. And you sort of screw up your entire index if you do that, so that kinda sucks. You COULD of course just keep adding lines and keeping track of outdated entries by datestamps and indexes, but.. Hassle.Also, there’s the thing with things getting too big for a single file. Or a single machine. This is probably a good thing though, since you really don’t want to be building a system that relies on having access to all your data reliably, coherently and instanstantly in a single place at all times. But it isn’t really what you’d call ‘convenient’, is it?
Man! This has evolved into a monolith of a post. Jesus. Tell you what! I’m gonna have a stack of tens on my desk in the morning. Anyone who’s read this far can just come by, share a nod and possibly a quiet fist-bump and help themselves to the ol’ tenner. Karl Gustavsgatan 1A, Gothenburg.
[1] this is vital because you WILL change the format at some point, it’s inevitable