I have aquired several very large files. Specifically, CVSs of 100+ GB.
I want to search for text in these files faster than manually running grep.
To do this, I need to index the files right? Would something like Aleph be good for this? It seems like the right tool…
https://github.com/alephdata/aleph
Any other tools for doing this?
I’ve used java Scanner objects to do this extremely efficiently with minimal memory required even with multiple parallel searches. Indexing is only necessary if you want to search for information many times and don’t know what exactly the search will be. For one time searches, it’s not going to be useful. Grep honestly is going to be faster and more efficient for most one time searches.
The initial indexing or searching of the files will be bottlenecked by the speed of the disk the files are on, no matter what you do. It only helps to index because you can move future searches to faster memory.
So it greatly depends on what and how often you need to search and the tradeoff is memory usage, but only for multiple searches of data you choose to index from the files in the first pass.