I have aquired several very large files. Specifically, CVSs of 100+ GB.
I want to search for text in these files faster than manually running grep.
To do this, I need to index the files right? Would something like Aleph be good for this? It seems like the right tool…
https://github.com/alephdata/aleph
Any other tools for doing this?
RDBMS shines on getbyId queries. Queries where the value starts with should also work well. But queries where the word is in the middle of the value or column generally don’t perform well. Since it’s just for personal use that might not matter too much. If you’re querying on exact values it’ll go pretty smooth. If you’re querying on ‘deniro’ while the value contains ‘bob deniro’ and others it’ll be less performant. But it’s possible it works well enough for your case.
Elasticsearch is well known for text searches and being incredibly flexible with queries and filtering. https://www.elastic.co/
Manticore is one that’s been on my check-it-out for I don’t know how long. It looks great imo: https://manticoresearch.com/
Open search: https://opensearch.org/
Disclaimer: I haven’t really used any RDBMS systems extensively for years so it’s possible there are some that added support for full text searches being more performant.
Aleph also seems to be able to cross reference data between documents. I don’t think any of the ones listed above do this. But I also don’t know if this is part of your requirements.