diff --git a/rainfallwrangler/README.md b/rainfallwrangler/README.md index bed227e..bdaf791 100644 --- a/rainfallwrangler/README.md +++ b/rainfallwrangler/README.md @@ -2,23 +2,67 @@ > Wrangles rainfall radar and water depth data into something sensible. - This Node.js-based tool is designed for wrangling rainfall, heightmap, and water depth data into something that the image semantic segmentation model that is the main feature of this repository can understand. The reason for this is efficiency: nothing less than a set of `.tfrecord` files for reading in parallel is sufficient if one wants the model to train in a reasonable length of time. -TODO: Write a guide for this tool here. - ## System requirements + - Linux (Windows *may* work but is untested. You will probably have a bad day if you use Windows) + - [Node.js](https://nodejs.org/en) v16+ + - Python 3.8+ (encoding .tfrecord files, as all existing `npm` packages fo doing this *suck*) + - Experience with the terminal + - Lots of time and patience ## Getting started +This tool, unlike [`nimrod-data-downloader`](https://www.npmjs.com/package/nimrod-data-downloader) and [`terrain50-cli`](https://www.npmjs.com/package/terrain50-cli), is not published to `npm`. This is because of the rather niche use-case this tool has. +To get started, first clone this git repository: + +```bash +git clone git@github.com:sbrl/research-rainfallradar.git; +cd research-rainfallradar/rainfallwrangler; +``` + +Then, install dependencies: + +```bash +npm install +pip3 install --user -r requirements.txt +``` + +The entrypoint for the tool is at `src/index.mjs`. Call it like so: + +```bash +src/index.mjs --help +``` + +It has 4 subcommands: + +- **recordify:** Converts a `.asc` heightmap, a concatenated `.asc` water depths file (output from [HAIL-CAESAR](https://github.com/sbrl/HAIL-CAESAR)), and a [`nimrod-data-downloader`](https://www.npmjs.com/package/nimrod-data-downloader) rainfall radar directory into an intermediate `.jsonl.gz` dataset. Defaults to putting 4096 samples per file. +- **uniq:** Deduplicates samples across an entire `.jsonl.gz` dataset. Basically hashes all samples with SHA256, marks duplicate hashes for deletion, and then files through all files in the dataset to remove those slated for deletion. +- **recompress:** Recompresses a `.jsonl.gz` dataset to ensure that (by default, 4096) samples are in each file. Needed after `uniq` since `uniq` can leave different numbers of records in each file. +- **jsonl2tfrecord:** Converts the aforementioned `.jsonl.gz` dataset into a `.tfrecord` dataset that the DeepLabV3+ model can understand + +All of these subcommands, where possible, operate in parallel. The general workflow is: + +1. `recordify` +2. `uniq` +3. `recompress` +4. `jsonl2tfrecord` + +Full help for each command is available if you call `--help`: + +```bash +src/index.mjs --help # Show general help for everything +src/index.mjs recordify --help # Snow specific help for the recordify subcommand +``` ## Contributing +Contributions are very welcome - both issues and pull requests! Please mention in any pull requests that you release your work under the AGPL-3 (see below). ## Licence -Same as that of the main repository. TODO expand on this. \ No newline at end of file +Same as that of the main repository. All the code in this repository is released under the GNU Affero General Public License 3.0 unless otherwise specified. The full license text is included in the [`LICENSE.md` file](./LICENSE.md) in this repository. GNU [have a great summary of the licence](https://www.gnu.org/licenses/#AGPL) which I strongly recommend reading before using this software. \ No newline at end of file