receipt-parser/README.md

# receipt-parser

> Receipt/invoice parser. Takes a list of PDF/images → pdftotext/tesseract → Anthropic Claude API for extraction → contenteditable HTML. We use the Claude Haiku model to reduce cost while maintaining high accuracy.

The receipt-parser is a tool that takes PDF files or images of receipts and invoices as input, and extracts the relevant information using optical character recognition (OCR) and the Anthropic Claude API. The extracted data is then presented in an easy-to-read, editable HTML format.

## System Requirements
- Node.js
- Bash
- Tesseract (`sudo apt install tesseract`)
- Anthropic API key: <https://www.anthropic.com/api>

## Usage
First, clone the repository: `git clone https://github.com/your-username/receipt-parser.git`

Then, install the required dependencies: `npm install`

Obtain an Anthropic API key from <https://www.anthropic.com/api>

Run the script list so:

```bash
./index.sh /path/to/receipt1.pdf /path/to/receipt2.pdf ...
```

...it takes a list of files.

The script will process all PDF and image (anything Tesseract supports) files to:

- Extract text using OCR and Tesseract
- Convert that into a machine-readable JSON object with the Anthropic Claude API
- Generate an HTML file with the extracted data for each input file.

### Nautilus Script
To register the script as a Nautilus script (for easy right-click access in the file manager), follow these steps:

From the root of this repository, run this command:

```bash
ln -s $(pwd)/src/index.sh ~/.local/share/nautilus/scripts/parse-receipts
```

Then, restart Nautilus by running `nautilus -q` in the terminal.

After restarting Nautilus, you should be able to right-click on any PDF or image file and select "Scripts" > "parse-receipts" to run the receipt parser on the selected files.


## Contributing
Contributions are very welcome - both issues and pull requests! Please mention in your pull request that you release your work under the MPL-2.0 (see below).

See [CONTRIBUTING.md](./CONTRIBUTING.md) for a guide on what to expect when submitting a pull request or issue to this project.

If you're feeling that way inclined, the sponsor button at the top of the page (if you're on GitHub) will take you to my [Liberapay profile](https://liberapay.com/sbrl) if you'd like to donate to say an extra thank you :-)


## License
This project is released under the GNU Public License 3.0. The full license text is included in the `LICENSE` file in this repository. Tldr legal have a [great summary](https://www.tldrlegal.com/license/gnu-general-public-license-v3-gpl-3) of the license if you're interested.
Initial commit 2024-07-09 21:31:25 +00:00			`# receipt-parser`

finish script 2024-07-09 22:12:06 +00:00			`> Receipt/invoice parser. Takes a list of PDF/images → pdftotext/tesseract → Anthropic Claude API for extraction → contenteditable HTML. We use the Claude Haiku model to reduce cost while maintaining high accuracy.`

			`The receipt-parser is a tool that takes PDF files or images of receipts and invoices as input, and extracts the relevant information using optical character recognition (OCR) and the Anthropic Claude API. The extracted data is then presented in an easy-to-read, editable HTML format.`
Initial commit 2024-07-09 21:31:25 +00:00
			`## System Requirements`
finish script 2024-07-09 22:12:06 +00:00			`- Node.js`
			`- Bash`
			- Tesseract (`sudo apt install tesseract`)
			`- Anthropic API key: <https://www.anthropic.com/api>`

			`## Usage`
			First, clone the repository: `git clone https://github.com/your-username/receipt-parser.git`

			Then, install the required dependencies: `npm install`

			`Obtain an Anthropic API key from <https://www.anthropic.com/api>`

			`Run the script list so:`

			```bash
			`./index.sh /path/to/receipt1.pdf /path/to/receipt2.pdf ...`
			```

			`...it takes a list of files.`

			`The script will process all PDF and image (anything Tesseract supports) files to:`

			`- Extract text using OCR and Tesseract`
			`- Convert that into a machine-readable JSON object with the Anthropic Claude API`
			`- Generate an HTML file with the extracted data for each input file.`

			`### Nautilus Script`
			`To register the script as a Nautilus script (for easy right-click access in the file manager), follow these steps:`

			`From the root of this repository, run this command:`

			```bash
			`ln -s $(pwd)/src/index.sh ~/.local/share/nautilus/scripts/parse-receipts`
			```

			Then, restart Nautilus by running `nautilus -q` in the terminal.

			`After restarting Nautilus, you should be able to right-click on any PDF or image file and select "Scripts" > "parse-receipts" to run the receipt parser on the selected files.`


			`## Contributing`
			`Contributions are very welcome - both issues and pull requests! Please mention in your pull request that you release your work under the MPL-2.0 (see below).`

			`See [CONTRIBUTING.md](./CONTRIBUTING.md) for a guide on what to expect when submitting a pull request or issue to this project.`

			`If you're feeling that way inclined, the sponsor button at the top of the page (if you're on GitHub) will take you to my [Liberapay profile](https://liberapay.com/sbrl) if you'd like to donate to say an extra thank you :-)`


			`## License`
			This project is released under the GNU Public License 3.0. The full license text is included in the `LICENSE` file in this repository. Tldr legal have a [great summary](https://www.tldrlegal.com/license/gnu-general-public-license-v3-gpl-3) of the license if you're interested.