Metadata-Version: 2.1
Name: langchain-utils
Version: 0.9.0
Summary: Utilities built upon the langchain library
Home-page: https://github.com/tddschn/langchain-utils
License: MIT
Keywords: langchain,utils,LLM,prompts,CLI
Author: Teddy Xinyuan Chen
Author-email: 45612704+tddschn@users.noreply.github.com
Requires-Python: >=3.11,<4.0
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Topic :: Utilities
Requires-Dist: apple-vision-utils (>=1.0.2,<2.0.0) ; sys_platform == "darwin"
Requires-Dist: bs4 (>=0.0.1,<0.0.2)
Requires-Dist: langchain (>=0.0.351,<0.0.352)
Requires-Dist: langchain-core (==0.1.1)
Requires-Dist: markdownify (>=0.11.6,<0.12.0)
Requires-Dist: pymupdf (>=1.21.1,<2.0.0)
Requires-Dist: pync[notify] (>=2.0.3,<3.0.0)
Requires-Dist: pyperclip (>=1.8.2,<2.0.0)
Requires-Dist: pytube (>=15.0.0,<16.0.0)
Requires-Dist: selenium (>=4.8.3,<5.0.0)
Requires-Dist: tiktoken (>=0.3.3,<0.4.0)
Requires-Dist: tqdm (>=4.65.0,<5.0.0)
Requires-Dist: typing-extensions (==3.10.0.2)
Requires-Dist: unstructured (==0.5.2)
Requires-Dist: youtube-transcript-api (>=0.5.0,<0.6.0)
Project-URL: Bug Tracker, https://github.com/tddschn/langchain-utils/issues
Project-URL: Repository, https://github.com/tddschn/langchain-utils
Description-Content-Type: text/markdown

# langchain-utils

LangChain Utilities

- [langchain-utils](#langchain-utils)
  - [Prompt generation using LangChain document loaders](#prompt-generation-using-langchain-document-loaders)
    - [Demos](#demos)
    - [`pandocprompt`](#pandocprompt)
    - [`urlprompt`](#urlprompt)
    - [`pdfprompt`](#pdfprompt)
    - [`ytprompt`](#ytprompt)
    - [`textprompt`](#textprompt)
    - [`htmlprompt`](#htmlprompt)
  - [Installation](#installation)
    - [pipx](#pipx)
    - [pip](#pip)
  - [Develop](#develop)

## Prompt generation using LangChain document loaders

Do you find yourself frequently copy-pasting texts from the web / PDFs / other documents into ChatGPT?

If yes, these tools are for you!

Optimized to feed into a chat interface (like ChatGPT) manually in one or multiple (to get around context length limits) goes.

Basically, the prompts generated look like this:

```python
REPLY_OK_IF_YOU_READ_TEMPLATE = '''
Below is {what}, reply "OK" if you read:

"""
{content}
"""
'''.strip()
```

You can feed it directly to a chat interface like ChatGPT, and ask follow up questions about it.

See [`prompts.py`](./langchain_utils/prompts.py) for other variations.

### Demos

- Loading `https://github.com/tddschn/langchain-utils` and copy to clipboard:

<!-- create a video tag with https://user-images.githubusercontent.com/45612704/231729153-341bd962-28cc-40a3-af8b-91e038ccaf6c.mp4 -->

<video src="https://user-images.githubusercontent.com/45612704/231729153-341bd962-28cc-40a3-af8b-91e038ccaf6c.mp4" controls width="100%"></video>

- Load 3 pages of a pdf file, open each part for inspection before copying, and optionally merge 3 pages into 2 prompts that wouldn't go over the `gpt-3.5-turbo`'s context length limit with langchain's `TokenTextSplitter`.

<!-- for https://user-images.githubusercontent.com/45612704/231731553-63cf3cef-a210-4761-8ca3-dd47bedc3393.mp4 -->

<video src="https://user-images.githubusercontent.com/45612704/231731553-63cf3cef-a210-4761-8ca3-dd47bedc3393.mp4" controls width="100%"></video>

### `pandocprompt`

```
$ pandocprompt --help

usage: pandocprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]
                    [-P PARTS [PARTS ...]] [-r] [-R]
                    [--print-percentage-non-ascii] [-n] [--out OUT] [-C]
                    [-w WHAT] [-M] [--from PANDOC_FROM_FORMAT]
                    [--to PANDOC_TO_FORMAT]
                    [PATH ...]

Get prompts from arbitrary files. You need to have `pandoc` installed and in
$PATH, it will be used to convert source files to desired (hopefully textual)
format. Common use cases: Getting prompts from EPub books or several TeX
files.

positional arguments:
  PATH                  Paths to the text files, or stdin if not provided
                        (default: None)

options:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  -c, --copy            Copy the prompt to clipboard (default: False)
  -e, --edit            Edit the prompt and copy manually (default: False)
  -m model, --model model
                        Model to use. This only affects the chunk size. Use -S
                        to disable splitting (infinite chunk size). (default:
                        gpt-4-32k)
  -S, --no-split        Do not split the prompt into multiple parts (use this
                        if the model has a really large context size)
                        (default: False)
  -s chunk_size, --chunk-size chunk_size
                        Chunk size when splitting transcript, also used to
                        determine whether to split, defaults to 1/2 of the
                        context length limit of the model (default: None)
  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]
                        Parts to select in the processes list of Documents
                        (default: None)
  -r, --raw             Wraps the content in triple quotes with no extra text
                        (default: False)
  -R, --raw-no-quotes   Output the content only (default: False)
  --print-percentage-non-ascii
                        Print percentage of non-ascii characters (default:
                        False)
  -n, --dry-run         Dry run (default: False)
  --out OUT             Output file (default: None)
  -C, --from-clipboard  Load text from clipboard (default: False)
  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF
                        content in the prompt (default: the content of a
                        document)
  -M, --merge           Merge contents of all pages before processing
                        (default: False)
  --from PANDOC_FROM_FORMAT
                        The format that is passed to -f in pandoc (default:
                        None)
  --to PANDOC_TO_FORMAT
                        The format that is passed to -t in pandoc. gfm-
                        raw_html means GitHub Flavored Markdown with raw HTML
                        stripped. (default: gfm-raw_html)

```
### `urlprompt`

```
$ urlprompt --help

usage: urlprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]
                 [-P PARTS [PARTS ...]] [-r] [-R]
                 [--print-percentage-non-ascii] [-n] [--out OUT] [-w WHAT]
                 [-M] [-j] [-g] [--github-path GITHUB_PATH]
                 [--github-revision GITHUB_REVISION] [--substack]
                 URL

Get a prompt consisting the text content of a webpage

positional arguments:
  URL                   URL to the webpage

options:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  -c, --copy            Copy the prompt to clipboard (default: False)
  -e, --edit            Edit the prompt and copy manually (default: False)
  -m model, --model model
                        Model to use. This only affects the chunk size. Use -S
                        to disable splitting (infinite chunk size). (default:
                        gpt-4-32k)
  -S, --no-split        Do not split the prompt into multiple parts (use this
                        if the model has a really large context size)
                        (default: False)
  -s chunk_size, --chunk-size chunk_size
                        Chunk size when splitting transcript, also used to
                        determine whether to split, defaults to 1/2 of the
                        context length limit of the model (default: None)
  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]
                        Parts to select in the processes list of Documents
                        (default: None)
  -r, --raw             Wraps the content in triple quotes with no extra text
                        (default: False)
  -R, --raw-no-quotes   Output the content only (default: False)
  --print-percentage-non-ascii
                        Print percentage of non-ascii characters (default:
                        False)
  -n, --dry-run         Dry run (default: False)
  --out OUT             Output file (default: None)
  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF
                        content in the prompt (default: the content of a
                        webpage)
  -M, --merge           Merge contents of all pages before processing
                        (default: False)
  -j, --javascript      Use JavaScript to render the page (default: False)
  -g, --github          Load the raw file from a GitHub URL (default: False)
  --github-path GITHUB_PATH
                        Path to the GitHub file (default: README.md)
  --github-revision GITHUB_REVISION
                        Revision for the GitHub file (default: master)
  --substack            Load from a Substack URL and convert it to Markdown
                        (default: False)

```
### `pdfprompt`

```
$ pdfprompt --help

usage: pdfprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]
                 [-P PARTS [PARTS ...]] [-r] [-R]
                 [--print-percentage-non-ascii] [-n] [--out OUT]
                 [-p PAGES [PAGES ...]] [-l PAGE_SLICE] [-M] [-w WHAT] [-o]
                 [-O] [-L OCR_LANGUAGE]
                 PDF Path

Get a prompt consisting the text content of a PDF file

positional arguments:
  PDF Path              Path to the PDF file

options:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  -c, --copy            Copy the prompt to clipboard (default: False)
  -e, --edit            Edit the prompt and copy manually (default: False)
  -m model, --model model
                        Model to use. This only affects the chunk size. Use -S
                        to disable splitting (infinite chunk size). (default:
                        gpt-4-32k)
  -S, --no-split        Do not split the prompt into multiple parts (use this
                        if the model has a really large context size)
                        (default: False)
  -s chunk_size, --chunk-size chunk_size
                        Chunk size when splitting transcript, also used to
                        determine whether to split, defaults to 1/2 of the
                        context length limit of the model (default: None)
  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]
                        Parts to select in the processes list of Documents
                        (default: None)
  -r, --raw             Wraps the content in triple quotes with no extra text
                        (default: False)
  -R, --raw-no-quotes   Output the content only (default: False)
  --print-percentage-non-ascii
                        Print percentage of non-ascii characters (default:
                        False)
  -n, --dry-run         Dry run (default: False)
  --out OUT             Output file (default: None)
  -p PAGES [PAGES ...], --pages PAGES [PAGES ...]
                        Only include specified page numbers (default: None)
  -l PAGE_SLICE, --page-slice PAGE_SLICE
                        Use Python slice syntax to select page numbers (e.g.
                        1:3, 1:10:2, etc.) (default: None)
  -M, --merge           Merge contents of all pages before processing
                        (default: False)
  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF
                        content in the prompt (default: the content of a PDF
                        file)
  -o, --fallback-ocr    Use OCR as fallback if no text detected on page,
                        please set TESSDATA_PREFIX environment variable to the
                        path of your tesseract data directory (default: False)
  -O, --force-ocr       Force OCR on all pages (default: False)
  -L OCR_LANGUAGE, --ocr-language OCR_LANGUAGE
                        Language to use for Tesseract OCR (like eng, chi_sim,
                        chi_tra, chi_tra_vert etc.)) (default: eng)

```
### `ytprompt`

```
$ ytprompt --help

usage: ytprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]
                [-P PARTS [PARTS ...]] [-r] [-R]
                [--print-percentage-non-ascii] [-n] [--out OUT]
                YouTube URL

Get a prompt consisting Title and Transcript of a YouTube Video

positional arguments:
  YouTube URL           YouTube URL

options:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  -c, --copy            Copy the prompt to clipboard (default: False)
  -e, --edit            Edit the prompt and copy manually (default: False)
  -m model, --model model
                        Model to use. This only affects the chunk size. Use -S
                        to disable splitting (infinite chunk size). (default:
                        gpt-4-32k)
  -S, --no-split        Do not split the prompt into multiple parts (use this
                        if the model has a really large context size)
                        (default: False)
  -s chunk_size, --chunk-size chunk_size
                        Chunk size when splitting transcript, also used to
                        determine whether to split, defaults to 1/2 of the
                        context length limit of the model (default: None)
  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]
                        Parts to select in the processes list of Documents
                        (default: None)
  -r, --raw             Wraps the content in triple quotes with no extra text
                        (default: False)
  -R, --raw-no-quotes   Output the content only (default: False)
  --print-percentage-non-ascii
                        Print percentage of non-ascii characters (default:
                        False)
  -n, --dry-run         Dry run (default: False)
  --out OUT             Output file (default: None)

```
### `textprompt`

```
$ textprompt --help

usage: textprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]
                  [-P PARTS [PARTS ...]] [-r] [-R]
                  [--print-percentage-non-ascii] [-n] [--out OUT] [-C]
                  [-w WHAT] [-M]
                  [PATH ...]

Get a prompt from text files

positional arguments:
  PATH                  Paths to the text files, or stdin if not provided
                        (default: None)

options:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  -c, --copy            Copy the prompt to clipboard (default: False)
  -e, --edit            Edit the prompt and copy manually (default: False)
  -m model, --model model
                        Model to use. This only affects the chunk size. Use -S
                        to disable splitting (infinite chunk size). (default:
                        gpt-4-32k)
  -S, --no-split        Do not split the prompt into multiple parts (use this
                        if the model has a really large context size)
                        (default: False)
  -s chunk_size, --chunk-size chunk_size
                        Chunk size when splitting transcript, also used to
                        determine whether to split, defaults to 1/2 of the
                        context length limit of the model (default: None)
  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]
                        Parts to select in the processes list of Documents
                        (default: None)
  -r, --raw             Wraps the content in triple quotes with no extra text
                        (default: False)
  -R, --raw-no-quotes   Output the content only (default: False)
  --print-percentage-non-ascii
                        Print percentage of non-ascii characters (default:
                        False)
  -n, --dry-run         Dry run (default: False)
  --out OUT             Output file (default: None)
  -C, --from-clipboard  Load text from clipboard (default: False)
  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF
                        content in the prompt (default: the content of a
                        document)
  -M, --merge           Merge contents of all pages before processing
                        (default: False)

```
### `htmlprompt`

```
$ htmlprompt --help

usage: htmlprompt [-h] [-V] [-c] [-e] [-m model] [-S] [-s chunk_size]
                  [-P PARTS [PARTS ...]] [-r] [-R]
                  [--print-percentage-non-ascii] [-n] [--out OUT] [-C]
                  [-w WHAT] [-M]
                  [PATH ...]

Get a prompt from html files

positional arguments:
  PATH                  Paths to the html files, or stdin if not provided
                        (default: None)

options:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  -c, --copy            Copy the prompt to clipboard (default: False)
  -e, --edit            Edit the prompt and copy manually (default: False)
  -m model, --model model
                        Model to use. This only affects the chunk size. Use -S
                        to disable splitting (infinite chunk size). (default:
                        gpt-4-32k)
  -S, --no-split        Do not split the prompt into multiple parts (use this
                        if the model has a really large context size)
                        (default: False)
  -s chunk_size, --chunk-size chunk_size
                        Chunk size when splitting transcript, also used to
                        determine whether to split, defaults to 1/2 of the
                        context length limit of the model (default: None)
  -P PARTS [PARTS ...], --parts PARTS [PARTS ...]
                        Parts to select in the processes list of Documents
                        (default: None)
  -r, --raw             Wraps the content in triple quotes with no extra text
                        (default: False)
  -R, --raw-no-quotes   Output the content only (default: False)
  --print-percentage-non-ascii
                        Print percentage of non-ascii characters (default:
                        False)
  -n, --dry-run         Dry run (default: False)
  --out OUT             Output file (default: None)
  -C, --from-clipboard  Load text from clipboard (default: False)
  -w WHAT, --what WHAT  Initial knowledge you want to insert before the PDF
                        content in the prompt (default: the text content of a
                        html file)
  -M, --merge           Merge contents of all pages before processing
                        (default: False)

```

## Installation

### pipx

This is the recommended installation method.

```
$ pipx install langchain-utils
```

### [pip](https://pypi.org/project/langchain-utils/)

```
$ pip install langchain-utils
```

## Develop

```
$ git clone https://github.com/tddschn/langchain-utils.git
$ cd langchain-utils
$ poetry install
```
