Metadata-Version: 2.4
Name: tocount
Version: 0.2
Summary: ToCount: Lightweight Token Estimator
Home-page: https://github.com/openscilab/tocount
Download-URL: https://github.com/openscilab/tocount/tarball/v0.2
Author: ToCount Development Team
Author-email: tocount@openscilab.com
License: MIT
Project-URL: Source, https://github.com/openscilab/tocount
Keywords: token tokenizer estimation llm ml nlp
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: Natural Language :: English
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python :: 3.7
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Classifier: Programming Language :: Python :: 3.13
Classifier: Intended Audience :: Developers
Classifier: Intended Audience :: Education
Classifier: Intended Audience :: End Users/Desktop
Classifier: Intended Audience :: Manufacturing
Classifier: Intended Audience :: Science/Research
Classifier: Topic :: Education
Classifier: Topic :: Scientific/Engineering
Classifier: Topic :: Scientific/Engineering :: Information Analysis
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Requires-Python: >=3.7
Description-Content-Type: text/markdown
License-File: LICENSE
License-File: AUTHORS.md
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: download-url
Dynamic: home-page
Dynamic: keywords
Dynamic: license
Dynamic: license-file
Dynamic: project-url
Dynamic: requires-python
Dynamic: summary


<div align="center">
    <h1>ToCount: Lightweight Token Estimator</h1>
    <br/>
    <a href="https://badge.fury.io/py/tocount"><img src="https://badge.fury.io/py/tocount.svg" alt="PyPI version"></a>
    <a href="https://codecov.io/gh/openscilab/tocount"><img src="https://codecov.io/gh/openscilab/tocount/branch/dev/graph/badge.svg?token=T9T0EPB3V2"></a>
    <a href="https://www.python.org/"><img src="https://img.shields.io/badge/built%20with-Python3-green.svg" alt="built with Python3"></a>
    <a href="https://github.com/openscilab/tocount"><img alt="GitHub repo size" src="https://img.shields.io/github/repo-size/openscilab/tocount"></a>
    <a href="https://discord.gg/X8ExzygDGf"><img src="https://img.shields.io/discord/1064533716615049236.svg" alt="Discord Channel"></a>
</div>

----------


## Overview
<p align="justify">
ToCount is a lightweight and extensible Python library for estimating token counts from text inputs using both rule-based and machine learning methods. Designed for flexibility, speed, and accuracy, ToCount provides a unified interface for different estimation strategies, making it ideal for tasks like prompt analysis, token budgeting, and optimizing interactions with token-based systems.
</p>

<table>
    <tr>
        <td align="center">PyPI Counter</td>
        <td align="center">
            <a href="https://pepy.tech/projects/tocount">
                <img src="https://static.pepy.tech/badge/tocount">
            </a>
        </td>
    </tr>
    <tr>
        <td align="center">Github Stars</td>
        <td align="center">
            <a href="https://github.com/openscilab/tocount">
                <img src="https://img.shields.io/github/stars/openscilab/tocount.svg?style=social&label=Stars">
            </a>
        </td>
    </tr>
</table>
<table>
    <tr> 
        <td align="center">Branch</td>
        <td align="center">main</td>
        <td align="center">dev</td>
    </tr>
    <tr>
        <td align="center">CI</td>
        <td align="center">
            <img src="https://github.com/openscilab/tocount/actions/workflows/test.yml/badge.svg?branch=main">
        </td>
        <td align="center">
            <img src="https://github.com/openscilab/tocount/actions/workflows/test.yml/badge.svg?branch=dev">
            </td>
    </tr>
</table>


## Installation

### PyPI
- Check [Python Packaging User Guide](https://packaging.python.org/installing/)
- Run `pip install tocount==0.2`
### Source code
- Download [Version 0.2](https://github.com/openscilab/tocount/archive/v0.2.zip) or [Latest Source](https://github.com/openscilab/tocount/archive/dev.zip)
- Run `pip install .`

## Models

### Rule-Based


| Model Name                 |   MAE   |     MSE     |   R²   |
|----------------------------|---------|-------------|--------|
| `RULE_BASED.UNIVERSAL`     | 106.70  | 381,647.81  | 0.8175 |
| `RULE_BASED.GPT_4`         | 152.34  | 571,795.89  | 0.7266 |
| `RULE_BASED.GPT_3_5`       | 161.93  | 652,923.59  | 0.6878 |

### Tiktoken R50K

| Model Name                        |   MAE   |     MSE     |   R²   |
|-----------------------------------|---------|-------------|--------|
| `TIKTOKEN_R50K.LINEAR_ALL`        |  71.38  |  183897.01  | 0.8941 |
| `TIKTOKEN_R50K.LINEAR_ENGLISH`    |  23.35  |  14127.92   | 0.9887 |


ℹ️ The training and testing dataset is taken from Lmsys-chat-1m [1] and Wildchat [2].

## Usage

```pycon
>>> from tocount import estimate_text_tokens, TextEstimator
>>> estimate_text_tokens("How are you?", estimator=TextEstimator.RULE_BASED.UNIVERSAL)
4
```

## Issues & bug reports

Just fill an issue and describe it. We'll check it ASAP! or send an email to [tocount@openscilab.com](mailto:tocount@openscilab.com "tocount@openscilab.com"). 

- Please complete the issue template

You can also join our discord server

<a href="https://discord.gg/X8ExzygDGf">
  <img src="https://img.shields.io/discord/1064533716615049236.svg?style=for-the-badge" alt="Discord Channel">
</a>

## References

<blockquote>1- Zheng, Lianmin, et al. "Lmsys-chat-1m: A large-scale real-world llm conversation dataset." International Conference on Learning Representations (ICLR) 2024 Spotlights.</blockquote>

<blockquote>2- Zhao, Wenting, et al. "Wildchat: 1m chatgpt interaction logs in the wild." International Conference on Learning Representations (ICLR) 2024 Spotlights.</blockquote>

## Show your support


### Star this repo

Give a ⭐️ if this project helped you!

### Donate to our project
If you do like our project and we hope that you do, can you please support us? Our project is not and is never going to be working for profit. We need the money just so we can continue doing what we do ;-) .			

<a href="https://openscilab.com/#donation" target="_blank"><img src="https://github.com/openscilab/tocount/raw/main/otherfiles/donation.png" width="270" alt="ToCount Donation"></a>

# Changelog
All notable changes to this project will be documented in this file.

The format is based on [Keep a Changelog](http://keepachangelog.com/en/1.0.0/)
and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.html).

## [Unreleased]
## [0.2] - 2025-10-02
### Added
- `TIKTOKEN_R50K.LINEAR_ALL` model
- `TIKTOKEN_R50K.LINEAR_ENGLISH` model
### Changed
- `README.md` updated
## [0.1] - 2025-08-30
### Added
- `RULE_BASED.UNIVERSAL` model
- `RULE_BASED.GPT_4` model
- `RULE_BASED.GPT_3_5` model


[Unreleased]: https://github.com/openscilab/tocount/compare/v0.2...dev
[0.2]: https://github.com/openscilab/tocount/compare/v0.1...v0.2
[0.1]: https://github.com/openscilab/tocount/compare/8385d46...v0.1
