Skip to content

Releases: benbrandt/text-splitter

v0.28.0

05 Sep 05:30
d2c22d6

Choose a tag to compare

What's Changed

  • Updated tokenizers to v0.22

Python

  • Minimum Python version updated to 3.10

Full Changelog: v0.27.0...v0.28.0

v0.27.0

28 May 07:58
bada552

Choose a tag to compare

What's New

  • Updated tiktoken-rs to v0.7.0

Full Changelog: v0.26.0...v0.27.0

v0.26.0

09 May 15:01

Choose a tag to compare

What's New

  • Updated to icu v2.0 for all unicode segmentation.
  • Minimum Rust version updated to 1.82.0

Full Changelog: v0.25.1...v0.26.0

v0.25.1

25 Mar 21:59

Choose a tag to compare

What's New

  • Use memchr crate instead of regex for parsing phase in TextSplitter. This should improve performance in how quickly the text is parsed when scanning for newline characters. #650
  • Implement ChunkSizer trait automatically for many more wrappers and references to types that already implement ChunkSizer #649

Full Changelog: v0.25.0...v0.25.1

v0.25.0

22 Mar 06:52

Choose a tag to compare

Breaking Changes

Rust

  • Remove support for rust-tokenizers crate. This crate hasn't been updated in several years and brings in depednencies that have security warnings.

What's New

  • Use faster encoding method for tokenizers library, which improves performance with usage of huggingface tokenizers.

Full Changelog: v0.24.2...v0.25.0

v0.24.2

19 Mar 15:33
11f12e4

Choose a tag to compare

Fixes

  • Python packages target a newer version of libc to hopefully fix header file issues with tree-sitter #638

What's New

  • MSRV updated to 1.81.0

Full Changelog: v0.24.1...v0.24.2

v0.24.1

24 Feb 09:46

Choose a tag to compare

What's Changed

Added a new chunk_char_indices method to the Rust splitters in #607

use text_splitter::{Characters, ChunkCharIndex, TextSplitter};

let text = "\r\na̐éö̲\r\n";
let splitter = TextSplitter::new(3);
let chunks = splitter.chunk_char_indices(text).collect::<Vec<_>>();

assert_eq!(
    vec![
        ChunkCharIndex {
            chunk: "a̐é",
            byte_offset: 2,
            char_offset: 2
        },
        ChunkCharIndex {
            chunk: "ö̲",
            byte_offset: 7,
            char_offset: 5
        }
    ],
    chunks
);

The pulls logic from the Python bindings down into the core library. This will be more expensive than just byte offsets, and for most usage in Rust, just having byte offsets is sufficient.

However, when interfacing with other languages or systems that require character offsets, this will track the character offsets for you, accounting for any trimming that may have occurred.

Full Changelog: v0.24.0...v0.24.1

v0.24.0

15 Feb 07:26
ae97576

Choose a tag to compare

What's Changed

Update to pulldown-cmark 0.13.0 to improve Markdown parsing.

Full Changelog: v0.23.0...v0.24.0

v0.23.0

09 Feb 07:56
0a22ee0

Choose a tag to compare

What's Changed

Update to tree-sitter v0.25

Full Changelog: v0.22.0...v0.23.0

v0.22.0

17 Jan 10:15
217fb50

Choose a tag to compare

Breaking Changes

  • Revert change to special token behavior in v0.21. This had many unintended side effects, and does not seem to be recommended for chunking.

Full Changelog: v0.21.0...v0.22.0