Releases · benbrandt/text-splitter

Use memchr crate instead of regex for parsing phase in TextSplitter. This should improve performance in how quickly the text is parsed when scanning for newline characters. #650
Implement ChunkSizer trait automatically for many more wrappers and references to types that already implement ChunkSizer #649

Full Changelog: v0.25.0...v0.25.1

Assets 2

0 Join discussion

22 Mar 06:52

benbrandt

v0.25.0

2fe70b5

v0.25.0

Breaking Changes

Rust

Remove support for rust-tokenizers crate. This crate hasn't been updated in several years and brings in depednencies that have security warnings.

What's New

Use faster encoding method for tokenizers library, which improves performance with usage of huggingface tokenizers.

Full Changelog: v0.24.2...v0.25.0

Assets 2

0 Join discussion

19 Mar 15:33

benbrandt

v0.24.2

11f12e4

v0.24.2

Fixes

Python packages target a newer version of libc to hopefully fix header file issues with tree-sitter #638

What's New

MSRV updated to 1.81.0

Full Changelog: v0.24.1...v0.24.2

Assets 2

0 Join discussion

24 Feb 09:46

benbrandt

v0.24.1

a3ee673

v0.24.1

What's Changed

Added a new chunk_char_indices method to the Rust splitters in #607

use text_splitter::{Characters, ChunkCharIndex, TextSplitter};

let text = "\r\na̐éö̲\r\n";
let splitter = TextSplitter::new(3);
let chunks = splitter.chunk_char_indices(text).collect::<Vec<_>>();

assert_eq!(
    vec![
        ChunkCharIndex {
            chunk: "a̐é",
            byte_offset: 2,
            char_offset: 2
        },
        ChunkCharIndex {
            chunk: "ö̲",
            byte_offset: 7,
            char_offset: 5
        }
    ],
    chunks
);

The pulls logic from the Python bindings down into the core library. This will be more expensive than just byte offsets, and for most usage in Rust, just having byte offsets is sufficient.

However, when interfacing with other languages or systems that require character offsets, this will track the character offsets for you, accounting for any trimming that may have occurred.

Full Changelog: v0.24.0...v0.24.1

Assets 2

1 Join discussion

15 Feb 07:26

benbrandt

v0.24.0

ae97576

v0.24.0

What's Changed

Update to pulldown-cmark 0.13.0 to improve Markdown parsing.

Full Changelog: v0.23.0...v0.24.0

Assets 2

0 Join discussion

09 Feb 07:56

benbrandt

v0.23.0

0a22ee0

v0.23.0

What's Changed

Update to tree-sitter v0.25

Full Changelog: v0.22.0...v0.23.0

Assets 2

0 Join discussion

17 Jan 10:15

benbrandt

v0.22.0

217fb50

v0.22.0

Breaking Changes

Revert change to special token behavior in v0.21. This had many unintended side effects, and does not seem to be recommended for chunking.

Full Changelog: v0.21.0...v0.22.0

Assets 2

0 Join discussion

Releases: benbrandt/text-splitter

v0.28.0

What's Changed

Python

Uh oh!

v0.27.0

What's New

Uh oh!

v0.26.0

What's New

Uh oh!

v0.25.1

What's New

Uh oh!

v0.25.0

Breaking Changes

Rust

What's New

Uh oh!

v0.24.2

Fixes

What's New

Uh oh!

v0.24.1

What's Changed

Uh oh!

v0.24.0

What's Changed

Uh oh!

v0.23.0

What's Changed

Uh oh!

v0.22.0

Breaking Changes

Uh oh!