Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Introduction

pytrees-rs is a high-performance Python library for optimal and greedy decision tree learning, powered by Rust. The library brings state-of-the-art decision tree algorithms to Python developers. It provides both optimal and fast approximate solutions, giving the flexibility to choose between guaranteed optimality and computational efficiency.

Key Features

Optimal Decision Trees

  • DL8.5 Algorithm: Guarantees globally optimal decision trees within constraints
  • Mathematical Proof: Every solution comes with optimality guarantees
  • Flexible Constraints: Control depth, minimum support, training time, and custom objectives

High-Performance Greedy Algorithms

  • LGDT Algorithm: Fast approximate solutions for large datasets using different search strategies
  • Multiple Search Strategies: Depth-2 optimal error or Depth-2 optimal information gain
  • Scalable: Handles large datasets samples and depths efficiently

Advanced Features

  • Rule-based Optimization: Fine-tune search behavior with discrepancy, gain, and purity rules
  • Custom Error Functions: Define specialized objectives for your domain
  • Interpretable Clustering: Decision tree-based unsupervised learning
  • Tree Visualization: Built-in support for Graphviz visualization

Python-First Design

  • Scikit-learn Compatible: Drop-in replacement for sklearn decision trees
  • Pythonic API: Intuitive interface following Python conventions
  • Comprehensive Documentation: Extensive guides, examples, and API reference

Quick Example

from pytrees import DL85Classifier, LGDTClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)

# Optimal decision tree
optimal_clf = DL85Classifier(max_depth=4, min_sup=10)
optimal_clf.fit(X, y)
print(f"Optimal accuracy: {optimal_clf.score(X, y):.3f}")

# Less greedy decision tree
greedy_clf = LGDTClassifier(max_depth=4, min_sup=10)
greedy_clf.fit(X, y)
print(f"Greedy accuracy: {greedy_clf.score(X, y):.3f}")

# Visualize the optimal tree
dot_string = optimal_clf.to_dot()
try:
    import graphviz
    graphviz.Source(dot_string).view()
except Exception as e:
    print("Failed to visualize tree")

Core Algorithms

DL8.5: Optimal Decision Tree Learning

DL8.5 algorithm finds globally optimal decision trees using dynamic programming and branch-and-bound search. Unlike greedy methods, DL8.5 guarantees the best possible tree within your specified constraints.

Key Characteristics:

  • Optimality: If given enough time it is guaranteed to find the best solution
  • Flexibility: Supports custom error functions and constraints
  • Efficiency: Advanced caching and pruning techniques
  • Completeness: Will find a solution if one exists within limits

Academic Reference: Aglin, G., Nijssen, S., & Schaus, P. (2020). Learning optimal decision trees using caching branch-and-bound search. AAAI Conference on Artificial Intelligence.

LDS-DL8.5 and CADL8.5: Anytime Optimal Learning

An enhanced version of DL8.5 that provides anytime optimization. It gives progressively better solutions as training time increases. This algorithm prevents search stagnation and ensures you always get the best possible tree within your time budget.

Key Features:

  • Anytime Property: Improves solution quality over time
  • Limited Discrepancy Search: Impose an exploration budget to each node of the search space to provide fast early good solutions to improve on (LDS-DL8.5)
  • Top-k: Similar to LDS but constrains the number of features to explore at each nodes.
  • Restart Strategy: Prevents getting stuck in local regions
  • Time-bounded: Respects strict time constraints

Academic References:

  • [1]
  • [2]

LGDT: Less Greedy Decision Tree Learning

LGDT provides fast approximate solutions by looking ahead two levels instead of making purely greedy choices. This approach significantly improves upon traditional greedy algorithms while maintaining computational efficiency.

Key Innovations:

  • Two-level Lookahead: Less myopic than traditional greedy methods
  • Efficient Data Structures: Reversible sparse bitsets for speed
  • Multiple Search Strategies: Adaptable to different scenarios
  • Scalability: Handles large datasets efficiently

Academic Reference:

Getting Started

Ready to experience optimal decision trees? Here's your next steps:

  1. Installation: Set up pytrees-rs
  2. Quick Start Guide: Build your first optimal or less greedy tree.
  3. Python Library: Explore the complete API
  4. Examples: See ways to uses the library

Installation

This guide will help you install pytrees-rs on your system. We provide multiple installation methods to suit different needs and environments.

Prerequisites

Before installing pytrees-rs, ensure you have the following:

Python Requirements

  • Python 3.10 or higher
  • pip package manager
  • NumPy (automatically installed with pytrees-rs)

For Source Installation

  • Rust toolchain (1.70.0 or higher)
  • Git (for cloning the repository)

Installing Rust

If you need to install Rust, use the official installer:

# Install Rust using rustup (recommended)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Restart your shell or run:
source ~/.cargo/env

# Verify installation
rustc --version
cargo --version

Alternative methods:

  • macOS: brew install rust
  • Ubuntu/Debian: sudo apt install rustc cargo
  • Windows: Download from rustup.rs

Installation Methods

Method 1: PyPI Installation

pip install pytrees-rs

Method 2: Building from Source (Current Method)

This is currently the primary installation method:

# Clone the repository
git clone https://github.com/haroldks/pytrees-rs.git
cd pytrees-rs

# Build the Rust binary
cargo build --release

# Install Python package
cd pytrees-rs  # Navigate to Python package directory
pip install .

Method 3: Binary Installation Only

If you only need the command-line interface:

# Clone and build
git clone https://github.com/haroldks/pytrees-rs.git
cd pytrees-rs
cargo build --release

# Create symbolic link (Unix-based systems)
ln -s $(pwd)/target/release/dtrees_rs $HOME/.local/bin/dtrees_rs

# Or copy to system path
sudo cp target/release/dtrees_rs /usr/local/bin/

Basic Usage

After installation, verify that PyTrees-RS is working:

Python Library Test

# Test basic import
import pytrees
from pytrees import DL85Classifier, LGDTClassifier

# Quick functionality test
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=5, random_state=42)
X = (X > 0).astype(float)
clf = DL85Classifier(max_depth=2, min_sup=5)
clf.fit(X, y)
print(f"Accuracy: {clf.score(X, y):.3f}")

Command Line Test

# Test binary installation
dtrees_rs --help

Next Steps

Once installed successfully:

  1. Quick Start Guide: Build your first optimal tree
  2. Python Library: Explore the API
  3. Examples: See real applications

Quick Start Guide

5-Minute Tutorial

Basic Classification

from pytrees import DL85Classifier, LGDTClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
X = (X > 0).astype(float)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Optimal decision tree (guaranteed best solution)
optimal_clf = DL85Classifier(max_depth=3, min_sup=10)
optimal_clf.fit(X_train, y_train)
optimal_pred = optimal_clf.predict(X_test)
optimal_accuracy = accuracy_score(y_test, optimal_pred)

print(f"Optimal Tree Accuracy: {optimal_accuracy:.3f}")
print(f"Training Time: {optimal_clf.results.duration:.3f}s")

# Greedy decision tree (fast approximate solution)
greedy_clf = LGDTClassifier(max_depth=3, min_sup=10)
greedy_clf.fit(X_train, y_train)
greedy_pred = greedy_clf.predict(X_test)
greedy_accuracy = accuracy_score(y_test, greedy_pred)

print(f"Greedy Tree Accuracy: {greedy_accuracy:.3f}")
print(f"Training Time: {greedy_clf.results.duration:.3f}s")

Tree Visualization

from pytrees import DL85Classifier, LGDTClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Generate sample data
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
X = (X > 0).astype(float)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

optimal_clf = DL85Classifier(max_depth=3, min_sup=10)
optimal_clf.fit(X_train, y_train)

dot_string = optimal_clf.to_dot()

# Save to file
with open('tree.dot', 'w') as f:
    f.write(dot_string)

# Or use with graphviz (if installed)
try:
    import graphviz
    graph = graphviz.Source(dot_string)
    graph.view()
    print("Tree visualization saved as tree.png")
except ImportError:
    print("Install graphviz for automatic visualization: pip install graphviz")

Advanced Configuration

from pytrees import DL85Classifier, ExposedHeuristic, ExposedGainRule, ExposedPurityRule
from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
X = (X > 0).astype(float)
# Advanced DL85 with rules and heuristics
advanced_clf = DL85Classifier(
    max_depth=4,
    min_sup=5,
    max_time=300.0,
    heuristic=ExposedHeuristic.InformationGain,
    gain=ExposedGainRule(min_gain=0.01, epsilon=1e-4),
    purity=ExposedPurityRule(min_purity=0.9)
)
advanced_clf.fit(X, y)
print(advanced_clf.score(X, y))

Next Steps

Now that you've got the basics down:

  1. Python Library: Explore the complete API
  2. Examples: See more real-world applications
  3. Key Concepts: Understand the algorithms and theory
  4. Advanced Features: Rules, custom functions, and optimization