Introduction
pytrees-rs is a high-performance Python library for optimal and greedy decision tree learning, powered by Rust. The library brings state-of-the-art decision tree algorithms to Python developers. It provides both optimal and fast approximate solutions, giving the flexibility to choose between guaranteed optimality and computational efficiency.
Key Features
Optimal Decision Trees
- DL8.5 Algorithm: Guarantees globally optimal decision trees within constraints
- Mathematical Proof: Every solution comes with optimality guarantees
- Flexible Constraints: Control depth, minimum support, training time, and custom objectives
High-Performance Greedy Algorithms
- LGDT Algorithm: Fast approximate solutions for large datasets using different search strategies
- Multiple Search Strategies: Depth-2 optimal error or Depth-2 optimal information gain
- Scalable: Handles large datasets samples and depths efficiently
Advanced Features
- Rule-based Optimization: Fine-tune search behavior with discrepancy, gain, and purity rules
- Custom Error Functions: Define specialized objectives for your domain
- Interpretable Clustering: Decision tree-based unsupervised learning
- Tree Visualization: Built-in support for Graphviz visualization
Python-First Design
- Scikit-learn Compatible: Drop-in replacement for sklearn decision trees
- Pythonic API: Intuitive interface following Python conventions
- Comprehensive Documentation: Extensive guides, examples, and API reference
Quick Example
from pytrees import DL85Classifier, LGDTClassifier
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
# Generate sample data
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
# Optimal decision tree
optimal_clf = DL85Classifier(max_depth=4, min_sup=10)
optimal_clf.fit(X, y)
print(f"Optimal accuracy: {optimal_clf.score(X, y):.3f}")
# Less greedy decision tree
greedy_clf = LGDTClassifier(max_depth=4, min_sup=10)
greedy_clf.fit(X, y)
print(f"Greedy accuracy: {greedy_clf.score(X, y):.3f}")
# Visualize the optimal tree
dot_string = optimal_clf.to_dot()
try:
import graphviz
graphviz.Source(dot_string).view()
except Exception as e:
print("Failed to visualize tree")
Core Algorithms
DL8.5: Optimal Decision Tree Learning
DL8.5 algorithm finds globally optimal decision trees using dynamic programming and branch-and-bound search. Unlike greedy methods, DL8.5 guarantees the best possible tree within your specified constraints.
Key Characteristics:
- Optimality: If given enough time it is guaranteed to find the best solution
- Flexibility: Supports custom error functions and constraints
- Efficiency: Advanced caching and pruning techniques
- Completeness: Will find a solution if one exists within limits
Academic Reference: Aglin, G., Nijssen, S., & Schaus, P. (2020). Learning optimal decision trees using caching branch-and-bound search. AAAI Conference on Artificial Intelligence.
LDS-DL8.5 and CADL8.5: Anytime Optimal Learning
An enhanced version of DL8.5 that provides anytime optimization. It gives progressively better solutions as training time increases. This algorithm prevents search stagnation and ensures you always get the best possible tree within your time budget.
Key Features:
- Anytime Property: Improves solution quality over time
- Limited Discrepancy Search: Impose an exploration budget to each node of the search space to provide fast early good solutions to improve on (LDS-DL8.5)
- Top-k: Similar to LDS but constrains the number of features to explore at each nodes.
- Restart Strategy: Prevents getting stuck in local regions
- Time-bounded: Respects strict time constraints
Academic References:
- [1]
- [2]
LGDT: Less Greedy Decision Tree Learning
LGDT provides fast approximate solutions by looking ahead two levels instead of making purely greedy choices. This approach significantly improves upon traditional greedy algorithms while maintaining computational efficiency.
Key Innovations:
- Two-level Lookahead: Less myopic than traditional greedy methods
- Efficient Data Structures: Reversible sparse bitsets for speed
- Multiple Search Strategies: Adaptable to different scenarios
- Scalability: Handles large datasets efficiently
Academic Reference:
Getting Started
Ready to experience optimal decision trees? Here's your next steps:
- Installation: Set up pytrees-rs
- Quick Start Guide: Build your first optimal or less greedy tree.
- Python Library: Explore the complete API
- Examples: See ways to uses the library
Installation
This guide will help you install pytrees-rs on your system. We provide multiple installation methods to suit different needs and environments.
Prerequisites
Before installing pytrees-rs, ensure you have the following:
Python Requirements
- Python 3.10 or higher
- pip package manager
- NumPy (automatically installed with pytrees-rs)
For Source Installation
- Rust toolchain (1.70.0 or higher)
- Git (for cloning the repository)
Installing Rust
If you need to install Rust, use the official installer:
# Install Rust using rustup (recommended)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
# Restart your shell or run:
source ~/.cargo/env
# Verify installation
rustc --version
cargo --version
Alternative methods:
- macOS:
brew install rust - Ubuntu/Debian:
sudo apt install rustc cargo - Windows: Download from rustup.rs
Installation Methods
Method 1: PyPI Installation
pip install pytrees-rs
Method 2: Building from Source (Current Method)
This is currently the primary installation method:
# Clone the repository
git clone https://github.com/haroldks/pytrees-rs.git
cd pytrees-rs
# Build the Rust binary
cargo build --release
# Install Python package
cd pytrees-rs # Navigate to Python package directory
pip install .
Method 3: Binary Installation Only
If you only need the command-line interface:
# Clone and build
git clone https://github.com/haroldks/pytrees-rs.git
cd pytrees-rs
cargo build --release
# Create symbolic link (Unix-based systems)
ln -s $(pwd)/target/release/dtrees_rs $HOME/.local/bin/dtrees_rs
# Or copy to system path
sudo cp target/release/dtrees_rs /usr/local/bin/
Basic Usage
After installation, verify that PyTrees-RS is working:
Python Library Test
# Test basic import
import pytrees
from pytrees import DL85Classifier, LGDTClassifier
# Quick functionality test
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=5, random_state=42)
X = (X > 0).astype(float)
clf = DL85Classifier(max_depth=2, min_sup=5)
clf.fit(X, y)
print(f"Accuracy: {clf.score(X, y):.3f}")
Command Line Test
# Test binary installation
dtrees_rs --help
Next Steps
Once installed successfully:
- Quick Start Guide: Build your first optimal tree
- Python Library: Explore the API
- Examples: See real applications
Quick Start Guide
5-Minute Tutorial
Basic Classification
from pytrees import DL85Classifier, LGDTClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Generate sample data
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
X = (X > 0).astype(float)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Optimal decision tree (guaranteed best solution)
optimal_clf = DL85Classifier(max_depth=3, min_sup=10)
optimal_clf.fit(X_train, y_train)
optimal_pred = optimal_clf.predict(X_test)
optimal_accuracy = accuracy_score(y_test, optimal_pred)
print(f"Optimal Tree Accuracy: {optimal_accuracy:.3f}")
print(f"Training Time: {optimal_clf.results.duration:.3f}s")
# Greedy decision tree (fast approximate solution)
greedy_clf = LGDTClassifier(max_depth=3, min_sup=10)
greedy_clf.fit(X_train, y_train)
greedy_pred = greedy_clf.predict(X_test)
greedy_accuracy = accuracy_score(y_test, greedy_pred)
print(f"Greedy Tree Accuracy: {greedy_accuracy:.3f}")
print(f"Training Time: {greedy_clf.results.duration:.3f}s")
Tree Visualization
from pytrees import DL85Classifier, LGDTClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Generate sample data
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
X = (X > 0).astype(float)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
optimal_clf = DL85Classifier(max_depth=3, min_sup=10)
optimal_clf.fit(X_train, y_train)
dot_string = optimal_clf.to_dot()
# Save to file
with open('tree.dot', 'w') as f:
f.write(dot_string)
# Or use with graphviz (if installed)
try:
import graphviz
graph = graphviz.Source(dot_string)
graph.view()
print("Tree visualization saved as tree.png")
except ImportError:
print("Install graphviz for automatic visualization: pip install graphviz")
Advanced Configuration
from pytrees import DL85Classifier, ExposedHeuristic, ExposedGainRule, ExposedPurityRule
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=10, n_classes=2, random_state=42)
X = (X > 0).astype(float)
# Advanced DL85 with rules and heuristics
advanced_clf = DL85Classifier(
max_depth=4,
min_sup=5,
max_time=300.0,
heuristic=ExposedHeuristic.InformationGain,
gain=ExposedGainRule(min_gain=0.01, epsilon=1e-4),
purity=ExposedPurityRule(min_purity=0.9)
)
advanced_clf.fit(X, y)
print(advanced_clf.score(X, y))
Next Steps
Now that you've got the basics down:
- Python Library: Explore the complete API
- Examples: See more real-world applications
- Key Concepts: Understand the algorithms and theory
- Advanced Features: Rules, custom functions, and optimization