Introduction
Pytrees-rs is a library that encompasses decision tree learning algorithms implemented using dynamic programming.
The currently available algorithms include:
DL8.5: Optimal Decision Tree Learning Algorithm
DL8.5 is an algorithm designed for learning optimal decision trees, incorporating default misclassification error and a user-provided objective function. The method uses a branch-and-bound approach with caching to minimize the search space exploration and a speed-up the exploration. For further insights, refer to the relevant paper, and find the native implementation. Both implementations incorporate concepts introduced by Murtree, another optimal decision learning algorithm.
LDS-DL8.5 :Anytime Optimal Decision Tree Learning Algorithm
An anytime decision tree learning algorithm, building upon the foundation of DL8.5. This algorithm addresses dynamic programming challenges, preventing stagnation in a particular part of the search space. It ensures minimal tree quality within time constraints by leveraging Limited Discrepancy Search and a strategically designed restart strategy. For in-depth details, refer to the comprehensive information available in the associated paper.
LGDT : Less Greedy Decision Tree Learning Algorithm
LGDT employs a reversible sparse bitset data structure, also found in DL8.5 and LDS-DL8.5. Additionally, it incorporates Murtree's specialized depth-2 algorithm to provide decision trees with less myopic evaluation. Unlike purely greedy algorithms such as CART and C4.5, LGDT assesses error or information gain over two levels to determine the local best node value. For a more in-depth understanding, refer to the details provided in the accompanying paper.
Installation
Before installing Pytrees-rs, make sure to have the rust toolchain installed.
Building From Source
git clone git@github.com:haroldks/pytrees-rs.git
cd pytrees-rs
cargo build --release
These commands will create a binary (dtrees_rs)in release mode (with compiler optimization). If desired, you can create a symbolic link to this binary in the local binary directory (assuming you are on a Unix-based OS).
ln -s ./target/release/dtrees_rs $HOME/.local/bin
From crates.io (TBA)
Command line arguments
Usage: dtrees-rs [OPTIONS] --input <INPUT> <COMMAND>
Commands:
dl85 DL8.5 Optimal search Algorithm with no depth limit and classification error as criterion.
d2-odt Optimal depth 2 algorithms using Error or Information as criterion
lgdt Less greedy decision tree approach usind misclassification or information gain tree as sliding window
help Print this message or the help of the given subcommand(s)
Options:
-i, --input <INPUT> Dataset input file path
--print-stats Printing Statistics and Constraints
--print-tree Printing Tree
-h, --help Print help
-V, --version Print version
Le binaire offres plusieurs sous commandes en fonction de l'algorithm de recherche utilisé.
- dl85: Generate decision tree usind DL8.5.
- d2-odt: Generate only optimal decision trees with a depth of 2, using either classification error or information gain as the cost function.
- lgdt: Generate Lookahead Trees for decision tree.
Learn an Optimal Decision Tree using DL8.5
Usage: dtrees-rs --input <INPUT> dl85 [OPTIONS] --depth <DEPTH>
Options:
-s, --support <SUPPORT>
Minimum support [default: 1]
-d, --depth <DEPTH>
Maximum depth
--sorting-once
Sorting Features based on heuristic only at the root (true) or at each node
--specialization <SPECIALIZATION>
Use Murtree Specialization Algorithm [default: none] [possible values: murtree, none]
--lb <LOWER_BOUND_HEURISTIC>
Lower bound heuristic strategy [default: none] [possible values: similarity, none]
-b, --branching <BRANCHING>
Branching type [default: none] [possible values: dynamic, none]
--cache <CACHE_TYPE>
[default: trie] [possible values: trie, hashmap]
--cache-init-size <CACHE_INIT_SIZE>
Cache init size Represents the reserved starting size of the cache [default: 0]
--init-strategy <INIT_STRATEGY>
Cache Initialization strategy [default: none] [possible values: dynamic-allocation, user-allocation, none]
-h, --heuristic <HEURISTIC>
Sorting heuristic [default: none] [possible values: information-gain, information-gain-ratio, gini-index, none]
--max-error <MAX_ERROR>
Tree error initial upper bound [default: inf]
-t, --timeout <TIMEOUT>
Maximum time allowed to the search
-h, --help
Print help
Learning Time constrained DL8.5 (tba)
Learning Depth-2 Optimal decision Trees
Usage: dtrees-rs --input <INPUT> d2-odt [OPTIONS]
Options:
-s, --support <SUPPORT> Minimum support [default: 1]
-d, --depth <DEPTH> Depth The depth you want. The algorithm is optimized for depth 1 and 2 and won't work for more than that [default: 2]
-o, --objective <OBJECTIVE> Objective to optimise. Error or Information Gain [default: error] [possible values: error, information-gain]
-h, --help Print help
Learning Less Greedy Decision Trees
Usage: dtrees-rs --input <INPUT> lgdt [OPTIONS] --depth <DEPTH>
Options:
-s, --support <SUPPORT> Minimum support [default: 1]
-d, --depth <DEPTH> Maximum depth
-o, --objective <OBJECTIVE> Objective function inside [default: error] [possible values: error, information-gain]
-h, --help Print help
A Python interface is available on Pypi. This interface allows you to utilize the various implemented algorithms with a Python script, employing a simple and familiar approach and is compatible with Scikit-Learn pipelines.
Installing From Pypi
Assuming you have Python 3.9 or 3.10 installed, precompiled versions of the library are available on PyPI for macOS and Linux operating systems. Binaries for Windows versions will be available in the future. To install the library, use:
pip install pytrees-rs
Installing from source
After cloning the project as mentioned here you'll find a subdirectory named pytrees-rs containing the source code for the interface.
To install the interface in your Python environment, follow these instructions:
pip install --upgrade pip
cd pytrees-rs
pip install .
The Python Api allows you to do several thing
Simple Example
from pytrees import DL85Classifier
dataset = np.genfromtxt("../test_data/anneal.txt", delimiter=" ")
X, y = dataset[:, 1:], dataset[:, 0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
clf = DL85Classifier(min_sup=1, max_depth=2, max_time=600)
clf.fit(X_train, y_train)
train_accuracy = clf.accuracy_
test_accuracy = clf.score(X_test, y_test)