An open-source, low-code machine learning library in Python

drawing

An open-source, low-code machine learning library in Python
🚀 Version 2.3.6 out now! Check out the release notes here.

Official • Docs • Install • Tutorials • Discussions • Contribute • Resources • Medium • LinkedIn • YouTube • Slack

Python pytest on push Documentation Status PyPI version License

Slack

alt text

Welcome to PyCaret

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is an end-to-end machine learning and model management tool that speeds up the experiment cycle exponentially and makes you more productive.

In comparison with the other open-source machine learning libraries, PyCaret is an alternate low-code library that can be used to replace hundreds of lines of code with few lines only. This makes experiments exponentially fast and efficient. PyCaret is essentially a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, LightGBM, CatBoost, spaCy, Optuna, Hyperopt, Ray, and few more.

The design and simplicity of PyCaret are inspired by the emerging role of citizen data scientists, a term first used by Gartner. Citizen Data Scientists are power users who can perform both simple and moderately sophisticated analytical tasks that would previously have required more technical expertise.

Important Links
⭐ Tutorials New to PyCaret? Checkout our official notebooks!
📋 Example Notebooks Example notebooks created by community.
📙 Blog Tutorials and articles by contributors.
📚 Documentation The detailed API docs of PyCaret
📺 Video Tutorials Our video tutorial from various events.
📢 Discussions Have questions? Engage with community and contributors.
🛠️ Changelog Changes and version history.
🌳 Roadmap PyCaret's software and community development plan.

Installation

PyCaret's default installation only installs hard dependencies as listed in the requirements.txt file.

pip install pycaret

To install the full version:

pip install pycaret[full]

Supervised Workflow

Classification Regression

Unsupervised Workflow

Clustering Anomaly Detection

PyCaret ⚡ NEW ⚡ Time Series Module

PyCaret new time series module is now available in beta. Staying true to simplicity of PyCaret, it is consistent with our existing API and fully loaded with functionalities. Statistical testing, model training and selection (30+ algorithms), model analysis, automated hyperparameter tuning, experiment logging, deployment on cloud, and more. All of this with only few lines of code (just like the other modules of pycaret). If you would like to give it a try, checkout our official quick start notebook.

📚 Time Series Docs

❓ Time Series FAQs

🚀 Features and Roadmap

The module is still in beta. We are adding new functionalities every day and doing weekly pip releases. Please ensure to create a separate python environment to avoid dependency conflicts with main pycaret. The final release of this module will be merged with the main pycaret in next major release.

pip install pycaret-ts-alpha

alt text

Who should use PyCaret?

PyCaret is an open source library that anybody can use. In our view the ideal target audience of PyCaret is:

  • Experienced Data Scientists who want to increase productivity.
  • Citizen Data Scientists who prefer a low code machine learning solution.
  • Data Science Professionals who want to build rapid prototypes.
  • Data Science and Machine Learning students and enthusiasts.

PyCaret on GPU

With PyCaret >= 2.2, you can train models on GPU and speed up your workflow by 10x. To train models on GPU simply pass use_gpu = True in the setup function. There is no change in the use of the API, however, in some cases, additional libraries have to be installed as they are not installed with the default version or the full version. As of the latest release, the following models can be trained on GPU:

  • Extreme Gradient Boosting (requires no further installation)
  • CatBoost (requires no further installation)
  • Light Gradient Boosting Machine requires GPU installation
  • Logistic Regression, Ridge Classifier, Random Forest, K Neighbors Classifier, K Neighbors Regressor, Support Vector Machine, Linear Regression, Ridge Regression, Lasso Regression requires cuML >= 0.15

License

PyCaret is completely free and open-source and licensed under the MIT license.

Contributors

Comments
  • Introduction of GitHub Actions

    Introduction of GitHub Actions

    I think it is a good idea to run automated unit tests at a commit time. Currently, this project uses travis, but I feel that it has a higher affinity for development using GitHub. In some cases, you may use two together as a trial. Please consider.

    Example: https://github.com/stanfordmlgroup/ngboost

    Along with that, I would like to automate tasks such as https://github.com/pycaret/pycaret/pull/336.

  • Difference in Screen Printouts between 2.0.0 and 2.2.0

    Difference in Screen Printouts between 2.0.0 and 2.2.0

    Note: I am using pycaret in a Databricks notebook to tackle a classification problem.

    When doing a classification task using compare_models, I would see running results like this as the job progressed:

    image

    However, with the upgrade to 2.2.0, I no longer see the informative table, but this token from pandas:

    image

    This is a pandas styler object that is not rendering correctly in the notebook.

    So, my question is, what do I need to do to render this table properly inline? Is there a pandas setting or pycaret parameter that needs to be set?

    Snippet of current code:

    from pycaret.classification import *
    
    EXP = setup(data=data, 
                target='Tier', 
                ignore_features=['imdbId', 'DV', 'log(DV)'],
                train_size=0.8,
                silent=True)
    
    top5Models = compare_models(exclude=excludeAlgs, 
                                    sort='MCC',
                                    fold=kFolds,
                                    turbo=True,
                                    verbose=True,
                                    n_select=topN)
    

    Thanks for all timely help

  • MemoryError: Unable to allocate 116. GiB for an array with shape (15554265202,) and data type int64

    MemoryError: Unable to allocate 116. GiB for an array with shape (15554265202,) and data type int64

    I am using a table with 1092 rows and 127 features plus one categorical target column and I am getting this error above. Any idea what could be going wrong here? The table contains integer values and the target column has a total of two classes.

     from pycaret.classification import * 
     clf1 = setup(data = data, 
             target = 'ORG',
             silent = False)
    > MemoryError: Unable to allocate 116. GiB for an array with shape (15554265202,) and data type int64
    

    I think, the data is not that big. Am I doing something wrong?

  • Error message installing to Jupyter Notebooks

    Error message installing to Jupyter Notebooks

    I was able to install Pycaret to CoLab but not Juptyer Notebooks via Anaconda. Pycaret will load and almost complete but each time regardless of how I install it (pip, conda) (terminal, Notebook) I get an error message
    ERROR: Command errored out with exit status 1: command: /Applications/anaconda3/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/4c/zjb0c1753g70dgrrsb3xy6c80000gn/T/pip-install-qbhjy_h2/pycaret/setup.py'"'"'; file='"'"'/private/var/folders/4c/zjb0c1753g70dgrrsb3xy6c80000gn/T/pip-install-qbhjy_h2/pycaret/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base pip-egg-info cwd: /private/var/folders/4c/zjb0c1753g70dgrrsb3xy6c80000gn/T/pip-install-qbhjy_h2/pycaret/ Complete output (8 lines): running egg_info creating pip-egg-info/pycaret.egg-info writing pip-egg-info/pycaret.egg-info/PKG-INFO writing dependency_links to pip-egg-info/pycaret.egg-info/dependency_links.txt writing requirements to pip-egg-info/pycaret.egg-info/requires.txt writing top-level names to pip-egg-info/pycaret.egg-info/top_level.txt writing manifest file 'pip-egg-info/pycaret.egg-info/SOURCES.txt' error: package directory 'pycaret' does not exist ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output

    Please advice and thanks

  • classification 'predict model' error

    classification 'predict model' error

    predict_model inside classification module produces erroneous output when a unseen pandas dataframe is passed. Number of row in input and outputs are different.

  • display association rules plots in streamlit

    display association rules plots in streamlit

    issue #884

    Changes

    Added display_format

    def plot_model(model, plot="2d", scale=1, display_format=None):
    .
    .
    .
        if display_format=='streamlit'
            st.write(fig)
        else:
            fig.show()
    
    

    Notes

    • Didn't include import streamlit as st since this will only work inside streamlit anyway
    • Didn't use st.plotly_chart since st.write can plot many types of charts and won't have to be changed if plotly is dropped later

    Update

    • Added display_format to anywhere plot_model can be called 😃
    • Error handling

    Additional:

    In nlp.py where fig.iplot() is used, I had to change asFigure = save_param to asFigure = True because streamlit needs a plotly object to interpret (https://discuss.streamlit.io/t/cufflinks-in-streamlit/2232/2)

    .
    .
    .
    if display_format=='streamlit':
        fig = df.iplot(asFigure=True) # plotly obj needs to be returned for streamlit to interpret
        st.write(fig)
    else:
        df.iplot()
    
  • Bayesian Hyperparameter Optimization

    Bayesian Hyperparameter Optimization

    Hi, I was wondering if we can have Bayesian Hyperparameter Optimization technique used instead of Random Grid. This will help with speed of tuning and allow us to scrape through much larger grid scientifically. We can have this enhancement along with ability to add custom grid in tuning.

    Thanks

  • Time Series Plot Model (Frequency Components)

    Time Series Plot Model (Frequency Components)

    Add plot models for Spectral Density and FFT

    plot_model(plot='spectrogram')
    plot_model(plot='welch')
    plot_model(plot='fft')
    

    All three can be achieved through scipy functions. We just need to plot them using plotly.

  • Unable to remove NaNs (missing values) using PyCaret

    Unable to remove NaNs (missing values) using PyCaret

    I am unable to impute NaNs (missing values) with mean and constant using PyCaret. Their documentation says, it does that by default. However, I have tried both (manual and automatic) but nothing is working. I am using my own car sales dataset.

    clf1 = setup(data = car_data, target = 'Price', numeric_imputation='mean', categorical_imputation='mode', train_size = 0.5)
    clf1 = setup(data = car_data, target = 'Price', categorical_features = ['Make', 'Colour'])
    
  • Prediction Probabilities in Classification

    Prediction Probabilities in Classification

    predict_model() returns the label and score. Can we also return the prediction probabilities?

    I did it personally by simply saving the sklearn pipe and model objects, and then running them using sklearn methods.

  • Unable to install Pycaret

    Unable to install Pycaret

    Description While trying to install Pycaret package, I am getting the below error : ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'C:\Users\SasanapY\AppData\Local\Continuum\anaconda3\Anaconda64bit\Library\bin\tbbmalloc.dll' Consider using the --user option or check the permissions.

    Expected behavior: I expected to install pycaret package Actual behavior: Got the error : ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'C:\Users\SasanapY\AppData\Local\Continuum\anaconda3\Anaconda64bit\Library\bin\tbbmalloc.dll' Consider using the --user option or check the permissions.

    Versions OS - Windows 10 Python Version - 3.7.4g the pycaret, I got the below

  • Please add evaluate_model for visualizing all the plots in the time_seris like train_test_split, ts, forecast,decomposition.

    Please add evaluate_model for visualizing all the plots in the time_seris like train_test_split, ts, forecast,decomposition.

    Describe the feature you want to add to this project

    Hi Pycaret team @ngupta23, @moezali1 and @Yard1 please add evaluate_model for visualizing all the plots in time_series like train_test_split, ts, forecast,decomposition. etc. i have seen this feature for classification and regression. if this is added to time_series as well it's good to see all plots at one place.

    Describe your proposed solution

    ...

    Describe alternatives you've considered, if relevant

    ...

    Additional context

    ...

  • Implement Solution 3 as described in #3202

    Implement Solution 3 as described in #3202

    Solution 3: Estimate the hyperparameters using only the train part of the first CV split

    • This will require us to update self._get_y_data to include another split type. This will be taken up in a separate development.

    Originally posted by @ngupta23 in https://github.com/pycaret/pycaret/issues/3202#issuecomment-1367589752

  • [ENH]: Handling sessions with Pycaret

    [ENH]: Handling sessions with Pycaret

    Describe the feature you want to add to this project

    I am having difficulty understanding how to properly use sessions in my code. When I define a new setup, I am unable to specify the USI value. I have attempted to solve this issue by using the following code:

    s = setup(data, target = 'quality', session_id=42, experiment_name = exp_name, log_experiment=True)
    s.set_config('USI', version_name)
    

    However, this creates a log before the USI is set, causing the USI values of the runs to not match the Session Initialized ... value. Additionally, when I run a new training and create a new setup, it is nested within the previous one.

    Screenshot 2022-12-29 at 15 59 39

    I would like to be able to properly manage the session so that, for example, if I run compare_models() today and want to fine-tune a model from this session tomorrow, I can insert it into the same session.

    Describe your proposed solution

    I would like to see functions such as list_sessions(), get_session(), and also have better control over nested sessions. These functions and improved control would allow me to more effectively manage and manipulate sessions in my code.

  • Added DagsHub Logger Support

    Added DagsHub Logger Support

    Describe the changes you've made

    1. Log data and model artifacts to DagsHub Storage.
    2. Log experiments to an MLFlow remote hosted by DagsHub.

    Type of change

    Please delete options that are not relevant.

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Code style update (formatting, local variables)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] This change requires a documentation update

    Checklist:

    • [x] My code follows the style guidelines of this project.
    • [x] I have performed a self-review of my own code.
    • [x] I have commented my code, particularly in hard-to-understand areas.
    • [ ] I have made corresponding changes to the documentation.
    • [ ] My changes generate no new warnings.
    • [ ] I have added tests that prove my fix is effective or that my feature works.
    • [ ] New and existing unit tests pass locally with my changes.
    • [ ] Any dependent changes have been merged and published in downstream modules.
  • Return unmodified dataframe in `predict_model`

    Return unmodified dataframe in `predict_model`

    Signed-off-by: Antoni Baum [email protected]

    Related Issue or bug

    Info about Issue or bug

    Closes https://github.com/pycaret/pycaret/issues/3178

    Describe the changes you've made

    Returns non-transformed dataframe instead of transformed one in predict_model, matching the behavior in Pycaret 2.0.

    Type of change

    Please delete options that are not relevant.

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Code style update (formatting, local variables)
    • [ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
    • [ ] This change requires a documentation update

    How Has This Been Tested?

    Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce.

    Describe if there is any unusual behaviour of your code(Write NA if there isn't)

    A clear and concise description of it.

    Checklist:

    • [ ] My code follows the style guidelines of this project.
    • [ ] I have performed a self-review of my own code.
    • [ ] I have commented my code, particularly in hard-to-understand areas.
    • [ ] I have made corresponding changes to the documentation.
    • [ ] My changes generate no new warnings.
    • [ ] I have added tests that prove my fix is effective or that my feature works.
    • [ ] New and existing unit tests pass locally with my changes.
    • [ ] Any dependent changes have been merged and published in downstream modules.

    Screenshots

    Original | Updated :--------------------: |:--------------------: original screenshot | updated screenshot |

  • A jarman/issue1766

    A jarman/issue1766

    Related Issue or bug

    Hi @AJarman

    I discussed this with @moezali1 and @Yard1 in the meeting on 20220409 and this is the recommendation. We should add an argument to setup called forecast_limit: Optional[List[Union[int, float]] = None. By default, no limits are applied (same as today's functionality). If user wants to set limits, they can do it as follows:

    forecast_limit = None # default - no limits
    forecast_limit = [0, 10000000] # lower and upper limit
    forecast_limit = [0, None]  # lower limit only
    forecast_limit = [None, 10000000] # upper limit only
    

    Coding Actions:

    • Inside setup, there will have to be a check to make sure that if a list is passed, it has exactly 2 entries and the types are correct.

    • If this is enabled, we should add a step to the transformation pipeline. The order of transformations will be as follows

      • Impute
      • Limit
      • Transform
      • Scale

    Future enhancements and user customization

    In the future, we will also extend the following arguments to accept Transformer objects directly so any user wanting to set custom arguments, can pass the object directly (this is similar to what the other modules such as regression and classification do in pycaret). This will do away with the need for the kwargs in setup.

    transform_target = "box_cox"  # Today
    transform_target = BoxCoxTransformer(limits = Sth, method="pearsonr")  # In the future (in lieu of transform_kwargs)
    
    forecast_limit = [0, None] # current proposal
    forecast_limit = ScaledLogitTransformer(0, None) # In the future (in lieu of transform_kwargs)
    

    Let me know if this makes sense. Would be happy to discuss more if needed.

    Thanks!

    Closes #1766

    Describe the changes you've made

    I have implemented the sktime scaledLogitTransformer in the time series module. This is a pipeline step added after imputatation, before transformation and scaling. This currently only affects the target variable and is specified using a list of two values. Implementing the same for exogenous variables would also be straightforward, however at the moment I'm not sure how this would function in practice, given that they would presumably need different limits. I have left a 'stubbed' out option (elif) which would be used to implement passing an sktime estimator directly as specified in the future requirements, at present this raises a specific NotImplementedError, rather than the generic TypeError it would do for any other type.

    Type of change

    Please delete options that are not relevant.

    • [x] New feature (non-breaking change which adds functionality)

    How Has This Been Tested?

    Unit tests have been added which use the airline dataset from conftest.py as a fixture. These test a combination of relevant inputs.

    Describe if there is any unusual behaviour of your code(Write NA if there isn't)

    NA

    Checklist:

    • [x] My code follows the style guidelines of this project.
    • [x] I have performed a self-review of my own code.
    • [x] I have commented my code, particularly in hard-to-understand areas.
    • [x] I have made corresponding changes to the documentation.
    • [x] My changes generate no new warnings.
    • [x] I have added tests that prove my fix is effective or that my feature works.
    • [] New and existing unit tests pass locally with my changes.
    • [x] Any dependent changes have been merged and published in downstream modules.

    Screenshots

    Original | Updated :--------------------: |:--------------------: original screenshot | updated screenshot |

Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.
Vowpal Wabbit is a machine learning system which pushes the frontier of machine learning with techniques such as online, hashing, allreduce, reductions, learning2search, active, and interactive learning.

This is the Vowpal Wabbit fast online learning code. Why Vowpal Wabbit? Vowpal Wabbit is a machine learning system which pushes the frontier of machin

Dec 30, 2022
An open source machine learning library for performing regression tasks using RVM technique.

Introduction neonrvm is an open source machine learning library for performing regression tasks using RVM technique. It is written in C programming la

May 31, 2022
Gesture Recognition Toolkit (GRT) is a cross-platform, open-source, C++ machine learning library designed for real-time gesture recognition.

Gesture Recognition Toolkit (GRT) The Gesture Recognition Toolkit (GRT) is a cross-platform, open-source, C++ machine learning library designed for re

Dec 29, 2022
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.
A fast, scalable, high performance Gradient Boosting on Decision Trees library, used for ranking, classification, regression and other machine learning tasks for Python, R, Java, C++. Supports computation on CPU and GPU.

Website | Documentation | Tutorials | Installation | Release Notes CatBoost is a machine learning method based on gradient boosting over decision tree

Dec 31, 2022
An Open Source Machine Learning Framework for Everyone
An Open Source Machine Learning Framework for Everyone

Documentation TensorFlow is an end-to-end open source platform for machine learning. It has a comprehensive, flexible ecosystem of tools, libraries, a

Jan 7, 2023
Deep Scalable Sparse Tensor Network Engine (DSSTNE) is an Amazon developed library for building Deep Learning (DL) machine learning (ML) models

Amazon DSSTNE: Deep Scalable Sparse Tensor Network Engine DSSTNE (pronounced "Destiny") is an open source software library for training and deploying

Dec 30, 2022
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.
A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms.

iNeural A library for creating Artificial Neural Networks, for use in Machine Learning and Deep Learning algorithms. What is a Neural Network? Work on

Apr 5, 2022
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.
High performance, easy-to-use, and scalable machine learning (ML) package, including linear model (LR), factorization machines (FM), and field-aware factorization machines (FFM) for Python and CLI interface.

What is xLearn? xLearn is a high performance, easy-to-use, and scalable machine learning package that contains linear model (LR), factorization machin

Dec 23, 2022
In-situ data analyses and machine learning with OpenFOAM and Python

PythonFOAM: In-situ data analyses with OpenFOAM and Python Using Python modules for in-situ data analytics with OpenFOAM 8. NOTE that this is NOT PyFO

Dec 29, 2022
Open standard for machine learning interoperability
Open standard for machine learning interoperability

Open Neural Network Exchange (ONNX) is an open ecosystem that empowers AI developers to choose the right tools as their project evolves. ONNX provides

Jan 7, 2023
Provide sample code of efficient operator implementation based on the Cambrian Machine Learning Unit (MLU) .

Cambricon CNNL-Example CNNL-Example 提供基于寒武纪机器学习单元(Machine Learning Unit,MLU)开发高性能算子、C 接口封装的示例代码。 依赖条件 操作系统: 目前只支持 Ubuntu 16.04 x86_64 寒武纪 MLU SDK: 编译和

Mar 7, 2022
Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference
 Edge ML Library - High-performance Compute Library for On-device Machine Learning Inference

Edge ML Library (EMLL) offers optimized basic routines like general matrix multiplications (GEMM) and quantizations, to speed up machine learning (ML) inference on ARM-based devices. EMLL supports fp32, fp16 and int8 data types. EMLL accelerates on-device NMT, ASR and OCR engines of Youdao, Inc.

Dec 20, 2022
An open source python library for automated feature engineering
An open source python library for automated feature engineering

"One of the holy grails of machine learning is to automate more and more of the feature engineering process." ― Pedro Domingos, A Few Useful Things to

Jan 7, 2023
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library,  for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow

eXtreme Gradient Boosting Community | Documentation | Resources | Contributors | Release Notes XGBoost is an optimized distributed gradient boosting l

Jan 3, 2023
A lightweight C++ machine learning library for embedded electronics and robotics.

Fido Fido is an lightweight, highly modular C++ machine learning library for embedded electronics and robotics. Fido is especially suited for robotic

Dec 17, 2022
Nano - C++ library [machine learning & numerical optimization] - superseeded by libnano

Nano Nano provides numerical optimization and machine learning utilities. For example it can be used to train models such as multi-layer perceptrons (

Apr 18, 2020
Nvvl - A library that uses hardware acceleration to load sequences of video frames to facilitate machine learning training

NVVL is part of DALI! DALI (Nvidia Data Loading Library) incorporates NVVL functionality and offers much more than that, so it is recommended to switc

Dec 19, 2022
Microsoft Cognitive Toolkit (CNTK), an open source deep-learning toolkit

CNTK Chat Windows build status Linux build status The Microsoft Cognitive Toolkit (https://cntk.ai) is a unified deep learning toolkit that describes

Dec 23, 2022
An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit
An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit

DREAMPlaceFPGA An Open-Source Analytical Placer for Large Scale Heterogeneous FPGAs using Deep-Learning Toolkit. This work leverages the open-source A

Dec 5, 2022