Date: 2023-10-19
See Discord Binding for project context,
BERTopic on AMD GPU using ROCm
Testing Bertopic Run times and Results
On Intel i7-9700 : 673.2816 Seconds On T4 on Google Colab: 80.7281 Seconds Cohere API: 358.6873 Seconds and 35 cents USD, 3,534,005 Calls GTX 1060 6Gb: 75.4157 Seconds
For additional context check out, What is the length of the Bertopic default dataset from sklearn?
CPU Script
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
import timeit
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic()
start_time = timeit.default_timer()
topics, probs = topic_model.fit_transform(docs)
elapsed = timeit.default_timer() - start_time
Google Colab T4 Script
!pip install bertopic
import timeit
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
# topic_model = BERTopic()
start_time = timeit.default_timer()
topics, probs = topic_model.fit_transform(docs)
elapsed = timeit.default_timer() - start_time
Cohere API Script (Requires Production API Key) - COSTS $0.35 USD
# !pip install cohere
# !pip install bertopic
import cohere
from bertopic import BERTopic
from bertopic.backend import CohereBackend
import timeit
from sklearn.datasets import fetch_20newsgroups
client = cohere.Client("PRODUCTION_API_KEY")
embedding_model = CohereBackend(client)
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
topic_model = BERTopic(embedding_model=embedding_model)
start_time = timeit.default_timer()
topics, probs = topic_model.fit_transform(docs)
elapsed = timeit.default_timer() - start_time
Installation Guide - RAPIDS Docs
GTX 1060 6Gb
wget https://bootstrap.pypa.io/get-pip.py
python3 get-pip.py
rm get-pip.py
sudo apt install python3-dev
sudo apt install build-essential
sudo apt install nvidia-modprobe
python3 -m pip install bertopic
import timeit
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
topic_model = BERTopic(embedding_model="all-MiniLM-L6-v2")
docs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']
# topic_model = BERTopic()
start_time = timeit.default_timer()
topics, probs = topic_model.fit_transform(docs)
elapsed = timeit.default_timer() - start_time
GTX 3090
TODO
Issues with Bertopic on ROCm
RuntimeError: HIP error: invalid device function
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing HIP_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.
- python - Numba needs NumPy 1.20 or less for shapley import - Stack Overflow
- SystemError: initialization of _internal failed without raising an exception · openai/whisper · Discussion #1103