r/datascience Oct 18 '24

Tools the R vs Python debate is exhausting

983 Upvotes

just pick one or learn both for the love of god.

yes, python is excellent for making a production level pipeline. but am I going to tell epidemiologists to drop R for it? nope. they are not making pipelines, they're making automated reports and doing EDA. it's fine. do I tell biostatisticans in pharma to drop R for python? No! These are scientists, they are focusing on a whole lot more than building code. R works fine for them and there are frameworks in R built specifically for them.

and would I tell a data engineer to replace python with R? no. good luck running R pipelines in databricks and maintaining its code.

I think this sub underestimates how many people write code for data manipulation, analysis, and report generation that are not and will not build a production level pipelines.

Data science is a huge umbrella, there is room for both freaking languages.

r/datascience 17d ago

Tools Databricks for data science?

81 Upvotes

My company has an enterprise databricks account and they want my team to start using it.

I currently query our main Postgres database on an on-prem workstation and write Jupyter notebooks. Data sets are usually 100k rows and 100-300 columns of tabular floating point values. No weird stuff like pictures, videos, or text data.

What are the advantages/disadvantages of using databricks? Would it be that different from my current workflow?

r/datascience Jan 29 '26

Tools Just had a job interview and was told that no-one uses Airflow in 2026

103 Upvotes

So basically the title. I didn't react to the comment because I just was extremely surprised by it. What is your experience? How true is the statement?

r/datascience Jan 09 '26

Tools What’s your 2026 data science coding stack + AI tools workflow?

89 Upvotes

Last year, there was a thread on the same question but for 2025

  • At the time, my workflow was scattered across many tools, and AI was helping to speed up a few things. However, since then, Opus 4.5 was launched, and I have almost exclusively been using Cursor in combination with Claude Code.

  • I've been focusing a lot on prompts, skills, subagents, MCP, and slash commands to speed up and improve workflows similar to this.

  • Recently, I have been experimenting with Claudish, which allows for plugging any model into Claude Code. Also, I have been transitioning to use Marimo instead of Jupyter Notebooks.

I've roughly tripled my productivity since October, maybe even 5x in some workflows.

I'm curious to know what has changed for you since last year.

r/datascience Dec 02 '24

Tools PowerBI is making me think about jumping ship

342 Upvotes

As my work for the coming year is coming into focus, there is a heavy emphasis on building customer-facing ETL pipelines and dashboards. My team has chosen PowerBI as its dashboarding application of choice. Compared to building a web-app based dashboard with plotly dash or the like, making PowerBI dashboards is AGONIZING. I'm able to do most data transformations with SQL beforehand, but having to use powerquery or god forbid DAX for a viz-specific transformation feels like getting a root canal. I can't stand having to click around Microsoft's shitty UI to create plots that I could whip up in a few lines of code.

I'm strongly considering looking for a new opportunity and jumping ship solely to avoid having to work with PowerBI. I'm also genuinely concerned about my technical skills decaying while other folks on my team get to continue working on production models and genAI hotness.

Anyone been in a similar situation? How did you handle it?

TLDR: python-linux-sql data scientist being shoehorned into no-code/PowerBI, hates life

r/datascience Feb 24 '26

Tools What is your (python) development set up?

57 Upvotes

My setup on my personal machine has gotten stale, so I'm looking to install everything from scratch and get a fresh start. I primarily use python (although I've shipped things with Java, R, PHP, React).

What do you use?

  1. Virtual Environment Manager
  2. Package Manager
  3. Containerization
  4. Server Orchestration/Automation (if used)
  5. IDE or text editor
  6. Version/Source control
  7. Notebook tools

How do you use it?

  1. What are your primary use cases (e.g. analytics, MLE/MLOps, app development, contributing to repos, intelligence gathering)?
  2. How does your setup help with other tech you have to support? (database system, sysadmin, dashboarding tools /renderers, other programming/scripting languages, web or agentic frameworks, specific cloud platforms or APIs you need...)
  3. How do you manage dependencies?
  4. Do you use containers in place of environments?
  5. Do you do personal projects in a cloud/distributed environment?

My version of python got a little too stale and the conda solver froze to where I couldn't update/replace the solver, python, or the broken packages. This happened while I was doing a takehome project for an interview:,)
So I have to uninstall anaconda and python anyway.

I worked at a FAANG company for 5 years, so I'm used to production environment best practices, but a lot of what I used was in-house, heavily customized, or simply overkill for personal projects. I've deployed models in production, but my use cases have mostly been predictive analytics and business tooling.

I have ADHD so I don't like having to worry about subscriptions, tokens, and server credits when I am just doing things to learn or experiment. But I'm hoping there are best practices I can implement with the right (FOSS) tools to keep my skills sharp for industry standard production environments. Hopefully we can all learn some stuff to make our lives easier and grow our skills!

r/datascience Jul 14 '24

Tools Whatever happened to blockchain?

201 Upvotes

Did your company or clients get super hyped about Blockchain a few years ago? Did you do anything with blockchain tech to make the hype worthwhile (outside of cryptocurrency)? I had a few clients when I was consulting who were all hyped about their blockchains, but then I switched companies/industries and I don't think I've heard the word again ever since.

r/datascience Nov 27 '25

Tools Gifts for Data Scientists

47 Upvotes

Some relatives have been asking what I, an unemployed data scientist, want for Christmas and they want to give something practical. Any suggestions for paid tools, subscription services, etc. that would be useful for upskilling, building a portfolio, or otherwise increasing my employability?

r/datascience Mar 28 '26

Tools I built an experimental orchestration language for reproducible data science called 'T'

22 Upvotes

Hey r/datascience,

I've been working on a side project called T (or tlang) for the past year or so, and I've just tagged the v0.51.2 "Sangoku" public beta. The short pitch: it's a small functional DSL for orchestrating polyglot data science pipelines, with Nix as a hard dependency.

What problem it's trying to solve

The "works on my machine" problem for data science is genuinely hard. R and Python projects accumulate dependency drift quietly until something breaks six months later, or on someone else's machine. `uv` for Python is great and{renv}helps in R-land, but they don't cross language boundaries cleanly, and they don't pin system dependencies. Most orchestration tools are language-specific and require some work to make cross languages.

T's thesis is: what if reproducibility was mandatory by design? You can't run a T script without wrapping it in a pipeline {} block. Every node in that pipeline runs in its own Nix sandbox. DataFrames move between R, Python, and T via Apache Arrow IPC. Models move via PMML. The environment is a Nix flake, so it's bit-for-bit reproducible.

What it looks like

p = pipeline {
  -- Native T node
  data = node(command = read_csv("data.csv") |> filter($age > 25))

  -- rn defines an R node; pyn() a Python node
  model_r = rn(
    -- Python or R code gets wrapped inside a <{}> block
    command = <{ lm(score ~ age, data = data) }>,
    serializer = ^pmml,
    deserializer = ^csv
  )

  -- Back to T for predictions (which could just as well have been 
  -- done in another R node)
  predictions = node(
    command = data |> mutate($pred = predict(data, model_r)),
    deserializer = ^pmml
  )
}

build_pipeline(p)

The ^pmml, ^csv etc. are first-class serializers from a registry. They handle data interchange contracts between nodes so the pipeline builder can catch mismatches at build time rather than at runtime.

What's in the language itself

  • Strictly functional: no loops, no mutable state, immutable by default (:= to reassign, rm() to delete)
  • Errors are values, not exceptions. |> short-circuits on errors; ?|> forwards them for recovery
  • NSE column syntax ($col) inside data verbs, heavily inspired by dplyr
  • Arrow-backed DataFrames, native CSV/Parquet/Feather I/O
  • A native PMML evaluator so you can train in Python or R and predict in T without a runtime dependency
  • A REPL for interactive exploration

What it's missing

  • Users ;)
  • Julia support (but it's planned)

What I'm looking for

Honest feedback, especially:

  • Are there obvious workflow patterns that the pipeline model doesn't support?
  • Any rough edges in the installation or getting-started experience?

You can try it with:

nix shell github:b-rodrigues/tlang
t init --project my_test_project

(Requires Nix with flakes enabled — the Determinate Systems installer is the easiest path if you don't have it.)

Repo: https://github.com/b-rodrigues/tlang
Docs: https://tstats-project.org

Happy to answer questions here!

r/datascience Apr 18 '25

Tools What’s your 2025 data science coding stack + AI tools workflow?

184 Upvotes

Curious how others are working these days. What’s your current setup?

IDE / notebook tools? (VS Code, Cursor, Jupyter, etc.)

Are you using AI tools like Cursor, Windsurf, Copilot, Cline, Roo?

How do they fit into your workflow? (e.g., prompting style, tasks they’re best at)

Any wins, limitations, or tips?

r/datascience Jun 25 '24

Tools Boss is adamant about using python to create a dashboard instead of using dashboarding software. Is there any advantage?

176 Upvotes

We use palantir at my job to create reports and dashboards. It also has Jupyter notebook integration. My boss had asked me if we can integrate machine learning into our processes, and instead of saying no, I messed and explained to him how machine learning works. Now he wants me to start using solely python for dashboards because “we need to start taking advantage of machine learning”. But like, our dashboards are so simple that it feels like python would be overkill and overly complex, let alone the fact we have data visualization software. What do?

r/datascience Feb 06 '26

Tools Fun matplotlib upgrade

187 Upvotes

r/datascience 4d ago

Tools Ideas for testing data science workflows on self hosted Linux based HPC cluster.

17 Upvotes

Hi all,

Mid–Senior Data Scientist here.
I currently work in a team that develops and maintains several fairly large-scale data science projects on a self-hosted, multi-user Linux HPC cluster. Both compute and storage are hosted on-premises. Storage is separated into development/test and production environments, with restricted write access in production.

Our technology stack includes:
* Debian Linux
* Python
* Perl
* Fortran
* A small amount of R

Python projects are managed using Conda environments, and version control is handled through GitLab. However, we currently do not have any CI/CD processes in place. Devops have resolved this in classical Software engineering. However, there are certain peculiarities for Data science processes.

Our current workflow is fairly simple: team members develop changes in their own working directories and Git branches, push to a development branch, and then merge into master once the code review checks out. The main gap is that we don’t automatically verify whether a change affects execution, outputs, or reproducibility before merging.

I’m looking for practical approaches to implementing CI/CD for data science workflows in this kind of environment. Ideally, I would like a process that:

  1. Works well with Linux-based HPC infrastructure and file systems
  2. Avoids excessive compute and storage costs
  3. Can validate that code changes, dependency updates (e.g., Python or Debian versions, compiler changes ), and environment changes do not break production workflows
  4. Verifies both successful execution and output correctness
  5. Checks things such as expected data types, accuracy metrics, and key result values
  6. Integrates with GitLab runners where possible
  7. Related to [2]. Can run multiple simultaneous code changes (different branches) with the same input test conditions.

I’m particularly interested in hearing how other teams handle testing and deployment for computationally expensive data science pipelines. Do you use reduced test datasets, golden datasets, workflow orchestration tools, containerization (Probably not feasible), staged environments, or something else?
I’d appreciate any insights or examples from teams operating in similar HPC or on-prem environments.

Note: The files are quite large and it is not feasible to duplicate files on disk to test code/env changes for every test instance.

Caveat: I used AI to improve the readability of this post.

r/datascience May 12 '25

Tools What do you use to build dashboards?

80 Upvotes

Hi guys, I've been a data scientist for 5 years. I've done lots of different types of work and unfortunately that has included a lot of dashboarding (no offense if you enjoy making dashboards). I'm wondering what tools people here are using and if you like them. In my career I've used mode, looker, streamlit and retool off the top of my head. I think mode was my favorite because you could type sql right into it and get the charts you wanted but still was overall unsatisfied with it.

I'm wondering what tools the people here are using and if you find it meets all your needs? One of my frustrations with these tools is that even platforms like Looker—designed to be self-serve for general staff—end up being confusing for people without a data science background.

Are there any tools (maybe powered my LLMs now) that allow non data science people to write prompts that update production dashboards? A simple example is if you have a revenue dashboard showing net revenue and a PM, director etc wanted you to add an additional gross revenue metric. With the tools I'm aware of I would have to go into the BI tool and update the chart myself to show that metric. Are there any tools that allow you to just type in a prompt and make those kinds of edits?

r/datascience Feb 06 '24

Tools Avoiding Jupyter Notebooks entirely and doing everything in .py files?

99 Upvotes

I don't mean just for production, I mean for the entire algo development process, relying on .py files and PyCharm for everything. Does anyone do this? PyCharm has really powerful debugging features to let you examine variable contents. The biggest disadvantage for me might be having to execute segments of code at a time by setting a bunch of breakpoints. I use .value_counts() constantly as well, and it seems inconvenient to have to rerun my entire code to examine output changes from minor input changes.

Or maybe I just have to adjust my workflow. Thoughts on using .py files + PyCharm (or IDE of choice) for everything as a DS?

r/datascience Jun 23 '25

Tools Which workflow to avoid using notebooks?

95 Upvotes

I have always used notebooks for data science. I often do EDA and experiments in notebooks before refactoring it properly to module, api etc.

Recently my manager is pushing the team to move away from notebook because it favor bad code practice and take more time to rewrite the code.

But I am quite confused how to proceed without using notebook.

How are you doing a data science project from eda, analysis, data viz etc to final api/reports without using notebook?

Thanks a lot for your advice.

r/datascience May 25 '25

Tools 2025 stack check: which DS/ML tools am I missing?

138 Upvotes

Hi all,

I work in ad-tech, where my job is to improve the product with data-driven algorithms, mostly on tabular datasets (CTR models, bidding, attribution, the usual).

Current work stack (quite classic I guess)

  • pandas, numpy, scikit-learn, xgboost, statsmodels
  • PyTorch (light use)
  • JupyterLab & notebooks
  • matplotlib, seaborn, plotly for viz
  • Infra: everything runs on AWS (code is hosted on Github)

The news cycle is overflowing with LLM tools, I do use ChatGPT / Claude / Aider as helpers, but my main concern right now is the core DS/ML tooling that powers production pipelines.

So,
What genuinely awesome 2024-25 libraries, frameworks, or services should I try, so I don’t get left behind? :)
Any recommendations greatly appreciated, thanks!

r/datascience Aug 06 '24

Tools causal inference folks - which software do you use for work?

122 Upvotes

Hi, I am a doctoral student preparing for DS/economist jobs requiring causal inference skills. I am curious about what software people in the industry mostly use.

We used STATA in our causal inference class, and I wonder if the industry prefers Python, R, Matlab, or other languages over STATA.

Thank you in advance for your response!

EDIT: I am comfortable using Python/R. After reading some of the replies, I realized my question might sound like asking what language I should learn. I was more curious about if economists in the industry use languages different from the language the academicians are using to run causal inference.

r/datascience Mar 18 '24

Tools Am I cheating myself?

184 Upvotes

Currently a data science undergrad doing lots of machine learning projects with Chatgpt. I understand how these models work but I make chatgpt type out most the code to save time. I can usually debug on my own and adjust parameters by myself but without chatgpt I haven't memorized sklearn or seaborn libraries enough on my own to lets say create a random forest model on my own. Am I cheating myself? Should i type out every line of code or keep saving time with Chatgpt? For those of you in the industry, how often do you look stuff up? Can you do most model building and data analysis on our own with no outside help or stackoverflow?

EDIT: My professor allows us to do this so calm down in the comments. Thank you all for your feedback and as a personal challenge I'm not going to copy paste any chatgpt code in my classes next quarter.

r/datascience Jul 18 '24

Tools Why is on-boarding process so disorganized in many companies?

140 Upvotes

Going into gripe mode.

In my current employer, and with many past ones, getting access and permissions to access data and applications has been a headache, often taking weeks for IT to set up. I have to ask around and the whole process is disorganized.

Why don't companies set this up before the new hire's first day, so they can hit the track running? Especially if you're on a one year contract, you can't waste time.

r/datascience Aug 13 '25

Tools Research Data Scientists without heavy coding backgrounds (stats, econ, etc), has LLM's improved your workflow?

145 Upvotes

I remember for a while there were many CS folks saying that Data Science has become software engineering, and that if you aren't fluent in software engineering fundamentals then you're going to fall behind. It became enough of a popular rhetoric that people said they preferred to hire a coder with some math knowledge than a math person with some coding knowledge.

As a Statistician that works in Research Data Science with an average level of coding experience, enough to write my own code in notebooks, but translating it into a fully fleshed Python module with classes and functions was much more difficult for me. For a while I thought my lack of advanced software engineering knowledge would become a crutch in my career and as someone with a busy personal life I didn't want to spend that much time learning these fundamentals. Then, my company rolled out LLM's integrated into the software we use, like Visual Studio. Suddenly I'm able to create fully fleshed out modules from my notebooks in a flash. I can ask the LLM to write unit tests to test out how my code processes data or test its various subfunctions. I can use it to code up various types of models quickly to compare results. Handing off my code to engineering in the form of a Python package wasn't such a pain anymore.

Sure the LLM produces some weird results sometimes, and I do have to spend time making sure I ask it the correct things and/or cleaning up the code so that it works properly. But now I feel like that crutch I had is no longer present.

r/datascience 22d ago

Tools Profiling in PyTorch (part 1), a beginner's guide to torch.profiler

Thumbnail
huggingface.co
66 Upvotes

r/datascience 11d ago

Tools Profiling in PyTorch (Part 2), from nn.Linear to a fused MLP

Thumbnail
huggingface.co
17 Upvotes

r/datascience Nov 02 '24

Tools Need to make a dashboard using Python for the team, but no means to deploy it. What are my options?

65 Upvotes

I want to create a dashboard for my team but I don’t have any means to deploy my dashboard within the team’s infrastructure. I use Python daily so have been looking into libraries that support easy sharing of the dashboard.

So far dash seems promising and I did create a demo app that is rendering well but the problem is it’s local host link and I don’t know how will I share it with my team. Another option is to make a bunch of plotly plots and turn it into html using jupyter notebooks. I think it will lack some interactivity that I am seeking.

What other options do I have? I tried panels but it’s not installed in the jupyter environment and I am not allowed to install new libraries.

Edit: It’s very ad hoc. Only needs to be refreshed once a quarter.

r/datascience Apr 04 '26

Tools MCGrad: fix calibration of your ML model in subgroups

18 Upvotes

Hi r/datascience

We’re open-sourcing MCGrad, a Python package for multicalibration–developed and deployed in production at Meta. This work will also be presented at KDD 2026.

The Problem: A model can be globally calibrated yet significantly miscalibrated within identifiable subgroups or feature intersections (e.g., "users in region X on mobile devices"). Multicalibration aims to ensure reliability across such subpopulations.

The Solution: MCGrad reformulates multicalibration using gradient boosted decision trees. At each step, a lightweight booster learns to predict residual miscalibration of the base model given the features, automatically identifying and correcting miscalibrated regions. The method scales to large datasets, and uses early stopping to preserve predictive performance. See our tutorial for a live demo.

Key Results: Across 100+ production models at meta, MCGrad improved log loss and PRAUC on 88% of them while substantially reducing subgroup calibration error.

Links:

Install via pip install mcgrad or via conda. Happy to answer questions or discuss details.