Seth is a recent graduate of Johns Hopkins University with a B.S. in Molecular and Cellular Biology, combining academic excellence with extensive research and community service experience. His undergraduate research focused on sex-specific differences in orofacial cleft risk using pediatric genomics data, and has presented this research at several scientific conferences. Currently, Seth works as a Research Assistant at the Johns Hopkins Bloomberg School of Public Health, contributing to the NIH’s Acute to Chronic Pain Signatures Consortium’s Data Integration and Resource Center team. Here, he contributes by preparing and analyzing multimodal data to enhance the field of pain science.

Disclaimer: This interview has been edited for clarity and flow. It is based on a recorded conversation but is not a word-for-word transcript. Some responses have been paraphrased or restructured to improve readability while preserving the intent and meaning of the discussion.



Let’s kick things off with a quick introduction! I know a lot of your background already, but feel free to introduce yourself however you'd like.


Sure! I’m Seth Berke. I did my undergrad at Johns Hopkins, where I discovered a passion for genomics and biomedical computing. My very first research project involved working on Cavatica, analyzing whole genome sequences from the Gabriella Miller Kids First Foundation. We were looking for sex-specific genetic loci related to orofacial cleft susceptibility in children.

That project got me into research and cloud-based computing. Since Kids First is part of the NIH Common Fund, I ended up connecting with the A2CPS project — that’s the Acute to Chronic Pain Signatures program — which studies why some people develop chronic pain after surgery while others don’t. That role broadened my experience into full multi-omics research, including proteomics and harmonization of diverse data types.

Now, I’m heading into a PhD program, having shifted fully from medicine into research. It's been a journey across data types and phenotypes — and I've learned a lot along the way.


That’s a great overview — and I appreciate how clearly you explain everything, even the acronyms! Let’s talk about your latest paper. What’s the focus?


This one’s really exciting for me — it’s tied to that very first research project I mentioned. The paper focuses on how to move custom analysis pipelines into the cloud. At the time, most tools available on the cloud were pre-installed, standardized software — things like BCFtools or SAMtools — and our lab was using a custom, in-house tool built for case-parent trio data. The problem was: the data lived in the cloud, but our pipeline ran locally on our cluster.

So, we had two options: either duplicate the data to our cluster (which isn’t ideal from a data stewardship standpoint), or learn to run our custom pipeline in the cloud. That became my mission during undergrad: to figure out how to bring specialized Bioconductor-based workflows into a cloud-based environment like Cavatica.


That’s a challenge a lot of groups are facing. People often publish the results, but not the process behind getting there. I love that your paper digs into that.


Exactly — that’s what we wanted to highlight. The scientific results will be published in a separate paper. This one is really about the how: What are the foundational principles that someone with limited cloud experience needs to understand? How do you take a custom workflow and make it cloud-compatible?

We also discuss why this is important. For instance, dbGaP datasets receive thousands of access requests, often for the same data. Everyone downloads it separately and runs their own analysis on local clusters. That kind of duplication isn't sustainable. We think the cloud — if used thoughtfully and securely — is the next step in streamlining and democratizing this kind of work.


What were some of the biggest blockers in getting everything up and running?


Three big ones come to mind:

  1. The learning curve — just understanding how cloud computing works conceptually. It’s very different from local clusters.

  2. Metadata issues — the dataset we were working with relied heavily on pedigree information, but that wasn’t readily available on dbGaP or in the Cavatica project. We had to reach out to the original PI to get what we needed.

  3. Software containment — specifically Docker integration. Ensuring all the necessary dependencies were packaged correctly took some trial and error.

Those challenges are what make documentation and onboarding so important.


Absolutely. That’s what WE hear all the time at conferences — metadata, interoperability, and containerization are always the pain points. Did you find any resources especially helpful?

Definitely the office hours and workshops. Hands-on support was far more helpful than just reading documentation. In retrospect, documentation is helpful — but it’s hard to start there if you’re new to cloud computing. Having a real project in mind was key. We weren’t just learning in a vacuum — we had an actual goal, and that helped everything click.


Totally agree — it’s like learning to code without a project. So much harder to stay motivated or even know where to begin. Any advice for other labs thinking about making the leap to the cloud?


Yeah, a few things:

  • Weigh the pros and cons based on the data you’re working with. If your datasets are large and shared across institutions, the cloud is probably the right move. But if you’re working with small, internal datasets, local clusters might still make sense.

  • Choose your platform wisely. There’s still a bit of vendor lock-in between cloud providers like Seven Bridges and Terra. Each has its own terminology and workflow system, so the learning curve resets slightly if you switch.

  • Stick with one platform once you commit — it reduces overhead.

  • And finally: have fun with it. CAVATICA, for example, has a really intuitive UI once you get into it. Don’t be afraid to just play around — you’re not going to break anything.