Join us for our weekly series of short talks: nf-core/bytesize.

Just 15 minutes + questions, we focus on topics about using and developing nf-core pipelines. These are recorded and made available at https://nf-co.re , helping to build an archive of training material. Got an idea for a talk? Let us know on the #bytesize Slack channel!

In this weeks bytesize talk, Marcel Ribeiro-Dantas (@mribeirodantas) is talking about his efforts to translate nf-core/nextflow training material into other languages including Portuguese and Spanish.

Video transcription **Note: The content has been edited for reader-friendliness**

0:01 Hello everyone to today’s bytesize. I’m very glad that Marcel could make it and he is going to talk about his work on translating nf-core and Nextflow training material. The stage is yours.

0:18 Hello everyone. Thanks Franziska and everyone else for having me. It’s great to be here. It’s great to be able to contribute to the bytesize initiative. Today I’m going to talk mostly about translating training material, but I’m going to also mention other efforts related to localization, translating content on Nextflow and nf-core, right? My name is Marcel. I’m a developer advocate at Seqera Labs for Nextflow and nf-core. You can reach me on Twitter, on GitHub with mribeirodantas as my handle. The first thing that we’re going to talk today is “what is localization?”, which is translated content to be able to have content that people from different languages can be able to understand and contribute.

1:12 At some point, a lot of people reached out to me, Phil and Chris and other people asking for translated content, asking for an easier way for people who doesn’t feel so confident in English to be able to contribute and to follow nf-core and Nextflow. At some point it became really clear to us that we had to start working with that, to try to find something that could help these people. Make them more comfortable and confident to learn Nextflow, to be able to contribute, to discuss. It became clear that we had to do something, but then where to start? What should we translate? When? How? A lot of people, they start showing up to contribute during the mentorship. I think two or three people, they mentioned they wanted to help translating content, giving talks in their native language, where they live.

2:08 Some people suggested to translate the documentation. Some people suggested other things. A lot of ideas came up. We knew we had to do something, but then it was important to be smart about the way we would do that. I mean, we are not too many people. Even though a lot of people showed up to contribute, it would still be like 6, 7, 8, 10 people. Not a lot of people. We have a huge amount of documentation, a huge amount of content. We had to be smart about how to use these people to be able to translate content. The original idea of translating documentation was risky because the documentation is not only very large. It’s also very often changed. Even if we managed to translate everything, like all the documentation of Nextflow and nf-core, the chance that that would be outdated was very high. In a few days, probably, the documentation would already be different in English. We would have to translate it into another language. Even though we have some paid contributors like me or Chris and Phil… For Portuguese, it would be fine, because it could be watched from very close. But if you’re talking about Hindi or some other language that we don’t have anyone in the team that are fluent, it would be very hard to make sure it would be updated. The risk of having outdated material was very high because it could give a totally different impression to the community that maybe we don’t value the translation, we don’t value the organization, we don’t care about them. It was very risky to translate something that would be good now, but then it would be very bad in a few days or weeks or months.

3:55 Where to start? The idea that came up was to translate training material because they’re much more stable, they don’t change as often, and they’re not that long. Also they’re easier to translate because they are informal language, you’re explaining something. Sometimes the documentation is very straight, and unless you know very much about the technology you don’t feel so confident about changing that. For the training material, it’s more welcoming for newcomers to contribute, it doesn’t change as often, and it’s not as long as the documentation. In the end our cost-benefit analysis convinced us that indeed the training material was the best place to start with the translation.

4:39 At the beginning, the main training material that we had in general was the training.seqera.io website. It was originally written and maintained by Seqera Labs. It uses ASCII doc technology, which was okay, but there are several limitations, several issues we wanted to overcome, it was very tricky on ASCII doc. At some point, very recently, someone showed us material for MKdocs and that we should use Markdown, we were using a lot of parts of the community, and he made a huge effort to convert everything to material for MKdocs. Then other people joined, and we were able to convert the whole training content to this new technology that was much easier to change, much more powerful, and much more beautiful. Now we’ve been using material for MKdocs, we decided to convert the training material to a more community effort. It’s now hosted at training.nextflow.io, even though the original content was created by Seqera Labs, it was donated to the community, it belongs to the community, it’s maintained by community, it’s much easier for the community to contribute, much faster, much easier.

6:06 We have indeed a lot of people that contributed compared to the past, and by having this new format and this new domain with new people, with new technology, it was exactly the time where this opportunity for localization came up, because a lot of people wanted to contribute around the same time we were doing this conversion. Then we were able to find this plugin called mkdocs-static-i18n, so internalization, which makes it very easy, once it’s set up, it’s very easy to maintain and contribute different versions of the content in another language. That’s how we have now the training material. It’s hosted at training.nextflow.io, it’s very similar to the original content that we had before in training.seqera.io, but it’s more beautiful, it’s better organized, and it’s been improved a lot in the past two or three months. Of course, now we have it in other languages also.

7:04 Once we set it up, we were able to write this, TRANSLATING.md file in the GitHub repository, explaining how you can contribute, the contributing model that we have. It’s obvious for people who are already experienced with GitHub, but for people who are not, it’s very important to have it very clear. After all, a lot of times people that contribute to content, to translation, to documentation, they’re not very skilled with GitHub. We made sure that it was very clear how to do that on the TRANSLATING.md file. You make a fork of the GitHub repository, the training material of the GitHub repository, you work on your local copy on your fork, you check the mkdocs.yaml file to see if the language that you want to contribute is already set up there. It’s very easy, you will see that soon. Then you go to the file, the original file that you have, which is in English, you just duplicate it, add the language code of your language that you can check in the link that is shown in this TRANSLATING.md. Once you’ve done that, you just change the new file, the filename.pt, for example, .md. You keep changing it, and in the end you just open a pull request to change that. Same thing for image files or any other type of file. You just add the language code and everything else is going to be taken care of by this plugin. You’re free to tag all the contributors with the add symbol to request a review. Pull requests, they need at least a review before they are merged. There’s an important note at the end, I’m going to talk about it in the next slide so it’s clear to you. Let’s just skip that for now.

8:56 Here’s one example of the first contribution that was done in Spanish. As you can see in the bottom here, the mkdocs.yaml file, it’s just a matter of… tthe image is a bit larger than the screen… you put your language code at some point below languages, you give a name to your language, and the code is here. That’s what you do, and you create a file .decode, and you can just put the content at the place where the English content were.

9:31 This is the graph of contribution that we had for the whole history of the training material. It’s very old, right, so it goes even, I think, over two years ago. At some point there were some small changes, as you can see here, September in 2021, and so on. In around April, 2022, you see there was some change here. I think this was right when Chris was hired, Chris Hakkaard. One of the first things he did… no, maybe I think it was around… I think Chris joined in June. In July and June, Chris and I started working on that, mostly Chris, because I hadn’t joined Seqera yet. I was doing some contributions as part of your effort. Then in October, I was already in Seqera, I joined at the end of July, I started contributing a lot. There was a lot of contribution around the end of the year, close to the training that we had. We had a nf-core training in October 2022. It was a lot of work to make sure that the training material was better at this time. Sorry, sorry, sorry. I was in the wrong year. It was here. It makes more sense now. Here, you see in July, there was a lot of contributions, but this was mostly Chris, then I. I think the other Chris also contributed a bit, and Phil also. But you see this huge help here, this huge amount of contributions was mostly by one, two, three people maybe. It was really like a lot of stuff that was easy to fix, like typos, and very easy to fix, and very few people working on that. Even though it looks like a lot of contributions, for the community size, not a lot, because it’s employees of Seqera, it was a Seqera content, low hanging fruits, let’s say. Even though it looks nice, we had lots of contributions here, it was just like payed people to do that, very few people, and so on. It was October before training, and during training, and then we have this year now. You see, it’s a lot of contributions for a longer period, and a lot of people involved, and most people were volunteers.

11:50 But this is the thing I like, this is the real community appearing, in my opinion. As you can see now, we have over 20 contributors, over 20 people who have contributed to this repository, which is training material. We have people who have worked or work at Seqera, but we have lots of people who have no ties to Seqera, people like Yara from Brazil, João from Brazil. You have, I forgot the name of this researcher, he’s from South Africa, but you can see the other contributors, you have people from all over the world. You have Anabella from Argentina, you have Pablo from Spain, you have people from all over the world that contributed in this period for the improving of the material, but also through this translation. I really like this chart, because it shows that for a long time we’ve been working on the training material, but ever since it became a community material, and we open up this opportunity for localization, a lot of people came to help. It’s just amazing to see how many people have been using this training material in other languages, and commented, and shared, and did local trainings, and all these things. It’s very nice to see that it’s being used. Not only that it’s very nice, that it brought a lot of people to work together, but also that a lot of people are using it and enjoying it.

13:11 The current status that we have is that the English content is a very nice quality, because it’s been improved a lot, a lot of people have delved into the English content to make sure everything is working. Multiple trainings have occurred, so everything is working, everything is fine. We were able to translate the whole training material to Portuguese, and it’s up to date. It’s translated, and it matches the English version 100%. For the French version, it’s about 30% translated, and the part that is translated is up to date. In Spanish, it’s about 2% and up to date, but actually there are some pull requests that are being reviewed, so the number is actually higher.

13:52 How do we know that? 100%, 100% is easy, right, because it’s everything. But how do we know the 2%, the 30%, how do we track that? So there’s a technology called GitLocalize that’s very interesting. GitLocalize, it watches a GitHub repository, and it watches the files that match a glob that you define, and it pays attention to when they are updated and when the original files are updated. The debugging.md for the debugging section of the training material, then you have debugging.pt.md for the Portuguese translation, debugging.fr.md for the French translation, debugging.es.md for the Spanish translation, and so on. By watching that very closely, every time you change the original English file, and in all the languages the equivalent file doesn’t change, GitLocalize will show this is incomplete. Now it’s only 99% or 98% translated, and it watches what is called segments, so that very small parts of the file are being taken care of to make sure that you know where it was changed. By using that, we can watch very closely these tables of every language, we can have editors and moderators for every language, it’s a very powerful tool and it’s free in the plan that we have. The way they make money is by providing this bridge between software developers and professional translators. In this case, we translated it ourselves, but if we needed translation services, that’s how they make money.

15:34 We have an issue right now with GitLocalize because the way that it was originally made, that you use the platform to translate this, it has an integration with GitHub, and then you open up a request from the GitLocalize platform, and it’s going to open in the GitHub repository. We don’t do that because it breaks the markdown code for materials for markdown that we’re using, so we have this limitation. We have been talking to them to make sure this is going to work better in the near future. They are very responsive, we are working on that, but so far because of that, sometimes we have some weird things like the 101% here and saying that it’s incomplete, even though it’s not. It’s a very good tool to watch and manage this translation effort, but it’s still not perfect for our need, so we’ve been in close contact with them to make sure it’s going to be more tailored for our situation, and I believe it’s going to be soon because they’re very responsive, they really want to help, it’s very nice.

16:32 One thing we also did was to create a project board on GitHub, to make sure it was easy to manage everything. As you can imagine, weeks before the training material, we had so many different things we want to change and improve and test, and then we have people that want to contribute with Spanish and French and Portuguese and fix things in English. I was in charge of this initiative for translation, so I made sure that everything was organized and easy to reach and understand who was doing what. I created this project board to be able to organize everything, and it was a very good initiative. If you haven’t used GitHub, but it was before, and I think it’s a good thing to use to make sure that you know what’s ready to review, what’s been done, and using tags and using labels, and who’s assigned to what, so I think it was a good experience.

17:24 It was not only written material. We had the nf-core training in March now, the last one, and for the first time ever, we had a training delivered in multiple languages at the same time. We had English, as always, that was done by Chris Hakkaard. We had a Portuguese version that was done by me. We had a Spanish version that was done by Júlia Mir Pedrol and Gisela Gabernet. We had the French version given by Maxime Garcia, and the Hindi version given by Abhinav Sharma. This is not only interesting and very amazing, in my opinion, because it was done in multiple languages, but also this was the most complete training that nf-core has ever done. Basically, the full training material right now, which is very long, we gave it from the very beginning to the end, plus a significant part of the nf-core documentation. It was a four-day, two-hour each session, very complete, in a very nice rhythm in my opinion, mostly because it’s on YouTube, so you can just change the speed or pause if you want. In my opinion, it was a very nice training, very nice content. All these guys, they made amazing, it was very nice to watch some of their talks, it was very nice. I was very happy at the end that we were able to conduct this effort to have it in multiple languages, and I already saw that a lot of people were very happy about it, using it. Obviously, most people are going to watch in English, be them native speakers of English or not, just because it’s more, I mean, you just go to the English, you don’t even know they’re in other languages. But some people watch in the other languages, it was clear that a lot of people were able to benefit from the training, and otherwise they wouldn’t be because they are not so confident in English, they are shy, they don’t feel confident, they just would rather watch it in a different language. Now we have it.

19:24 It’s not all the languages, of course, but a lot of languages that are very common around our community. We also had another effort, which was the regional Latin America channel on the Nextflow and nf-core Slack. A lot of people from Latin America requested for a long time, a place to be able to speak in Portuguese and Spanish, and to try to grow the community like, as you probably know, in North America in Europe, the Nextflow and nf-core community is huge, it’s very large. But if you go to other places, like Middle East, Africa, Asia and Pacific and Latin America and the Caribbean, the community is much smaller. The Latin American community specifically, it was growing on Slack compared to the other ones, but people were still a bit disconnected and some people started to ask for a channel where they would be more comfortable to speak in Portuguese or Spanish or to ask more beginner questions, let’s say, and at some point we got some requests, we decided, okay, let’s try this out. Now we have this channel, which is communicating between the two Slacks. You just have to be one to be able to participate in this channel and comment and read content. We had seen a lot of initiatives there. I mean, the translation mostly started there, the translation initiative. It was very nice to see this happening. That’s it. I welcome any question you may have about nf-core or anything else. You can contribute in this first link, which is the GitHub repository of the training material. You’re welcome to join the Nextflow or the nf-core Slacks and these two next links, and I’m also open for any type of contact in my email. Thanks for having me and thanks for paying attention.

21:12 (host) Thank you very much. Really thank you for this impressive work. I think this will help a lot to get some people in the nf-core boat that would otherwise not have dared.

(speaker) Definitely. Yeah.

(host) Do we have any questions from from the audience? It doesn’t seem to be the case, then I would like to thank you again, and I would like to thank the audience for listening in. As usual, I also would like to thank the Chan Zuckerberg Initiative for funding our talks and also actually this translation efforts are part funded by the Chan Zuckerberg Initiative. Thank you very much.

(Speaker) Thank you. Have a nice day, everyone.