Reddit DevOps
270 subscribers
6 photos
31.1K links
Reddit DevOps. #devops
Thanks @reddit2telegram and @r_channels
Download Telegram
Have you managed to make a self-service EC2 portal?

What I mean is this: we have a lot of devs constantly testing things. We were a very small infra/DevOps team until lately and so most of them are used to having admin privileges and just kinda using the dev account as a playground.

The result, unsurprisingly, is dozens of instances with no clear owner, and we're unsure if it's even needed. Most surely aren't.

We're in the process of implementing tagging and other such identifiers, but an ambitious goal of ours is to allow people to spin up instances but with more "guardrails", and using Terraform.

I imagine this can be accomplished with a Flask frontend collecting variables, building a tfvar, and then passing it into Terraform or something similar. But of course that sounds a bit difficult and hard to maintain to say the least.

I wanted to ask if any of you have done this successfully, or if you know of some good - ideally free - software that can do it. Or is it just a fool's errands to try to wrangle unpredictable needs into a template like this? Is this taking "self service" too far in the name of cutting down on technical debt? Will we just make ourselves more debt at the end of the day?

https://redd.it/qhxtlc
@r_devops
The process of declining a git push.

I'm trying to understand the process of automatically declining a push that has failed CI tests.

Is the flow goes like the following?

* someone attempts to push changes
* remote branch sees the change but doesn't accept it yet, routes to CI server
* CI server run tests and based on return code decide if to accept the push or not

If not, what happens after the push? What component in git doesn't accept the push yet, and how does it interact with the CI tests?

Thanks ahead!

https://redd.it/qhvh4p
@r_devops
How do I config php-fpm properly?

Ok, so I checked the Apache configs on the server where I can get websites running and the configs on the website where varnish keeps returning 503 and 500 and I found they were the same. The only difference is php-fpm, but I can't think of the reason why that would be the case.

​

[root@webdev01 \~\]# sudo netstat -plnt

Active Internet connections (only servers)

Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name

tcp 0 0 127.0.0.2:800.0.0.0:* LISTEN 1679/varnishd

tcp 0 0 172.31.23.5:800.0.0.0:* LISTEN 1644/nginx

tcp 0 0 127.0.0.1:800.0.0.0:* LISTEN 1620/httpd

tcp 0 0 0.0.0.0:220.0.0.0:* LISTEN 1177/sshd

tcp 0 0 127.0.0.1:250.0.0.0:* LISTEN 1439/master

tcp 0 0 172.31.23.5:4430.0.0.0:* LISTEN 1644/nginx

tcp 0 0 127.0.0.1:4430.0.0.0:* LISTEN 1620/httpd

tcp 0 0 127.0.0.1:60820.0.0.0:* LISTEN 1678/varnishd

tcp 0 0 127.0.0.1:112110.0.0.0:* LISTEN 1155/memcached

tcp 0 0 127.0.0.1:63790.0.0.0:* LISTEN 1072/redis-server 1

tcp 0 0 :::22 :::* LISTEN 1177/sshd

tcp 0 0 :::3306 :::* LISTEN 1315/mysqld

[root@webdev01 \~\]#

​

This is where it's working, and we don't see php-fpm.

​

[centos@staging script\]$ sudo /usr/sbin/php-fpm

[28-Oct-2021 15:17:31\] ERROR: An another FPM instance seems to already listen on /var/run/php-fpm/php5-fcgi-staging01.sock

[28-Oct-2021 15:17:31\] ERROR: FPM initialization failed

​

So it's running on a sock? But for some reason I don't see it listening to a port? Are they different?

​

[root@webdev01 \~\]# sudo service php-fpm status

php-fpm (pid 1455) is running...

​

So it's running.

​

On the server where I can't have it running I have:

​

[centos@staging03 script\]$ sudo netstat -plnt

Active Internet connections (only servers)

Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name

tcp 0 0 127.0.0.2:800.0.0.0:* LISTEN 2624/varnishd

tcp 0 0 127.0.0.1:800.0.0.0:* LISTEN 2580/httpd

tcp 0 0 172.31.22.60:800.0.0.0:* LISTEN 1582/nginx

tcp 0 0 0.0.0.0:220.0.0.0:* LISTEN 1290/sshd

tcp 0 0 127.0.0.1:250.0.0.0:* LISTEN 1544/master

tcp 0 0 127.0.0.1:4430.0.0.0:* LISTEN 2580/httpd

tcp 0 0 127.0.0.1:60820.0.0.0:* LISTEN 2623/varnishd

tcp 0 0 127.0.0.1:90000.0.0.0:* LISTEN
3397/php-fpm

tcp 0 0 127.0.0.1:112110.0.0.0:* LISTEN 1268/memcached

tcp 0 0 127.0.0.1:63790.0.0.0:* LISTEN 1061/redis-server 1

tcp 0 0 :::22 :::* LISTEN 1290/sshd

tcp 0 0 :::3306 :::* LISTEN 1422/mysqld

​

​

I looked inside etc/php-fpm.d and found this file:

​

[php5-fcgi-elvis\]

listen = /var/run/php-fpm/php5-fcgi-elvis.sock

listen.allowed_clients = 127.0.0.1

user = elvis

;group = elvis

pm = dynamic

pm.max_children = 50

pm.start_servers = 14

pm.min_spare_servers = 14

pm.max_spare_servers = 25

pm.max_requests = 500

catch_workers_output = yes

request_slowlog_timeout = 8

slowlog = /var/log/php-fpm/www-slow.log

php_admin_value[error_log\] = /var/log/php-fpm/www-error.log

php_admin_flag[log_errors\] = on

php_value[session.save_handler\] = files

php_value[session.save_path\] = /var/lib/php/session



listen.owner = apache

listen.group = apache

listen.mode = 0666

​

And it's almost the same as the one on the faulty server:

​

[php5-fcgi-staging03\]

listen = /var/run/php-fpm/php5-fcgi-staging03.sock

listen.allowed_clients = 127.0.0.1

user = staging03

;group = staging03

pm = dynamic

pm.max_children = 13

pm.start_servers = 4

pm.min_spare_servers = 4

pm.max_spare_servers = 7

pm.max_requests = 500

catch_workers_output = yes

request_slowlog_timeout = 8

slowlog = /var/log/php-fpm/www-slow.log

php_admin_value[error_log\] = /var/log/php-fpm/www-error.log

php_admin_flag[log_errors\] = on

php_value[session.save_handler\] = files

php_value[session.save_path\] = /var/lib/php/session



listen.owner = apache

listen.group = apache

listen.mode = 0666

​

However, I found this www.conf file also:

​

[www\]

group = apache

listen = 127.0.0.1:9000

listen.allowed_clients = 127.0.0.1

pm = dynamic

pm.max_children = 50

pm.start_servers = 5

pm.min_spare_servers = 5

pm.max_spare_servers = 35

php_admin_value[error_log\] = /var/log/php-fpm/www-error.log

php_admin_flag[log_errors\] = on

php_value[session.save_handler\] = files

php_value[session.save_path\] = /var/lib/php/session

php_value[soap.wsdl_cache_dir\] = /var/lib/php/wsdlcache

​

So would deleting this www.conf file solve every problem? Because I am thinking there are additional steps. I just don't have the full picture to know what are the things that I can check and what are the things that are wrong.

https://redd.it/qhxtin
@r_devops
DevOps Bootcamp

Hello!

Anyone took that Bootcamp below?

https://www.techworld-with-nana.com/devops-bootcamp

I saw some of her videos on YouTube and she seems very knowledgeable. I was wondering if anyone could recommend her Bootcamp.

Thanks

https://redd.it/qi2w9z
@r_devops
Open-source Wireguard VPN automation with Wiretrustee

Hey folks,

I've been making a few posts about Wiretrustee on Reddit (mostly channels related to self-hosting), but for some reason never did it here :)

We got lots of positive feedback about the project from individuals that are self-hosting the solution and using a free managed version for private use cases (e.g. connecting RPis, building home networks, private Minecraft servers, etc).

I'd love to hear your opinion about the project. Maybe you'd have some cool use cases or maybe point out something that is missing. I'm also curious about the VPN needs of small/medium IT/Engineering teams.

Your feedback will help to further develop the project.

Shortly about Wiretrustee. The details can be found on Github.

Wiretrustee is an open-source VPN platform built on top of WireGuard® making it easy to create secure private networks for your organization or home.

It requires zero configuration effort leaving behind the hassle of opening ports, complex firewall rules, VPN gateways, and so forth.

Wiretrustee automates Wireguard-based networks, offering a management layer with:

Centralized Peer IP management with a UI dashboard.
Automatic Peer discovery and configuration.
UDP hole punching to establish peer-to-peer connections behind NAT, firewall, and without a public static IP.
Connection relay fallback in case a peer-to-peer connection is not possible.
Open Source.
Could be self-hosted.
Works on Linux, Mac, Windows, ARM devices.

Future plans:

Multitenancy.
DNS
Client application SSO with MFA.
Access Controls.
Activity Monitoring.

Let me know what you think. Thank you!

Disclaimer

I'm the author and contributor to the project.

https://redd.it/qi9hej
@r_devops
100 days plan to learn and upskill for job opportunities in DevOps ( CICD Tools, Docker, Kubernetes, Ansible, Cloud, Terraform, Grafana & more! ).

Are you looking for Job in DevOps career?. Did you decided to upskill yourself and start looking for jobs in Devops roles!. I have created a study plan for you. Check this one let me know if it is feasible for you.

Study 2 hrs a day for next 100 days. The main area of focus would be system administration, programming, devops tools and cloud platform. Most of these below topics are covered in video series . You will find this learnings from introductory to advanced knowledge and is better than books and paid lectures

The breakup is as follows

System Administration : Focused on RHCSA/RHCE -- 15 mins per day

Programming: Learn enough for scripting on Python, Go, Ruby . -- 1 hour per day

DevOps Tools: Jenkins/GitLab, Docker,Ansible, Kubernetes, Terraform -- 20 mins per day

Cloud: AWS/Azure/GCP -- 15 mins perr day

Monitoring: Prometheus, Splunk, Grafana -- 10 mins per day

​

If you are capable, it be wise to learn these 5 topics in parallel or you can concentrate one at a time, complete it and then move to the next one.

Suggestions, feedback, criticism all are welcome.

Ask yourself, Are you serious to become DevOps enginner in 2021 ? If yes, then click the Subscribe button now! and spend quality and consistent time for developing your skills.Finally good luck, well no it's not about luck, more about discipline ...

https://redd.it/qid6ap
@r_devops
Move on or stay?

So, recently got into a DevOps role with a company I've been with 2 years(only devops engineer) . I'd say I'm a strong 3rd line/Senior Sysadmin/light dev, I'm in charge of two companies platforms(which is odd I know but we're in that growing phase) , both fully cloud based, 1 with hilarious amounts of microservices/pipelines etc which I build/maintain. As part of an overall strategy we want to implement more automation with our environments etc which is great. So we got some outside consultantency from a DevOps group and they are planned to be coming in to do the work.

So my thing with each company I go to is I like going in at the time where they're growing very quickly and becoming quite large, I end up with way more responsibility and experience than I would have had being at a giant company. This has worked great over the past couple years and I've reached the point where I would cash out again in terms of experience with the big DevOps push strategy. But I'm thinking that with the 3rd party guys coming in. Is there any point in me being there if they're going to do everything anyway?

https://redd.it/qicpqx
@r_devops
Using PowerShell to interact with REST API's

I have a new post regarding using PowerShell to interact with REST API's.

https://seehad.tech/2021/10/29/use-powershell-to-interact-with-rest-apis/

Crafting the API request relies on reviewing the (hopefully) well documented API body structure and requirements for using an access token and how to craft GET or POST methods.

You can also interact with Azure using its API, here is the supporting documentation: https://docs.microsoft.com/en-us/rest/api/azure/

​

Besides Postman, what other visual API collaboration/testing tools are out there worth exploring?

https://redd.it/qif9lk
@r_devops
How do you manage server credentials and logins for 100s for servers/vps.

So our company develops some products and then to host the products we create VPS. Now we have roughly 100-150 such clients atm and so we have 100-150 VPS to manage. (among them 10-15 will require active work). so how to manage these many VPS efficiently. Currently, I use WinSCP, store the credentials and then login if required. Is there any better and efficient way for this?

https://redd.it/qitc3n
@r_devops
Should password file be scalable?

When hosting an encrypted passwords file that the source-code would access to retrieve passwords/keys (either via Hashicorp Vault or a custom made one in Python), should the passwords file be hosted on a single server which would be referenced by the code (while of course, being monitored, audited, and backed up to a different server), or should it be somehow orchestrated across multiple places to avoid a heavy load on the file system?

It's hard for me to imagine that any code would have to read so much from a passwords file that it'd cause a problem on the filesystem.

(I have thought about an idea where it tries to cache the password, and only if the cached password fails, only then read from the encrypted passwords file, but the question still remain)

Is there some best practice I'm missing?

Thanks ahead!

https://redd.it/qj2d57
@r_devops
Resources for learning Kafka

Do you know any good resources for learning Kafka for a DevOps? Learn the basics of configuring Kafka instances and how it works?

https://redd.it/qj2bgh
@r_devops
Elastic Cloud is really good for the price. My Team's Journey...



If you are a relatively small shop and you don't have a ton of traffic volume I recommend looking into Elastic Cloud. I found that from a cost to manage our own elasticsearch instances in terms of resources and the cost savings we got from centralizing logs + apm + infra metrics in one place to be extremely inexpensive based on what you get.

Our breakdown on datadog pricing was about $2k/month all in one logs, metrics, apm for just our AWS environment. Its $1k/month with elastic cloud which includes twice as many hosts with our on-prem environment because its all resource based. We were able to migrate our on-prem elasticsearch and prometheus instances to elastic cloud. Newrelic would have been cheaper if we were really small because they charge per user. In summary we moved all the following to elastic cloud for $1k/month:

1. Two self hosted elastic search instances. AWS & On-Prem
2. 1 Prometheus Instance (replaced with elasticsearch metrics with datastream & elastic agent)
3. DataDog for \~50 hosts with infra monitoring on all, logging on some and APM on most

I have a few compliants... If you don't have elasticsearch experience to start out with your journey is going to be a pain and they don't hold your hand unless you pay a lot of money. Datadog makes it much easier and their support is more responsive even if you are a small shop. DataDog also has a nicer UI in my opinion. Elastic Agent is also new, so you have to use filebeat if you do *anything* non standard with your logs. Also they have very few integrations compared to datadog / newrelic. We have to write our own webhook interface for some stuff such as opsgenie alerts.

https://redd.it/qj7fz9
@r_devops
Guide to secure a server/vps

What are the resources or guides you would suggest for a developer who needs to set up and secure a web server.

I have basically collected this much:

* SSH
* use cert
* disable root login
* change port (contested)
* fail2ban
* Accounts
* principle of least privilege (use specific accounts for only what their needed for)
* Don't run as root
* Firewall
* only have the minimal ports open (http,https,ssh) using ufw or iptables
* SELinux or alternatives (advanced)
* Orchestration concerns(maybe not related to tile)
* do it over a private sub net
* use ssh even then
* Secrets management
* don't store api keys, or certs on disk if possible and load into memory
* user virtualization to isolate host in case webservers are compromised

​

* Misc
* take an inventory of running services and installed software
* keep only what you need
* Logging/perf monitoring
* email,slack for realtime notifications
* backing up your logs in close to real time (in case of compromise for example)
* Always update
* Secure your individual applications (nginx,db,node etc)
* Advanced
* specific distros like alpine or void or build your own
* way smaller attack surface
* musl lib c.
* busybox

​

Cool references i found are:

* Linode/Digital Ocean documentation (basic)
* Arch Linux docs in general but specifically on security/hardening or other distros
* Alot of stuff in github repos in terms of guides but none are authoritative/guaranteed up to date

https://redd.it/qjc1jw
@r_devops
What's the best way to deal with config drift from GUI usage?

Azure's GUI is good. At least good enough that some devs (including me) simply _forget_ IaC exists and use the GUI to make the small modifications necessary for ops. Maybe a scale up of a database here. Maybe changing some permissions there.
The friction of a new PR to the IaC seems to be so high, that people are just not keeping it updated. Fast forward one year and everything's out of whack and we can't replicate any environments.

The simplest solution to implement is a human-process level one, where we simply exhort everyone to update the IaC when they change something. Clearly that hasn't really worked.

The solution that might work better is a drift detector, and maybe auto-applying IaC so devs are forced to PR any changes to the code. But clearly, the devs don't enjoy applying changes to things using code (since they're human too, and everyone likes GUIs) and I'm looking for something better.

I'm thinking that the drift detector should detect changes and make a pull request to the IaCodebase automatically, for modification and acceptance by the owners - since they already made the changes in the GUI. Perhaps they copy-paste configs to some other envs, and merge the PR.
If they reject the PR, the drift is corrected automatically. If not, no further work is necessary by the maintainers - they don't feel like their effort and time updating stuff on the GUI is wasted.


I've looked at older posts like
- [https://www.reddit.com/r/devops/comments/cgcstz/show\_reddit\_configuration\_to\_automatically\_detect/](https://www.reddit.com/r/devops/comments/cgcstz/show_reddit_configuration_to_automatically_detect/): Not Azure, core reco is just not using the GUI. Not great UX IMO, see above.
- https://www.reddit.com/r/devops/comments/60n5qa/how_do_you_manage_configuration_drift/ - this one is too low level for me, but configuration management DB and drift detectors are a good idea.

Overall, UIs like the ones Pulumi or env0.com provide don't seem to be exactly this either. Env0 is close, but seems like they provide their own GUI for specific things instead of re-use current workflows.

Disclaimer - this might be a problem specific to Azure, were the GUI is good enough to use but Azure's IaC support is bad enough to prevent full usage of tools like Az-templates/TF/Pulumi.

https://redd.it/qjgft1
@r_devops
Atlantis with Azure Dev Server

We are using the on-premise version of Azure DevOps Server 2020. I am having trouble getting Atlantis to authenticate with a git repo hosted on our Azure DevOps Server. I would appreciate any help you can offer.

The first challenge I had was that the on-premise version does not set the Request-ID header in the webhook that is sent to Atlantis. This was fairly easily resolved by running an instance of HAProxy in front and adding the header.

The second challenge was that there are a few hard-coded references to dev.azure.com which works for the cloud version of Azure DevOps, but not the self-hosted one. Thankfully this has been resolved in the Atlantis repo in the last couple weeks. It hasn't been released yet, but I was able to use a dev build of the container.

Now I'm stuck with the Git authentication. In the pull request I get an error that says 'fatal: authentication failed'. The comment from Atlantis says that it tried to run this command (personal info redacted):

git clone --branch dev --depth=1 --single-branch https://[username]:[token]@[our_on_prem_url]/[site_collection]/[project]/_git/[repo]

From the command-line on my dev machine, that command also fails. I tried all sorts of combinations of username:password, username:token, username:base64-token, etc. All failed. I am able to get Git to authenticate when setting the authorization header this:

git -c http.extraheader="AUTHORIZATION: Basic abcdefghi" clone --branch dev --depth=1 --single-branch https://[our_on_prem_url]/[site_collection]/[project]/_git/[repo]

From what I have read, this is because it is trying to use NTLM authentication when the basic authorization header is not set.

Has anyone got Atlantis to work with the on-premise version of Azure DevOps Server? I have the webhooks and pull request commenting working, so I think this is my last hurdle before I can have Atlantis run Terraform.

I also had the same problem with ArgoCD. I got around that by using their SSH option for connecting to the repo and that has worked great so far.

https://redd.it/qj8omw
@r_devops
Humblebundle the ultimate DevOps bundle (books)

Hello everyone,

What do you think of the DevOps ultimate bundle?

25 books for approx. 15 dollars. Link to the bundle

Does anyone have experience with those books from Packt or can give an opinion/recommendation for a beginner?

Thank you.

https://redd.it/qikfk0
@r_devops
Consul HA structure

I'm trying to learn about service mesh with Consul and I'm trying to understand a minimal setup that needs to be done for a High Availabality Fail-over to work.

If I have two servers that run code, and two servers that host Vault (one is active and the other standby), do I just create a Consul agent on each of the 4 servers with the logic of if active Vault fails -> go to standby? Would this be enough for HA?

Or, do I need additional servers on top of that, such as a Consul server-side that would do all that logic? Like this

Huge thanks ahead!

https://redd.it/qjl2ju
@r_devops
Build with Github actions

Hello comrades, I was playing around for few days with Github workflows and now have a real world use case for that but I'm not sure if it will be possible to achieve. Shortly, I have a multi-stage dockerfile that I want to "translate" to github actions.
For example: I have a scala app in /scala-app-dir that I want to build with sbt, then I want the built folder to be copied into container, then I have an elixir app that I also need to compile and copy the binary to the container. I want to use 1 reusable workflow for building and compiling and one workflow for deployment which will then call the build one. So my question is, will I be able to use the output from build workflow in deploy one, specifically in the dockerfile, where I want to do smth like:


COPY /app-binary-from-build-workflow /app-dir

As the app binary will be created by another workflow (reusable).


I will be more than happy with some starting point at least, or maybe you have such experience.


Many thanks!

https://redd.it/qjkeus
@r_devops
Jenkins over TFS

Wanna convince my team to use Jenkins Enterprise Edition instead of TFS, share some good value points.

https://redd.it/qivlt7
@r_devops
curious about management tools, since i have learned about ansible only and starting terraform next week:

what makes ansible different than other configuration management tools?

https://redd.it/qir4g5
@r_devops