原作者;Bilal Aslam
編譯:李哲
核心提示:今天,Appuri的聯(lián)合創(chuàng)始人兼首席產(chǎn)品官Bilal?Aslam與大家分享他與他的團(tuán)隊(duì)使用亞馬遜ECS的慘痛經(jīng)歷。
在Appuri,ETL管道、API和UI都是由大量的小型單目標(biāo)服務(wù)構(gòu)成的。一開始,我們使用的是大型的單個資源庫,后來逐漸向微服務(wù)模式轉(zhuǎn)型。這并不是因?yàn)槟撤N觀念上的偏見,而是因?yàn)樗衔覀兊墓ぷ鞣绞?。盡管有利有弊,大體上來說,微服務(wù)的應(yīng)用效果很不錯。但是,我們今天不是來討論微服務(wù)的,而是要向你講述我們應(yīng)用亞馬遜EC2彈性容器服務(wù)(ECS)的慘痛經(jīng)歷,以及我們?nèi)绾瓮ㄟ^轉(zhuǎn)向?Kubernetes懸崖勒馬。
特此聲明:總體上講,我們很喜歡AWS的產(chǎn)品。而且,每家公司對ECS的使用程度不同。比如說,Segment就有對ECS非常愉快的使用經(jīng)歷,完全沒有我們這些抱怨。
我對管理服務(wù)很是鐘情。例如,我們不自己運(yùn)行Postgres服務(wù)器,而是使用亞馬遜RDS。我們也不自己運(yùn)行hypervisor或者bare-metal服務(wù)器,而是使用亞馬遜EC2。在理想的情況下,你向提供商購買管理服務(wù),以便專注于創(chuàng)造更多差異化的附加價(jià)值,這是一個雙贏的局面。事實(shí)上,我們與很多管理服務(wù)提供商都有這樣的經(jīng)歷。
2015年6月,我們開始考慮購買PaaS來部署公司的服務(wù)。我的意愿是選擇Docker化的管理服務(wù),與此同時,保持一定的控制權(quán)。作為AWS的客戶,我們考慮使用亞馬遜Elastic Beanstalk和全新的亞馬遜EC2 ECS。
亞馬遜ECS的優(yōu)勢在于:
- 可以方便快捷地啟動Docker容器
- ECS能提供多重可用區(qū)(Multiple Availability Zones)
- 支持回滾部署(rolling deploys),真正實(shí)現(xiàn)了零停機(jī)(Zero-Downtime)部署
- API客戶端。所有AWS服務(wù)的API客戶端都支持我們使用的所有語言類型。
- ECS和EC2實(shí)例集群協(xié)同工作。這樣,我們就不需要學(xué)習(xí)一個新的PaaS,只需要在運(yùn)行亞馬遜Linux的任何一個EC2實(shí)例上安裝ECS客戶端,加入ECS集群。
第一印象
我們看到ECS demo的第一印象是,它缺少很多關(guān)鍵功能:
缺少服務(wù)發(fā)現(xiàn)(service discovery)功能。在ECS中,服務(wù)發(fā)現(xiàn)功能的替代方式為使用內(nèi)置的負(fù)載均衡器(load balancers)。這是運(yùn)行ECS網(wǎng)絡(luò)可訪問(network-accessible)服務(wù)的唯一方式,即使只有一個實(shí)例,也必須得運(yùn)行ELB。對于微服務(wù)架構(gòu)來說,這就增加了每次部署服務(wù)的成本。
不能統(tǒng)一配置。ECS不能夠把不帶參數(shù)的配置信息傳遞給服務(wù)(即Docker容器),那么我如何把環(huán)境參數(shù)傳遞給每個服務(wù)呢?只能復(fù)制粘貼。
平庸的CLI。和Kubernetes等競爭對手相比,ECS的CLI表現(xiàn)很平庸。你可以從命令行(aws ecs update-service –desired-count N)進(jìn)行擴(kuò)展,但是ECS的CLI功能不是很強(qiáng)大。
盡管缺少了這么多核心功能,我們還是選擇了繼續(xù)使用ECS。
讓我們后悔的時刻
讓我們后悔的瞬間發(fā)生在,我們發(fā)現(xiàn),環(huán)境參數(shù)會被泄漏到CloudTrail以及使用CloudTrail事件記錄和日志的其他第三方服務(wù)中。
我們在論壇上發(fā)了帖子,ECS團(tuán)隊(duì)的回復(fù)沒有切中要害。顯然,他們不認(rèn)為環(huán)境參數(shù)是敏感信息。
我們原本可以建更多的基礎(chǔ)設(shè)施來用亞馬遜的密鑰管理服務(wù)(KMS)加密機(jī)密信息,然后在啟動服務(wù)的時候進(jìn)行解密。實(shí)際上,這正是Convox做的事情。但是,我們這個領(lǐng)域還有這么多有趣的工作可做,為什么要建這些基礎(chǔ)設(shè)施呢?
讓我們崩潰的時刻
在使用ECS的近一年時間里,我們關(guān)注每一個功能的發(fā)布,積極參與開放GitHub issue等等。但是到最后,我們還是因?yàn)橐韵聨讉€原因放棄了ECS:
ECS agent經(jīng)常斷開連接,致使我們無法啟動新容器。ECS在每一個EC2實(shí)例中都安裝一個agent,用來和亞馬遜API以及Docker進(jìn)行互動。但是這個agent經(jīng)常斷開連接,導(dǎo)致部署失敗,這對我們的服務(wù)部署來說是致命的。這一問題盡管已成定論,但仍然在不斷發(fā)生。在我們的集群上,這一問題每天至少出現(xiàn)兩次。盡管我們已經(jīng)做出了最大努力,但仍然找不到根本原因。據(jù)我所知,ECS團(tuán)隊(duì)至今還沒有解決這一問題。
下圖是在Slack上的搜索結(jié)果,這只是問題反饋的一小部分。這一問題出現(xiàn)得非常頻繁,以至于我們不得不經(jīng)常重啟agents來避免這一問題。
當(dāng)你每隔一小時就要重啟一次服務(wù)來修復(fù)漏洞的時候,你肯定會崩潰的。
- 對GitHub issue缺少關(guān)注。GitHub issue上有很多功能和客戶請求,并沒有得到亞馬遜ECS的關(guān)注。
- 糟糕的架構(gòu)。ECS欠缺很多現(xiàn)代化部署和運(yùn)營基礎(chǔ)設(shè)施所需的基本元素。
再見,ECS;你好,Kubernetes
在對ECS的一片怨聲載道過后,我們決定試用Kubernetes (k8s)。兩個星期的體驗(yàn)之后,我們感覺很滿意。這個開源項(xiàng)目很適合做大規(guī)模的部署和運(yùn)營。不管是它的CLI,還是服務(wù)發(fā)現(xiàn)或配置管理,都非常好用。盡管我們遇到了一個很奇怪的問題,就是它的kube-proxy不能正確地挖掘流量,但是重啟之后問題就解決了,而且沒有復(fù)發(fā)。到目前為止,我們還沒有后悔我們做出的這一選擇。
英文原文:
Here at Appuri, we have a large number of small, single-purpose services that make up our ETL pipeline, API and UI. We started from large, monolithic repos and gradually migrated to this microservices pattern, not because of any philosophical bias but because it fit our work style. By and large, this has worked well with all the known pros and cons of microservices. But I’m not here to debate microservices. I’m here to tell you about our nightmare on Amazon EC2 Elastic Container Service (ECS) and how we saved ourselves by moving to Kubernetes.
NOTE: In general, we love AWS. Also, your mileage with ECS may vary. For example, Segment had a great experience with ECS and apparently none of our complaints.
There’s also the wonderful Convox project which contains a lot of great workflows on top of ECS. When we started using ECS, Convox wasn’t far enough along to meet our needs.
And so, it begins, with a love of managed services
I love managed services. For example, we don’t run our own Postgres server – we use Amazon RDS. We also don’t run our own hypervisor or bare metal servers, we use Amazon EC2. With managed services, you trade control for peace of mind and, in an ideal world, you can focus on building differentiated value add. Everyone wins. In fact, we have had exactly this experience with most managed services.
In June 2015, we started looking into a PaaS where we could deploy our services. I wanted to stay close to Docker, but maintain a degree of control. As an AWS customer, we considered Amazon Elastic Beanstalk and the shiny new Amazon EC2 Elastic Container Service (ECS).
Amazon ECS fit the bill because of several promises:
- With ECS, you simply launch Docker containers.
- ECS is aware of multiple availability zones (AZs). As long as EC2 instances are set up in multiple AZs, ECS will try to distribute containers to maintain high availability.
- You can do rolling deploys. Neato, deployments with zero downtime!
- API clients. All AWS services have (sadly auto-generated) API clients for all languages we use.
- ECS works with vanilla EC2 instances. This is a nice plus, as we don’t have to learn a new PaaS – just install the ECS agent on any plain old EC2 instance running Amazon Linux and have it join an ECS cluster.
First impression: wow, it’s missing a LOT of stuff.
My first impression on seeing an ECS demo was how much it was missing. We use a lot of AWS services and are well-aware of how Amazon releases incremental updates. That’s all good, we do that, too. However, it was sad to see that these key features were missing:
- No service discovery. In ECS, the recommended way to do service discovery is to use internal load balancers. This is actually a bigger deal because using an internal ELB is the only way you can run a service in ECS that is network-accessible; even with a single instance you HAVE to run an ELB for the service to be discoverable — for a microservice architecture this adds cost with every service you deploy despite having no additional hardware.
- No central config. ECS doesn’t have a way to pass configuration to services (i.e. Docker containers) other than with environment variables. Great, how do I pass the same environment variable to every service? Copy and paste it. We considered setting up Consul, but instead decided to stick with native ECS environment variables to start using the service.
- Mediocre CLI. Compared to competitors like Kubernetes, ECS has a mediocre CLI at best. You can scale from the command line (
aws ecs update-service --desired-count N
) but the ECS CLI is just not very powerful.
Despite these missing features, we decided to move ahead.
I have made a huge mistake
Our first “oh crap” moment with ECS in production was when we noticed that it was leaking environment variables to CloudTrail, and on to DataDog and other third party services that consume CloudTrail events and logs. ECS, like a good AWS citizen, logs events to CloudTrail. When you start a new service, it logs the service definition including environment variables to CloudTrail!
We opened a forum post and response from the team wasn’t on target. Apparently they don’t believe in treating environment variables as sensitive quantities.
Now, we could have built yet more infrastructure to encrypt secrets using Amazon Key Management Service (KMS) and decrypt them at service start – in fact, this is exactly what Convox does. But why would we build this infrastructure when there was so much more interesting work in our domain to do?
What killed ECS for us
We ran ECS in production for nearly a year. In that time, we watched every single feature announcement, participated in opening GitHub issues and so on. Finally, we gave up on ECS when two issues remained unaddressed:
- ECS agent disconnects periodically, making it impossible to launch new containers. Recall that ECS works by installing an agent on every EC2 instance that’s part of an ECS cluster. This agent interacts with the Amazon API as well as Docker. This agent has a horrible tendency to disconnect, and when this happens your deployments will fail – this kills your services. This problem is tracked in this GitHub issue and despite it being a closed issue, we have seen it happen repeatedly. It happens at least twice a day on our clusters and despite our best efforts, we haven’t been able to nail the root cause. To my knowledge, it remains unaddressed by the ECS team.
This is a Slack search results view of just some of the times we’ve seen this problem happen. This problem became so pervasive that we started restarting agents periodically to get around the failure:
You know you’re going crazy when you restart a service every hour to fix its bugs.
- Lack of traction on GitHub issues. This issue is an example of how many features and customer requests remain unaddressed. This issue is the most commented feature for a year and remains unaddressed. Incidentally, we hit this issue as well.
- Bad architecture. I expect modern deployment and operations infrastructure to support 12 factor apps in a meaningful, robust way. ECS simply lacks the fundamentals.
Adios ECS, hello Kubernetes
After much grumbling at ECS, we decided to try out Kubernetes (k8s). Having flipped the switch in production two weeks ago, we are delighted. It seems that the contributors to this open source projects really thought through deployments and operations at scale. From the CLI to service discovery and configuration management, it has been a pleasure to use. We ran into an odd issue with kube-proxy
not routing traffic correctly, but a restart fixed the issue and it hasn’t cropped up since. We haven’t looked back!
- 蜜度索驥:以跨模態(tài)檢索技術(shù)助力“企宣”向上生長
- 密態(tài)計(jì)算技術(shù)助力農(nóng)村普惠金融 螞蟻密算、網(wǎng)商銀行項(xiàng)目入選大數(shù)據(jù)“星河”案例
- 專利糾紛升級!Netflix就虛擬機(jī)專利侵權(quán)起訴博通及VMware
- 兩大難題發(fā)布!華為啟動2024奧林帕斯獎全球征集
- 2025年工業(yè)軟件市場格局:7個關(guān)鍵統(tǒng)計(jì)數(shù)據(jù)與分析
- Commvault持續(xù)業(yè)務(wù)策略:應(yīng)對現(xiàn)代數(shù)據(jù)保護(hù)挑戰(zhàn)的新范式
- 2025年網(wǎng)絡(luò)安全主要趨勢
- 2025年值得關(guān)注的數(shù)據(jù)中心可持續(xù)發(fā)展趨勢
- 量子計(jì)算火熱,投資者又在大舉尋找“量子概念股”
- 從量子威脅到人工智能防御:2025年網(wǎng)絡(luò)安全將如何發(fā)展
- 后人工智能時代:2025年,在紛擾中重塑數(shù)據(jù)、洞察和行動
免責(zé)聲明:本網(wǎng)站內(nèi)容主要來自原創(chuàng)、合作伙伴供稿和第三方自媒體作者投稿,凡在本網(wǎng)站出現(xiàn)的信息,均僅供參考。本網(wǎng)站將盡力確保所提供信息的準(zhǔn)確性及可靠性,但不保證有關(guān)資料的準(zhǔn)確性及可靠性,讀者在使用前請進(jìn)一步核實(shí),并對任何自主決定的行為負(fù)責(zé)。本網(wǎng)站對有關(guān)資料所引致的錯誤、不確或遺漏,概不負(fù)任何法律責(zé)任。任何單位或個人認(rèn)為本網(wǎng)站中的網(wǎng)頁或鏈接內(nèi)容可能涉嫌侵犯其知識產(chǎn)權(quán)或存在不實(shí)內(nèi)容時,應(yīng)及時向本網(wǎng)站提出書面權(quán)利通知或不實(shí)情況說明,并提供身份證明、權(quán)屬證明及詳細(xì)侵權(quán)或不實(shí)情況證明。本網(wǎng)站在收到上述法律文件后,將會依法盡快聯(lián)系相關(guān)文章源頭核實(shí),溝通刪除相關(guān)內(nèi)容或斷開相關(guān)鏈接。