EC2 Instance Lifecycle and Troubleshooting


整理 EC2 Instance Lifecycle 和常遇到的 Troubleshooting.

名詞解釋

  • Instance store-backed: 關機資料就刪除了
  • Instance EBS-backed: 關機資料還在

EC2 Instance Lifecycle

下圖是 EC2 Instance State 的狀態流程圖,兩種 EBS-backed and Store-backed 的流程:

圖片來源:Instance Lifecycle

Actions

  • Instance Launch
    • 透過 AMI 建立新的 EC2 Instance. 開始跑的狀態是 pending.
  • Instance Stop and Start (EBS-backed instances only)
    • 執行這兩個動作,會讓 Instance 使用不同的硬體。
    • 通常 AWS 系統維護時 (會出現在 EC2 Console -> Events, 也會收到 AWS Maintenance 通知),如果是實體機器老舊要汰換,只要執行 stop, start 兩個動作, Instance 就會到另一台狀態正常的機器執行起來。
    • Instance State 如果是 running, 執行 stop, 狀態會: running -> stopping -> stopped. 停止之後的 Instance 將不再收費,但是 EBS 會收費。
    • 當執行 start Instance, 狀態會:pending -> running
  • Instance Termination
    • Instance State: shutting-down -> terminated
    • Root Volume 依據 DeleteOnTermination 屬性決定是否刪除,其他 Root 以外的 Volume 則需要手動刪除。

Troubleshooting

Instance 開關機問題

開機狀態有兩個,分別是 System Status Checks, Instance Status Checks. 這兩個都正常,才算是開機成功。

System Status Checks

  • 可能的原因
    • Loss of network connectivity
    • Loss of system power
    • Software issues on the physical host
    • Hardware issues on the physical host
  • 解決方法
    • Stop, Start
    • 等 AWS 修好 …

Instance Status Checks

通常是軟體或者是網路問題。

  • 可能的原因
    • Failed system status checks
    • Incorrect networking or startup configuration
    • Exhausted memory
    • Corrupted file system
    • Incompatible kernel
  • 解決方法
    • reboot

Instance 開機馬上又被關掉

到 EC2 Console -> 選擇 Instance -> Status Checks 找到錯誤訊息:

  • InsufficientInstanceCapacity: AWS 資源不足,只能等了,或者先換其他 instance type 開機 …
  • InstanceLimitExceeded: 已經超過 limits 了,送 Support Ticket.

Alarm and Actions for Status Checks (CloudWatch)

如果遇到 Status Checks 的問題,可以透過 Create Status Check Alarm 處理,可以做以下動作:

  • reboot
  • stop
  • terminate
  • recover

重要的服務,建議開啟此功能。

無法連到 Instance (網路)

  • 網路設定檢查
    • 檢查 Security Group, Network ACLs 設定
    • 檢查所屬 subnet 為 public or private subnet
    • 確認 SSH (22) or RDP (3389) 有開
    • 如果是 linux, 檢查網卡是不是有特殊設定
    • IP 選擇正確,public subnet 使用 EIP, private subnet 使用 private IP
  • Private Key
    • private key (pem file) 不對
    • private key 檔案屬性不錯
    • 連線帳號不對, linux 的套件使用不同的帳號. ubuntu 使用 ubuntu 登入, AWS linux AMI 使用 ec2-user, RHEL 使用 root or ec2-user.
  • 系統狀況
    • 如果是 t2 系列,確認 CPU Credit 是不是還足夠
    • 檢查 CPU 是否滿載

注意: 通常這不會跟 IAM Role, Policy 有關.

Scheduled Events

通常就是硬體維護、或者有問題,這時候 AWS Data Center 準備做維護。需要把在上面跑的 Instance 移轉到其他機器。通常管理者會收到類似以下的信:

Dear Amazon EC2 Customer,

One or more of your Amazon EC2 instances is scheduled for maintenance on 2016-03-29 for 2 hours starting at 12:00 UTC. During this time, the following instances in the ap-northeast-1 region will be unavailable and then rebooted:

    i-b4e0xxxx

Your instances will return to normal operations after maintenance is complete and all of your configuration settings will be retained. To continue normal operation and avoid any unavailability or reboots during this time, you can migrate the instances listed above to replacement instances. Replacement instances will not be affected by this scheduled maintenance. Otherwise, no action is generally required on your part (certain underlying system components may change at reboot time, and your operating system may prompt you to install additional software/drivers post-reboot as a result). If your instance is part of an auto-scaling group, then it will automatically be terminated and replaced by a newly launched instance during the maintenance window.

You can see more information on this maintenance in the AWS Management Console at https://console.aws.amazon.com/ec2/home?region=ap-northeast-1#s=Events.

Additional information about maintenance events, including how to migrate to replacement instances, can be found at http://docs.amazonwebservices.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html.

We perform maintenance regularly to ensure that the EC2 service continues uninterrupted for our customers. In most cases, maintenance can be performed without service interruption. When maintenance cannot be performed without service interruption, we work hard to keep any impact as brief as possible.

If you have any questions or concerns, you can contact the AWS Support Team on the community forums and via AWS Premium Support at: http://aws.amazon.com/support.

Sincerely,
Amazon Web Services

收到這樣的信,到 EC2 Console -> Events 查看看是哪一台、Event Type 是啥。

大部份的 maintenance 都只要 stop, start 就可以排除狀況,通知信裡也會有詳細的說明。

  • Types of Scheduled Events: Instance stop, Instance retirement, Reboot, System maintenance

參考資料:Scheduled Events for Your Instances

更多 Troubleshooting see: Troubleshooting Instances with Failed Status Checks

延伸閱讀 (站內)

參考資料



Comments

  • 全站索引
  • 學習法則
  • 思考本質
  • 一些領悟
  • 分類哲學
  • ▲ TOP ▲