I've encountered a very strange issue regarding AppEngine behavior. We are currently migrating from 1.9 version to 1.12 as 1.9 is going to be obsolete after 1st of October. All went smooth, all APIs migrated, etc., etc. Tested on staging environment (a separate GC Project for several days).
After going to production, everything was alright for about 6 hours (!), after that all requests were timing out. Previous 1.9 version was kept alive and traffic was rerouted to it.
Now this appears to be not a load problem: after the incident deploying Go 1.12 runtime without switching it as active version and accessing privately (a single request!) cannot be processed in a timely manner if application handler has to make any request: to storage, datastore, or even plain http request does not work! e.g. even narrowing it down to:
putting this code in main before everything else produces timeout after deploy!
AppEngine Standard environment. Go1.12. So no urlfetch. Example code (even no requests needed, multiple urls tested, all working locally and on STG Google Cloud Project with same code):
func main() {
fmt.Printf("Testing request to external server
")
resp, err := http.Get("https://www.example.com")
if err != nil {
fmt.Printf("Error making request to external server: %v
", err)
}
data, err := ioutil.ReadAll(resp.Body)
if len(string(data)) > 255 {
fmt.Printf("Content: %v
", string(data)[0:255])
} else {
fmt.Printf("Content: %v
", string(data))
}
resp.Body.Close()
...
log.Fatal(http.ListenAndServe(fmt.Sprintf(":%s", port), r))
}
Results in this message during instance startup after deploy
Error making request to external server: Get https://www.example.com: dial tcp 123.123.123.123: i/o timeout
Start program failed: failed to detect app after start: ForAppStart(): [aborted, context canceled. subject:"app/valid" Timeout:30m0s, attempts:6073 aborted, context canceled. subject:"app/invalid" Timeout:30m0s, attempts:6074]
(2nd line is understandable -- it just cannot connect and waits for too long).
The most weird thing is if to deploy this code on staging environment (exactly the same file) it works. Seems that like GAE ran out of connection attempts or something like that on production environment in several hours (this is a pretty high-load project) and cannot proceed.
What can be the issue of this behavior? Is it a bug? What can be leaking and how to diagnose this? I'm running out of ideas, googled for hours, any help would be appreciated.