{
	"id": "21c3dae9-d56d-4ffd-b108-304a264362df",
	"created_at": "2026-04-06T01:30:50.491486Z",
	"updated_at": "2026-04-10T03:24:17.99189Z",
	"deleted_at": null,
	"sha1_hash": "5ca65a6165a66309151b15bfc4e53038c83e96f2",
	"title": "Jobs",
	"llm_title": "",
	"authors": "",
	"file_creation_date": "0001-01-01T00:00:00Z",
	"file_modification_date": "0001-01-01T00:00:00Z",
	"file_size": 300281,
	"plain_text": "Jobs\r\nArchived: 2026-04-06 01:23:13 UTC\r\nJobs represent one-off tasks that run to completion and then stop.\r\nA Job creates one or more Pods and will continue to retry execution of the Pods until a specified number of them\r\nsuccessfully terminate. As pods successfully complete, the Job tracks the successful completions. When a\r\nspecified number of successful completions is reached, the task (ie, Job) is complete. Deleting a Job will clean up\r\nthe Pods it created. Suspending a Job will delete its active Pods until the Job is resumed again.\r\nA simple case is to create one Job object in order to reliably run one Pod to completion. The Job object will start a\r\nnew Pod if the first Pod fails or is deleted (for example due to a node hardware failure or a node reboot).\r\nYou can also use a Job to run multiple Pods in parallel.\r\nIf you want to run a Job (either a single task, or several in parallel) on a schedule, see CronJob.\r\nRunning an example Job\r\nHere is an example Job config. It computes π to 2000 places and prints it out. It takes around 10s to complete.\r\napiVersion: batch/v1\r\nkind: Job\r\nmetadata:\r\n name: pi\r\nspec:\r\n template:\r\n spec:\r\n containers:\r\n - name: pi\r\n image: perl:5.34.0\r\n command: [\"perl\", \"-Mbignum=bpi\", \"-wle\", \"print bpi(2000)\"]\r\n restartPolicy: Never\r\n backoffLimit: 4\r\nYou can run the example with this command:\r\nkubectl apply -f https://kubernetes.io/examples/controllers/job.yaml\r\nThe output is similar to this:\r\njob.batch/pi created\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 1 of 26\n\nCheck on the status of the Job with kubectl :\r\nkubectl describe job pi\r\nkubectl get job pi -o yaml\r\nName: pi\r\nNamespace: default\r\nSelector: batch.kubernetes.io/controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c\r\nLabels: batch.kubernetes.io/controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c\r\n batch.kubernetes.io/job-name=pi\r\n ...\r\nAnnotations: batch.kubernetes.io/job-tracking: \"\"\r\nParallelism: 1\r\nCompletions: 1\r\nStart Time: Mon, 02 Dec 2019 15:20:11 +0200\r\nCompleted At: Mon, 02 Dec 2019 15:21:16 +0200\r\nDuration: 65s\r\nPods Statuses: 0 Running / 1 Succeeded / 0 Failed\r\nPod Template:\r\n Labels: batch.kubernetes.io/controller-uid=c9948307-e56d-4b5d-8302-ae2d7b7da67c\r\n batch.kubernetes.io/job-name=pi\r\n Containers:\r\n pi:\r\n Image: perl:5.34.0\r\n Port: \u003cnone\u003e\r\n Host Port: \u003cnone\u003e\r\n Command:\r\n perl\r\n -Mbignum=bpi\r\n -wle\r\n print bpi(2000)\r\n Environment: \u003cnone\u003e\r\n Mounts: \u003cnone\u003e\r\n Volumes: \u003cnone\u003e\r\nEvents:\r\n Type Reason Age From Message\r\n ---- ------ ---- ---- -------\r\n Normal SuccessfulCreate 21s job-controller Created pod: pi-xf9p4\r\n Normal Completed 18s job-controller Job completed\r\napiVersion: batch/v1\r\nkind: Job\r\nmetadata:\r\n annotations: batch.kubernetes.io/job-tracking: \"\"\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 2 of 26\n\n...\r\n creationTimestamp: \"2022-11-10T17:53:53Z\"\r\n generation: 1\r\n labels:\r\n batch.kubernetes.io/controller-uid: 863452e6-270d-420e-9b94-53a54146c223\r\n batch.kubernetes.io/job-name: pi\r\n name: pi\r\n namespace: default\r\n resourceVersion: \"4751\"\r\n uid: 204fb678-040b-497f-9266-35ffa8716d14\r\nspec:\r\n backoffLimit: 4\r\n completionMode: NonIndexed\r\n completions: 1\r\n parallelism: 1\r\n selector:\r\n matchLabels:\r\n batch.kubernetes.io/controller-uid: 863452e6-270d-420e-9b94-53a54146c223\r\n suspend: false\r\n template:\r\n metadata:\r\n creationTimestamp: null\r\n labels:\r\n batch.kubernetes.io/controller-uid: 863452e6-270d-420e-9b94-53a54146c223\r\n batch.kubernetes.io/job-name: pi\r\n spec:\r\n containers:\r\n - command:\r\n - perl\r\n - -Mbignum=bpi\r\n - -wle\r\n - print bpi(2000)\r\n image: perl:5.34.0\r\n imagePullPolicy: IfNotPresent\r\n name: pi\r\n resources: {}\r\n terminationMessagePath: /dev/termination-log\r\n terminationMessagePolicy: File\r\n dnsPolicy: ClusterFirst\r\n restartPolicy: Never\r\n schedulerName: default-scheduler\r\n securityContext: {}\r\n terminationGracePeriodSeconds: 30\r\nstatus:\r\n active: 1\r\n ready: 0\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 3 of 26\n\nstartTime: \"2022-11-10T17:53:57Z\"\r\n uncountedTerminatedPods: {}\r\nTo view completed Pods of a Job, use kubectl get pods .\r\nTo list all the Pods that belong to a Job in a machine readable form, you can use a command like this:\r\npods=$(kubectl get pods --selector=batch.kubernetes.io/job-name=pi --output=jsonpath='{.items[*].metadata.name}\r\necho $pods\r\nThe output is similar to this:\r\npi-5rwd7\r\nHere, the selector is the same as the selector for the Job. The --output=jsonpath option specifies an expression\r\nwith the name from each Pod in the returned list.\r\nView the standard output of one of the pods:\r\nkubectl logs $pods\r\nAnother way to view the logs of a Job:\r\nkubectl logs jobs/pi\r\nThe output is similar to this:\r\n3.1415926535897932384626433832795028841971693993751058209749445923078164062862089986280348253421170679821480865\r\nWriting a Job spec\r\nAs with all other Kubernetes config, a Job needs apiVersion , kind , and metadata fields.\r\nWhen the control plane creates new Pods for a Job, the .metadata.name of the Job is part of the basis for naming\r\nthose Pods. The name of a Job must be a valid DNS subdomain value, but this can produce unexpected results for\r\nthe Pod hostnames. For best compatibility, the name should follow the more restrictive rules for a DNS label.\r\nEven when the name is a DNS subdomain, the name must be no longer than 63 characters.\r\nA Job also needs a .spec section.\r\nJob Labels\r\nJob labels will have batch.kubernetes.io/ prefix for job-name and controller-uid .\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 4 of 26\n\nPod Template\r\nThe .spec.template is the only required field of the .spec .\r\nThe .spec.template is a pod template. It has exactly the same schema as a Pod, except it is nested and does not\r\nhave an apiVersion or kind .\r\nIn addition to required fields for a Pod, a pod template in a Job must specify appropriate labels (see pod selector)\r\nand an appropriate restart policy.\r\nOnly a RestartPolicy equal to Never or OnFailure is allowed.\r\nPod selector\r\nThe .spec.selector field is optional. In almost all cases you should not specify it. See section specifying your\r\nown pod selector.\r\nParallel execution for Jobs\r\nThere are three main types of task suitable to run as a Job:\r\n1. Non-parallel Jobs\r\nnormally, only one Pod is started, unless the Pod fails.\r\nthe Job is complete as soon as its Pod terminates successfully.\r\n2. Parallel Jobs with a fixed completion count:\r\nspecify a non-zero positive value for .spec.completions .\r\nthe Job represents the overall task, and is complete when there are .spec.completions successful\r\nPods.\r\nwhen using .spec.completionMode=\"Indexed\" , each Pod gets a different index in the range 0 to\r\n.spec.completions-1 .\r\n3. Parallel Jobs with a work queue:\r\ndo not specify .spec.completions , default to .spec.parallelism .\r\nthe Pods must coordinate amongst themselves or an external service to determine what each should\r\nwork on. For example, a Pod might fetch a batch of up to N items from the work queue.\r\neach Pod is independently capable of determining whether or not all its peers are done, and thus that\r\nthe entire Job is done.\r\nwhen any Pod from the Job terminates with success, no new Pods are created.\r\nonce at least one Pod has terminated with success and all Pods are terminated, then the Job is\r\ncompleted with success.\r\nonce any Pod has exited with success, no other Pod should still be doing any work for this task or\r\nwriting any output. They should all be in the process of exiting.\r\nFor a non-parallel Job, you can leave both .spec.completions and .spec.parallelism unset. When both are\r\nunset, both are defaulted to 1.\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 5 of 26\n\nFor a fixed completion count Job, you should set .spec.completions to the number of completions needed. You\r\ncan set .spec.parallelism , or leave it unset and it will default to 1.\r\nFor a work queue Job, you must leave .spec.completions unset, and set .spec.parallelism to a non-negative\r\ninteger.\r\nFor more information about how to make use of the different types of job, see the job patterns section.\r\nControlling parallelism\r\nThe requested parallelism ( .spec.parallelism ) can be set to any non-negative value. If it is unspecified, it\r\ndefaults to 1. If it is specified as 0, then the Job is effectively paused until it is increased.\r\nActual parallelism (number of pods running at any instant) may be more or less than requested parallelism, for a\r\nvariety of reasons:\r\nFor fixed completion count Jobs, the actual number of pods running in parallel will not exceed the number\r\nof remaining completions. Higher values of .spec.parallelism are effectively ignored.\r\nFor work queue Jobs, no new Pods are started after any Pod has succeeded -- remaining Pods are allowed\r\nto complete, however.\r\nIf the Job Controller has not had time to react.\r\nIf the Job controller failed to create Pods for any reason (lack of ResourceQuota , lack of permission, etc.),\r\nthen there may be fewer pods than requested.\r\nThe Job controller may throttle new Pod creation due to excessive previous pod failures in the same Job.\r\nWhen a Pod is gracefully shut down, it takes time to stop.\r\nCompletion mode\r\nFEATURE STATE: Kubernetes v1.24 [stable]\r\nJobs with fixed completion count - that is, jobs that have non null .spec.completions - can have a completion\r\nmode that is specified in .spec.completionMode :\r\nNonIndexed (default): the Job is considered complete when there have been .spec.completions\r\nsuccessfully completed Pods. In other words, each Pod completion is homologous to each other. Note that\r\nJobs that have null .spec.completions are implicitly NonIndexed .\r\nIndexed : the Pods of a Job get an associated completion index from 0 to .spec.completions-1 . The\r\nindex is available through four mechanisms:\r\nThe Pod annotation batch.kubernetes.io/job-completion-index .\r\nThe Pod label batch.kubernetes.io/job-completion-index (for v1.28 and later). Note the feature\r\ngate PodIndexLabel must be enabled to use this label, and it is enabled by default.\r\nAs part of the Pod hostname, following the pattern $(job-name)-$(index) . When you use an\r\nIndexed Job in combination with a Service, Pods within the Job can use the deterministic hostnames\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 6 of 26\n\nto address each other via DNS. For more information about how to configure this, see Job with Pod-to-Pod Communication.\r\nFrom the containerized task, in the environment variable JOB_COMPLETION_INDEX .\r\nThe Job is considered complete when there is one successfully completed Pod for each index. For more\r\ninformation about how to use this mode, see Indexed Job for Parallel Processing with Static Work\r\nAssignment.\r\nNote:\r\nAlthough rare, more than one Pod could be started for the same index (due to various reasons such as node\r\nfailures, kubelet restarts, or Pod evictions). In this case, only the first Pod that completes successfully will count\r\ntowards the completion count and update the status of the Job. The other Pods that are running or completed for\r\nthe same index will be deleted by the Job controller once they are detected.\r\nHandling Pod and container failures\r\nA container in a Pod may fail for a number of reasons, such as because the process in it exited with a non-zero exit\r\ncode, or the container was killed for exceeding a memory limit, etc. If this happens, and the\r\n.spec.template.spec.restartPolicy = \"OnFailure\" , then the Pod stays on the node, but the container is re-run.\r\nTherefore, your program needs to handle the case when it is restarted locally, or else specify\r\n.spec.template.spec.restartPolicy = \"Never\" . See pod lifecycle for more information on restartPolicy .\r\nAn entire Pod can also fail, for a number of reasons, such as when the pod is kicked off the node (node is\r\nupgraded, rebooted, deleted, etc.), or if a container of the Pod fails and the .spec.template.spec.restartPolicy\r\n= \"Never\" . When a Pod fails, then the Job controller starts a new Pod. This means that your application needs to\r\nhandle the case when it is restarted in a new pod. In particular, it needs to handle temporary files, locks,\r\nincomplete output and the like caused by previous runs.\r\nBy default, each pod failure is counted towards the .spec.backoffLimit limit, see pod backoff failure policy.\r\nHowever, you can customize handling of pod failures by setting the Job's pod failure policy.\r\nAdditionally, you can choose to count the pod failures independently for each index of an Indexed Job by setting\r\nthe .spec.backoffLimitPerIndex field (for more information, see backoff limit per index).\r\nNote that even if you specify .spec.parallelism = 1 and .spec.completions = 1 and\r\n.spec.template.spec.restartPolicy = \"Never\" , the same program may sometimes be started twice.\r\nIf you do specify .spec.parallelism and .spec.completions both greater than 1, then there may be multiple\r\npods running at once. Therefore, your pods must also be tolerant of concurrency.\r\nIf you specify the .spec.podFailurePolicy field, the Job controller does not consider a terminating Pod (a pod\r\nthat has a .metadata.deletionTimestamp field set) as a failure until that Pod is terminal (its .status.phase is\r\nFailed or Succeeded ). However, the Job controller creates a replacement Pod as soon as the termination\r\nbecomes apparent. Once the pod terminates, the Job controller evaluates .backoffLimit and\r\n.podFailurePolicy for the relevant Job, taking this now-terminated Pod into consideration.\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 7 of 26\n\nIf either of these requirements is not satisfied, the Job controller counts a terminating Pod as an immediate failure,\r\neven if that Pod later terminates with phase: \"Succeeded\" .\r\nPod backoff failure policy\r\nThere are situations where you want to fail a Job after some amount of retries due to a logical error in\r\nconfiguration etc. To do so, set .spec.backoffLimit to specify the number of retries before considering a Job as\r\nfailed.\r\nThe .spec.backoffLimit is set by default to 6, unless the backoff limit per index (only Indexed Job) is specified.\r\nWhen .spec.backoffLimitPerIndex is specified, then .spec.backoffLimit defaults to 2147483647\r\n(MaxInt32).\r\nFailed Pods associated with the Job are recreated by the Job controller with an exponential back-off delay (10s,\r\n20s, 40s ...) capped at six minutes.\r\nThe number of retries is calculated in two ways:\r\nThe number of Pods with .status.phase = \"Failed\" .\r\nWhen using restartPolicy = \"OnFailure\" , the number of retries in all the containers of Pods with\r\n.status.phase equal to Pending or Running .\r\nIf either of the calculations reaches the .spec.backoffLimit , the Job is considered failed.\r\nNote:\r\nIf your Job has restartPolicy = \"OnFailure\" , keep in mind that your Pod running the job will be terminated\r\nonce the job backoff limit has been reached. This can make debugging the Job's executable more difficult. We\r\nsuggest setting restartPolicy = \"Never\" when debugging the Job or using a logging system to ensure output\r\nfrom failed Jobs is not lost inadvertently.\r\nBackoff limit per index\r\nFEATURE STATE: Kubernetes v1.33 [stable] (enabled by default)\r\nWhen you run an indexed Job, you can choose to handle retries for pod failures independently for each index. To\r\ndo so, set the .spec.backoffLimitPerIndex to specify the maximal number of pod failures per index.\r\nWhen the per-index backoff limit is exceeded for an index, Kubernetes considers the index as failed and adds it to\r\nthe .status.failedIndexes field. The succeeded indexes, those with a successfully executed pods, are recorded\r\nin the .status.completedIndexes field, regardless of whether you set the backoffLimitPerIndex field.\r\nNote that a failing index does not interrupt execution of other indexes. Once all indexes finish for a Job where you\r\nspecified a backoff limit per index, if at least one of those indexes did fail, the Job controller marks the overall Job\r\nas failed, by setting the Failed condition in the status. The Job gets marked as failed even if some, potentially\r\nnearly all, of the indexes were processed successfully.\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 8 of 26\n\nYou can additionally limit the maximal number of indexes marked failed by setting the .spec.maxFailedIndexes\r\nfield. When the number of failed indexes exceeds the maxFailedIndexes field, the Job controller triggers\r\ntermination of all remaining running Pods for that Job. Once all pods are terminated, the entire Job is marked\r\nfailed by the Job controller, by setting the Failed condition in the Job status.\r\nHere is an example manifest for a Job that defines a backoffLimitPerIndex :\r\napiVersion: batch/v1\r\nkind: Job\r\nmetadata:\r\n name: job-backoff-limit-per-index-example\r\nspec:\r\n completions: 10\r\n parallelism: 3\r\n completionMode: Indexed # required for the feature\r\n backoffLimitPerIndex: 1 # maximal number of failures per index\r\n maxFailedIndexes: 5 # maximal number of failed indexes before terminating the Job execution\r\n template:\r\n spec:\r\n restartPolicy: Never # required for the feature\r\n containers:\r\n - name: example\r\n image: python\r\n command: # The jobs fails as there is at least one failed index\r\n # (all even indexes fail in here), yet all indexes\r\n # are executed as maxFailedIndexes is not exceeded.\r\n - python3\r\n - -c\r\n - |\r\n import os, sys\r\n print(\"Hello world\")\r\n if int(os.environ.get(\"JOB_COMPLETION_INDEX\")) % 2 == 0:\r\n sys.exit(1)\r\nIn the example above, the Job controller allows for one restart for each of the indexes. When the total number of\r\nfailed indexes exceeds 5, then the entire Job is terminated.\r\nOnce the job is finished, the Job status looks as follows:\r\nkubectl get -o yaml job job-backoff-limit-per-index-example\r\n status:\r\n completedIndexes: 1,3,5,7,9\r\n failedIndexes: 0,2,4,6,8\r\n succeeded: 5 # 1 succeeded pod for each of 5 succeeded indexes\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 9 of 26\n\nfailed: 10 # 2 failed pods (1 retry) for each of 5 failed indexes\r\n conditions:\r\n - message: Job has failed indexes\r\n reason: FailedIndexes\r\n status: \"True\"\r\n type: FailureTarget\r\n - message: Job has failed indexes\r\n reason: FailedIndexes\r\n status: \"True\"\r\n type: Failed\r\nThe Job controller adds the FailureTarget Job condition to trigger Job termination and cleanup. When all of the\r\nJob Pods are terminated, the Job controller adds the Failed condition with the same values for reason and\r\nmessage as the FailureTarget Job condition. For details, see Termination of Job Pods.\r\nAdditionally, you may want to use the per-index backoff along with a pod failure policy. When using per-index\r\nbackoff, there is a new FailIndex action available which allows you to avoid unnecessary retries within an\r\nindex.\r\nPod failure policy\r\nFEATURE STATE: Kubernetes v1.31 [stable] (enabled by default)\r\nA Pod failure policy, defined with the .spec.podFailurePolicy field, enables your cluster to handle Pod failures\r\nbased on the container exit codes and the Pod conditions.\r\nIn some situations, you may want to have a better control when handling Pod failures than the control provided by\r\nthe Pod backoff failure policy, which is based on the Job's .spec.backoffLimit . These are some examples of use\r\ncases:\r\nTo optimize costs of running workloads by avoiding unnecessary Pod restarts, you can terminate a Job as\r\nsoon as one of its Pods fails with an exit code indicating a software bug.\r\nTo guarantee that your Job finishes even if there are disruptions, you can ignore Pod failures caused by\r\ndisruptions (such as preemption, API-initiated eviction or taint-based eviction) so that they don't count\r\ntowards the .spec.backoffLimit limit of retries.\r\nYou can configure a Pod failure policy, in the .spec.podFailurePolicy field, to meet the above use cases. This\r\npolicy can handle Pod failures based on the container exit codes and the Pod conditions.\r\nHere is a manifest for a Job that defines a podFailurePolicy :\r\napiVersion: batch/v1\r\nkind: Job\r\nmetadata:\r\n name: job-pod-failure-policy-example\r\nspec:\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 10 of 26\n\ncompletions: 12\r\n parallelism: 3\r\n template:\r\n spec:\r\n restartPolicy: Never\r\n containers:\r\n - name: main\r\n image: docker.io/library/bash:5\r\n command: [\"bash\"] # example command simulating a bug which triggers the FailJob action\r\n args:\r\n - -c\r\n - echo \"Hello world!\" \u0026\u0026 sleep 5 \u0026\u0026 exit 42\r\n backoffLimit: 6\r\n podFailurePolicy:\r\n rules:\r\n - action: FailJob\r\n onExitCodes:\r\n containerName: main # optional\r\n operator: In # one of: In, NotIn\r\n values: [42]\r\n - action: Ignore # one of: Ignore, FailJob, Count\r\n onPodConditions:\r\n - type: DisruptionTarget # indicates Pod disruption\r\nIn the example above, the first rule of the Pod failure policy specifies that the Job should be marked failed if the\r\nmain container fails with the 42 exit code. The following are the rules for the main container specifically:\r\nan exit code of 0 means that the container succeeded\r\nan exit code of 42 means that the entire Job failed\r\nany other exit code represents that the container failed, and hence the entire Pod. The Pod will be re-created if the total number of restarts is below backoffLimit . If the backoffLimit is reached the entire\r\nJob failed.\r\nNote:\r\nBecause the Pod template specifies a restartPolicy: Never , the kubelet does not restart the main container in\r\nthat particular Pod.\r\nThe second rule of the Pod failure policy, specifying the Ignore action for failed Pods with condition\r\nDisruptionTarget excludes Pod disruptions from being counted towards the .spec.backoffLimit limit of\r\nretries.\r\nNote:\r\nIf the Job failed, either by the Pod failure policy or Pod backoff failure policy, and the Job is running multiple\r\nPods, Kubernetes terminates all the Pods in that Job that are still Pending or Running.\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 11 of 26\n\nThese are some requirements and semantics of the API:\r\nif you want to use a .spec.podFailurePolicy field for a Job, you must also define that Job's pod template\r\nwith .spec.restartPolicy set to Never .\r\nthe Pod failure policy rules you specify under spec.podFailurePolicy.rules are evaluated in order. Once\r\na rule matches a Pod failure, the remaining rules are ignored. When no rule matches the Pod failure, the\r\ndefault handling applies.\r\nyou may want to restrict a rule to a specific container by specifying its name\r\nin spec.podFailurePolicy.rules[*].onExitCodes.containerName . When not specified the rule applies to\r\nall containers. When specified, it should match one the container or initContainer names in the Pod\r\ntemplate.\r\nyou may specify the action taken when a Pod failure policy is matched by\r\nspec.podFailurePolicy.rules[*].action . Possible values are:\r\nFailJob : use to indicate that the Pod's job should be marked as Failed and all running Pods should\r\nbe terminated.\r\nIgnore : use to indicate that the counter towards the .spec.backoffLimit should not be\r\nincremented and a replacement Pod should be created.\r\nCount : use to indicate that the Pod should be handled in the default way. The counter towards the\r\n.spec.backoffLimit should be incremented.\r\nFailIndex : use this action along with backoff limit per index to avoid unnecessary retries within\r\nthe index of a failed pod.\r\nNote:\r\nWhen you use a podFailurePolicy , the job controller only matches Pods in the Failed phase. Pods with a\r\ndeletion timestamp that are not in a terminal phase ( Failed or Succeeded ) are considered still terminating. This\r\nimplies that terminating pods retain a tracking finalizer until they reach a terminal phase. Since Kubernetes 1.27,\r\nKubelet transitions deleted pods to a terminal phase (see: Pod Phase). This ensures that deleted pods have their\r\nfinalizers removed by the Job controller.\r\nNote:\r\nStarting with Kubernetes v1.28, when Pod failure policy is used, the Job controller recreates terminating Pods\r\nonly once these Pods reach the terminal Failed phase. This behavior is similar to podReplacementPolicy:\r\nFailed . For more information, see Pod replacement policy.\r\nWhen you use the podFailurePolicy , and the Job fails due to the pod matching the rule with the FailJob\r\naction, then the Job controller triggers the Job termination process by adding the FailureTarget condition. For\r\nmore details, see Job termination and cleanup.\r\nSuccess policy\r\nWhen creating an Indexed Job, you can define when a Job can be declared as succeeded using a\r\n.spec.successPolicy , based on the pods that succeeded.\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 12 of 26\n\nBy default, a Job succeeds when the number of succeeded Pods equals .spec.completions . These are some\r\nsituations where you might want additional control for declaring a Job succeeded:\r\nWhen running simulations with different parameters, you might not need all the simulations to succeed for\r\nthe overall Job to be successful.\r\nWhen following a leader-worker pattern, only the success of the leader determines the success or failure of\r\na Job. Examples of this are frameworks like MPI and PyTorch etc.\r\nYou can configure a success policy, in the .spec.successPolicy field, to meet the above use cases. This policy\r\ncan handle Job success based on the succeeded pods. After the Job meets the success policy, the job controller\r\nterminates the lingering Pods. A success policy is defined by rules. Each rule can take one of the following forms:\r\nWhen you specify the succeededIndexes only, once all indexes specified in the succeededIndexes\r\nsucceed, the job controller marks the Job as succeeded. The succeededIndexes must be a list of intervals\r\nbetween 0 and .spec.completions-1 .\r\nWhen you specify the succeededCount only, once the number of succeeded indexes reaches the\r\nsucceededCount , the job controller marks the Job as succeeded.\r\nWhen you specify both succeededIndexes and succeededCount , once the number of succeeded indexes\r\nfrom the subset of indexes specified in the succeededIndexes reaches the succeededCount , the job\r\ncontroller marks the Job as succeeded.\r\nNote that when you specify multiple rules in the .spec.successPolicy.rules , the job controller evaluates the\r\nrules in order. Once the Job meets a rule, the job controller ignores remaining rules.\r\nHere is a manifest for a Job with successPolicy :\r\napiVersion: batch/v1\r\nkind: Job\r\nmetadata:\r\n name: job-success\r\nspec:\r\n parallelism: 10\r\n completions: 10\r\n completionMode: Indexed # Required for the success policy\r\n successPolicy:\r\n rules:\r\n - succeededIndexes: 0,2-3\r\n succeededCount: 1\r\n template:\r\n spec:\r\n containers:\r\n - name: main\r\n image: python\r\n command: # Provided that at least one of the Pods with 0, 2, and 3 indexes has succeeded,\r\n # the overall Job is a success.\r\n - python3\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 13 of 26\n\n- -c\r\n - |\r\n import os, sys\r\n if os.environ.get(\"JOB_COMPLETION_INDEX\") == \"2\":\r\n sys.exit(0)\r\n else:\r\n sys.exit(1)\r\n restartPolicy: Never\r\nIn the example above, both succeededIndexes and succeededCount have been specified. Therefore, the job\r\ncontroller will mark the Job as succeeded and terminate the lingering Pods when either of the specified indexes, 0,\r\n2, or 3, succeed. The Job that meets the success policy gets the SuccessCriteriaMet condition with a\r\nSuccessPolicy reason. After the removal of the lingering Pods is issued, the Job gets the Complete condition.\r\nNote that the succeededIndexes is represented as intervals separated by a hyphen. The number are listed in\r\nrepresented by the first and last element of the series, separated by a hyphen.\r\nNote:\r\nWhen you specify both a success policy and some terminating policies such as .spec.backoffLimit and\r\n.spec.podFailurePolicy , once the Job meets either policy, the job controller respects the terminating policy and\r\nignores the success policy.\r\nJob termination and cleanup\r\nWhen a Job completes, no more Pods are created, but the Pods are usually not deleted either. Keeping them\r\naround allows you to still view the logs of completed pods to check for errors, warnings, or other diagnostic\r\noutput. The job object also remains after it is completed so that you can view its status. It is up to the user to delete\r\nold jobs after noting their status. Delete the job with kubectl (e.g. kubectl delete jobs/pi or kubectl\r\ndelete -f ./job.yaml ). When you delete the job using kubectl , all the pods it created are deleted too.\r\nBy default, a Job will run uninterrupted unless a Pod fails ( restartPolicy=Never ) or a Container exits in error\r\n( restartPolicy=OnFailure ), at which point the Job defers to the .spec.backoffLimit described above. Once\r\n.spec.backoffLimit has been reached the Job will be marked as failed and any running Pods will be terminated.\r\nAnother way to terminate a Job is by setting an active deadline. Do this by setting the\r\n.spec.activeDeadlineSeconds field of the Job to a number of seconds. The activeDeadlineSeconds applies to\r\nthe duration of the job, no matter how many Pods are created. Once a Job reaches activeDeadlineSeconds , all of\r\nits running Pods are terminated and the Job status will become type: Failed with reason: DeadlineExceeded .\r\nNote that a Job's .spec.activeDeadlineSeconds takes precedence over its .spec.backoffLimit . Therefore, a\r\nJob that is retrying one or more failed Pods will not deploy additional Pods once it reaches the time limit specified\r\nby activeDeadlineSeconds , even if the backoffLimit is not yet reached.\r\nExample:\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 14 of 26\n\napiVersion: batch/v1\r\nkind: Job\r\nmetadata:\r\n name: pi-with-timeout\r\nspec:\r\n backoffLimit: 5\r\n activeDeadlineSeconds: 100\r\n template:\r\n spec:\r\n containers:\r\n - name: pi\r\n image: perl:5.34.0\r\n command: [\"perl\", \"-Mbignum=bpi\", \"-wle\", \"print bpi(2000)\"]\r\n restartPolicy: Never\r\nNote that both the Job spec and the Pod template spec within the Job have an activeDeadlineSeconds field.\r\nEnsure that you set this field at the proper level.\r\nKeep in mind that the restartPolicy applies to the Pod, and not to the Job itself: there is no automatic Job\r\nrestart once the Job status is type: Failed . That is, the Job termination mechanisms activated with\r\n.spec.activeDeadlineSeconds and .spec.backoffLimit result in a permanent Job failure that requires manual\r\nintervention to resolve.\r\nTerminal Job conditions\r\nA Job has two possible terminal states, each of which has a corresponding Job condition:\r\nSucceeded: Job condition Complete\r\nFailed: Job condition Failed\r\nJobs fail for the following reasons:\r\nThe number of Pod failures exceeded the specified .spec.backoffLimit in the Job specification. For\r\ndetails, see Pod backoff failure policy.\r\nThe Job runtime exceeded the specified .spec.activeDeadlineSeconds\r\nAn indexed Job that used .spec.backoffLimitPerIndex has failed indexes. For details, see Backoff limit\r\nper index.\r\nThe number of failed indexes in the Job exceeded the specified spec.maxFailedIndexes . For details, see\r\nBackoff limit per index\r\nA failed Pod matches a rule in .spec.podFailurePolicy that has the FailJob action. For details about\r\nhow Pod failure policy rules might affect failure evaluation, see Pod failure policy.\r\nJobs succeed for the following reasons:\r\nThe number of succeeded Pods reached the specified .spec.completions\r\nThe criteria specified in .spec.successPolicy are met. For details, see Success policy.\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 15 of 26\n\nIn Kubernetes v1.31 and later the Job controller delays the addition of the terminal conditions, Failed or\r\nComplete , until all of the Job Pods are terminated.\r\nIn Kubernetes v1.30 and earlier, the Job controller added the Complete or the Failed Job terminal conditions\r\nas soon as the Job termination process was triggered and all Pod finalizers were removed. However, some Pods\r\nwould still be running or terminating at the moment that the terminal condition was added.\r\nIn Kubernetes v1.31 and later, the controller only adds the Job terminal conditions after all of the Pods are\r\nterminated. You can control this behavior by using the JobManagedBy and the JobPodReplacementPolicy (both\r\nenabled by default) feature gates.\r\nTermination of Job pods\r\nThe Job controller adds the FailureTarget condition or the SuccessCriteriaMet condition to the Job to trigger\r\nPod termination after a Job meets either the success or failure criteria.\r\nFactors like terminationGracePeriodSeconds might increase the amount of time from the moment that the Job\r\ncontroller adds the FailureTarget condition or the SuccessCriteriaMet condition to the moment that all of the\r\nJob Pods terminate and the Job controller adds a terminal condition ( Failed or Complete ).\r\nYou can use the FailureTarget or the SuccessCriteriaMet condition to evaluate whether the Job has failed or\r\nsucceeded without having to wait for the controller to add a terminal condition.\r\nFor example, you might want to decide when to create a replacement Job that replaces a failed Job. If you replace\r\nthe failed Job when the FailureTarget condition appears, your replacement Job runs sooner, but could result in\r\nPods from the failed and the replacement Job running at the same time, using extra compute resources.\r\nAlternatively, if your cluster has limited resource capacity, you could choose to wait until the Failed condition\r\nappears on the Job, which would delay your replacement Job but would ensure that you conserve resources by\r\nwaiting until all of the failed Pods are removed.\r\nClean up finished jobs automatically\r\nFinished Jobs are usually no longer needed in the system. Keeping them around in the system will put pressure on\r\nthe API server. If the Jobs are managed directly by a higher level controller, such as CronJobs, the Jobs can be\r\ncleaned up by CronJobs based on the specified capacity-based cleanup policy.\r\nTTL mechanism for finished Jobs\r\nFEATURE STATE: Kubernetes v1.23 [stable]\r\nAnother way to clean up finished Jobs (either Complete or Failed ) automatically is to use a TTL mechanism\r\nprovided by a TTL controller for finished resources, by specifying the .spec.ttlSecondsAfterFinished field of\r\nthe Job.\r\nWhen the TTL controller cleans up the Job, it will delete the Job cascadingly, i.e. delete its dependent objects,\r\nsuch as Pods, together with the Job. Note that when the Job is deleted, its lifecycle guarantees, such as finalizers,\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 16 of 26\n\nwill be honored.\r\nFor example:\r\napiVersion: batch/v1\r\nkind: Job\r\nmetadata:\r\n name: pi-with-ttl\r\nspec:\r\n ttlSecondsAfterFinished: 100\r\n template:\r\n spec:\r\n containers:\r\n - name: pi\r\n image: perl:5.34.0\r\n command: [\"perl\", \"-Mbignum=bpi\", \"-wle\", \"print bpi(2000)\"]\r\n restartPolicy: Never\r\nThe Job pi-with-ttl will be eligible to be automatically deleted, 100 seconds after it finishes.\r\nIf the field is set to 0 , the Job will be eligible to be automatically deleted immediately after it finishes. If the\r\nfield is unset, this Job won't be cleaned up by the TTL controller after it finishes.\r\nNote:\r\nIt is recommended to set ttlSecondsAfterFinished field because unmanaged jobs (Jobs that you created\r\ndirectly, and not indirectly through other workload APIs such as CronJob) have a default deletion policy of\r\norphanDependents causing Pods created by an unmanaged Job to be left around after that Job is fully deleted.\r\nEven though the control plane eventually garbage collects the Pods from a deleted Job after they either fail or\r\ncomplete, sometimes those lingering pods may cause cluster performance degradation or in worst case cause the\r\ncluster to go offline due to this degradation.\r\nYou can use LimitRanges and ResourceQuotas to place a cap on the amount of resources that a particular\r\nnamespace can consume.\r\nJob patterns\r\nThe Job object can be used to process a set of independent but related work items. These might be emails to be\r\nsent, frames to be rendered, files to be transcoded, ranges of keys in a NoSQL database to scan, and so on.\r\nIn a complex system, there may be multiple different sets of work items. Here we are just considering one set of\r\nwork items that the user wants to manage together — a batch job.\r\nThere are several different patterns for parallel computation, each with strengths and weaknesses. The tradeoffs\r\nare:\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 17 of 26\n\nOne Job object for each work item, versus a single Job object for all work items. One Job per work item\r\ncreates some overhead for the user and for the system to manage large numbers of Job objects. A single Job\r\nfor all work items is better for large numbers of items.\r\nNumber of Pods created equals number of work items, versus each Pod can process multiple work items.\r\nWhen the number of Pods equals the number of work items, the Pods typically requires less modification\r\nto existing code and containers. Having each Pod process multiple work items is better for large numbers\r\nof items.\r\nSeveral approaches use a work queue. This requires running a queue service, and modifications to the\r\nexisting program or container to make it use the work queue. Other approaches are easier to adapt to an\r\nexisting containerised application.\r\nWhen the Job is associated with a headless Service, you can enable the Pods within a Job to communicate\r\nwith each other to collaborate in a computation.\r\nThe tradeoffs are summarized here, with columns 2 to 4 corresponding to the above tradeoffs. The pattern names\r\nare also links to examples and more detailed description.\r\nPattern\r\nSingle Job\r\nobject\r\nFewer pods than work\r\nitems?\r\nUse app\r\nunmodified?\r\nQueue with Pod Per Work Item ✓ sometimes\r\nQueue with Variable Pod Count ✓ ✓\r\nIndexed Job with Static Work\r\nAssignment\r\n✓ ✓\r\nJob with Pod-to-Pod\r\nCommunication\r\n✓ sometimes sometimes\r\nJob Template Expansion ✓\r\nWhen you specify completions with .spec.completions , each Pod created by the Job controller has an identical\r\nspec . This means that all pods for a task will have the same command line and the same image, the same\r\nvolumes, and (almost) the same environment variables. These patterns are different ways to arrange for pods to\r\nwork on different things.\r\nThis table shows the required settings for .spec.parallelism and .spec.completions for each of the patterns.\r\nHere, W is the number of work items.\r\nPattern .spec.completions .spec.parallelism\r\nQueue with Pod Per Work Item W any\r\nQueue with Variable Pod Count null any\r\nIndexed Job with Static Work Assignment W any\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 18 of 26\n\nPattern .spec.completions .spec.parallelism\r\nJob with Pod-to-Pod Communication W W\r\nJob Template Expansion 1 should be 1\r\nAdvanced usage\r\nSuspending a Job\r\nFEATURE STATE: Kubernetes v1.24 [stable]\r\nWhen a Job is created, the Job controller will immediately begin creating Pods to satisfy the Job's requirements\r\nand will continue to do so until the Job is complete. However, you may want to temporarily suspend a Job's\r\nexecution and resume it later, or start Jobs in suspended state and have a custom controller decide later when to\r\nstart them.\r\nTo suspend a Job, you can update the .spec.suspend field of the Job to true; later, when you want to resume it\r\nagain, update it to false. Creating a Job with .spec.suspend set to true will create it in the suspended state.\r\nIn Kubernetes 1.35 or later the .status.startTime field is cleared on Job suspension when the\r\nMutableSchedulingDirectivesForSuspendedJobs feature gate is enabled.\r\nWhen a Job is resumed from suspension, its .status.startTime field will be reset to the current time. This\r\nmeans that the .spec.activeDeadlineSeconds timer will be stopped and reset when a Job is suspended and\r\nresumed.\r\nWhen you suspend a Job, any running Pods that don't have a status of Completed will be terminated with a\r\nSIGTERM signal. The Pod's graceful termination period will be honored and your Pod must handle this signal in\r\nthis period. This may involve saving progress for later or undoing changes. Pods terminated this way will not\r\ncount towards the Job's completions count.\r\nAn example Job definition in the suspended state can be like so:\r\nkubectl get job myjob -o yaml\r\napiVersion: batch/v1\r\nkind: Job\r\nmetadata:\r\n name: myjob\r\nspec:\r\n suspend: true\r\n parallelism: 1\r\n completions: 5\r\n template:\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 19 of 26\n\nspec:\r\n ...\r\nYou can also toggle Job suspension by patching the Job using the command line.\r\nSuspend an active Job:\r\nkubectl patch job/myjob --type=strategic --patch '{\"spec\":{\"suspend\":true}}'\r\nResume a suspended Job:\r\nkubectl patch job/myjob --type=strategic --patch '{\"spec\":{\"suspend\":false}}'\r\nThe Job's status can be used to determine if a Job is suspended or has been suspended in the past:\r\nkubectl get jobs/myjob -o yaml\r\napiVersion: batch/v1\r\nkind: Job\r\n# .metadata and .spec omitted\r\nstatus:\r\n conditions:\r\n - lastProbeTime: \"2021-02-05T13:14:33Z\"\r\n lastTransitionTime: \"2021-02-05T13:14:33Z\"\r\n status: \"True\"\r\n type: Suspended\r\n startTime: \"2021-02-05T13:13:48Z\"\r\nThe Job condition of type \"Suspended\" with status \"True\" means the Job is suspended; the lastTransitionTime\r\nfield can be used to determine how long the Job has been suspended for. If the status of that condition is \"False\",\r\nthen the Job was previously suspended and is now running. If such a condition does not exist in the Job's status,\r\nthe Job has never been stopped.\r\nEvents are also created when the Job is suspended and resumed:\r\nkubectl describe jobs/myjob\r\nName: myjob\r\n...\r\nEvents:\r\n Type Reason Age From Message\r\n ---- ------ ---- ---- -------\r\n Normal SuccessfulCreate 12m job-controller Created pod: myjob-hlrpl\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 20 of 26\n\nNormal SuccessfulDelete 11m job-controller Deleted pod: myjob-hlrpl\r\n Normal Suspended 11m job-controller Job suspended\r\n Normal SuccessfulCreate 3s job-controller Created pod: myjob-jvb44\r\n Normal Resumed 3s job-controller Job resumed\r\nThe last four events, particularly the \"Suspended\" and \"Resumed\" events, are directly a result of toggling the\r\n.spec.suspend field. In the time between these two events, we see that no Pods were created, but Pod creation\r\nrestarted as soon as the Job was resumed.\r\nMutable Scheduling Directives\r\nFEATURE STATE: Kubernetes v1.27 [stable]\r\nIn most cases, a parallel job will want the pods to run with constraints, like all in the same zone, or all either on\r\nGPU model x or y but not a mix of both.\r\nThe suspend field is the first step towards achieving those semantics. Suspend allows a custom queue controller to\r\ndecide when a job should start; However, once a job is unsuspended, a custom queue controller has no influence\r\non where the pods of a job will actually land.\r\nThis feature allows updating a Job's scheduling directives before it starts, which gives custom queue controllers\r\nthe ability to influence pod placement while at the same time offloading actual pod-to-node assignment to kube-scheduler.\r\nThe fields in a Job's pod template that can be updated are node affinity, node selector, tolerations, labels,\r\nannotations and scheduling gates.\r\nMutable Scheduling Directives for suspended Jobs\r\nFEATURE STATE: Kubernetes v1.35 [alpha] (disabled by default)\r\nIn Kubernetes 1.34 or earlier mutating of Pod's scheduling directives is allowed only for suspended Jobs that have\r\nnever been unsuspended before. In Kubernetes 1.35, this is allowed for any suspended Jobs when the\r\nMutableSchedulingDirectivesForSuspendedJobs feature gate is enabled.\r\nAdditionally, this feature gate enables clearing of the .status.startTime field on Job suspension.\r\nMutable Pod resources for suspended Jobs\r\nFEATURE STATE: Kubernetes v1.35 [alpha] (disabled by default)\r\nA cluster administrator can define admission controls in Kubernetes, modifying the resource requests or limits for\r\na Job, based on policy rules.\r\nWith this feature, Kubernetes also lets you modify the pod template of a suspended job, to change the resource\r\nrequirements of the Pods in the Job. This is different from in-place Pod resize which lets you update resources,\r\none Pod at a time, for Pods that are already running.\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 21 of 26\n\nThe client that sets the new resource requests or limits can be different from the client that initially created the\r\nJob, and does not need to be a cluster administrator.\r\nSpecifying your own Pod selector\r\nNormally, when you create a Job object, you do not specify .spec.selector . The system defaulting logic adds\r\nthis field when the Job is created. It picks a selector value that will not overlap with any other jobs.\r\nHowever, in some cases, you might need to override this automatically set selector. To do this, you can specify the\r\n.spec.selector of the Job.\r\nBe very careful when doing this. If you specify a label selector which is not unique to the pods of that Job, and\r\nwhich matches unrelated Pods, then pods of the unrelated job may be deleted, or this Job may count other Pods as\r\ncompleting it, or one or both Jobs may refuse to create Pods or run to completion. If a non-unique selector is\r\nchosen, then other controllers (e.g. ReplicationController) and their Pods may behave in unpredictable ways too.\r\nKubernetes will not stop you from making a mistake when specifying .spec.selector .\r\nHere is an example of a case when you might want to use this feature.\r\nSay Job old is already running. You want existing Pods to keep running, but you want the rest of the Pods it\r\ncreates to use a different pod template and for the Job to have a new name. You cannot update the Job because\r\nthese fields are not updatable. Therefore, you delete Job old but leave its pods running, using kubectl delete\r\njobs/old --cascade=orphan . Before deleting it, you make a note of what selector it uses:\r\nkubectl get job old -o yaml\r\nThe output is similar to this:\r\nkind: Job\r\nmetadata:\r\n name: old\r\n ...\r\nspec:\r\n selector:\r\n matchLabels:\r\n batch.kubernetes.io/controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002\r\n ...\r\nThen you create a new Job with name new and you explicitly specify the same selector. Since the existing Pods\r\nhave label batch.kubernetes.io/controller-uid=a8f3d00d-c6d2-11e5-9f87-42010af00002 , they are controlled\r\nby Job new as well.\r\nYou need to specify manualSelector: true in the new Job since you are not using the selector that the system\r\nnormally generates for you automatically.\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 22 of 26\n\nkind: Job\r\nmetadata:\r\n name: new\r\n ...\r\nspec:\r\n manualSelector: true\r\n selector:\r\n matchLabels:\r\n batch.kubernetes.io/controller-uid: a8f3d00d-c6d2-11e5-9f87-42010af00002\r\n ...\r\nThe new Job itself will have a different uid from a8f3d00d-c6d2-11e5-9f87-42010af00002 . Setting\r\nmanualSelector: true tells the system that you know what you are doing and to allow this mismatch.\r\nJob tracking with finalizers\r\nFEATURE STATE: Kubernetes v1.26 [stable]\r\nThe control plane keeps track of the Pods that belong to any Job and notices if any such Pod is removed from the\r\nAPI server. To do that, the Job controller creates Pods with the finalizer batch.kubernetes.io/job-tracking .\r\nThe controller removes the finalizer only after the Pod has been accounted for in the Job status, allowing the Pod\r\nto be removed by other controllers or users.\r\nElastic Indexed Jobs\r\nFEATURE STATE: Kubernetes v1.31 [stable] (enabled by default)\r\nYou can scale Indexed Jobs up or down by mutating both .spec.parallelism and .spec.completions together\r\nsuch that .spec.parallelism == .spec.completions . When scaling down, Kubernetes removes the Pods with\r\nhigher indexes.\r\nUse cases for elastic Indexed Jobs include batch workloads which require scaling an indexed Job, such as MPI,\r\nHorovod, Ray, and PyTorch training jobs.\r\nDelayed creation of replacement pods\r\nFEATURE STATE: Kubernetes v1.34 [stable] (enabled by default)\r\nBy default, the Job controller recreates Pods as soon they either fail or are terminating (have a deletion\r\ntimestamp). This means that, at a given time, when some of the Pods are terminating, the number of running Pods\r\nfor a Job can be greater than parallelism or greater than one Pod per index (if you are using an Indexed Job).\r\nYou may choose to create replacement Pods only when the terminating Pod is fully terminal (has status.phase:\r\nFailed ). To do this, set the .spec.podReplacementPolicy: Failed . The default replacement policy depends on\r\nwhether the Job has a podFailurePolicy set. With no Pod failure policy defined for a Job, omitting the\r\npodReplacementPolicy field selects the TerminatingOrFailed replacement policy: the control plane creates\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 23 of 26\n\nreplacement Pods immediately upon Pod deletion (as soon as the control plane sees that a Pod for this Job has\r\ndeletionTimestamp set). For Jobs with a Pod failure policy set, the default podReplacementPolicy is Failed ,\r\nand no other value is permitted. See Pod failure policy to learn more about Pod failure policies for Jobs.\r\nkind: Job\r\nmetadata:\r\n name: new\r\n ...\r\nspec:\r\n podReplacementPolicy: Failed\r\n ...\r\nProvided your cluster has the feature gate enabled, you can inspect the .status.terminating field of a Job. The\r\nvalue of the field is the number of Pods owned by the Job that are currently terminating.\r\nkubectl get jobs/myjob -o yaml\r\napiVersion: batch/v1\r\nkind: Job\r\n# .metadata and .spec omitted\r\nstatus:\r\n terminating: 3 # three Pods are terminating and have not yet reached the Failed phase\r\nDelegation of managing a Job object to external controller\r\nFEATURE STATE: Kubernetes v1.35 [stable] (enabled by default)\r\nThis feature allows you to disable the built-in Job controller, for a specific Job, and delegate reconciliation of the\r\nJob to an external controller.\r\nYou indicate the controller that reconciles the Job by setting a custom value for the spec.managedBy field - any\r\nvalue other than kubernetes.io/job-controller . The value of the field is immutable.\r\nNote:\r\nWhen using this feature, make sure the controller indicated by the field is installed, otherwise the Job may not be\r\nreconciled at all.\r\nNote:\r\nWhen developing an external Job controller be aware that your controller needs to operate in a fashion conformant\r\nwith the definitions of the API spec and status fields of the Job object.\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 24 of 26\n\nPlease review these in detail in the Job API. We also recommend that you run the e2e conformance tests for the\r\nJob object to verify your implementation.\r\nFinally, when developing an external Job controller make sure it does not use the batch.kubernetes.io/job-tracking finalizer, reserved for the built-in controller.\r\nAlternatives\r\nBare Pods\r\nWhen the node that a Pod is running on reboots or fails, the pod is terminated and will not be restarted. However,\r\na Job will create new Pods to replace terminated ones. For this reason, we recommend that you use a Job rather\r\nthan a bare Pod, even if your application requires only a single Pod.\r\nReplication Controller\r\nJobs are complementary to Replication Controllers. A Replication Controller manages Pods which are not\r\nexpected to terminate (e.g. web servers), and a Job manages Pods that are expected to terminate (e.g. batch tasks).\r\nAs discussed in Pod Lifecycle, Job is only appropriate for pods with RestartPolicy equal to OnFailure or\r\nNever .\r\nNote:\r\nIf RestartPolicy is not set, the default value is Always .\r\nSingle Job starts controller Pod\r\nAnother pattern is for a single Job to create a Pod which then creates other Pods, acting as a sort of custom\r\ncontroller for those Pods. This allows the most flexibility, but may be somewhat complicated to get started with\r\nand offers less integration with Kubernetes.\r\nAn advantage of this approach is that the overall process gets the completion guarantee of a Job object, but\r\nmaintains complete control over what Pods are created and how work is assigned to them.\r\nWhat's next\r\nLearn about Pods.\r\nRead about different ways of running Jobs:\r\nCoarse Parallel Processing Using a Work Queue\r\nFine Parallel Processing Using a Work Queue\r\nUse an indexed Job for parallel processing with static work assignment\r\nCreate multiple Jobs based on a template: Parallel Processing using Expansions\r\nFollow the links within Clean up finished jobs automatically to learn more about how your cluster can\r\nclean up completed and / or failed tasks.\r\nJob is part of the Kubernetes REST API. Read the Job object definition to understand the API for jobs.\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 25 of 26\n\nRead about CronJob , which you can use to define a series of Jobs that will run based on a schedule,\r\nsimilar to the UNIX tool cron .\r\nPractice how to configure handling of retriable and non-retriable pod failures using podFailurePolicy ,\r\nbased on the step-by-step examples.\r\nSource: https://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nhttps://kubernetes.io/docs/concepts/workloads/controllers/job/\r\nPage 26 of 26",
	"extraction_quality": 1,
	"language": "EN",
	"sources": [
		"MITRE"
	],
	"references": [
		"https://kubernetes.io/docs/concepts/workloads/controllers/job/"
	],
	"report_names": [
		"job"
	],
	"threat_actors": [
		{
			"id": "eb3f4e4d-2573-494d-9739-1be5141cf7b2",
			"created_at": "2022-10-25T16:07:24.471018Z",
			"updated_at": "2026-04-10T02:00:05.002374Z",
			"deleted_at": null,
			"main_name": "Cron",
			"aliases": [],
			"source_name": "ETDA:Cron",
			"tools": [
				"Catelites",
				"Catelites Bot",
				"CronBot",
				"TinyZBot"
			],
			"source_id": "ETDA",
			"reports": null
		}
	],
	"ts_created_at": 1775439050,
	"ts_updated_at": 1775791457,
	"ts_creation_date": 0,
	"ts_modification_date": 0,
	"files": {
		"pdf": "https://archive.orkl.eu/5ca65a6165a66309151b15bfc4e53038c83e96f2.pdf",
		"text": "https://archive.orkl.eu/5ca65a6165a66309151b15bfc4e53038c83e96f2.txt",
		"img": "https://archive.orkl.eu/5ca65a6165a66309151b15bfc4e53038c83e96f2.jpg"
	}
}