DataLossError (see above for traceback): corrupted record at 12 #13463

huangrandong · 2017-10-03T12:08:07Z

I have a big problem, I use the tfrecord file to import data for my tensorflow program. But, when the program run a period of time， it displays the DataLossError:

System information

OS Platform and Distribution : Linux Ubuntu 14.04
TensorFlow installed from : Anaconda
TensorFlow version : 1.3.0
Python version: 2.7.13
CUDA/cuDNN version: 8.0 / 6.0
GPU model and memory: Pascal TITAN X

Describe the problem

2017-10-03 19:45:43.854601: W tensorflow/core/framework/op_kernel.cc:1192] Data loss: corrupted record at 12
Traceback (most recent call last):
File "east_quad_train_backup.py", line 416, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "east_quad_train_backup.py", line 330, in main
Training()
File "east_quad_train_backup.py", line 312, in Training
feed_dict={learning_rate: lr})
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 12
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,512,512,3], [?,128,128,9]], output_types=[DT_UINT8, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"]]
[[Node: gradients/Tile_grad/Shape/_23 = _HostRecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_442_gradients/Tile_grad/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]]

Caused by op u'IteratorGetNext', defined at:
File "east_quad_train_backup.py", line 416, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "east_quad_train_backup.py", line 330, in main
Training()
File "east_quad_train_backup.py", line 251, in Training
batch_image, batch_label = iterator.get_next()
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/data/python/ops/dataset_ops.py", line 304, in get_next
name=name))
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 379, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

DataLossError (see above for traceback): corrupted record at 12
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,512,512,3], [?,128,128,9]], output_types=[DT_UINT8, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"]]
[[Node: gradients/Tile_grad/Shape/_23 = _HostRecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_442_gradients/Tile_grad/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]]

Thanks anyone to answer this question.

cy89 · 2017-10-09T00:16:54Z

@huangrandong is this problem repeatable, or did it happen just one time?

huangrandong · 2017-10-09T03:06:27Z

@cy89 , thank you for your response。This problem happened many times,and it will come out whenever i run my program. The problem can not be repeatable. the reason can be the problem of my computer configuration. my program can run on another machine and don't display the error.

reedwm · 2017-10-12T03:24:59Z

Can you post a small example that will cause the DataLossError after running it for a while, so that we can see what the problem is?

huangrandong · 2017-10-12T11:39:39Z

@reedwm my code is used to put the numpy array into the TFrecord file and read the it from the same file ,this is my code:

create the TFrecord file function:

img_tfrecord_name = image_base_name + ".tfrecord"
writer = tf.python_io.TFRecordWriter(new_label_path + img_tfrecord_name)
label_concate = np.concatenate((score_map, x1_offset, y1_offset,
x2_offset, y2_offset, x3_offset,
y3_offset, x4_offset, y4_offset), axis = -1)
org_train_image = cv2.imread(org_train_images_path + img_name)
org_train_image_resize = cv2.resize(org_train_image,
(input_image_size, input_image_size))
assert org_train_image_resize.shape == (512,512,3)
org_train_image_resize = org_train_image_resize.astype(np.uint8)
org_train_image_resize_raw = org_train_image_resize.tostring()
label_concate = label_concate.astype(np.float32)
label_concate_raw = label_concate.tostring()
example = tf.train.Example(
features = tf.train.Features(
feature = {'image':tf.train.Feature(bytes_list =
tf.train.BytesList(value[org_train_image_resize_raw])),
'label':tf.train.Feature(bytes_list = tf.train.BytesList(value=[label_concate_raw]))}))
serialized = example.SerializeToString()
writer.write(serialized)
print 'writer ',img_name,' DOWN!'
writer.close()

read the TFrecord file ：

def _parse_function_for_train(example_proto):
features = {'image': tf.FixedLenFeature((), tf.string, default_value=""),
'label': tf.FixedLenFeature((), tf.string, default_value="")}
parsed_features = tf.parse_single_example(example_proto, features)
image_raw_out = parsed_features['image']
label_raw_out = parsed_features['label']
image_out = tf.decode_raw(image_raw_out, tf.uint8)
label_out = tf.decode_raw(label_raw_out, tf.float32)
image_out = tf.reshape(image_out, [512, 512, 3])
label_out = tf.reshape(label_out, [128,128,9])
return image_out, label_out

def CreateTrainDataset():
train_image_label_tfrecord_list = ["t1.tfrecord", "t2.tfrecord",......]
train_dataset = tf.contrib.data.TFRecordDataset(train_image_label_tfrecord_list)
train_dataset = train_dataset.map(_parse_function_for_train)
batched_train_dataset = train_dataset.batch(512)
return batched_train_dataset

batched_train_dataset = CreateTrainDataset()
iterator = batched_train_dataset.make_initializable_iterator()
batch_image, batch_label = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer)
When the above code run some iterations, the DataLossError will com out

reedwm · 2017-10-13T19:28:41Z

@huangrandong can you post a complete, self-contained example I can copy to a text file and run? In the code above, image_base_name is not defined.

@saxenasaurabh @vrv, any idea what the problem could be?

huangrandong · 2017-10-14T06:49:40Z

@reedwm you can define the variable which the code didn't define. and the code is used to put a numpy array of image and another label array into the tfrecord file. Then, reading the two array from the tfrecord file

reedwm · 2017-10-16T18:47:25Z

It's much easier to quickly reproduce these issues if I have a self-contained example without having to define variables. Perhaps the issue only occurs for certain values of x1_offset, for example. So can you please add a complete example?

guillaumekln · 2017-11-10T12:47:20Z

I also had reports of this error which appears to occur randomly during the training. It happened on multiple occasions and with different reported offsets (see OpenNMT/OpenNMT-tf#19).

To investigate the issue, I wrote a small script that repeatedly loops over the same TFRecord dataset that threw the error and applies the same processing as done during the training. However, I was not able to reproduce it, indicating that no records are corrupted in the file and something else is going one during the training.

Any pointers to better investigate the issue would be appreciated.

rjbruin · 2017-11-14T10:07:16Z

Same problem here. For several different sets of TFRecord files we get this error at random times during training.

homink · 2017-11-14T18:45:04Z

I have reproduced the error at the same record location. The first and third got the error in the middle of 'Filling up shuffle buffer' and the second got the error in the beginning of that. In my case, this error looks highly relevant with the buffer shuffling process although different size of buffer didn't work. I hope this would be helpful for debugging.

[kwon@ssi-dnn-slave-002 wsj_kaldi_tf]$ grep DataLossError wsj.log
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 3449023918
DataLossError (see above for traceback): corrupted record at 3449023918
[kwon@ssi-dnn-slave-002 wsj_kaldi_tf]$ grep DataLossError wsj.log1
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 3449023918
DataLossError (see above for traceback): corrupted record at 3449023918
[kwon@ssi-dnn-slave-002 wsj_kaldi_tf]$ grep DataLossError wsj.log2
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 3449023918
DataLossError (see above for traceback): corrupted record at 3449023918

FirefoxMetzger · 2017-11-21T21:29:15Z

Allow me to further complicate matters. (Although I am not 100% sure that it is the same issue)

I have some custom data and know that the TFRecord is not corrupt, because I've iterated over it (using the same code) successfully before. Now I've encountered the same situation that homink described.
After restarting my machine it is again working as intended.

Assuming that it is related, is there any caching involved when reading the .tfrecord? Either from tensorflow, python or the OS? (I am currently running it on Win10)

tjvandal · 2017-11-22T03:37:27Z

@FirefoxMetzger I am too having this issue so I tried restarting my machine, as you did, and it did not fix the problem. I'm using Ubuntu 16.04.

tensorflowbutler · 2017-12-20T01:10:33Z

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue? Please update the label and/or status accordingly.

reedwm · 2017-12-20T01:38:51Z

/CC @mrry @saxenasaurabh, any ideas what the issue could be? This is hard to debug without a small example that reproduces the issue.

mrry · 2017-12-20T02:14:15Z

AFAICT, this problem only affects ZLIB-compressed TFRecord files (because that is the sole source of "corrupted record at" in an error message). The source indicates a CRC mismatch. I'm a little surprised that none of the code snippets mention ZLIB compression.

/CC @saxenasaurabh @rohan100jain, who last touched the ZLIB-related code in that file.

guillaumekln · 2017-12-20T09:17:15Z

I confirm that the issue was encountered without any compression configured, unless it is the default (which is not AFAIK).

mrry · 2017-12-21T04:30:25Z

Pardon my mistake, indeed there are other code paths that can print that message, and each of them is related to a CRC mismatch.

tensorflowbutler · 2018-01-04T19:06:10Z

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

tjvandal · 2018-01-06T03:44:12Z

Anymore thoughts on this? It's a big issue for me but I don't know where to start debugging. Each time I reprocess my data the errors appear in different locations. Sometimes it takes a couple training epochs to occur.

amj · 2018-01-20T17:55:09Z

/sub

This is happening to us as well, any ideas?

Edit to add: We are using zlib compression, reading a bunch of files off GCS with interleave and shuffling them into one large Dataset; as a result, there's no way to catch the error and try and carry on.

Is it possible this is some GCS transient? I'm also having trouble repeating it with the same data.

muayyad-alsadi · 2019-02-06T12:18:48Z

does the .repeat() understand that?

  dataset = dataset.repeat()

sjain-stanford · 2019-02-12T02:32:59Z

@guillaumekln thanks for the pointer to tf.data.experimental.ignore_errors. I do have a follow-up question on that:

How does it handle tf.errors.OutOfRangeError - does it ignore that too? I use this to track the end of my dataset (validation). When I ignore errors, it seems that the validation loop is stuck upon reaching the end and sess.run doesn't yield anything at that point.

dataset = dataset.apply(tf.data.experimental.ignore_errors())

guillaumekln · 2019-02-12T07:54:49Z

does the .repeat() understand that?

I think it does.

How does it handle tf.errors.OutOfRangeError - does it ignore that too? I use this to track the end of my dataset (validation). When I ignore errors, it seems that the validation loop is stuck upon reaching the end and sess.run doesn't yield anything at that point.

Not sure sure about this. The following snippet does raise the OutOfRangeError exception:

import tensorflow as tf

dataset = tf.data.Dataset.range(10)
dataset = dataset.apply(tf.data.experimental.ignore_errors())

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
    while True:
        print(sess.run(next_element))

sjain-stanford · 2019-02-12T19:14:02Z

@guillaumekln You're right, what I'm seeing may not have to do with OutofRangeError, but the execution stalls indefinitely when it encounters corrupt data within a TFRecord, despite using tf.data.experimental.ignore_errors(). I've created #25700 with the minimal code to reproduce what I'm seeing. Have you encountered this before?

UPDATE: #25700 (comment)

I think the issue was that, when ignore_errors is used, the same file will repeat as the file_index is not moved forward to completion.

Bug fix by @yongtang in #25705

yuleung · 2019-04-20T09:01:21Z

@reedwm my code is used to put the numpy array into the TFrecord file and read the it from the same file ,this is my code:
create the TFrecord file function:
img_tfrecord_name = image_base_name + ".tfrecord"
writer = tf.python_io.TFRecordWriter(new_label_path + img_tfrecord_name)
label_concate = np.concatenate((score_map, x1_offset, y1_offset,
x2_offset, y2_offset, x3_offset,
y3_offset, x4_offset, y4_offset), axis = -1)
org_train_image = cv2.imread(org_train_images_path + img_name)
org_train_image_resize = cv2.resize(org_train_image,
(input_image_size, input_image_size))
assert org_train_image_resize.shape == (512,512,3)
org_train_image_resize = org_train_image_resize.astype(np.uint8)
org_train_image_resize_raw = org_train_image_resize.tostring()
label_concate = label_concate.astype(np.float32)
label_concate_raw = label_concate.tostring()
example = tf.train.Example(
features = tf.train.Features(
feature = {'image':tf.train.Feature(bytes_list =
tf.train.BytesList(value[org_train_image_resize_raw])),
'label':tf.train.Feature(bytes_list = tf.train.BytesList(value=[label_concate_raw]))}))
serialized = example.SerializeToString()
writer.write(serialized)
print 'writer ',img_name,' DOWN!'
writer.close()
read the TFrecord file ：
def _parse_function_for_train(example_proto):
features = {'image': tf.FixedLenFeature((), tf.string, default_value=""),
'label': tf.FixedLenFeature((), tf.string, default_value="")}
parsed_features = tf.parse_single_example(example_proto, features)
image_raw_out = parsed_features['image']
label_raw_out = parsed_features['label']
image_out = tf.decode_raw(image_raw_out, tf.uint8)
label_out = tf.decode_raw(label_raw_out, tf.float32)
image_out = tf.reshape(image_out, [512, 512, 3])
label_out = tf.reshape(label_out, [128,128,9])
return image_out, label_out
def CreateTrainDataset():
train_image_label_tfrecord_list = ["t1.tfrecord", "t2.tfrecord",......]
train_dataset = tf.contrib.data.TFRecordDataset(train_image_label_tfrecord_list)
train_dataset = train_dataset.map(_parse_function_for_train)
batched_train_dataset = train_dataset.batch(512)
return batched_train_dataset
batched_train_dataset = CreateTrainDataset()
iterator = batched_train_dataset.make_initializable_iterator()
batch_image, batch_label = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer)
When the above code run some iterations, the DataLossError will com out

Don't use org_train_image_resize_raw = org_train_image_resize.tostring(),
Just use org_train_image_resize_raw = org_train_image_resize.tobytes().
In my case, change this will solve the problem

LionnelBall · 2019-04-23T08:53:23Z

@reedwm my code is used to put the numpy array into the TFrecord file and read the it from the same file ,this is my code:
create the TFrecord file function:
img_tfrecord_name = image_base_name + ".tfrecord"
writer = tf.python_io.TFRecordWriter(new_label_path + img_tfrecord_name)
label_concate = np.concatenate((score_map, x1_offset, y1_offset,
x2_offset, y2_offset, x3_offset,
y3_offset, x4_offset, y4_offset), axis = -1)
org_train_image = cv2.imread(org_train_images_path + img_name)
org_train_image_resize = cv2.resize(org_train_image,
(input_image_size, input_image_size))
assert org_train_image_resize.shape == (512,512,3)
org_train_image_resize = org_train_image_resize.astype(np.uint8)
org_train_image_resize_raw = org_train_image_resize.tostring()
label_concate = label_concate.astype(np.float32)
label_concate_raw = label_concate.tostring()
example = tf.train.Example(
features = tf.train.Features(
feature = {'image':tf.train.Feature(bytes_list =
tf.train.BytesList(value[org_train_image_resize_raw])),
'label':tf.train.Feature(bytes_list = tf.train.BytesList(value=[label_concate_raw]))}))
serialized = example.SerializeToString()
writer.write(serialized)
print 'writer ',img_name,' DOWN!'
writer.close()
read the TFrecord file ：
def _parse_function_for_train(example_proto):
features = {'image': tf.FixedLenFeature((), tf.string, default_value=""),
'label': tf.FixedLenFeature((), tf.string, default_value="")}
parsed_features = tf.parse_single_example(example_proto, features)
image_raw_out = parsed_features['image']
label_raw_out = parsed_features['label']
image_out = tf.decode_raw(image_raw_out, tf.uint8)
label_out = tf.decode_raw(label_raw_out, tf.float32)
image_out = tf.reshape(image_out, [512, 512, 3])
label_out = tf.reshape(label_out, [128,128,9])
return image_out, label_out
def CreateTrainDataset():
train_image_label_tfrecord_list = ["t1.tfrecord", "t2.tfrecord",......]
train_dataset = tf.contrib.data.TFRecordDataset(train_image_label_tfrecord_list)
train_dataset = train_dataset.map(_parse_function_for_train)
batched_train_dataset = train_dataset.batch(512)
return batched_train_dataset
batched_train_dataset = CreateTrainDataset()
iterator = batched_train_dataset.make_initializable_iterator()
batch_image, batch_label = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer)
When the above code run some iterations, the DataLossError will com out

Don't use org_train_image_resize_raw = org_train_image_resize.tostring(),
Just use org_train_image_resize_raw = org_train_image_resize.tobytes().
In my case, change this will solve the problem

why this modification can solve the problem?

LionnelBall · 2019-04-24T08:34:52Z

In my case, I solved this problem in this way:
https://gist.github.com/ed-alertedh/9f49bfc6216585f520c7c7723d20d951
several tfrecord files are corrupted and can be found using above code.
After removing this corrupted files, everyting goes well in training.
The checking process prints like this:
validating train_feat/391072.tfrecord
error in train_feat/391072.tfrecord at record 391064
corrupted record at 12
validating train_feat/391073.tfrecord
validating train_feat/391074.tfrecord
validating train_feat/391075.tfrecord
validating train_feat/391076.tfrecord

eecshope · 2019-05-11T09:21:30Z

I ran into this problem once today while I was using Google Colab GPU Version and I fixed this just restart my notebook. Here is the address of my note book
https://colab.research.google.com/drive/1aqKWeqKGSDUiTJFmmlS47MVd8XiFEJWe#scrollTo=jU3twrhvfe9r

…t corrupted mid-run)

kimlaintu · 2019-06-09T11:59:23Z

I had this issue. After execute sudo sh -c "sync; echo 1 > /proc/sys/vm/drop_caches" or restart my machine, it fix the problem temporally and can run a few epochs. However, this issue happened sometime later. Finally, I found it may relate to my rams and used memtest86 (https://www.youtube.com/watch?v=9_xFNojChNA) to test each of them. It turned out that one of my rams was faulty. Never have this problem again after plugging out the faulty ram.

decewei · 2019-07-17T18:55:30Z

I just restarted my computer and it works. dk what the problem is. could be memory issue.

panfeng-hover · 2019-07-21T07:21:27Z

Fixed by increasing the number of tf record shards.

To check tf record files:

total_images = 0
train_files = sorted(glob.glob(os.path.join(tfrecord_path, '*')))
    for idx, file in enumerate(train_files):
        try:
            total_images += sum([1 for _ in tf.io.tf_record_iterator(file)]) # Check corrupted tf records
        except:
            print("{}: {} is corrupted".format(idx, file))
    print("Succeed, no corrupted tf records found for {} images".format(total_images))

caozhanxu · 2019-07-30T09:28:38Z

i encountered this problem in MAC system, the reason is, in my directory , it has a .DS_store file，but my code noly need the records data, so when i filter the .DS_store, it run sucessed

CNugteren · 2019-08-13T10:03:07Z

I encountered an issue where TFRecords where occasionally genuinely corrupt (verified with the posted code above) when producing them in parallel, but not when producing them in a single process. It turned out to be an issue with the fact that the list of TFRecords to be produced was not unique, and that would occasionally make two processes write to disk at the same time, causing corruption. So if you encounter this issue and you are using parallelism, double check that your dataset doesn't contain duplicate items.

Yossarian0916 · 2019-12-17T23:07:58Z

I also encountered DataLossError corrupted record at 0, when I used generated gzip tfrecord to build a train dataset. After I removed the compression gzip during the writing tfrecord process, the error was gone. I assumed this error was related to gzip compression type when generating tfrecord.

pank2210 · 2020-02-23T05:29:22Z

I was getting DataLossError (see above for traceback): Attempted to pad to a smaller size than the input element.
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[8], [8,100], [8,100,4], [8,100,4], [8,100], [8,100], [8,100], [8,100], [8,512,512,3], [8], [8], [8], [8,3]], output_types=[DT_STRING, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_INT64, DT_BOOL, DT_FLOAT, DT_FLOAT, DT_STRING, DT_INT32, DT_STRING, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: RandomShuffle_14/_12093 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_8060_RandomShuffle_14", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

I did tried everything that was given on this thread but none of them work. First of all using iterate tf record i was not able detect any corrupt record. Now records were not corrupt but i was still getting this error. After multiple trial and errors I found that my current training set had max 324 boxes for few images. So so all i have to do was to update the train for max box param.
train_config: {
batch_size: 4
max_number_of_boxes: 325
And this solved the problem. In few older version or other variant of OD this param pay be under input_Reader.
Hope this will help.
I used below code for record validations.

def val_fun1(filenames):
dataset = tf.data.TFRecordDataset(filenames)
#dataset = dataset.apply(tf.data.experimental.ignore_errors())
dataset = dataset.batch(64)
dataset = dataset.repeat(1)

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
i = 0
while True:
try:
print(i, sess.run(next_element).shape)
i = i + 1
except tf.errors.OutOfRangeError:
print("Dataset complete")
break

thanks...Pankaj

Navids71 · 2020-03-23T11:08:49Z

In my case, I solved this problem in this way:
https://gist.github.com/ed-alertedh/9f49bfc6216585f520c7c7723d20d951
several tfrecord files are corrupted and can be found using above code.
After removing this corrupted files, everyting goes well in training.
The checking process prints like this:
validating train_feat/391072.tfrecord
error in train_feat/391072.tfrecord at record 391064
corrupted record at 12
validating train_feat/391073.tfrecord
validating train_feat/391074.tfrecord
validating train_feat/391075.tfrecord
validating train_feat/391076.tfrecord

hello my friend I am new to python, would just help me how can I run this script?

Lannister-Xiaolin · 2020-05-21T08:54:21Z

I was facing the same problem , but it's happened when I training several epochs.....

wsu13 · 2020-08-27T15:55:07Z

I had this issue. After execute sudo sh -c "sync; echo 1 > /proc/sys/vm/drop_caches" or restart my machine, it fix the problem temporally and can run a few epochs. However, this issue happened sometime later. Finally, I found it may relate to my rams and used memtest86 (https://www.youtube.com/watch?v=9_xFNojChNA) to test each of them. It turned out that one of my rams was faulty. Never have this problem again after plugging out the faulty ram.

Encounter the same problem. "sync; echo 1 > /proc/sys/vm/drop_caches" works for me.

dasfinux · 2020-10-12T06:06:09Z

I also encountered DataLossError corrupted record at 0, when I used generated gzip tfrecord to build a train dataset. After I removed the compression gzip during the writing tfrecord process, the error was gone. I assumed this error was related to gzip compression type when generating tfrecord.

Thank you, your answer resolved my problem that bothered me three hours.

xeisberg · 2021-05-14T08:31:53Z

I also encountered DataLossError corrupted record at 0, when I used generated gzip tfrecord to build a train dataset. After I removed the compression gzip during the writing tfrecord process, the error was gone. I assumed this error was related to gzip compression type when generating tfrecord.

I had a similar problem as I was preparing my data in gzip files to later use them on training a bert model.
The problem was solved by simply gunzipping (gunzip -r pre*) and then I could train the bert model without a problem. However, if I created the data without the gzip, it did work so for some reason. I had to first create the data with gzip and unzip, it was not possible to create tfrecord files directly.

isrishtisingh · 2021-09-14T16:07:47Z

My problem with this is that I really had a corrupted tfrecord file: I was sending it to the other machine, sending process was stopped, but the file remained there. I didn't notice that it was just a part of a file ...
So you could check you tfrecords file with simple processing:
import tensorflow as tf
import glob

train_files = sorted(glob.glob('./train*.tfrecord'))
for f_i, file in enumerate(train_files): 
    print(f_i) 
    total_images += sum([1 for _ in tf.python_io.tf_record_iterator(file)])
This code raises an exception when it reaches the corrupted tfrecord file (exception was triggered in tf.python_io.tf_record_iterator(file)).
This is really gold to find the corrupted files. Thanks

where to add this code?
before training?
I am really new to this, so i apologize if its too basic.

cy89 added the stat:awaiting response Status - Awaiting response from author label Oct 9, 2017

aselle removed the stat:awaiting response Status - Awaiting response from author label Oct 9, 2017

reedwm added the stat:awaiting response Status - Awaiting response from author label Oct 12, 2017

aselle removed the stat:awaiting response Status - Awaiting response from author label Oct 13, 2017

reedwm added the stat:awaiting response Status - Awaiting response from author label Oct 13, 2017

aselle removed the stat:awaiting response Status - Awaiting response from author label Oct 14, 2017

asimshankar added the stat:awaiting response Status - Awaiting response from author label Nov 3, 2017

guillaumekln mentioned this issue Nov 10, 2017

Data loss: corrupted record at xx... OpenNMT/OpenNMT-tf#19

Closed

reedwm added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Dec 20, 2017

sjain-stanford mentioned this issue Feb 12, 2019

Dataset iterator is stalled indefinitely with corrupt TFRecord despite using ignore_errors #25700

Closed

meyerjo mentioned this issue Mar 21, 2019

[TF2.0][FR] Skip TFRecord files with 'DataLossError: corrupted record at' #26991

Closed

fperezgamonal referenced this issue in fperezgamonal/flownet2-tf May 13, 2019

WIP: test 'image_matches' for flownet_s_interp (see if tfrecord is no…

607542f

…t corrupted mid-run)

panfeng-hover mentioned this issue Jul 21, 2019

Error when train on customized dataset: Invalid JPEG data or crop window, data size 36864 tensorflow/tpu#455

Closed

gowthamkpr mentioned this issue Feb 18, 2020

DataLoss error on TFRecords - Randomly when accessing the dataset on S3 like storage #36802

Closed

Navids71 mentioned this issue Mar 23, 2020

hello LionnelBall/A-simple-classification-network#1

Open

amahendrakar mentioned this issue Nov 4, 2020

TFRecordWriter create in parent process can't work properly in child process #44571

Open

weiyao1996 mentioned this issue Nov 14, 2021

tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 2585605691 chenchao15/2D_projection_matching#4

Closed

yangyxt mentioned this issue Sep 19, 2022

DataLoss Error with Tensorflow google/deepvariant#564

Closed

DataLossError (see above for traceback): corrupted record at 12 #13463

DataLossError (see above for traceback): corrupted record at 12 #13463

Comments

huangrandong commented Oct 3, 2017

I have a big problem, I use the tfrecord file to import data for my tensorflow program. But, when the program run a period of time， it displays the DataLossError:

System information

Describe the problem

cy89 commented Oct 9, 2017

huangrandong commented Oct 9, 2017

reedwm commented Oct 12, 2017

huangrandong commented Oct 12, 2017

create the TFrecord file function:

read the TFrecord file ：

reedwm commented Oct 13, 2017

huangrandong commented Oct 14, 2017 • edited

reedwm commented Oct 16, 2017

guillaumekln commented Nov 10, 2017

rjbruin commented Nov 14, 2017

homink commented Nov 14, 2017 • edited

FirefoxMetzger commented Nov 21, 2017 • edited

tjvandal commented Nov 22, 2017

tensorflowbutler commented Dec 20, 2017

reedwm commented Dec 20, 2017

mrry commented Dec 20, 2017

guillaumekln commented Dec 20, 2017

mrry commented Dec 21, 2017

tensorflowbutler commented Jan 4, 2018

tjvandal commented Jan 6, 2018

amj commented Jan 20, 2018 • edited

muayyad-alsadi commented Feb 6, 2019 • edited

sjain-stanford commented Feb 12, 2019

guillaumekln commented Feb 12, 2019

sjain-stanford commented Feb 12, 2019 • edited

yuleung commented Apr 20, 2019

LionnelBall commented Apr 23, 2019

LionnelBall commented Apr 24, 2019 • edited

eecshope commented May 11, 2019

kimlaintu commented Jun 9, 2019

decewei commented Jul 17, 2019

panfeng-hover commented Jul 21, 2019

caozhanxu commented Jul 30, 2019

CNugteren commented Aug 13, 2019 • edited

Yossarian0916 commented Dec 17, 2019

pank2210 commented Feb 23, 2020

Navids71 commented Mar 23, 2020

Lannister-Xiaolin commented May 21, 2020

wsu13 commented Aug 27, 2020

dasfinux commented Oct 12, 2020

xeisberg commented May 14, 2021

isrishtisingh commented Sep 14, 2021

huangrandong commented Oct 14, 2017 •

edited

homink commented Nov 14, 2017 •

edited

FirefoxMetzger commented Nov 21, 2017 •

edited

amj commented Jan 20, 2018 •

edited

muayyad-alsadi commented Feb 6, 2019 •

edited

sjain-stanford commented Feb 12, 2019 •

edited

LionnelBall commented Apr 24, 2019 •

edited

CNugteren commented Aug 13, 2019 •

edited