Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DataLossError (see above for traceback): corrupted record at 12 #13463

Closed
huangrandong opened this issue Oct 3, 2017 · 79 comments
Closed

DataLossError (see above for traceback): corrupted record at 12 #13463

huangrandong opened this issue Oct 3, 2017 · 79 comments
Labels
stat:awaiting response Status - Awaiting response from author

Comments

@huangrandong
Copy link

I have a big problem, I use the tfrecord file to import data for my tensorflow program. But, when the program run a period of time, it displays the DataLossError:

System information

OS Platform and Distribution : Linux Ubuntu 14.04
TensorFlow installed from : Anaconda
TensorFlow version : 1.3.0
Python version: 2.7.13
CUDA/cuDNN version: 8.0 / 6.0
GPU model and memory: Pascal TITAN X

Describe the problem

2017-10-03 19:45:43.854601: W tensorflow/core/framework/op_kernel.cc:1192] Data loss: corrupted record at 12
Traceback (most recent call last):
File "east_quad_train_backup.py", line 416, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "east_quad_train_backup.py", line 330, in main
Training()
File "east_quad_train_backup.py", line 312, in Training
feed_dict={learning_rate: lr})
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 12
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,512,512,3], [?,128,128,9]], output_types=[DT_UINT8, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"]]
[[Node: gradients/Tile_grad/Shape/_23 = _HostRecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_442_gradients/Tile_grad/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]]

Caused by op u'IteratorGetNext', defined at:
File "east_quad_train_backup.py", line 416, in
tf.app.run(main=main, argv=[sys.argv[0]] + unparsed)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "east_quad_train_backup.py", line 330, in main
Training()
File "east_quad_train_backup.py", line 251, in Training
batch_image, batch_label = iterator.get_next()
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/contrib/data/python/ops/dataset_ops.py", line 304, in get_next
name=name))
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_dataset_ops.py", line 379, in iterator_get_next
output_shapes=output_shapes, name=name)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/t/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

DataLossError (see above for traceback): corrupted record at 12
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[?,512,512,3], [?,128,128,9]], output_types=[DT_UINT8, DT_FLOAT], _device="/job:localhost/replica:0/task:0/cpu:0"]]
[[Node: gradients/Tile_grad/Shape/_23 = _HostRecvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/gpu:0", send_device="/job:localhost/replica:0/task:0/cpu:0", send_device_incarnation=1, tensor_name="edge_442_gradients/Tile_grad/Shape", tensor_type=DT_INT32, _device="/job:localhost/replica:0/task:0/gpu:0"]]

Thanks anyone to answer this question.

@cy89
Copy link

cy89 commented Oct 9, 2017

@huangrandong is this problem repeatable, or did it happen just one time?

@cy89 cy89 added the stat:awaiting response Status - Awaiting response from author label Oct 9, 2017
@huangrandong
Copy link
Author

@cy89 , thank you for your response。This problem happened many times,and it will come out whenever i run my program. The problem can not be repeatable. the reason can be the problem of my computer configuration. my program can run on another machine and don't display the error.

@aselle aselle removed the stat:awaiting response Status - Awaiting response from author label Oct 9, 2017
@reedwm
Copy link
Member

reedwm commented Oct 12, 2017

Can you post a small example that will cause the DataLossError after running it for a while, so that we can see what the problem is?

@reedwm reedwm added the stat:awaiting response Status - Awaiting response from author label Oct 12, 2017
@huangrandong
Copy link
Author

@reedwm my code is used to put the numpy array into the TFrecord file and read the it from the same file ,this is my code:

create the TFrecord file function:

img_tfrecord_name = image_base_name + ".tfrecord"
writer = tf.python_io.TFRecordWriter(new_label_path + img_tfrecord_name)
label_concate = np.concatenate((score_map, x1_offset, y1_offset,
x2_offset, y2_offset, x3_offset,
y3_offset, x4_offset, y4_offset), axis = -1)
org_train_image = cv2.imread(org_train_images_path + img_name)
org_train_image_resize = cv2.resize(org_train_image,
(input_image_size, input_image_size))
assert org_train_image_resize.shape == (512,512,3)
org_train_image_resize = org_train_image_resize.astype(np.uint8)
org_train_image_resize_raw = org_train_image_resize.tostring()
label_concate = label_concate.astype(np.float32)
label_concate_raw = label_concate.tostring()
example = tf.train.Example(
features = tf.train.Features(
feature = {'image':tf.train.Feature(bytes_list =
tf.train.BytesList(value[org_train_image_resize_raw])),
'label':tf.train.Feature(bytes_list = tf.train.BytesList(value=[label_concate_raw]))}))
serialized = example.SerializeToString()
writer.write(serialized)
print 'writer ',img_name,' DOWN!'
writer.close()

read the TFrecord file :

def _parse_function_for_train(example_proto):
features = {'image': tf.FixedLenFeature((), tf.string, default_value=""),
'label': tf.FixedLenFeature((), tf.string, default_value="")}
parsed_features = tf.parse_single_example(example_proto, features)
image_raw_out = parsed_features['image']
label_raw_out = parsed_features['label']
image_out = tf.decode_raw(image_raw_out, tf.uint8)
label_out = tf.decode_raw(label_raw_out, tf.float32)
image_out = tf.reshape(image_out, [512, 512, 3])
label_out = tf.reshape(label_out, [128,128,9])
return image_out, label_out

def CreateTrainDataset():
train_image_label_tfrecord_list = ["t1.tfrecord", "t2.tfrecord",......]
train_dataset = tf.contrib.data.TFRecordDataset(train_image_label_tfrecord_list)
train_dataset = train_dataset.map(_parse_function_for_train)
batched_train_dataset = train_dataset.batch(512)
return batched_train_dataset

batched_train_dataset = CreateTrainDataset()
iterator = batched_train_dataset.make_initializable_iterator()
batch_image, batch_label = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer)
When the above code run some iterations, the DataLossError will com out

@aselle aselle removed the stat:awaiting response Status - Awaiting response from author label Oct 13, 2017
@reedwm
Copy link
Member

reedwm commented Oct 13, 2017

@huangrandong can you post a complete, self-contained example I can copy to a text file and run? In the code above, image_base_name is not defined.

@saxenasaurabh @vrv, any idea what the problem could be?

@reedwm reedwm added the stat:awaiting response Status - Awaiting response from author label Oct 13, 2017
@huangrandong
Copy link
Author

huangrandong commented Oct 14, 2017

@reedwm you can define the variable which the code didn't define. and the code is used to put a numpy array of image and another label array into the tfrecord file. Then, reading the two array from the tfrecord file

@aselle aselle removed the stat:awaiting response Status - Awaiting response from author label Oct 14, 2017
@reedwm
Copy link
Member

reedwm commented Oct 16, 2017

It's much easier to quickly reproduce these issues if I have a self-contained example without having to define variables. Perhaps the issue only occurs for certain values of x1_offset, for example. So can you please add a complete example?

@guillaumekln
Copy link
Contributor

I also had reports of this error which appears to occur randomly during the training. It happened on multiple occasions and with different reported offsets (see OpenNMT/OpenNMT-tf#19).

To investigate the issue, I wrote a small script that repeatedly loops over the same TFRecord dataset that threw the error and applies the same processing as done during the training. However, I was not able to reproduce it, indicating that no records are corrupted in the file and something else is going one during the training.

Any pointers to better investigate the issue would be appreciated.

@rjbruin
Copy link

rjbruin commented Nov 14, 2017

Same problem here. For several different sets of TFRecord files we get this error at random times during training.

@homink
Copy link

homink commented Nov 14, 2017

I have reproduced the error at the same record location. The first and third got the error in the middle of 'Filling up shuffle buffer' and the second got the error in the beginning of that. In my case, this error looks highly relevant with the buffer shuffling process although different size of buffer didn't work. I hope this would be helpful for debugging.

[kwon@ssi-dnn-slave-002 wsj_kaldi_tf]$ grep DataLossError wsj.log
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 3449023918
DataLossError (see above for traceback): corrupted record at 3449023918
[kwon@ssi-dnn-slave-002 wsj_kaldi_tf]$ grep DataLossError wsj.log1
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 3449023918
DataLossError (see above for traceback): corrupted record at 3449023918
[kwon@ssi-dnn-slave-002 wsj_kaldi_tf]$ grep DataLossError wsj.log2
tensorflow.python.framework.errors_impl.DataLossError: corrupted record at 3449023918
DataLossError (see above for traceback): corrupted record at 3449023918

@FirefoxMetzger
Copy link

FirefoxMetzger commented Nov 21, 2017

Allow me to further complicate matters. (Although I am not 100% sure that it is the same issue)

I have some custom data and know that the TFRecord is not corrupt, because I've iterated over it (using the same code) successfully before. Now I've encountered the same situation that homink described.
After restarting my machine it is again working as intended.

Assuming that it is related, is there any caching involved when reading the .tfrecord? Either from tensorflow, python or the OS? (I am currently running it on Win10)

@tjvandal
Copy link

@FirefoxMetzger I am too having this issue so I tried restarting my machine, as you did, and it did not fix the problem. I'm using Ubuntu 16.04.

@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and the awaiting response label was assigned. Is this still an issue? Please update the label and/or status accordingly.

@reedwm
Copy link
Member

reedwm commented Dec 20, 2017

/CC @mrry @saxenasaurabh, any ideas what the issue could be? This is hard to debug without a small example that reproduces the issue.

@reedwm reedwm added stat:awaiting tensorflower Status - Awaiting response from tensorflower and removed stat:awaiting response Status - Awaiting response from author labels Dec 20, 2017
@mrry
Copy link
Contributor

mrry commented Dec 20, 2017

AFAICT, this problem only affects ZLIB-compressed TFRecord files (because that is the sole source of "corrupted record at" in an error message). The source indicates a CRC mismatch. I'm a little surprised that none of the code snippets mention ZLIB compression.

/CC @saxenasaurabh @rohan100jain, who last touched the ZLIB-related code in that file.

@guillaumekln
Copy link
Contributor

I confirm that the issue was encountered without any compression configured, unless it is the default (which is not AFAIK).

@mrry
Copy link
Contributor

mrry commented Dec 21, 2017

Pardon my mistake, indeed there are other code paths that can print that message, and each of them is related to a CRC mismatch.

@tensorflowbutler
Copy link
Member

It has been 14 days with no activity and the awaiting tensorflower label was assigned. Please update the label and/or status accordingly.

@tjvandal
Copy link

tjvandal commented Jan 6, 2018

Anymore thoughts on this? It's a big issue for me but I don't know where to start debugging. Each time I reprocess my data the errors appear in different locations. Sometimes it takes a couple training epochs to occur.

@amj
Copy link

amj commented Jan 20, 2018

/sub

This is happening to us as well, any ideas?

Edit to add: We are using zlib compression, reading a bunch of files off GCS with interleave and shuffling them into one large Dataset; as a result, there's no way to catch the error and try and carry on.

Is it possible this is some GCS transient? I'm also having trouble repeating it with the same data.

@muayyad-alsadi
Copy link

muayyad-alsadi commented Feb 6, 2019

does the .repeat() understand that?

  dataset = dataset.repeat()

@sjain-stanford
Copy link

@guillaumekln thanks for the pointer to tf.data.experimental.ignore_errors. I do have a follow-up question on that:

How does it handle tf.errors.OutOfRangeError - does it ignore that too? I use this to track the end of my dataset (validation). When I ignore errors, it seems that the validation loop is stuck upon reaching the end and sess.run doesn't yield anything at that point.

dataset = dataset.apply(tf.data.experimental.ignore_errors())

@guillaumekln
Copy link
Contributor

does the .repeat() understand that?

I think it does.

How does it handle tf.errors.OutOfRangeError - does it ignore that too? I use this to track the end of my dataset (validation). When I ignore errors, it seems that the validation loop is stuck upon reaching the end and sess.run doesn't yield anything at that point.

Not sure sure about this. The following snippet does raise the OutOfRangeError exception:

import tensorflow as tf

dataset = tf.data.Dataset.range(10)
dataset = dataset.apply(tf.data.experimental.ignore_errors())

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
with tf.Session() as sess:
    while True:
        print(sess.run(next_element))

@sjain-stanford
Copy link

sjain-stanford commented Feb 12, 2019

@guillaumekln You're right, what I'm seeing may not have to do with OutofRangeError, but the execution stalls indefinitely when it encounters corrupt data within a TFRecord, despite using tf.data.experimental.ignore_errors(). I've created #25700 with the minimal code to reproduce what I'm seeing. Have you encountered this before?

UPDATE: #25700 (comment)

I think the issue was that, when ignore_errors is used, the same file will repeat as the file_index is not moved forward to completion.

Bug fix by @yongtang in #25705

@yuleung
Copy link

yuleung commented Apr 20, 2019

@reedwm my code is used to put the numpy array into the TFrecord file and read the it from the same file ,this is my code:
create the TFrecord file function:
img_tfrecord_name = image_base_name + ".tfrecord"
writer = tf.python_io.TFRecordWriter(new_label_path + img_tfrecord_name)
label_concate = np.concatenate((score_map, x1_offset, y1_offset,
x2_offset, y2_offset, x3_offset,
y3_offset, x4_offset, y4_offset), axis = -1)
org_train_image = cv2.imread(org_train_images_path + img_name)
org_train_image_resize = cv2.resize(org_train_image,
(input_image_size, input_image_size))
assert org_train_image_resize.shape == (512,512,3)
org_train_image_resize = org_train_image_resize.astype(np.uint8)
org_train_image_resize_raw = org_train_image_resize.tostring()
label_concate = label_concate.astype(np.float32)
label_concate_raw = label_concate.tostring()
example = tf.train.Example(
features = tf.train.Features(
feature = {'image':tf.train.Feature(bytes_list =
tf.train.BytesList(value[org_train_image_resize_raw])),
'label':tf.train.Feature(bytes_list = tf.train.BytesList(value=[label_concate_raw]))}))
serialized = example.SerializeToString()
writer.write(serialized)
print 'writer ',img_name,' DOWN!'
writer.close()
read the TFrecord file :
def _parse_function_for_train(example_proto):
features = {'image': tf.FixedLenFeature((), tf.string, default_value=""),
'label': tf.FixedLenFeature((), tf.string, default_value="")}
parsed_features = tf.parse_single_example(example_proto, features)
image_raw_out = parsed_features['image']
label_raw_out = parsed_features['label']
image_out = tf.decode_raw(image_raw_out, tf.uint8)
label_out = tf.decode_raw(label_raw_out, tf.float32)
image_out = tf.reshape(image_out, [512, 512, 3])
label_out = tf.reshape(label_out, [128,128,9])
return image_out, label_out
def CreateTrainDataset():
train_image_label_tfrecord_list = ["t1.tfrecord", "t2.tfrecord",......]
train_dataset = tf.contrib.data.TFRecordDataset(train_image_label_tfrecord_list)
train_dataset = train_dataset.map(_parse_function_for_train)
batched_train_dataset = train_dataset.batch(512)
return batched_train_dataset
batched_train_dataset = CreateTrainDataset()
iterator = batched_train_dataset.make_initializable_iterator()
batch_image, batch_label = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer)
When the above code run some iterations, the DataLossError will com out

Don't use org_train_image_resize_raw = org_train_image_resize.tostring(),
Just use org_train_image_resize_raw = org_train_image_resize.tobytes().
In my case, change this will solve the problem

@LionnelBall
Copy link

@reedwm my code is used to put the numpy array into the TFrecord file and read the it from the same file ,this is my code:
create the TFrecord file function:
img_tfrecord_name = image_base_name + ".tfrecord"
writer = tf.python_io.TFRecordWriter(new_label_path + img_tfrecord_name)
label_concate = np.concatenate((score_map, x1_offset, y1_offset,
x2_offset, y2_offset, x3_offset,
y3_offset, x4_offset, y4_offset), axis = -1)
org_train_image = cv2.imread(org_train_images_path + img_name)
org_train_image_resize = cv2.resize(org_train_image,
(input_image_size, input_image_size))
assert org_train_image_resize.shape == (512,512,3)
org_train_image_resize = org_train_image_resize.astype(np.uint8)
org_train_image_resize_raw = org_train_image_resize.tostring()
label_concate = label_concate.astype(np.float32)
label_concate_raw = label_concate.tostring()
example = tf.train.Example(
features = tf.train.Features(
feature = {'image':tf.train.Feature(bytes_list =
tf.train.BytesList(value[org_train_image_resize_raw])),
'label':tf.train.Feature(bytes_list = tf.train.BytesList(value=[label_concate_raw]))}))
serialized = example.SerializeToString()
writer.write(serialized)
print 'writer ',img_name,' DOWN!'
writer.close()
read the TFrecord file :
def _parse_function_for_train(example_proto):
features = {'image': tf.FixedLenFeature((), tf.string, default_value=""),
'label': tf.FixedLenFeature((), tf.string, default_value="")}
parsed_features = tf.parse_single_example(example_proto, features)
image_raw_out = parsed_features['image']
label_raw_out = parsed_features['label']
image_out = tf.decode_raw(image_raw_out, tf.uint8)
label_out = tf.decode_raw(label_raw_out, tf.float32)
image_out = tf.reshape(image_out, [512, 512, 3])
label_out = tf.reshape(label_out, [128,128,9])
return image_out, label_out
def CreateTrainDataset():
train_image_label_tfrecord_list = ["t1.tfrecord", "t2.tfrecord",......]
train_dataset = tf.contrib.data.TFRecordDataset(train_image_label_tfrecord_list)
train_dataset = train_dataset.map(_parse_function_for_train)
batched_train_dataset = train_dataset.batch(512)
return batched_train_dataset
batched_train_dataset = CreateTrainDataset()
iterator = batched_train_dataset.make_initializable_iterator()
batch_image, batch_label = iterator.get_next()
with tf.Session() as sess:
sess.run(iterator.initializer)
When the above code run some iterations, the DataLossError will com out

Don't use org_train_image_resize_raw = org_train_image_resize.tostring(),
Just use org_train_image_resize_raw = org_train_image_resize.tobytes().
In my case, change this will solve the problem

why this modification can solve the problem?

@LionnelBall
Copy link

LionnelBall commented Apr 24, 2019

In my case, I solved this problem in this way:
https://gist.github.com/ed-alertedh/9f49bfc6216585f520c7c7723d20d951
several tfrecord files are corrupted and can be found using above code.
After removing this corrupted files, everyting goes well in training.
The checking process prints like this:
validating train_feat/391072.tfrecord
error in train_feat/391072.tfrecord at record 391064
corrupted record at 12
validating train_feat/391073.tfrecord
validating train_feat/391074.tfrecord
validating train_feat/391075.tfrecord
validating train_feat/391076.tfrecord

@eecshope
Copy link

I ran into this problem once today while I was using Google Colab GPU Version and I fixed this just restart my notebook. Here is the address of my note book
https://colab.research.google.com/drive/1aqKWeqKGSDUiTJFmmlS47MVd8XiFEJWe#scrollTo=jU3twrhvfe9r

fperezgamonal referenced this issue in fperezgamonal/flownet2-tf May 13, 2019
@kimlaintu
Copy link

I had this issue. After execute sudo sh -c "sync; echo 1 > /proc/sys/vm/drop_caches" or restart my machine, it fix the problem temporally and can run a few epochs. However, this issue happened sometime later. Finally, I found it may relate to my rams and used memtest86 (https://www.youtube.com/watch?v=9_xFNojChNA) to test each of them. It turned out that one of my rams was faulty. Never have this problem again after plugging out the faulty ram.

@decewei
Copy link

decewei commented Jul 17, 2019

I just restarted my computer and it works. dk what the problem is. could be memory issue.

@panfeng-hover
Copy link

Fixed by increasing the number of tf record shards.

To check tf record files:

total_images = 0
train_files = sorted(glob.glob(os.path.join(tfrecord_path, '*')))
    for idx, file in enumerate(train_files):
        try:
            total_images += sum([1 for _ in tf.io.tf_record_iterator(file)]) # Check corrupted tf records
        except:
            print("{}: {} is corrupted".format(idx, file))
    print("Succeed, no corrupted tf records found for {} images".format(total_images))

@caozhanxu
Copy link

i encountered this problem in MAC system, the reason is, in my directory , it has a .DS_store file,but my code noly need the records data, so when i filter the .DS_store, it run sucessed

@CNugteren
Copy link
Contributor

CNugteren commented Aug 13, 2019

I encountered an issue where TFRecords where occasionally genuinely corrupt (verified with the posted code above) when producing them in parallel, but not when producing them in a single process. It turned out to be an issue with the fact that the list of TFRecords to be produced was not unique, and that would occasionally make two processes write to disk at the same time, causing corruption. So if you encounter this issue and you are using parallelism, double check that your dataset doesn't contain duplicate items.

@Yossarian0916
Copy link

I also encountered DataLossError corrupted record at 0, when I used generated gzip tfrecord to build a train dataset. After I removed the compression gzip during the writing tfrecord process, the error was gone. I assumed this error was related to gzip compression type when generating tfrecord.

@pank2210
Copy link

I was getting DataLossError (see above for traceback): Attempted to pad to a smaller size than the input element.
[[Node: IteratorGetNext = IteratorGetNextoutput_shapes=[[8], [8,100], [8,100,4], [8,100,4], [8,100], [8,100], [8,100], [8,100], [8,512,512,3], [8], [8], [8], [8,3]], output_types=[DT_STRING, DT_FLOAT, DT_FLOAT, DT_FLOAT, DT_INT64, DT_INT64, DT_BOOL, DT_FLOAT, DT_FLOAT, DT_STRING, DT_INT32, DT_STRING, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"]]
[[Node: RandomShuffle_14/_12093 = _Recvclient_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device_incarnation=1, tensor_name="edge_8060_RandomShuffle_14", tensor_type=DT_INT64, _device="/job:localhost/replica:0/task:0/device:GPU:0"]]

I did tried everything that was given on this thread but none of them work. First of all using iterate tf record i was not able detect any corrupt record. Now records were not corrupt but i was still getting this error. After multiple trial and errors I found that my current training set had max 324 boxes for few images. So so all i have to do was to update the train for max box param.
train_config: {
batch_size: 4
max_number_of_boxes: 325
And this solved the problem. In few older version or other variant of OD this param pay be under input_Reader.
Hope this will help.
I used below code for record validations.

def val_fun1(filenames):
dataset = tf.data.TFRecordDataset(filenames)
#dataset = dataset.apply(tf.data.experimental.ignore_errors())
dataset = dataset.batch(64)
dataset = dataset.repeat(1)

iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
i = 0
while True:
try:
print(i, sess.run(next_element).shape)
i = i + 1
except tf.errors.OutOfRangeError:
print("Dataset complete")
break

thanks...Pankaj

@Navids71
Copy link

In my case, I solved this problem in this way:
https://gist.github.com/ed-alertedh/9f49bfc6216585f520c7c7723d20d951
several tfrecord files are corrupted and can be found using above code.
After removing this corrupted files, everyting goes well in training.
The checking process prints like this:
validating train_feat/391072.tfrecord
error in train_feat/391072.tfrecord at record 391064
corrupted record at 12
validating train_feat/391073.tfrecord
validating train_feat/391074.tfrecord
validating train_feat/391075.tfrecord
validating train_feat/391076.tfrecord

hello my friend I am new to python, would just help me how can I run this script?

@Lannister-Xiaolin
Copy link

I was facing the same problem , but it's happened when I training several epochs.....

@wsu13
Copy link

wsu13 commented Aug 27, 2020

I had this issue. After execute sudo sh -c "sync; echo 1 > /proc/sys/vm/drop_caches" or restart my machine, it fix the problem temporally and can run a few epochs. However, this issue happened sometime later. Finally, I found it may relate to my rams and used memtest86 (https://www.youtube.com/watch?v=9_xFNojChNA) to test each of them. It turned out that one of my rams was faulty. Never have this problem again after plugging out the faulty ram.

Encounter the same problem. "sync; echo 1 > /proc/sys/vm/drop_caches" works for me.

@dasfinux
Copy link

I also encountered DataLossError corrupted record at 0, when I used generated gzip tfrecord to build a train dataset. After I removed the compression gzip during the writing tfrecord process, the error was gone. I assumed this error was related to gzip compression type when generating tfrecord.

Thank you, your answer resolved my problem that bothered me three hours.

@xeisberg
Copy link

I also encountered DataLossError corrupted record at 0, when I used generated gzip tfrecord to build a train dataset. After I removed the compression gzip during the writing tfrecord process, the error was gone. I assumed this error was related to gzip compression type when generating tfrecord.

I had a similar problem as I was preparing my data in gzip files to later use them on training a bert model.
The problem was solved by simply gunzipping (gunzip -r pre*) and then I could train the bert model without a problem. However, if I created the data without the gzip, it did work so for some reason. I had to first create the data with gzip and unzip, it was not possible to create tfrecord files directly.

@isrishtisingh
Copy link

My problem with this is that I really had a corrupted tfrecord file: I was sending it to the other machine, sending process was stopped, but the file remained there. I didn't notice that it was just a part of a file ...
So you could check you tfrecords file with simple processing:

import tensorflow as tf
import glob

train_files = sorted(glob.glob('./train*.tfrecord'))
for f_i, file in enumerate(train_files): 
    print(f_i) 
    total_images += sum([1 for _ in tf.python_io.tf_record_iterator(file)])

This code raises an exception when it reaches the corrupted tfrecord file (exception was triggered in tf.python_io.tf_record_iterator(file)).

This is really gold to find the corrupted files. Thanks

where to add this code?
before training?
I am really new to this, so i apologize if its too basic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:awaiting response Status - Awaiting response from author
Projects
None yet
Development

No branches or pull requests