Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

spark input_file_name() not working in cobrix #221

Closed
kriswijnants opened this issue Dec 6, 2019 · 20 comments
Closed

spark input_file_name() not working in cobrix #221

kriswijnants opened this issue Dec 6, 2019 · 20 comments
Assignees
Labels
accepted Accepted for implementation enhancement New feature or request help wanted Extra attention is needed
Milestone

Comments

@kriswijnants
Copy link

kriswijnants commented Dec 6, 2019

Hi,

Thank you for creating and maintaining Cobrix. It's a tool we discovered recently, and plan to implement it in our cloud data platform for our Mainframe project.

Just a small question to ask. We notice the input_file_name() command in spark always returns blanks when using cobrix. This in combination with the option("is_record_sequence", "true") option.

spark.read.format("cobol").option("copybook", "/mnt/inputMDP/BIWA_GUTEX/Copybooks/"+dbutils.widgets.get("version")+"/GAGUSECO_20070115.txt").option("is_record_sequence", "true").load("/mnt/inputMDP/BIWA_GUTEX/Datafiles/"+dbutils.widgets.get("version")+"/GA-GA324001*").withColumn("ISN_Source", input_file_name).createOrReplaceTempView("vw_gutex_GA")

Do you notice the same behaviour? Is there any chance to get this working?

Keep up the good work!

Regards,

Kris

@yruslan
Copy link
Collaborator

yruslan commented Dec 10, 2019

Thanks for reporting the issue!

Looks interesting. Will take a look.

@yruslan yruslan added the accepted Accepted for implementation label Dec 10, 2019
@yruslan yruslan self-assigned this Dec 10, 2019
yruslan added a commit that referenced this issue Dec 10, 2019
yruslan added a commit that referenced this issue Dec 10, 2019
@yruslan yruslan added the enhancement New feature or request label Dec 10, 2019
@yruslan
Copy link
Collaborator

yruslan commented Dec 10, 2019

I can confirm the issue. Indeed, for variable-record-size files input_file_name() returns an empty string. That is due to the way we handle sparse indexes creation to parallelize the reading of such files.

It will take a while to fix this properly (probably need to create a custom RDD). But we can add a workaround to generate a column with the input file name for each record. That's what we are going to do first. It would look like this:

.option("with_input_file_name_col", "ISN_Source")

@yruslan
Copy link
Collaborator

yruslan commented Dec 10, 2019

Just a double check. Which Spark version are you using?

We are planning to release Cobrix 2.0.0 first and all further changes will be made there. But it will support Spark 2.4 or above.

@kriswijnants
Copy link
Author

kriswijnants commented Dec 10, 2019 via email

@yruslan
Copy link
Collaborator

yruslan commented Dec 10, 2019

Great! Cobrix 2.0.0 is planned to be released this week. And the workaround for this issue can be expected sometime next week.

@kriswijnants
Copy link
Author

kriswijnants commented Dec 10, 2019 via email

@yruslan yruslan added the help wanted Extra attention is needed label Dec 12, 2019
yruslan added a commit that referenced this issue Dec 16, 2019
yruslan added a commit that referenced this issue Dec 17, 2019
yruslan added a commit that referenced this issue Dec 17, 2019
@yruslan
Copy link
Collaborator

yruslan commented Dec 17, 2019

This should be fixed in the latest snapshot.
Please, try:

        <dependency>
            <groupId>za.co.absa.cobrix</groupId>
            <artifactId>spark-cobol_2.11</artifactId>
            <version>2.0.1-SNAPSHOT</version>
        </dependency>

and let me know if the issue is fixed.

@yruslan yruslan added this to the 2.0.1 milestone Dec 17, 2019
@kriswijnants
Copy link
Author

kriswijnants commented Dec 17, 2019 via email

@yruslan
Copy link
Collaborator

yruslan commented Dec 18, 2019

Forgot to mention. In order to get input file names for each record of a variable record length file a workaround is used. In your case the option looks like this:

.option("with_input_file_name_col", "ISN_Source")

I'd also recommend using

.option("pedantic", "true")

So that unrecognized options cause errors.

@yruslan yruslan closed this as completed Dec 20, 2019
@kriswijnants
Copy link
Author

kriswijnants commented Dec 20, 2019 via email

@yruslan
Copy link
Collaborator

yruslan commented Dec 20, 2019

Hi Kris,

Snapshot version linking requires additional configuration in .m2/settings.xml. It might be even harder for managed clusters.

Try setting the version to 2.0.1 which was released today.

And please let me know if it worked for you.

Thank you,
Ruslan

@kriswijnants
Copy link
Author

kriswijnants commented Dec 20, 2019 via email

@bart-at-qqdatafruits
Copy link

bart-at-qqdatafruits commented Feb 20, 2020

H2. environment: docker: jupyter/all-spark-notebook:latest + Apache Toree - Scala

H2. Issue

when using

.option("file_start_offset", "600")
.option("file_end_offset", "600")

input_file_name() no longer works

H3. Annonymized extract

%AddDeps za.co.absa.cobrix spark-cobol_2.11 2.0.3 --transitive

val sparkBuilder = SparkSession.builder().appName("Example")

val spark = sparkBuilder .getOrCreate()

`
import org.apache.spark.sql.functions._

import org.apache.spark.sql.SparkSession

spark.udf.register("get_file_name", (path: String) => path.split("/").last)
val cobolDataframe = spark
.read
.format("za.co.absa.cobrix.spark.cobol.source")
.option("pedantic", "true")
.option("copybook", "file:///home/jovyan/data/BRAND/COPYBOOK.txt")
.option("file_start_offset", "600")
.option("file_end_offset", "600")
.load("file:///home/jovyan/data/BRAND/initial_transformed/FILEPATTERN*")
.withColumn("DPSource", callUDF("get_file_name", input_file_name()))
`

cobolDataframe //.filter("RECORD.ID % 2 = 0") // filter the even values of the nested field 'RECORD_LENGTH' .take(20) .foreach(v => println(v))

@kriswijnants
Copy link
Author

kriswijnants commented Feb 20, 2020 via email

@yruslan
Copy link
Collaborator

yruslan commented Feb 20, 2020

Hi Kris,
When you use file offset a different reader is used. Use the workaround for this case instead of input_file_name():

.option("with_input_file_name_col", "DPSource")

@kriswijnants
Copy link
Author

kriswijnants commented Feb 21, 2020 via email

@bart-at-qqdatafruits
Copy link

Hi Ruslan,

"with_input_file_name_col" seems be intended for "is_record_sequence = true" only.

In this case I have a copy book (fixed lenth) where the copybook does not mention the Header and footer.

Possibly actions I should take are:

  • get rid off the header and footer in a pre-prosessing (a less clean solution, to be avoided)
  • try to rewrite the copybook to accomodate header and footer (ideal solution, maybe as it should) consisting of several record types. I will look into this next.

I value your opinion. Mainframe code can be messy. It is a trade off between handling source particuliarities out of the box and keeping the cobrix code maintainable.

Thanks in advance,

Regards, Bart,

a test of your suggestion:

`
import org.apache.spark.sql.functions._

import org.apache.spark.sql.SparkSession

spark.udf.register("get_file_name", (path: String) => path.split("/").last)

val cobolDataframe = spark
.read
.format("za.co.absa.cobrix.spark.cobol.source")
//.option("is_record_sequence", "true")
//.option("generate_record_id", "true") // for comparison with unconverted (windows) file only
.option("pedantic", "true")
//.option("with_input_file_name_col", "DPSourceTemp")
.option("copybook", "file:///home/jovyan/data/BRAND/COPYBOOK.txt")
.option("file_start_offset", "600")
.option("file_end_offset", "600")
.option("with_input_file_name_col", "DPSourceTemp")
.load("file:///home/jovyan/data/BRAND/initial_transformed")
`

the result:

Name: java.lang.IllegalArgumentException Message: Option 'with_input_file_name_col' is supported only when 'is_record_sequence' = true. StackTrace: at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersParser$.validateSparkCobolOptions(CobolParametersParser.scala:467) at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersParser$.parse(CobolParametersParser.scala:209) at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:56) at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:48) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)

@yruslan
Copy link
Collaborator

yruslan commented Feb 21, 2020

Interesting. I will take a look. I think this can be easily fixed so that with_input_file_name_col would work in your case.

@yruslan
Copy link
Collaborator

yruslan commented Feb 21, 2020

Opened #252 to continue the discussion there. Since the incompatibility between with_input_file_name_col and file_start_offset is a separate issue,

@kriswijnants
Copy link
Author

kriswijnants commented Feb 21, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Accepted for implementation enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants