Skip to main content

Posts

Showing posts from February, 2020



Access AWS S3 or HCP HS3 (Hitachi) using Hadoop or HDFS or Distcp

Create Credentials File for S3 Keys hadoop credential create fs.s3a.access.key -value <Access_KEY> -provider localjceks://file/$HOME/aws-dev-keys.jceks hadoop credential create fs.s3a.secret.key -value <Secret_KEY> -provider localjceks://file/$HOME/aws-dev-keys.jceks Where -  <Access_KEY>- S3 access key <Secret_KEY> - S3 secret key Note -  this will create a file local file system, in home directory with name aws-dev-keys.jceks Put this file to HDFS. For, distributed access. To list the details execute below command-  hadoop credential list -provider localjceks://file/$HOME/aws-dev-keys.jceks List files in S3 Bucket with hadoop Shell hdfs dfs -Dhadoop.security.credential.provider.path=jceks://hdfs/myfilelocation/aws-dev-keys.jceks -ls s3a://s3bucketname/ hdfs dfs -Dfs.s3a.access.key=<Access_KEY> -Dfs.s3a.secret.key=<Secret_KEY> -ls s3a://aa-daas-ookla/ Note - Similarly, other hadoop/ ...

Install AWS Cli in a Virtual Environment

Create a Virtual Environment for your project mkdir $HOME/py36venv python3 -m venv $HOME/py36venv Activate 3.6 virtual Environment source $HOME/py36venv/bin/activate Install AWS Commandline pip install awscli chmod 755 $HOME/py36venv/bin/aws aws --version aws configure AWS Access Key ID [None]: ---------------------- AWS Secret Access Key [None]: ----+----+--------------- Default region name [None]: us-east-2 Default output format [None]: aws s3 ls aws s3 sync local_dir/ s3://my-s3-bucket aws s3 sync s3://my-s3-bucket local_dir/

spark.sql.utils.AnalysisException: cannot resolve 'INPUT__FILE__NAME'

I have a Hive SQL - select regexp_extract(`unenriched`.` input__file__name `,'[^/]*$',0) `SRC_FILE_NM from dl.table1; This query fails running with Spark - spark . sql . utils . AnalysisException : u "cannot resolve 'INPUT__FILE__NAME' given input columns: Anaylsis- INPUT__FILE__NAME is a Hive specific virtual column and it is not supported in Spark. Solution- Spark provides input_file_name function which should work in a similar way: SELECT input_file_name() FROM df but it requires Spark 2.0 or later to work correctly with Spark.