Network Security Internet Technology Development Database Servers Mobile Phone Android Software Apple Software Computer Software News IT Information

In addition to Weibo, there is also WeChat

Please pay attention

WeChat public account

Shulou

What is the format of Data Lake Analytics + OSS data files?

2025-01-15 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Servers >

Share

Shulou(Shulou.com)05/31 Report--

In this issue, the editor will bring you about the format processing of Data Lake Analytics + OSS data files. The article is rich in content and analyzed and described from a professional point of view. I hope you can get something after reading this article.

0. Preface

Data Lake Analytics is a Serverless-based interactive query and analysis service on the cloud. Users can use standard SQL statements to query and analyze the data stored on OSS and TableStore without moving.

In addition to plain text files (for example, CSV,TSV, etc.), other formats of data files stored on OSS can also be queried and analyzed using Data Lake Analytics, including geographic JSON data of ORC, PARQUET, JSON, RCFILE, AVRO and even ESRI specifications, and files that can be matched with regular expressions.

Here is a detailed description of how to use Data Lake Analytics (hereinafter referred to as DLA) for analysis based on the file format stored on OSS. DLA has built-in implementations of SerDe (short for Serialize/Deserilize for serialization and deserialization) to handle file data. Users do not need to write their own programs, but can basically choose one or more SerDe in DLA to match the data file format on your OSS.

1. Storage format and SerDe

Users can build tables based on the data files stored on OSS and specify the format of the data files through STORED AS.

For example,

CREATE EXTERNAL TABLE nation (N_NATIONKEY INT, N_NAME STRING, N_REGIONKEY INT, N_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE LOCATION' oss://test-bucket-julian-1/tpch_100m/nation'

After a successful table creation, you can use the SHOW CREATE TABLE statement to view the original table creation statement.

Mysql > show create table nation +- - - - - -+ | Result | +- - - - - - -+ | CREATE EXTERNAL TABLE `nation` (`n_ nationkey` int `nname` string, `nregionkey` int `n_ comment`string) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS `TEXTFILE`location' oss://test-bucket-julian-1/tpch_100m/nation' | + - - - - - -+ 1 row in set (1.81 sec)

The following table lists the file formats that DLA already supports. When creating a table for files in the following formats, you can directly use STORED AS,DLA to select the appropriate SERDE/INPUTFORMAT/OUTPUTFORMAT.

Storage format

Description

STORED AS TEXTFILE

Data files are stored in plain text format. The default file type.

Each row in the file corresponds to a record in the table.

STORED AS ORC

The data file is stored in ORC format.

STORED AS PARQUET

The data file is stored in PARQUET format.

STORED AS RCFILE

The data file is stored in RCFILE format.

STORED AS AVRO

The data file is stored in AVRO format.

STORED AS JSON

Data files are stored in JSON (except geographic JSON data files for Esri ArcGIS).

While specifying STORED AS, you can also specify SerDe (used to parse the data file and map to the DLA table), special column delimiters, etc., according to the characteristics of the specific file.

The later part will be explained further.

two。 Example 2.1 CSV file

CSV files, which are essentially plain text files, can be used with STORED AS TEXTFILE.

Columns are separated by commas and can be represented by ROW FORMAT DELIMITED FIELDS TERMINATED BY','.

Ordinary CSV file

For example, the content of the data file oss://bucket-for-testing/oss/text/cities/city.csv is

Beijing,China,010ShangHai,China,021Tianjin,China,022

A table statement can be

CREATE EXTERNAL TABLE city (city STRING, country STRING, code INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY', 'STORED AS TEXTFILE LOCATION' oss://bucket-for-testing/oss/text/cities'; uses OpenCSVSerde__ to process fields referenced by quotation marks

You should pay attention to the following points when using OpenCSVSerde:

Users can specify field separators, field content reference symbols, and escape characters for the fields of a row, for example: WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\")

Line separators embedded in fields are not supported

All fields define the STRING type

For other data types, you can convert them using functions in SQL.

For example,

CREATE EXTERNAL TABLE test_csv_opencsvserde (id STRING, name STRING, location STRING, create_date STRING, create_timestamp STRING, longitude STRING, latitude STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'with serdeproperties (' separatorChar'=',','quoteChar'=' ", 'escapeChar'='\\') STORED AS TEXTFILE LOCATION 'oss://test-bucket-julian-1/test_csv_serde_1'; custom delimiter

You need a custom column separator (FIELDS TERMINATED BY), escape character (ESCAPED BY), and line Terminator (LINES TERMINATED BY).

It needs to be specified in the table statement.

ROW FORMAT DELIMITED FIELDS TERMINATED BY't 'ESCAPED BY'\\ 'LINES TERMINATED BY'\ n' ignores HEADER in the CSV file

In csv files, there is sometimes HEADER information that needs to be ignored when the data is read. At this point, you need to define skip.header.line.count in the table-building statement.

For example, the data file oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl contains the following:

N_NATIONKEY | N_NAME | N_REGIONKEY | N_COMMENT0 | ALGERIA | 0 | haggle. Carefully final deposits detect slyly agai | 1 | ARGENTINA | 1 | al foxes promise slyly according to the regular accounts. Bold requests alon | 2 | BRAZIL | 1 | y alongside of the pending deposits. Carefully special packages are about the ironic forges. Slyly special | 3 | CANADA | 1 | eas hang ironic, silent packages. Slyly regular packages are furiously over the tithes. Fluffily bold | 4 | EGYPT | 4 | y above the carefully unusual theodolites. Final dugouts are quickly across the furiously regular d | 5 | ETHIOPIA | 0 | ven packages wake quickly. Regu |

The corresponding table-building statements are:

CREATE EXTERNAL TABLE nation_header (N_NATIONKEY INT, N_NAME STRING, N_REGIONKEY INT, N_COMMENT STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY'| 'STORED AS TEXTFILE LOCATION' oss://my-bucket/datasets/tpch/nation_csv/nation_header.tbl'TBLPROPERTIES ("skip.header.line.count" = "1")

The value x of skip.header.line.count is related to the actual number of rows n of the data file as follows:

When xencrypn, DLA will filter out all the contents of the file when it reads the file.

2.2 TSV Fil

Similar to CSV files, files in TSV format are plain text files with the delimiter Tab between columns.

For example, the content of the data file oss://bucket-for-testing/oss/text/cities/city.tsv is

Beijing China 010ShangHai China 021Tianjin China 022

A table statement can be

CREATE EXTERNAL TABLE city (city STRING, country STRING, code INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY'\ t 'STORED AS TEXTFILE LOCATION' oss://bucket-for-testing/oss/text/cities';2.3 Multi-character data Field Separator File

Assuming that the delimiter of your data field contains multiple characters, you can use the following example table statement, where the data field delimiter for each row is "| |" and can be replaced with your specific delimiter string.

ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'with serdeproperties ("field.delim" = "| |")

Example:

CREATE EXTERNAL TABLE test_csv_multidelimit (id STRING, name STRING, location STRING, create_date STRING, create_timestamp STRING, longitude STRING, latitude STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.MultiDelimitSerDe'with serdeproperties ("field.delim" = "| |") STORED AS TEXTFILE LOCATION' oss://bucket-for-testing/oss/text/cities/';2.4 JSON file

JSON files that can be processed by DLA are usually stored in plain text format, and the data files need to be encoded in UTF-8.

In the JSON file, each line must be a complete JSON object.

For example, the following file format is not acceptable

{"id": 123," name ":" jack "," c3 ":" 2001-02-03 12:34:56 "} {" id ": 456," name ":" rose "," c3 ":" 1906-04-18 05:12:00 "} {" id ": 789," name ":" tom "," c3 ":" 2001-02-03 12:34:56 "} {" id ": 234," name ":" alice " "c3": "1906-04-18 05:12:00"}

It needs to be rewritten as:

{"id": 123," name ":" jack "," c3 ":" 2001-02-03 12:34:56 "} {" id ": 456," name ":" rose "," c3 ":" 1906-04-18 05:12:00 "} {" id ": 789," name ":" tom "," c3 ":" 2001-02-03 12:34:56 "} {" id ": 234," name ":" alice " "C3": "1906-04-18 05:12:00"} does not contain nested JSON data

Table-building sentences can be written

CREATE EXTERNAL TABLE T1 (id int, name string, c3 timestamp) STORED AS JSONLOCATION 'oss://path/to/t1/directory'; contains nested JSON files

Define nested JSON data using struct and array structures.

For example, user raw data (note: whether nested or not, a complete piece of JSON data can only be placed on one line before it can be processed by Data Lake Analytics:

{"DocId": "Alibaba", "User_1": {"Id": 1234, "Username": "bob1234", "Name": "Bob", "ShippingAddress": {"Address1": "969 Wenyi West St.", "Address2": null "City": "Hangzhou", "Province": "Zhejiang"}, "Orders": [{"ItemId": 6789, "OrderDate": "11ax 11max 2017"} {"ItemId": 4352, "OrderDate": "12 Universe 12 Placement 2017"]}}

After formatting using the online JSON formatting tool, the data content is as follows:

{"DocId": "Alibaba", "User_1": {"Id": 1234, "Username": "bob1234", "Name": "Bob", "ShippingAddress": {"Address1": "969 Wenyi West St.", "Address2": null, "City": "Hangzhou" "Province": "Zhejiang"}, "Orders": [{"ItemId": 6789, "OrderDate": "11 ItemId 11and2017"}, {"ItemId": 4352 "OrderDate": "12 Compact 12 Universe 2017"]}}

The table statement can be written as follows (Note: the path specified in LOCATION must be the directory where the JSON data file is located, and all JSON files in this directory can be recognized as the data of the table):

CREATE EXTERNAL TABLE json_table_1 (docid string, user_1 struct

< id:INT, username:string, name:string, shippingaddress:struct< address1:string, address2:string, city:string, province:string >

, orders:array

< struct< itemid:INT, orderdate:string >

> >) STORED AS JSONLOCATION 'oss://xxx/test/json/hcatalog_serde/table_1/'

Query the table:

Select * from json_table_1 +- -- + | docid | user_1 | +-+-- -+ | Alibaba | [1234 Bob1234, Bob, [969 Wenyi West St., null, Hangzhou, Zhejiang], [[6789, 11/11/2017], [4352 ]] | +-+-+

For nested structures defined by struct, you can use "." Make hierarchical object references, and for array structures defined by array, you can make object references through "[array subscript]" (note: array subscript starts at 1).

Select DocId, User_1.Id, User_1.ShippingAddress.Address1, User_1.Orders [1] .ItemIdfrom json_table_1where User_1.Username = 'bob1234' and User_1.Orders [2] .OrderDate =' 12 Greater 2017' +-+ | DocId | id | address1 | _ col3 | +-+ | Alibaba | 1234 | 969 Wenyi West St. | 6789 | | +-+ use JSON function to process data |

For example, store the nested JSON value of "value_string" as a string:

{"data_key": "com.taobao.vipserver.domains.meta.biz.alibaba.com", "ts": 1524550275112, "value_string": "{\" appName\ ":\"\ "apps\": [],\ "checksum\":\ "50fa0540b430904ee78dff07c7350e1c\",\ "clusterMap\": {\ "DEFAULT\": {\ "defCkport\": 80,\ "defIPPort\": 80,\ "healthCheckTask\": null,\ "healthChecker\": {\ "checkCode\": \ "curlHost\":\ ",\" curlPath\ ":\" / status.taobao\ ",\" type\ ":\" HTTP\ "},\" name\ ":\" DEFAULT\ ",\" nodegroup\ ":\" sitegroup\ ":\"\ ",\" submask\ ":\" 0.0.0.0and0\ ",\" syncConfig\ ": {\" appName\ ":\" trade-ma\ ",\" nodegroup\ ":\" tradema\ " \ "pubLevel\":\ "publish\",\ "role\":\ "site\":\ "\"},\ "useIPPort4Check\": true},\ "disabledSites\": [],\ "enableArmoryUnit\": false,\ "enableClientBeat\": false,\ "enableHealthCheck\": true,\ "enabled\": true,\ "envAndSites\":\ "\",\ "invalidThreshold\": 0.6,\ "ipDeleteTimeout\": 1800000,\ "lastModifiedMillis\": 1524550275107,\ "localSiteCall\": true \ "localSiteThreshold\": 0.8,\ "name\":\ "biz.alibaba.com\",\ "nodegroup\":\ ",\" owners\ ": [\" junlan.zx\ ",\" Zhang San\ ",\" Li Si\ ",\" cui.yuanc\ "],\" protectThreshold\ ": 0,\" requireSameEnv\ ": false,\" resetWeight\ ": false,\" symmetricCallType\ ": null,\" symmetricType\ ":\" warehouse\ ",\" tagName\ ":\" ipGroup\ " \ "tenantId\":\ "\",\ "tenants\": [],\ "token\":\ "1cf0ec0c771321bb4177182757a67fb0\",\ "useSpecifiedURL\": false} "}"

After formatting using the online JSON formatting tool, the data content is as follows:

{"data_key": "com.taobao.vipserver.domains.meta.biz.alibaba.com", "ts": 1524550275112, "value_string": "{\" appName\ ":\"\ "apps\": [],\ "checksum\":\ "50fa0540b430904ee78dff07c7350e1c\",\ "clusterMap\": {\ "DEFAULT\": {\ "defCkport\": 80,\ "defIPPort\": 80,\ "healthCheckTask\": null \ "healthChecker\": {\ "checkCode\": 200,\ "curlHost\":\ "\",\ "curlPath\":\ "/ status.taobao\",\ "type\":\ "HTTP\"},\ "name\":\ "DEFAULT\",\ "nodegroup\":\ "\",\ "sitegroup\":\ "\",\ "submask\":\ "0.0.0.0and0\",\ "syncConfig\": {\ "appName\":\ "trade-ma\" \ "nodegroup\":\ "tradema\",\ "pubLevel\":\ "publish\",\ "role\":\ ",\" site\ ":\"\ "},\" useIPPort4Check\ ": true},\" disabledSites\ ": [],\" enableArmoryUnit\ ": false,\" enableClientBeat\ ": false,\" enableHealthCheck\ ": true,\" enabled\ ": true,\" envAndSites\ ":\"\ ",\" invalidThreshold\ ": 0.6,\" ipDeleteTimeout\ ": 1800000 \ "lastModifiedMillis\": 1524550275107,\ "localSiteCall\": true,\ "localSiteThreshold\": 0.8,\ "name\":\ "biz.alibaba.com\",\ "nodegroup\":\ ",\" owners\ ": [\" junlan.zx\ ",\" Zhang San\ ",\" Li Si\ ",\" cui.yuanc\ "],\" protectThreshold\ ": 0,\" requireSameEnv\ ": false,\" resetWeight\ ": false,\" symmetricCallType\ ": null \ "symmetricType\":\ "warehouse\",\ "tagName\":\ "ipGroup\",\ "tenantId\":\ "tenants\": [],\ "token\":\ "1cf0ec0c771321bb4177182757a67fb0\",\ "useSpecifiedURL\": false} "}"

The statement of the table is

CREATE external TABLE json_table_2 (data_key string, ts bigint, value_string string) STORED AS JSONLOCATION 'oss://xxx/test/json/hcatalog_serde/table_2/'

After the table is built, you can make a query:

Select * from json_table_2 + - - - - - - - - -+ | data_key | ts | value_string | | +-+ -+- - - - - - - - -- + | com.taobao.vipserver.domains.meta.biz.alibaba.com | 1524550275112 | {"appName": "" "apps": [], "checksum": "50fa0540b430904ee78dff07c7350e1c", "clusterMap": {"DEFAULT": {"defCkport": 80, "defIPPort": 80, "healthCheckTask": null, "healthChecker": {"checkCode": 200," curlHost ":"," curlPath ":" / status.taobao "," type ":" HTTP "}," name ":" DEFAULT "," nodegroup ":", "sitegroup": "", "submask": "0.0.0.0and0" "syncConfig": {"appName": "trade-ma", "nodegroup": "tradema", "pubLevel": "publish", "role": "", "site": "}," useIPPort4Check ": true}}," disabledSites ": []," enableArmoryUnit ": false," enableClientBeat ": false," enableHealthCheck ": true," enabled ": true," envAndSites ":"," invalidThreshold ": 0.6," ipDeleteTimeout ": 1800000," lastModifiedMillis ": 1524550275107," localSiteCall ": true," localSiteThreshold ": "name": "biz.alibaba.com", "nodegroup": "", "owners": ["junlan.zx", "Zhang San", "Li Si", "cui.yuanc"], "protectThreshold": 0, "requireSameEnv": false, "resetWeight": false, "symmetricCallType": null, "symmetricType": "warehouse", "tagName": "ipGroup", "tenantId": "," tenants ": []," token ":" 1cf0ec0c771321bb4177182757a67fb0 " "useSpecifiedURL": false} | +-+ - - - - - - - - -+

The following SQL example shows how to use common JSON functions such as json_parse,json_extract_scalar,json_extract:

Mysql > select json_extract_scalar (json_parse (value),'$.girls [1]') from json_table_2 +-+ | _ col0 | +-+ | Zhang San | +-+ mysql > select json_extract_scalar (json_obj.json_col,'$.DEFAULT.submask') from (select json_extract (json_parse (value),'$.clusterMap') json_objwhere json_extract_scalar (json_obj.json_col,'$.DEFAULT.healthChecker.curlPath') ='/ status.taobao'' +-+ | _ col0 | +-+ | 0.0.0.0 mysql 0 | +-+ mysql > with json_obj as (select json_extract (json_parse (value),'$.clusterMap') as json_col from json_table_2) select json_extract_scalar (json_obj.json_col,'$.DEFAULT.submask') from json_obj where json_extract_scalar (json_obj.json_col) '$.DEFAULT.healthChecker.curlPath') ='/ status.taobao' +-+ | _ col0 | +-+ | 0.0.0.0 col0 0 | +-+ 2.5 ORC file

Optimized Row Columnar (ORC) is an optimized column storage file format supported by the Apache open source project Hive. Compared with CSV files, it can not only save storage space, but also get better query performance.

For ORC files, you only need to specify STORED AS ORC when you create the table.

For example,

CREATE EXTERNAL TABLE orders_orc_date (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE DATE, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) STORED AS ORC LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/orc_date/orders_orc';2.6 PARQUET files

Parquet is a column storage file format supported by the Apache open source project Hadoop.

When you use DLA to build a table, you need to specify STORED AS PARQUET.

For example,

CREATE EXTERNAL TABLE orders_parquet_date (O_ORDERKEY INT, O_CUSTKEY INT, O_ORDERSTATUS STRING, O_TOTALPRICE DOUBLE, O_ORDERDATE DATE, O_ORDERPRIORITY STRING, O_CLERK STRING, O_SHIPPRIORITY INT, O_COMMENT STRING) STORED AS PARQUET LOCATION 'oss://bucket-for-testing/datasets/tpch/1x/parquet_date/orders_parquet';2.7 RCFILE files

Record Columnar File (RCFile), column storage files, can effectively store relational table structures in distributed systems, and can be read and processed efficiently.

DLA needs to specify STORED AS RCFILE when creating a table.

For example,

CREATE EXTERNAL TABLE lineitem_rcfile_date (L_ORDERKEY INT, L_PARTKEY INT, L_SUPPKEY INT, L_LINENUMBER INT, L_QUANTITY DOUBLE, L_EXTENDEDPRICE DOUBLE, L_DISCOUNT DOUBLE, L_TAX DOUBLE, L_RETURNFLAG STRING, L_LINESTATUS STRING, L_SHIPDATE DATE, L_COMMITDATE DATE, L_RECEIPTDATE DATE, L_SHIPINSTRUCT STRING, L_SHIPMODE STRING L_COMMENT STRING) STORED AS RCFILELOCATION 'oss://bucke-for-testing/datasets/tpch/1x/rcfile_date/lineitem_rcfile'2.8 AVRO file

When DLA builds tables for AVRO files, you need to specify STORED AS AVRO, and the fields defined need to conform to the schema of the AVRO file.

If you are not sure that you can use the tools provided by Avro, you can get schema and build tables based on schema.

Download avro-tools-.jar from the Apache Avro official website to local, and execute the following command to get the schema of the Avro file:

Java-jar avro-tools-1.8.2.jar getschema / path/to/your/doctors.avro {"type": "record", "name": "doctors", "namespace": "testing.hive.avro.serde", "fields": [{"name": "number", "type": "int", "doc": "Order of playing the role"} {"name": "first_name", "type": "string", "doc": "first name of actor playing role"}, {"name": "last_name", "type": "string", "doc": "last name of actor playing role"}]}

The table statement is as follows. The name in fields corresponds to the column name in the table. Type needs to refer to the table in this document and convert it to the type supported by DLA.

CREATE EXTERNAL TABLE doctors (number int,first_name string,last_name string) STORED AS AVROLOCATION 'oss://mybucket-for-testing/directory/to/doctors'

In most cases, the type of Avro can be converted directly to the corresponding type in DLA. If this type is not supported in DLA, it will be converted to a similar type. Please refer to the following table for details:

Avro type corresponds to DLA type nullvoidbooleanbooleanintintlongbigintfloatfloatdoubledoublebytesbinarystringstringrecordstructmapmaplistarrayunionunionenumstringfixedbinary2.9 files that can be matched with regular expressions

Typically, files of this type are stored on the OSS in plain text format, with each row representing a record in the table, and each row can be matched with a regular expression.

For example, the Apache WebServer log file is this type of file.

The contents of a log file are:

127.0.0.1-frank [10/Oct/2000:13:55:36-0700] "GET / apache_pb.gif HTTP/1.0" 200 2326127.0.0.1-- [26/May/2009:00:00:00 + 0000] "GET / someurl/?track=Blabla (Main) HTTP/1.1" 200 5864-"Mozilla/5.0 (Windows; U; Windows NT 6.0) En-US) AppleWebKit/525.19 (KHTML, like Gecko) Chrome/1.0.154.65 Safari/525.19 "

Each line of file can be represented by the following regular expression, separated by spaces between columns:

([^] *) (-;\ [[^\]] *\]) ([^\ "] * |\" [^\ "] *\") (- | [0-9] *) (- | [0-9] *) (?: ([^\ "] * |\" [^\ "] *\") ([^\ "] * |\" [^\ "] *\")?

For the above file format, the table statement can be expressed as:

CREATE EXTERNAL TABLE serde_regex (host STRING, identity STRING, userName STRING, time STRING, request STRING, status STRING, size INT, referer STRING Agent STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES ("input.regex" = "([^] *) (- |\ [[^\]] *\]) ([^\"] * |\ "[^\"] *\ ") (- | [0-9] *) (- | [0-9] *) (?: ([^\") ] * |\ "[^\"] *\ ") ([^\"] * |\ "[^\"] *\ ")?) STORED AS TEXTFILELOCATION 'oss://bucket-for-testing/datasets/serde/regex'

Query result

Mysql > select * from serde_regex + -+-+-- -+ | host | identity | userName | time | request | status | size | referer | agent | | +-- +-- -+- -+ | 127.0.0.1 |-| frank | [10/Oct/2000:13:55:36-0700] | "GET / apache_pb.gif HTTP/1.0" | 2326 | NULL | NULL | | 127.0.0.1 |-|-| [26/May/2009:00:00:00 + 0000] | "GET / someurl/?track=Blabla (Main) HTTP/1.1" | 5864 |-| "Mozilla/5.0 (Windows) | U; Windows NT 6.0 En-US) AppleWebKit/525.19 (KHTML Like Gecko) Chrome/1.0.154.65 Safari/525.19 "| +-- +-- -+- -- Geographic JSON data file of + 2.10 Esri ArcGIS

DLA supports the SerDe processing of Esri ArcGIS's geographic JSON data files. For a description of this geographic JSON data format, please refer to: https://github.com/Esri/spatial-framework-for-hadoop/wiki/JSON-Formats

Example:

CREATE EXTERNAL TABLE IF NOT EXISTS california_counties (Name string, BoundaryShape binary) ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'STORED AS INPUTFORMAT' com.esri.json.hadoop.EnclosedJsonInputFormat'OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'LOCATION' oss://test_bucket/datasets/geospatial/california-counties/'

As you can see from the above example, DLA can support most files in open source storage formats. For the same data, using different storage formats, the size of the file stored in OSS, DLA query analysis speed will be quite different. It is recommended to use ORC format for file storage and query.

In order to get faster query speed, DLA is still in the process of optimization, and more data sources will be supported in the future, which will bring users a better big data analysis experience.

The above is the format of the Data Lake Analytics + OSS data file shared by the editor. If you happen to have similar doubts, please refer to the above analysis to understand. If you want to know more about it, you are welcome to follow the industry information channel.

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

Views: 0

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.

Share To

Servers

Wechat

© 2024 shulou.com SLNews company. All rights reserved.

12
Report