How to get started with Spark NLP 07/09 Update SLTechnology News&Howtos

How to get started with Spark NLP

2025-07-09 Update From: SLTechnology News&Howtos shulou NAV: SLTechnology News&Howtos > Internet Technology >

Shulou(Shulou.com)06/02 Report--

How to get started with Spark NLP, I believe many inexperienced people don't know what to do about it. Therefore, this article summarizes the causes and solutions of the problem. Through this article, I hope you can solve this problem.

Application of AI in Enterpri

The annual O'Reilly report on the application of AI in enterprises was released in February 2019. The report surveyed more than 1300 employees in multiple vertical industries. The survey included AI projects in the production environment of the respondents' enterprises, how these AI projects were applied in enterprises, and how AI quickly expanded to deep learning, human-computer interaction system, knowledge graph, and reinforcement learning.

The survey includes the frameworks and tools in which respondents' enterprises mainly use ML and AI. The following figure shows a summary of usage:

Spark NLP ranks seventh among all frameworks and tools, and is by far the most popular NLP library, twice as popular as spaCy. In fact, in addition to other open source tools and other cloud services, Spark NLP is the most popular AI tool after scikit-learn, TensorFlow, Keras, and PyTorch.

High accuracy, high performance and scalability

The survey is consistent with the growing use of Spark NLP in health care, finance, life sciences and recruitment in recent years, the root cause of which is that NLP technology has undergone a major shift in recent years.

High accuracy

In the past 3-5 years, with the rise of deep learning in the field of natural language, the accuracy of algorithms is getting higher and higher, while the accuracy of traditional algorithms such as spaCy, Stanford CoreNLP, ntlp and OpenNLP is obviously not comparable to these latest research results.

In pursuit of higher accuracy and performance, the industry continues to produce the latest research results. The following is a summary so far (based on F1 values tested by the en_core_web_lg standard):

High performance

Due to the optimization of Apache Spark, the performance of both stand-alone and cluster is very close to that of bare metal. The performance of Spark NLP can be an order of magnitude faster than that of traditional AI libraries, which are limited by their design.

A year ago, O'Reily released the most comprehensive product-level NLP library performance comparison test so far. The left side of the following figure shows a simple pipeline performance comparison chart in spaCy and Spark NLP training. The test is based on stand-alone configuration (Intel I5, 4 cores, 16GB memory):

A major trend in the field of using GPU for training and inferential programming deep learning, using TensorFlow for deep learning enables Spark NLP to take full advantage of modern computer platforms-from nVida's DGX-1 to Intel's Cascade Lake processors, traditional libraries, with or without deep learning techniques, need to rewrite code to take full advantage of the features of these new hardware It is the features of these new hardware that improve the performance of NLP by an order of magnitude.

Expandability

In the field of deep learning, it is becoming more and more critical to be able to train, reason, and seamlessly migrate the entire AI pipeline from a stand-alone machine to a cluster. Spark NPL benefits from the native construction of Apache Spark ML, which can be extended arbitrarily in the spark cluster, and the distributed execution plan of Spark and the optimization of Cache can also help to improve the performance of Spark NLP.

Other work product level codes for production

Unlike research-oriented NLP libraries such as AllenNLP and NLP Architect, we are committed to providing our Spark NLP libraries to enterprises.

Open source license agreement

The license agreement for Spark NLP to use Apache 2.0, unlike Stanford CoreNLP (paid for commercialization) and ShareAlike CC license agreement for the use of the SpaCy model, is completely free for commercialization.

Python, Java, Scala support

Supporting multilingual programming not only improves the audience of Spark NLP, but also avoids the exchange of process data in use. For example, SpaCy only supports Python, and users need to exchange data between JVM process and Python process during use, which will lead to architecture complexity and performance degradation.

Quick release of the version

In addition to community contributions, Spark NLP has a dedicated development team. Spark NLP is basically released twice a month, with a total of 26 versions released in 2018, and the Spark NLP community welcomes contributions to code, documentation, models, and issues.

Getting started with Python

One of the major design goals of Spark NLP 2.0 is that users can use the benefits of Spark and TensorFlow platforms without knowing Spark or TensorFlow. Users do not need to know what is Spark ML's estimator and transformer, or what is tensorFlow graph or session. Users can also use Spark NLP to build their own models, but they can do it in a minimum of time and learning curve. Spark NLP's built-in 15 training pipelining and models can cover most user scenarios.

Users can install the python version of Spark NLP through pip or conda. For installation and configuration details of Jupyter and Databricks, please see the installation page (https://nlp.johnsnowlabs.com/docs/en/install). Spark NLP is widely used in various components, including Zepplin, SageMaker, Azure,GCP, Cloudera and Vanilla spark, and supports K8S and non-K8S environments.

The following figure shows a simple example of emotional analysis:

The following is an example of using the Bert model to train named entity recognition:

The above example code can process large amounts of text on a spark cluster, with two key methods-annotate (), which takes the string type as input, transform (), whose data input is the data frame of spark.

Scala

Spark NLP is written in Scala, can directly manipulate Spark Data Frame, zero copy of data in the process, can take full advantage of Spark execution plans and other optimizations, so it is very convenient for Scala and Java developers to use Spark NLP.

Spark NLP can be found in the Maven library, and users can use it as long as they add Spark NLP dependencies. If users want to have Spark NLP's OCR capabilities, they need to install additional dependencies. The following figure is an example of a spell check:

Learn more about Spark NLP

Spark NLP shields users from many complex details, so the above code snippets are very simple, and Spark NLP also provides flexibility for users to customize according to their own needs. Spark NLP has done in-depth optimization for the NLP model in the training field. The following is a detailed description of the Python code for Bert model training named entity recognition:

Sparknlp.start () creates the spark session.

PretrainedPipeline () loads the English version of explain_document_dl pipelined, pre-training models and their dependencies.

The process of starting TensorFlow, TensorFlow is in the same JVM process as spark's, loading pre-trained Embeddings and deep learning models (such as NER), which can be automatically distributed and shared on the cluster.

Annotate () starts the reasoning process of NLP and distributes the algorithm flow of each stage.

The NER phase runs on tensorflow, using bi-directional LSTM neural network and CNN respectively.

Embeddings is used to convert contextual tokens to vectors in the reasoning process.

The final result is output in the form of python dictionary

GIVE IT A GO

The Spark NLP home page contains a large number of samples, documentation, and installation instructions. In addition, Spark NLP also provides a docker image, which makes it easy for users to build their own environment locally. If users encounter any problems, they can log in to Slack for help.

After reading the above, have you mastered how to get started with Spark NLP? If you want to learn more skills or want to know more about it, you are welcome to follow the industry information channel, thank you for reading!

Welcome to subscribe "Shulou Technology Information " to get latest news, interesting things and hot topics in the IT industry, and controls the hottest and latest Internet news, technology news and IT industry trends.

*The comments in the above article only represent the author's personal views and do not represent the views and positions of this website. If you have more insights, please feel free to contribute and share.