Semantic natural language understanding at scale using Spark, machine-learned annotators, and deep-learned ontologies

David Talby and Claudiu Branzan offer a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records. Infrastructure components include Kafka, Spark Streaming, Spark, and Elasticsearch; data science components include spaCy, custom annotators, curated taxonomies, machine-learned dynamic ontologies, and real-time inferencing.


Talk Title	Semantic natural language understanding at scale using Spark, machine-learned annotators, and deep-learned ontologies
Speakers	David Talby (Pacific AI), Claudiu Branzan (Accenture)
Conference	Strata + Hadoop World
Conf Tag	Big Data Expo
Location	San Jose, California
Date	March 14-16, 2017
URL	Talk Page
Slides	Talk Slides
Video

A text-mining system must go way beyond indexing and search to appear truly intelligent. First, it should understand language beyond keyword matching. (For example, distinguishing between “Jane has the flu,” “Jane may have the flu,” “Jane is concerned about the flu," “Jane’s sister has the flu, but she doesn’t,” or “Jane had the flu when she was 9” is of critical importance.) This is a natural language processing problem. Second, it should “read between the lines” and make likely inferences even if they’re not explicitly written. (For example, if Jane has had a fever, a headache, fatigue, and a runny nose for three days, not as part of an ongoing condition, then she likely has the flu.) This is a semisupervised machine-learning problem. Third, it should automatically learn the right contextual inferences to make. (For example, learning on its own that fatigue is sometimes a flu symptom—only because it appears in many diagnosed patients—without a human ever explicitly stating that rule.) This is an association-mining problem, which can be tackled via deep learning or via more guided machine-learning techniques. David Talby and Claudiu Branzan lead a live demo of an end-to-end system that makes nontrivial clinical inferences from free-text patient records and provides real-time inferencing at scale. The architecture is built out of open source big data components: Kafka and Spark Streaming for real-time data ingestion and processing, Spark for modeling, and Elasticsearch for enabling low-latency access to results. The data science components include spaCy, a pipeline with custom annotators, machine-learning models for implicit inferences, and dynamic ontologies for representing and learning new relationships between concepts. Source code will be made available after the talk to enable you to hack away on your own.

Semantic natural language understanding at scale using Spark, machine-learned annotators, and deep-learned ontologies

Using R for scalable data analytics: From single machines to Hadoop Spark clusters

Zillow: Transforming real estate through big data and machine learning

Shifting left for continuous quality in an Agile data world

Sparklyr: An R interface for Apache Spark

Streams: Successfully transforming your business one millisecond at a time

Unified, portable, efficient: Batch and stream processing with Apache Beam (incubating)