Big Data Science – Telegram
Big Data Science
3.75K subscribers
65 photos
9 videos
12 files
637 links
Big Data Science channel gathers together all interesting facts about Data Science.
For cooperation: a.chernobrovov@gmail.com
💼https://news.1rj.ru/str/bds_job — channel about Data Science jobs and career
💻https://news.1rj.ru/str/bdscience_ru — Big Data Science [RU]
Download Telegram
🦋Useful ML Services: Everypixel API for Image Recognition
We continue to get acquainted with useful ML tools. Meet the Everypixel API, a simple yet powerful visual recognition method that uses machine learning to understand images.
The API uses a set of pre-trained models that parse images and return useful information. It processes images and then tags them with relevant keywords, which helps in their categorization and moderation. In addition, it evaluates images according to their quality and aesthetic value. Great for online stores and marketplaces to complement product and image data. Allows you to upload images without writing denoscriptions, as they are filled in automatically. Thanks to the generation of keywords for images, it will help in SEO tasks, and the categorization of images will improve search and directory navigation.
Pros of Everypixel API:
• works even when the end user takes a picture at the wrong angle or in poor lighting conditions;
• sees images the way a person sees them;
• can create keywords associated with images;
• selects the best shot from several similar photos;
• Can rate images from 0 to 100 depending on their quality.
Disadvantages of Everypixel API:
• The free plan is limited to 100 requests per day;
• cannot rate historical photographs, illustrations, or 3D renderings.
https://labs.everypixel.com/api
🌸TOP-15 Data Science conferences in April 2022:
• Apr 5-6,
Healthcare NLP Summit (Online training takes place Apr 12-15) https://www.nlpsummit.org/healthcare-2022/
• Apr 6, Google Data Cloud Summit. Virtual. https://cloudonair.withgoogle.com/events/summit-data-cloud-2022
• Apr 13-14, Unite 2022: The Collaborative Intelligence Summit. Atlanta, GA, USA. https://unite2022.com/
• Apr 13, Analytics Summit 2022. Cincinnati, OH, USA. https://web.cvent.com/event/c6511810-01df-4e56-8c98-9c649301e3e4/
• Apr 14-16, WAICF: World AI Cannes Festival. Cannes, France. https://worldaicannes.com/
• Apr 19-21, ODSC East: Open Data Science, Boston, MA, USA. https://odsc.com/boston/
• Apr 20, DSS Virtual: AI & ML in the Enterprise. Virtual. https://www.datascience.salon/virtual-ai-and-ml-enterprise/
• Apr 21-22, RE.WORK AI in Finance Summit. New York, NY, USA https://www.re-work.co/events/ai-in-finance-summit-new-york-2022
• Apr 21-22, RE.WORK AI in Insurance Summit. New York, NY, USA https://www.re-work.co/events/ai-in-insurance-summit-new-york-2022
• Apr 25-27, Data Governance, Quality, and Compliance https://tdwi.org/events/seminars/april/data-governance-quality-compliance/home.aspx
• Apr 25-26, Chief Data & Analytics Officers, APEX East. Fort Myers, FL, USA. https://cdao-apex-east.coriniumintelligence.com/
• Apr 25-29, International Conference on Learning Representations (ICLR) https://www.iclr.cc/Conferences/2022
• Apr 26-27, Insurance AI & Innovative Tech USA 2022. Chicago, IL, USA. https://events.reutersevents.com/insurance/insuranceai-usa
• Apr 27, 4-6PM GMT, Natural Language Generation: Financial services, humans + AI together. London, UK. https://www.meetup.com/london-nlg-meetup-group/events/284525082/
• Apr 27, Computer Vision Summit. San Jose, CA, USA. https://computervisionsummit.com/
🙌🏻Generation of 3D scenes from 2D photos with NVIDIA's NeRF
Inverse rendering has long used AI to approximate the behavior of light in the real world, allowing a 3D scene to be reconstructed from multiple 2D images taken from different angles. The NVIDIA research team has developed an approach that solves this problem almost instantly by combining ultra-fast neural network training and fast rendering.
NVIDIA has taken this approach to a popular new technology called Neural Radiation Fields, or NeRF. The result, dubbed Instant NeRF, is the fastest NeRF technology to date, achieving over 1000x speedup in some cases. It only takes a few seconds for the model to learn from a few dozen still photos - plus the camera angles they were taken from - and then it can render the resulting 3D scene in tens of milliseconds.
NeRFs use neural networks to represent and render realistic 3D scenes based on an input collection of 2D images. Collecting data for NeRF transmission is reminiscent of the work of a photographer on the red carpet: the neural network needs several dozen images taken from different points of the scene, as well as the position of the camera of each of them.
Typically, creating a 3D scene using traditional methods takes several hours or more, depending on the complexity and resolution of the rendering. Bringing AI into the picture speeds things up. Early NeRF models rendered crisp, artifact-free scenes in minutes, but took hours to learn. Instant NeRF reduces rendering time by several orders of magnitude. It is based on multi-resolution hash mesh encoding that is optimized to run efficiently on NVIDIA GPUs. This way you can achieve high-quality results using a fast and small neural network.
The model was developed using the NVIDIA CUDA toolkit and the Tiny CUDA neural network library. Due to its lightness, the neural network can be trained and run on a single NVIDIA GPU - it runs fastest on cards with NVIDIA Tensor Cores.
This technology will be useful for training robots and self-driving cars so that they can understand the size and shape of objects in the real world by capturing 2D images or video recordings of them. It can also be used in architecture and entertainment to quickly create digital representations of real environments that creators can modify and use.
https://blogs.nvidia.com/blog/2022/03/25/instant-nerf-research-3d-ai/
🔥3
🤔3👍2🔥1
🗣👂🏻Noise Reduction in Quantum Computing: An MIT Study
Quantum computers are very sensitive to noise interference caused by imperfect control signals, environmental disturbances, and unwanted interactions between qubits. Therefore, researchers at MIT have created QuantumNAS, a framework that can identify the most robust quantum circuit for a particular computational problem and generate a mapping pattern tailored to the target quantum processor's qubits. device. QuantumNAS is much less computationally intensive than other search methods and can identify quantum circuits that improve the accuracy of machine learning and quantum chemistry problems. In classical neural networks, including more parameters often improves model accuracy. But in variational quantum computing, more parameters require more quantum gates, which introduces more noise.
To do this, a super-circuit was first designed with all possible parameterized quantum elements in the design space. This circuit was then trained and used to search for circuit architectures with high noise tolerance. The process includes a simultaneous search for quantum circuits and qubit mappings using an evolutionary search algorithm. This algorithm generates several candidates for displaying quantum circuits and qubits, and then evaluates their accuracy using a noise model or on a real machine. The results are fed back into the algorithm, which chooses the most efficient parts and uses them to restart the process until it finds the perfect candidates. The developers have collected the results of the study into the TorchQuantum open source library https://github.com/mit-han-lab/torchquantum.
https://news.mit.edu/2022/quantum-circuits-robust-noise-0321
📝Auto-generate summaries from Google Docs
Google Docs now automatically generate summaries of their content. summaries of content when available. While all users can add summaries, auto-generated suggestions are currently only available to Google Workspace business customers.
This is achieved through natural language understanding (NLU) and natural language generation (NLG) ML models, especially Transformer and Pegasus. A popular technique for combining NLU and NLG is to train a machine learning model using sequence-to-sequence learning, where the input is the words of the document and the output is the final words. The neural network then learns to map input tokens to output tokens. Early applications of the sequence-to-sequence paradigm used recurrent neural networks (RNNs) for both the encoder and decoder.
The introduction of Transformers has provided a promising alternative to RNNs due to internal attention for better modeling of long input and output dependencies, which is critical when summarizing documents. However, these models require large amounts of manually labeled data for sufficient training, so the appearance of Transformers alone was not enough to make significant progress in the field of document summarization.
The combination of Transformers with self-supervised preconditioning (BERT, GPT, T5) has led to major breakthroughs in many NLU problems for which limited labeled data is available. In self-supervised pre-learning, the model uses large amounts of unlabeled text to learn general language understanding and generation capabilities. Then, in a subsequent fine-tuning step, the model learns to apply these abilities to a specific task, such as debriefing or answering questions.
Pegasus' work takes this idea one step further by introducing a pre-workout goal tailored to abstract generalization. In Pegasus pre-training, also called Sentence Gap Prediction (GSP), full sentences from untagged news articles and web documents are masked from the input and a model is required to reconstruct them based on the remaining untagged sentences. In particular, GSP attempts to mask sentences that are considered important to the document with various heuristics to make pre-training as close to a debriefing task as possible. Pegasus has achieved state-of-the-art results on a diverse set of summation datasets.
Taking advantage of Transformer and Pegasus, the Google AI researchers carefully cleaned and filtered the fine-tuning data to contain training examples that were more consistent and presented a coherent definition of the summary text. Despite the reduction in the amount of training data, this resulted in a better model. Then the problem of maintaining a high-quality model in production was solved. Although the Transformer version of the encoder-decoder architecture is the dominant approach to model training for sequential sequence transformation problems such as abstract summation, it can be inefficient and impractical for use in real world applications. The main inefficiency is associated with the Transformer decoder, where the output summary token is generated sequentially through autoregressive decoding. The decoding process becomes noticeably slower as summaries get longer as the decoder processes all previously generated tokens at each step. RNNs are a more efficient architecture for decoding, since there is no internal attention when using the previous tokens, as in the Transformer model.
After transferring knowledge from a large model to a more efficient smaller model to transform the Pegasus model into a hybrid architecture of the Transformer encoder and RNN decoder, the number of layers of the RNN decoder was reduced to improve efficiency. The resulting model has improved delays and memory, while maintaining the original quality.
https://ai.googleblog.com/2022/03/auto-generated-summaries-in-google-docs.html
📝Dataframe validation with Pandera
In large DS projects, the Great Expectations framework can be used to validate the dataset and check the quality of the data. However, smaller tasks require simpler tools. For example, the lightweight Python library Pandera, which explicitly checks information in dataframes at runtime. Pandera allows you to define a data schema once using a class-based API with pydantic syntax and use it to validate various types of dataframes, including pandas, dask, modin, and pyspark.pandas. You can check the types and properties of columns in pd.DataFrame or values in pd.Series, perform more complex statistical testing such as hypothesis testing. You can synthesize data from schema objects for property-based testing using pandas data structures.
Function decorators allow you to integrate with existing data analysis/processing pipelines using function decorators. With lazy validation, you can validate dataframes before errors occur. Finally, compatibility with other Python tools such as pydantic, fastapi, and mypy makes Pandera a useful tool for the ML developer and data analyst.
Documentation: https://pandera.readthedocs.io/en/stable/
Example: https://towardsdatascience.com/validate-your-pandas-dataframe-with-pandera-2995910e564
👍2
💥Why you need Modin: Pandas alternative for fast big data processing
Handling large frames of data with Pandas is slow because this Python library does not support working with data that does not fit in available memory. As a result, Pandas workflows that work well for prototyping a few MB of data don't scale to a real or hundreds of real GB dataset. Therefore, due to the single-threaded execution of operations in RAM, Pandas is not very suitable for processing really large data sets. with a wide range of data. There is an alternative - the Modin, Python-library with a Pandas-like API that scales to all processor cores using the Dask or Ray engine.
Modin supports working with data that won't fit in, so you can comfortably work with hundreds of GB without worrying about massive memory slowdowns or memory errors. With support for the cluster and beyond the core, Modin represents the use of a DataFrame with exceptional performance on a single node and high scalability in a cluster.
In the context of an algorithm (no cluster), Modin will create and manage a local (Dask or Ray) cluster for execution. There is no need to suggest how to evaluate the data, or even know how many cores the system has. Extraction, you can use code with Pandas by simply changing the library import statement from pandas to modin.pandas and getting a significant speedup even on a single machine. Modin speeds up to 4x on a laptop with 4 main cores.
Docs: https://modin.readthedocs.io/en/latest/index.html
Github: https://github.com/modin-project/modin
👍3
Z-scoring for simple and fast anomaly detection
Anomaly detection is a fairly common problem that covers many scenarios, from financial fraud to computer network failures. Some problems require complex machine learning models, but most often some simpler and cheaper methods are sufficient. For example, you have sales data over a period of time where you want to flag days with abnormally high volumes or highlight customers with abnormally high credit card swipes for risk testing.
For such cases, a simple statistical method of marking outliers, called Z-scoring, will do. The score is equal to the difference between the current and mean values, divided by the standard deviation. Z-scoring assumes the classical normal distribution of random variables. Converting nominal scale values to a logarithmic scale will improve the ability of most ML models to discern relationships and improve the ability of Z-scores to flag outliers.
In practice the implementation of Z-score is very simple: it can be written as a small software noscript or even a set of SQL queries to quickly get a lightweight MVP and quickly test a hypothesis.
https://towardsdatascience.com/anomaly-detection-in-sql-2bcd8648f7a8
🕸✍🏻3 Python-libraries for working with URLs
The task of processing URLs is quite common in practice. For example, make a list of the most frequently visited sites or those that are allowed to be visited during business hours from corporate computers. To automate such cases, the following Python libraries are useful:
• Yarl - allows you to extract features from a URL, provides a convenient class for parsing and changing the address of a web resource. But it only works with Python 3 and does not accept boolean values in the API - you need to convert boolean values to strings yourself using the desired translation protocol. https://github.com/aio-libs/yarl
• Furl - makes parsing and manipulating URLs easier. The library has a wide range of features, but also a number of limitations. In particular, the furl object can change, so problems can occur when passing it to the outside. https://github.com/gruns/furl
• URLObject - A utility class for manipulating URLs with a clean API that focuses on proper method names rather than operator overrides. The object itself is immutable here, each URL change creates a new URL object. But the library does not perform any decoding / encoding transformations, which the user has to deal with on their own. https://github.com/zacharyvoase/urlobject
👍2
👆🏻Something about deduplication with DISTINCT
You can exclude duplicates from the selection by simply adding the DISTINCT keyword to the SQL query. However, this simple solution will not always be correct. To ensure that there are no duplicates in a data set, the DBMS needs to compare all rows with each other, filtering out duplicates. This requires a lot of CPU and memory resources to store all the strings. they need to be compared with each other in memory, even if the hash is being worked on at a low level. In addition, DISTINCT reduces computational parallelism by slowing down query execution.
DISTINCT removes duplicates, but does not resolve incorrect joins and filters, which in practice most often lead to repetitions, for example, due to CROSS JOIN or using RANK instead of ROW_NUMBER, which leads to duplication due to a poorly defined section window. See here for details with code examples: https://jmarquesdatabeyond.medium.com/sql-like-a-pro-please-stop-using-distinct-31bdb6481256
🔥2
💥DataSpell: A professional data science development environment from JetBrains
Lacking a comfortable development environment in a lightweight Jupyter notebook? Need to write Python code in a reliable IDE with all DS libraries? Try DataSpell by JetBrains, a professional IDE like PyCharm that combines many popular data analysis and machine learning libraries with a powerful set of developer tools.
Released in 2020, today DataSpell is in demand by machine learning developers and data analysts around the world.
https://www.jetbrains.com/ru-ru/dataspell/
🔥1🤯1
#test
What could be used to avoid the risk of ML-model's overfitting?
Anonymous Quiz
8%
Normalization
83%
Regularization
5%
Normalization
4%
Optimization
👍1
☀️TOP-15 Data Science and ML conferences all over the World in May 2022:
• 5-6 May -
The #1 MLOps Conference on the planet - Marriott Marquis, New York, NY https://rev.dominodatalab.com/
• 5-6 May - Data Innovation Summit 2022 - KISTAMÄSSAN, STOCKHOLM https://datainnovationsummit.com/
• 10-12 May - Wrangle Summit 2022 Virtual https://www.trifacta.com/events/wrangle-summit-2022/
• 11-12 May - Big Data & AI World. Frankfurt, Germany. https://www.bigdataworldfrankfurt.de/
• 12-13 May - The Data Science Conference. Chicago, IL, USA https://www.thedatascienceconference.com/
• 12 - May 9AM ET, Ontotext Demo-Day. Virtual. https://event.gotowebinar.com/event/bfd3b6ef-828c-46a1-a644-b4e785cece6c
• 15-18 - May FLAIRS-35: Special Track on Neural Networks and Data Mining, Jensen Beach, FL, USA. https://sites.google.com/view/flairs-35-nn-dm-track/home
• 17 May - The data dividend: reimagining data strategies to deepen insight. San Francisco, CA, USA https://events.economist.com/custom-events/the-data-dividend-san-francisco/
• 18 May - Data Science Mini Salon | AI and ML in Retail & E-Commerce. Virtual. https://www.datascience.salon/retail-and-ecommerce
• 23-25 May - TDWI Visualization, Dashboards, and Analytics Adoption https://tdwi.org/events/seminars/may/dashboards-visualization-analytics-adoption/home.aspx
• 24-25 May - Graph + AI Summit. Virtual. https://www.tigergraph.com/graphaisummit
• 24-25 May - Chief Data & Analytics Officers, Insurance US. New York, NY, USA. https://cdaoi.coriniumintelligence.com/
• 25-26 May - Data Reliability Engineering Conference. Virtual https://drecon.org/
• 26 May - Zero Gravity: A Modern Cloud Data Pipeline Event. Virtual. https://www.incorta.com/zerogravity
• 30 May – HeyGrowth - Yerevan, Armenia https://heygrowth.com/yerevan
🔥3
💫Continuous Machine Learning: CML for CI/CD
Need to introduce CI / CD in the development of ML systems? Try CML, an open source CLI tool from Iterative.ai for implementing CI/CD within MLOps. It is suitable for automating ML model development workflows, including provisioning, training and evaluation, comparison of experiments in the history of the project, and monitoring of changing datasets. CML is based on the following principles:
• GitLab or GitHub for managing ML experiments, monitoring model training and data changes using DVC;
• Automated reports for machine learning experiments with metrics and graphs on every Git pull to make informed decisions based on data.
• no additional services - only GitLab, Bitbucket or GitHub, Docker and DVC. Optionally, you can add cloud storage, as well as self-hosted or cloud workers such as AWS EC2 or MS Azure.
CML introduces CI/CD-style automation into the workflow: most of the configurations are defined in the cml.yaml file stored in the repository. This file specifies what actions should be taken when a new feature branch is ready to be merged into the main branch. When a pull request is created, GitHub Actions uses this workflow and performs the actions specified in the configuration file.
Source code: https://github.com/iterative/cml
Documentation: https://cml.dev/doc
Use case example: https://towardsdatascience.com/continuous-machine-learning-e1ffb847b8da
👍1🔥1
#test
What method in Apahe Spark deals with File System instead of RAM?
Anonymous Quiz
46%
partitionBy()
15%
coelesce()
39%
repartition()
🔥2